It was 5: 17pm today, just as I was wrapping up work for the day, and my manager pinged me with the following chat:
: Hello Jeremy, we have a ticket - escalated at,., etc. I am also trying to help. If you need diagnostics, are you available tonight?
No obligation, just checking in to see if I’m available. After a quick review, I realized that I did not have any plans tonight. Why not? If I can help, then the other members of the team won’t have to. I have nothing better to do tonight.
: I'm available all night
I synchronized with my coworker, and then joined the bridge team where the front-line tech team was troubleshooting.
Last Wednesday, I was talking with some software engineers with whom I work. I shared my belief that crashes are the most interesting problem. Crash (which is usually a core dump) can be quite simple to debug. The more serious problems are often vague brown-outs. They will not trigger alarms or health monitoring. All synthetic user transactions are valid. The logs contain no errors. But one particular function or module of your application that usually completes in a few seconds suddenly starts taking 10 minutes to finish. It could be caused by a single SQL query suddenly taking longer to complete. It takes seconds to complete the SQL if you do it yourself. You can still see the accumulation of database connections. Unrelated parts of your application experience delays in acquiring new connections. And while nothing is failing, something isn’t working for your users. If not addressed quickly, an ominous backlog can start to build up in the work queue. This could lead to much larger problems. The freight train is coming towards you …. It is unavoidable and tortuously slow.
This was the same situation that I found myself in today. One part of their software architecture uses PostgreSQL. They have multiple deployments of this application all over the globe. This morning, in one geography, they performed a major version upgrade of a PostgreSQL database to version 11. One part of the application began experiencing problems after the upgrade.
By my time on the call, the investigation was well underway. They were able to identify the affected part of the application. They were able to identify the problem in one particular SQL that was being executed by this module. They discovered a throttle that could be used for slowing down or pausing this specific application module. So they temporarily disabled the module to avoid any adverse effects to the database, except when actively troubleshooting. When the module was enabled, the PostgreSQL view
pg_stat_activity showed that hundreds of connections would concurrently run this SQL. The view did not show any wait events, which indicated that the SQL was running on the CPU. The system CPU utilization spiked to around 100% so it was pretty obvious this wasn’t a matter of some un-instrumented off-CPU wait.
The query was quite simple, but there’s one thing you need to know: A dynamic in-list that contains between one and one thousand elements. It’s a very common pattern. This is because it makes it difficult for humans and computers to group related queries together. For example, the
pg_stat_statements view will have several hundred different entries, each with a different number of in-list elements. But the query operated on a single table and only had one filter in addition to the in-list (a
Here is a rough outline of
SELECT FROM the table WHERE (text_col1, numeric_col3) IN (('first1',1),('second1',2), ... ) AND timestamp_col1 = :1
That's all! Simple! According to
pg_relation_size() the table was 845GB. Two indexes exist which are relevant for the query at hand: a 229GB btree "index1" on (text_col1, text_col2) and a 206GB btree "index2" on (numeric_col1, numeric_col2, numeric_col3, timestamp_col1). Important to note that the query references the last two columns of this second index.
The most obvious theory was that the planner had made a poor execution plan. Major version upgrades to any relational database will bring this about. Major versions include all the improvements and tweaks to the planner, optimizer, costing algorithms, etc. Most of your questions either remain the same or become faster. There might be one or two regressions. It takes just one to ruin your weeknight
Nonetheless, when we executed the query from
psql it returned in a few milliseconds - the good performance we expected. We connected to another PostgreSQL deployment in a different location and confirmed the same execution time. The theory that the application used pre-prepared statements was being suggested to me as I joined the call. Perhaps after the upgrade, the application had cached its prepared statements with poor plans and connected before the updated statistics were available.
Side note: As all of you know, major version upgrades in PostgreSQL can reset object statistics. To update your statistics, you will need to run ANALYZE after installing the major version.
According to the team, they were about to begin a rolling start of the application server fleet within the geographic area with the newly upgraded database. They checked that the database was up-to date and wanted to make sure that the application servers had fresh connections with the best execution plans.
I listened carefully, but I knew there was not much evidence that this would solve the problem (though it was a good guess). I continued to read the ticket history, and checked the work around testing executions plans.
PostgreSQL doesn't give any visibility to execution plans at runtime, unlike other relational databases. There's
auto_explain which can log actual plans for SQL that successfully completes with a runtime above some threshold. The
pg_stat_activity view added
query_id in PostgreSQL 14, and I'm hopeful that a
plan_id might come someday - but the work hasn't been done yet. With a plan hash in both
pg_stat_activity and in the
EXPLAIN output, it would be trivial to know whether the query that your app server is currently executing has the same plan as the query you just ran in
psql - even with low latency, high frequency queries that you couldn't log due to volume. Dynamic in-lists could still cause problems, but I don’t know of any relational database that has a great solution.
(Edit 10-Feb-2022: Joshua left a comment reminding me about Cybertec and Hironobu Suzuki's extension pg_show_plans. Joshua made the mistake of not mentioning it in my article. Although I had seen the extension, I hadn't had the chance to install it and test it out. It's now a priority for me to have a closer look at it! )
The next best solution I know is using
pg_stat_statements as a proxy of sorts. I asked the DBA in the team to run the following query
select substr(query,1,40),calls,total_time,mean_time,stddev_time,(shared_blks_hit+shared_blks_read)/calls blks,rows from pg_stat_statements where query like '%(text_col1, numeric_col3)%' order by total_time desc;
And here's what we got back:
substr | calls | total_time | mean_time | stddev_time | blks | rows -------------------------+-------+------------------+------------------+------------------+------+------ select from table whe | 57 | 21904. 791707 | 384. 294591350877 | 759. 553887634356 | 2444 | 24 select from table whe | 2383 | 20875. 809368 | 8. 76030607133866 | 39. 355973386603 | 2610 | 10 select from table whe | 98 | 14309. 699707 | 146. 01734394898 | 396. 703422675129 | 1846 | 10 select from table whe | 78 | 14282. 730893 | 183. 111934525641 | 430. 76318896177 | 1769 | 43 select from table whe | 41 | 14053. 415115 | 342. 766222317073 | 848. 723086074263 | 2893 | 28 select from table whe | 106 | 12767. 352093 | 120. 446717858491 | 293. 357452072648 | 1756 | 24 select from table whe | 51 | 12442. 307684 | 243. 966817333333 | 530. 104152533396 | 2077 | 22 ... explain analyze verbose | 1 | 23. 031251 | 23. 031251 | 0 | 241 | 0 select from table whe | 2 | 11. 987225 | 5. 9936125 | 0. 7968815 | 46 | 2 1 697005 | 6. 697005 | 0 | 3026 | 1 1 032849 | 6. 032849 | 0 | 2704 | 0 (315 rows)
I asked them to execute the same query in another geography that was still using the older major version PostgreSQL. The other geography had a higher call count because the query was showing cumulative totals since the last database reboot and the problem database was restarted during the upgrade.
The first thing that I noticed was the drastically different ratio of rows and blocks between the regions - the good geography had more rows. Could this mean that a different execution plan had been used? Then, I realized I had made a mistake when I interpreted my own query. I was reporting blocks per call but was actually reporting total rows for all calls. Once I had figured this out, I discovered that the numbers between good and poor geographies didn't really seem to be remarkably different. The same was true for the execution plans and the execution time when we ran the problem queries at a
psql prompt - they were the same "good" values in both the good and bad geographies.
Here is the healthy, fast execution plan that we were seeing:
Bitmap Heap Scan on table (cost=3791.09..80234213. 68 rows=6713 width=273) (actual time=74.911..74. 953 rows=0 loops=1) Output: Recheck Cond: OR... OR (text_col1) text = 'first1’::text OR... OR [(text_col1)]:text = ='second1’::text] Filter: ((timestamp_col1 <= '2022-02-09 00: 00: 00'::timestamp without time zone) AND ((((text_col1)::text = 'first1'::text) AND ... AND (numeric_col3 = '2'::numeric)))) -> BitmapOr (cost=3791.09..3791. 09 rows=212565 width=0) (actual time=74.905..74. 946 rows=0 loops=1) -> Bitmap Index Scan on index1 (cost=0.00..38. 65 rows=2261 width=0) (actual time=0.026..0. 026 rows=0 loops=1) Index Cond: [(text_col1)]:text = ‘first1’::text -> Bitmap Index Scan on index1 (cost=0.00..38. 65 rows=2261 width=0) (actual time=0.446..0. 446 rows=0 loops=1) Index Cond: [(text_col1)]:text = ‘second1’::text -> Bitmap Index Scan on index1 (cost=0.00..38. 65 rows=2261 width=0) (actual time=0.447..0. 447 rows=0 loops=1) Index Cond: ((text_col1)::text = 'third1'::text)
Listening on the call, I heard someone mention that queries were timing out due to a
statement_timeout setting around 2 minutes. This led to some question about whether the
pg_stat_statements data simply didn't reflect the long-running queries because they were erroring out. This is what I believe was true. It led me to another idea. I asked if anyone had access to the PostgreSQL log files and could calculate the timeout error rate. Errors are both the RED and USE methods' common 'E'. You should be paying attention! This confirmed that the problem started with the upgrade. It didn't appear in the other regions, but it was evident.
At this point I felt a bit worried. I was beginning to feel a little worried if it wasn't a plan shift. Hardware problem such as a CPU that is defective and running at an inexact frequency. (Have you seen that before. Is there a hard-to-find bug in the Linux kernel? (I've seen them too, and they don't suit me. My colleague had done a quick and dirty run of perf, so he took a screenshot. But I wanted to increase the challenge. We have a script of our own which is very similar to Tanel Poder's run_xcpu.sh in his 0x.tools collection. It uses perf for conservative, low-frequency sampling of stacks using running processes. It can run continuously in the background, keeps data for one week and can be safely left running for months. It won't overload a server loaded with many cores and many files. It's safe to use in production, just like Tanel's program. We flipped the switch and started profiling the CPU with the database at 100% CPU.
After about 10 minutes, I generated a flame graph. Here's what I saw:
When I first looked at the CPU profile on the bad system, I didn't think it would help much. The execution plan that we have seen above matches the call stack. It performed well in our tests. Executor_Run calls ExecScan, which corresponds to "Bitmap Heap Scan", which is then called the BitmapIndexScan. Although I was aware that both the seqscan and classic index scan were slow (tested using session GUCs), it didn't appear like either one of these was occurring here. It is clearly performing the bitmap scan. This confirms my fears that the problem we are facing isn't a bad one.
I created a flame graph using the same script to generate a comparison for the older PostgreSQL data in the good geography. I highlighted the ExecScan functions. This is
The first thing to notice here is the sample count. Remember that both of these were gathering samples at the same frequencies for about the same duration (10 minutes) - so the difference in sample count actually reflects a difference in CPU time. All the data we see in the healthy system .... is likely still there. but it's so tiny you can't see it, because it's just a few data points compared with 400 thousand data points coming from one bad query that's killing the CPU.
Now, I thought to my self that this problem query must have been running somewhere on the system. Although the call stack may have slight differences from the older major PostgreSQL version, I am confident that it will be nearly identical. I began with the ExecScan nodes, and then looked at the ones which appeared lowest on the stack. It took me no time to find it. It's the only one not preceded or looped by any kind of sort or type:
When I zoomed in on this node, I noticed something that I didn't quite understand. It was almost identical to my unhealthy system. I was confident that my query was there, but there was more. I saw the function MultiExecBitmapAnd but only with a mere 23 samples. And there was this function MultiExecBitmapOr alongside with 63 samples. Another similar execution plan is available that is very close in relation to the one that is causing problems.
Now I'm a big fan of the Oak Table Network, one reason being their dogged advocacy for taking an evidence-based, scientific and methodical approach to database problem solving. You'd be surprised at how common it is to "find the smartest/most loudest person and let them guess until the problem disappears ...")" However, when dealing with complex problems, the diagnostic data is often not available which means you would need to deductively follow each step until you find the solution. It seemed like this was one of those situations. I took my headset off and walked away from the desk. I walked a bit and tried to think out of the box.
I was thinking, "I wonder if these are related?" I did a search on the bad flame graph for MultiExecBitmapOr - lo and behold it was there - with a mere 366 samples, a tiny invisible sliver next to the bad SQL. When I was looking at the execution plan for the good SQL statements, I noticed something very important. There was a node called "BitmapOr". The idea of a good SQL plan corresponding with healthy stacks became more plausible as I looked at both the flamegraphs and the data from the other systems. If this good SQL plan is not the one that's killing CPU ?"
It was high time to learn more about PostgreSQL. What's the difference between "BitmapOr", and "BitmapAnd", and how does the latter kill my CPU? I began by searching the PostgreSQL documentation. It turned up a few hits, but nothing particularly useful. Next I tried google, which led me to a little blurb on the PgMustard site. I was becoming more confident that my CPU profile could be a sign of a bad plan. Finally, I hopped over to GitHub to do a quick search for the PostgreSQL codebase. I read some comments and looked at code hits. Now I am confident that the flame graph of the bad database's call stack reflects a different plan to what we get when we run the query.
Then I had a "hunch".
The query contained an in-list that had a number of OR conditions. It also contained an additional filter, which was joined by an AND condition: The timestamp. I thought that just maybe if the planner expected lower cardinality out of that date filter, it might try to pull that data from an index and bui