postgres – Database Sherpa

Although I’ve dabbled in postgres since 2002, I’ve only been working with it heavily since 2019. My favorite challenge was tuning a 37 TB AWS RDS instance that ran a flagship feature for a previous employer. It was a very busy system, much like an IOT system enduring millions of reads/writes and billions of updates and deletes every day. It was so busy I was only able to investigate and tune from 9pm to 2am. At any other time, even the simplest queries wouldn’t even return.

When you embark on a big tuning effort like the one just described, and one is tasked with figuring out where the sluggishness is coming from, it is invaluable to peer into all the unique SQL statements the system has had to process over time, and where inefficiencies are lurking. The pg_stat_statements extension is crucial to such an effort. First added in version 8.4, performance monitoring continues to improve with every release.

It is a good idea to learn how to use it to your advantage.

Getting Started

Assuming the user/role you are using to connect has the appropriate level of privileges, query your statement monitoring cache:

> select * from pg_stat_statements;

If this errors out, then either your don’t have the right privileges, or it still needs to be installed by your DBA (which might be you). First check that it is included in the preloaded libraries. Open your postgres.conf file and look for this line:

shared_preload_libraries = ‘pg_stat_statements’

There might be other libraries in that listed of pre-loads. If you have to add pg_stat_statements to that parameter, restart Postgres. Once that is done, then enable the feature:

> CREATE EXTENSION pg_stat_statements;

This will create the pg_stat_statements view (and also pg_stat_statements_info on v 14.X). A number of columns were renamed in version 13. The examples below were written for version 14. If the statement does not run on your system, your version is 12 or below; just remove the word “exec” from total_exec_time, min_exec_time, max_exec_time, mean_exec_time and stddev_exec_time. I won’t spend any time explaining each column. Most are self-explanatory. You can read the official Postgres doc linked above to get a better understanding of each column.

There are a handful of postgres.conf parameters that pertain to statement tracking. I like to reveal their values using this:

select name, setting, unit, min_val, max_val, source, vartype, enumvals, boot_val
  from pg_settings
 where name like 'pg_stat%' or name in ('track_io_timing','track_activity_query_size');

The default settings are typically good enough. However, if your queries are quite large, you might want to set track_activity_query_size to 2048 or 4096. Multi-page queries might require even more bytes to accommodate the whole thing. Also, if you find that blk_read_time and blk_write_time columns are blank for your statements, this means that the track_io_timing parameter is off. You can turn it on for the entire server in the postgres.conf file, or just for the current database, like this:

> ALTER DATABASE mydb SET track_io_timing = on;

It does consume some resources to track I/O. You may wish to turn it back off once your heaviest physical disk-touching loads have been identified and corrected.

I’d like to introduce you to a handful of queries, that I picked up here and there over years of tuning, which I find handy for quickly identifying sore spots that need attention.

Top SQL Consumers: Time

Below you will see my favorite query for working with tracked statements. It is multi-purpose. You can add WHERE predicates if you wish, or partial query strings to filter the results to just a query or two you are looking for, or switch out the column(s) being ordered by to change the focus of the query. But this one is the one I use the most, bubbling to the top the biggest consumers of time (and therefore resources) on the system.

with stmts as (
   select query
        , round((total_exec_time/1000/60/60)::numeric, 2) as total_hours
        , round((100 * total_exec_time / sum(total_exec_time::numeric) OVER ())::numeric, 2) AS pct_of_total_hours
        , calls
        , rows
        , (round(blk_read_time::numeric,0))/1000 blk_read_seconds
        , (round(blk_write_time::numeric,0))/1000 as blk_write_seconds
        , (blk_read_time+blk_write_time)/calls as physical_io_per_call_ms
        , round(min_exec_time::numeric, 2) as min_call_ms
        , round(max_exec_time::numeric, 2) as max_call_ms
        , round(mean_exec_time::numeric, 2) as mean_call_ms
        , temp_blks_read
        , temp_blks_written
     from pg_stat_statements
    where dbid = (select oid from pg_database where datname = 'postgres')
)
select *
  from stmts
 where 1=1
   --and query like '%:partial_query_string%'
order by total_hours desc, calls desc;

Be sure to replace the name of your database of interest in line 16 if yours is using a name other than the default of “postgres”.

If I already know the statement I’m looking for, or a table name or keyword found in that statement, I can comment in the filter query like '%:partial_query_string%' and re-run to find only those statements matching my string. You can get fancier with regular expressions too.

You may find yourself working on a system where each of the thousands of unique statements only differs by a few characters. In order to reduce the noise, you could truncate the query down to 40 or 50 of the first characters and group the measurements by that. This aggregation may show the primary culprit, occuping 40, 50 or 70% of the total time for all statements. From there you can jump back to using the query above to zero in on the individual statements (using the partial_query_string version to find all statements sharing the same signature as found in the grouped version below), which are causing the majority of the load on the database.

-- bucket statements by simple truncation
with stmts as (
   select query
        , round((total_exec_time/1000/60/60)::numeric, 2) as total_hours
        , round((100 * total_exec_time / sum(total_exec_time::numeric) OVER ())::numeric, 2) AS pct_of_total_hours
        , calls
        , rows
        , (round(blk_read_time::numeric,0))/1000 blk_read_seconds
        , (round(blk_write_time::numeric,0))/1000 as blk_write_seconds
        , (blk_read_time+blk_write_time)/calls as physical_io_per_call_ms
        , round(mean_exec_time::numeric, 2) as mean_call_ms
        , temp_blks_read
        , temp_blks_written
     from pg_stat_statements
    where dbid = (select oid from pg_database where datname = 'postgres')
)
select substring(query,1,50) as query_first_50chars, sum(total_hours) as hours, sum(pct_of_total_hours) as pct
     , sum(calls) as calls, sum(rows) as rows, round(avg(mean_call_ms),2) as mean_call_ms
     , sum(temp_blks_read) as temp_blks_read, sum(temp_blks_written) temp_blks_written
  from stmts
 group by query_first_50chars
order by hours desc, calls desc;

There are times in your tuning efforts, where one query was so horrendous it was consuming most of the resources, or took so much time to complete, they skew the results. This will usually stand out like a beacon, seen by a gigantic difference between min and max execution time, and a large standard deviation. In the example below, ordered by total time, the queries behind rows 3 and 4 have really huge stddev. Their minimum execution time is .02ms, and average is 19s and 8s respectively, but look at the maximum! 105 hours and 51 hours to execute!

Example output from trimmed version of pg_stat_statements query

If those queries took days to run every time, that would be the first thing you’d fix. But they don’t. Here’s where you dig a little deeper and exercise judgement. Pursuing queries with large max time or stddev could be a waste of your time. There might have been one anomalous incident where resources were locked, preventing the statements from completing. Steps should definitely be taken to log and alert when queries take this long in the future (subject of future blog post), instead of hanging indefinitely. But if this only happened once or twice, this is a distraction. In fact queries 3 and 4 aren’t executed much at all compared to queries 1 and 2, and might have even stopped being used. Perhaps they were only used during initial migration. The mean tells us queries 3 and 4 are not too bad. Indeed these queries that I can’t show to protect my client, are part of a backend data pipeline system that does large load and deletion operations, so 8s to 19s is OK. It is for reasons like these, that I’ll glance over stddev, but I get more value by focusing on total time taken.

I don’t have an example to show at the moment, but imagine that query 1 in the screenshot above took a mean time of 1 second to return. That seems pretty good if this were a query supporting a UI screen. But it is not. It was called over six million times in the week-long period where these stats were gathered. A one second response time for a query called millions of times could break the business or bring the system to its knees. Luckily the top two statements both take less than 100ms and aren’t a concern (the second query is a stored procedure, and postgres doesn’t appear to log rows affected by stored routines, otherwise the rows column can be informative).

Top SQL Consumers: Temp

One indication of a wasteful query, or a work_mem setting that is too small, is a large use of temp space because there wasn’t enough work_mem in the session to handle the sort operations. Use this to identify those types of statements:

with stmts as (
   select query
        , round((total_exec_time/1000/60/60)::numeric, 2) as total_hours
        , round((100 * total_exec_time / sum(total_exec_time::numeric) OVER ())::numeric, 2) AS pct_of_total_hours
        , calls
        , rows
        , (round(blk_read_time::numeric,0))/1000 blk_read_seconds
        , (round(blk_write_time::numeric,0))/1000 as blk_write_seconds
        , (blk_read_time+blk_write_time)/calls as physical_io_per_call_ms
        , round(min_exec_time::numeric, 2) as min_call_ms
        , round(max_exec_time::numeric, 2) as max_call_ms
        , round(mean_exec_time::numeric, 2) as mean_call_ms
        , round(stddev_exec_time::numeric, 2) as stddev_call_ms
        --, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written
        , temp_blks_read
        , temp_blks_written
     from pg_stat_statements
    where dbid = (select oid from pg_database where datname = 'postgres')
)
select *
  from stmts
order by (temp_blks_read + temp_blks_written) desc, calls desc;

Having identified some big uses of temp space, now you need to know the business context behind these queries. Many top users of temp space will be large INSERT, CREATE INDEX, DELETE and other bulk-type operations that don’t happen very often. These require lots of in-memory work that spill over to temp and probably don’t need tuning. If you find a big, bloated query though, that is taking 34 minutes to run every execution and is inefficiently spilling over into disk access and temp space (and nobody knew because it was running at night and not affecting end users), then it probably requires attention and tuning.

Top SQL Consumers: I/O

In some circumstances, you may be more concerned about queries that have to spend inordinate effort reading from or writing to disk. To highlight these operations, just take the same query above, but order by physical_io_per_call_ms DESC.

In Closing…

It isn’t enough just to query pg_stat_statements and look for big time hogs. That’s a great first step and may very well net you some “low-hanging fruit” and quick wins. But you could also spin your wheels and tune the wrong thing.

Knowing the business, knowing the app or service context where the query is used, asking the right questions, figuring out what the query should be doing, how much data should be processed at each step in the query plan, and the order in which the data should be joining and sorting, this is all part of being a good performance tuner. “Know thy data” is the first principle I teach in my classes on data design and performance tuning. This helps you quickly zero in on true waste of resources, ignoring the fluff and noise, and is how you earn your living as a rock star data engineer.

Now that I’m writing again, I’d like to encourage you to subscribe to the blog for future posts. Feel free to contact me if you have questions or a desire to engage, or leave a comment if you have a clarifications or find a technical error.

Tips

Create a weekly job, perhaps orchestrated by cron, which saves a snapshot of everything in pg_stat_statements to a local table. Perhaps create a new schema in the database named monitor to hold this table and anything else you create to watch and learn about the database’s behavior over time.
If you intend to use the pg_stat_statements_reset() function to reset statement stats for the database, a user, or a query, save a copy of the pg_stat_statements data for that query, user or entire database BEFORE you make your change and reset and re-test to see if performance is better.
pg_stat_statements is only useful for successful, completed DML and DDL. To see queries currently in flight, typically useful when a query isn’t returning and you’d like to know what it is doing, query pg_stat_activity instead (subject of future blog post)