Timed wait events in PostgreSQL

This is something that doesn't exist in PostgreSQL at the time of writing. However, it's extremely easy to add it, in a way that has negligible overhead. The overhead is the "classic" reason that PostgreSQL does not add it.

The way to add timing is to indirectly add it, by adding a "probe" to the wait event start and end functions (pgstat_report_wait_start() and pgstat_report_wait_end() functions) before the compilation and linking of the source. By having such a probe, we can use bpftrace (or SystemTap) to pick up the wait event start and end occasions, and see and time the wait events.

Adding probes is very easy. To add the two probes for wait event start and wait event end, it's the addition of 5 lines to the source of PostgreSQL:

Add the probes to the file that contains the pgstat_report_wait_start() and pgstat_report_wait_end(); src/include/utils/wait_event.h:

diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 009b03a520..f3b06113e9 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -28,6 +28,9 @@
 /* enums for wait events */
 #include "utils/wait_event_types.h"

+/* pg_trace for TRACE_POSTGRESQL macros */
+#include "pg_trace.h"
+
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
 extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
 static inline void pgstat_report_wait_start(uint32 wait_event_info);
@@ -87,6 +90,7 @@ extern char **GetWaitEventExtensionNames(int *nwaitevents);
 static inline void
 pgstat_report_wait_start(uint32 wait_event_info)
 {
+       TRACE_POSTGRESQL_WAIT_EVENT_START(wait_event_info);
        /*
         * Since this is a four-byte field which is always read and written as
         * four-bytes, updates are atomic.
@@ -103,6 +107,7 @@ pgstat_report_wait_start(uint32 wait_event_info)
 static inline void
 pgstat_report_wait_end(void)
 {
+       TRACE_POSTGRESQL_WAIT_EVENT_END();
        /* see pgstat_report_wait_start() */
        *(volatile uint32 *) my_wait_event_info = 0;
 }

And add the probes to the probe list in src/backend/utils/probes.d:

diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 0af275587b..a1e53e8e24 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -91,4 +91,8 @@ provider postgresql {
        probe wal__switch();
        probe wal__buffer__write__dirty__start();
        probe wal__buffer__write__dirty__done();
+
+        probe wait__event__start(uint32);
+        probe wait__event__end();
+
 };

Next up is set the switch to enable "dtrace" for compilation, and compile and link the PostgreSQL source.

To validate the addition of the new probes, you can use the bpftrace -l command to list the probes:

bpftrace -l 'usdt:/usr/local/pgsql/bin/postgres:wait*'
usdt:/usr/local/pgsql/bin/postgres:postgresql:wait__event__end
usdt:/usr/local/pgsql/bin/postgres:postgresql:wait__event__start

There they are!

I extended the script query-analyzer.bt with wait events: query-analyzer-waits.bt.

Now I can use bpftrace and the probes to see where the time is spent, including the wait events:

root@84fe326249c0:/postgres-bpftrace# ./query-analyzer-waits.bt -p 18183
Attaching 200 probes...
PostgreSQL statement execution analyzer.
Time in microseconds (us).
pid   :Phase      :time to phase :time in phase : query
------|-----------|--------------|--------------|------
[21964] WAIT      : (          ) :    3878097701: .
[21964]Query start:              :              : select now();
[21964] parse     : (       113) :           318: select now();
[21964] rewrite   : (       199) :           302: select now();
[21964] plan      : (       102) :           488: select now();
[21964] execute   : (       647) :           294: select now();
[21964]Query done : (       505) :          2972: select now();

I executed select now(); in psql, we can see it's using the simple query protocol (Query start/Query done). The execution ended a wait, but because the wait class and name info is only with the wait start, I cannot see it here. It is a client.CLIENT_READ wait.

For such a small, simple query there is no wait event that is encountered at all. This is a bit odd, I am sure the result of the execution is sent back to my client (psql), so this send is not timed?

I am pretty sure this is deliberately, because if I sent back a lot of data to the client, I can generate client.CLIENT_WRITE waits. The event client.CLIENT_WRITE seems to be the event that indicates time spent on sending data to the client (turn off the pager in psql for that: \pset pager off):

[21964] WAIT      : (          ) :      22160788: client.CLIENT_READ
[21964]Query start:              :              : select repeat('x',1000000);
[21964] parse     : (        90) :            49: select repeat('x',1000000);
[21964] rewrite   : (       422) :            48: select repeat('x',1000000);
[21964] plan      : (       669) :          5788: select repeat('x',1000000);
[21964] WAIT      : (          ) :           444: client.CLIENT_WRITE
[21964] WAIT      : (          ) :           119: client.CLIENT_WRITE
[21964] WAIT      : (          ) :           266: client.CLIENT_WRITE
[21964] execute   : (        98) :          1776: select repeat('x',1000000);
[21964]Query done : (       166) :          9109: select repeat('x',1000000);

This is interesting too: when I create a table, the WAL write(io.WAL_WRITE) and fsync() call (io.WAL_SYNC) is happening when the execute phase is done (the execute phase reports its time before the io.WAL_WRITE and io.WAL_SYNC):

[21964] WAIT      : (          ) :      11254040: client.CLIENT_READ
[21964]Query start:              :              : create table t(id int, f1 text);
[21964] parse     : (        71) :            83: create table t(id int, f1 text);
[21964] rewrite   : (       283) :             8: create table t(id int, f1 text);
[21964] WAIT      : (          ) :           292: io.DATA_FILE_EXTEND
[21964] WAIT      : (          ) :           116: io.DATA_FILE_EXTEND
[21964] WAIT      : (          ) :          4512: io.DATA_FILE_IMMEDIATE_SYNC
[21964] execute   : (         0) :         12127: create table t(id int, f1 text);
[21964] WAIT      : (          ) :            67: io.WAL_WRITE
[21964] WAIT      : (          ) :          1778: io.WAL_SYNC
[21964]Query done : (      2425) :         15252: create table t(id int, f1 text);

And if I insert data, the fsync() is also happening outside of the execute phase:

[21964] WAIT      : (          ) :      45872174: client.CLIENT_READ
[21964]Query start:              :              : insert into t select id, repeat('x',10) from generate_series(1,
[21964] parse     : (       361) :           645: insert into t select id, repeat('x',10) from generate_series(1,
[21964] rewrite   : (       935) :          1296: insert into t select id, repeat('x',10) from generate_series(1,
[21964] plan      : (        97) :          2486: insert into t select id, repeat('x',10) from generate_series(1,
[21964] WAIT      : (          ) :           638: io.DATA_FILE_EXTEND
[21964] execute   : (         0) :          1969: insert into t select id, repeat('x',10) from generate_series(1,
[21964] execute   : (         0) :          2131: insert into t select id, repeat('x',10) from generate_series(1,
[21964] WAIT      : (          ) :            98: io.WAL_WRITE
[21964] WAIT      : (          ) :          2448: io.WAL_SYNC
[21964]Query done : (      3401) :         11518: insert into t select id, repeat('x',10) from generate_series(1,

This is how a temp spill looks like:

[21964] WAIT      : (          ) :       5325897: client.CLIENT_READ
[21964]Query start:              :              : select count(*) from generate_series(1,85_350);
[21964] parse     : (        94) :            41: select count(*) from generate_series(1,85_350);
[21964] rewrite   : (       115) :            58: select count(*) from generate_series(1,85_350);
[21964] plan      : (        49) :            62: select count(*) from generate_series(1,85_350);
[21964] WAIT      : (          ) :            33: io.BUFFILE_WRITE
[21964] WAIT      : (          ) :             8: io.BUFFILE_WRITE
[21964] WAIT      : (          ) :            12: io.BUFFILE_WRITE
[21964] WAIT      : (          ) :            12: io.BUFFILE_WRITE
[21964] WAIT      : (          ) :             7: io.BUFFILE_WRITE
...many more io.BUFFILE_WRITEs...
[21964] WAIT      : (          ) :            10: io.BUFFILE_READ
[21964] WAIT      : (          ) :             6: io.BUFFILE_READ
[21964] WAIT      : (          ) :             6: io.BUFFILE_READ
[21964] WAIT      : (          ) :             4: io.BUFFILE_READ
[21964] WAIT      : (          ) :             4: io.BUFFILE_READ
...many more io.BUFFILE_READs...
[21964] execute   : (        67) :         35538: select count(*) from generate_series(1,85_350);
[21964]Query done : (       275) :         36303: select count(*) from generate_series(1,85_350);

Almost the same number of temp IOs can be seen to with explain analyse (buffers).

The first session to logon to a database must (re)create the relcache. This is done before the session gets to execute anything. This is how it looks like on my database:

[14454] WAIT      : (          ) :            32: io.RELATION_MAP_READ
[14454] WAIT      : (          ) :            66: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            10: io.RELATION_MAP_READ
[14454] WAIT      : (          ) :            68: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            45: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            21: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            42: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            77: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            39: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            19: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            49: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            14: io.DATA_FILE_READ
[14454] WAIT      : (          ) :           120: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            16: io.DATA_FILE_READ
...more io.DATA_FILE_READs

On my database with no user tables, this generates 2 io.RELATION_MAP_READ events, and 71 io.DATA_FILE_READ events. These are cached on the OS, and therefore they have the above latency (the latency is in microsecond!).

It's quite probably this happens not only for the first session to logon, but also if the relations change, requiring a refresh of the relcaches, which are files. These are quite probably outside of the execution phases and therefore potentially hard to spot.

In fact, the relcache reads do also write the relcache init files at first access to a database. However, there is no event io.RELATION_MAP_WRITE. I wondered if there is some kind of mechanism to validate if the existing files are still valid, but a quick strace -fp shows that the relcache init files are written, which is not visible.

When I execute a first query, the backend must complete it's catalog, which also generates some io.DATA_FILE_READ events:

[14454]Query start:              :              : select now();
[14454] parse     : (       316) :           811: select now();
[14454] WAIT      : (          ) :           200: io.DATA_FILE_READ
[14454] WAIT      : (          ) :           105: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            95: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            42: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            57: io.DATA_FILE_READ
[14454] WAIT      : (          ) :           140: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            46: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            60: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            36: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            57: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            65: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            66: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            30: io.DATA_FILE_READ
[14454] WAIT      : (          ) :            89: io.DATA_FILE_READ
[14454] rewrite   : (       510) :          7532: select now();
[14454] plan      : (       502) :           612: select now();
[14454] execute   : (       453) :           134: select now();
[14454]Query done : (       322) :         11197: select now();

The most important thing to see and realise is that these are executed in the rewrite phase. and therefore make the rewrite phase take longer. This phase is mostly not reported, and not visible in pg_stat_statements.