Instrument usage stack v13 by lfittl · Pull Request #58 · lfittl/postgres

lfittl · 2026-04-04T23:06:58Z

TODO

0004: Add tests covering Parallel BHS and parallel IOS, and check coverage
0004: Make separate bugfix commit

0007: Check how much benefit happens by only moving ExecProcNodeInstr
0007: Add pg_attribute_always_inline to ExecProcNodeInstr

0008: Review if we can do "Index Buffers" instead of "Table Buffers"
0008: Move InstrInitOptions calls to the respective files
0008: Put together a nocfbot patch that sits on top of the index prefetching patch instead, to show how this evolves with prefetching in the picture

Changes beyond review feedback:

Simplified InstrAggNode, because it is never called with running=true

Avoid dropping the heap page pin (xs_cbuf) and visibility map pin (xs_vmbuffer) within heapam_index_fetch_reset. Retaining these pins saves cycles during certain nested loop joins and merge joins that frequently restore a saved mark: cases where the next tuple fetched after a reset often falls on the same heap page will now avoid the cost of repeated pinning and unpinning. Avoiding dropping the scan's heap page buffer pin is preparation for an upcoming patch that will add I/O prefetching to index scans. Testing of that patch (which makes heapam tend to pin more buffers concurrently than was typical before now) shows that the aforementioned cases get a small but clearly measurable benefit from this optimization. Upcoming work to add a slot-based table AM interface for index scans (which is further preparation for prefetching) will move VM checks for index-only scans out of the executor and into heapam. That will expand the role of xs_vmbuffer to include VM lookups for index-only scans (the field won't just be used for setting pages all-visible during on-access pruning via the enhancement recently introduced by commit b46e1e5). Avoiding dropping the xs_vmbuffer pin will preserve the historical behavior of nodeIndexonlyscan.c, which always kept this pin on a rescan; that aspect of this commit isn't really new. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com

Also rename it to index_create_copy. Add a 'boolean concurrent' option, and make it work for both cases: in concurrent mode, just create the catalog entries; caller is responsible for the actual building later. In non-concurrent mode, the index is built right away. This allows it to be reused for other purposes -- specifically, for concurrent REPACK. (With the CONCURRENTLY option, REPACK cannot simply swap the heap file and rebuild its indexes. Instead, it needs to build a separate set of indexes, including their system catalog entries, *before* the actual swap, to reduce the time AccessExclusiveLock needs to be held for. This approach is different from what CREATE INDEX CONCURRENTLY does.) Per a suggestion from Mihail Nikalayeu. Author: Antonin Houska <ah@cybertec.at> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/41104.1754922120@localhost

Just a cosmetic cleanup.

Guard definition pg_pmull_available() on compile-time availability of PMULL. Oversight in fbc57f2. In passing, remove "inline" hint for consistency. Reported-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/f153d5a4-a9be-4211-b0b2-7e99b56d68d5@vondra.me

io_method=io_uring has a heuristic to trigger asynchronous processing of IOs once the IO depth is a bit larger. That heuristic is important when doing buffered IO from the kernel page cache, to allow parallelizing of the memory copy, as otherwise io_method=io_uring would be a lot slower than io_method=worker in that case. An upcoming commit will make read_stream.c only increase the read-ahead distance if we needed to wait for IO to complete. If to-be-read data is in the kernel page cache, io_uring will synchronously execute IO, unless the IO is flagged as async. Therefore the aforementioned change in read_stream.c heuristic would lead to a substantial performance regression with io_uring when data is in the page cache, as we would never reach a deep enough queue to actually trigger the existing heuristic. Parallelizing the copy from the page cache is mainly important when doing a lot of IO, which commonly is only possible when doing largely sequential IO. The reason we don't just mark all io_uring IOs as asynchronous is that the dispatch to a kernel thread has overhead. This overhead is mostly noticeable with small random IOs with a low queue depth, as in that case the gain from parallelizing the memory copy is small and the latency cost high. The facts from the two prior paragraphs show a way out: Use the size of the IO in addition to the depth of the queue to trigger asynchronous processing. One might think that just using the IO size might be enough, but experimentation has shown that not to be the case - with deep look-ahead distances being able to parallelize the memory copy is important even with smaller IOs. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

The long if statements were hard to read and hard to document. Splitting them into inline helpers makes it much easier to explain each part separately. This is done in preparation for making the logic more complicated... Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu

In a subsequent commit the read-ahead distance will only be increased when waiting for IO. Without further work that would cause a regression: As IO combining and read-ahead are currently controlled by the same mechanism, we would end up not allowing IO combining when never needing to wait for IO (as the distance ends up too small to allow for full sized IOs), which can increase CPU overhead. A typical reason to not have to wait for IO completion at a low look-ahead distance is use of io_uring with the to-be-read data in the page cache. But even with worker the IO submission rate may be low enough for the worker to keep up. One might think that we could just always perform IO combining, but doing so at the start of a scan can cause performance regressions: 1) Performing a large IO commonly has a higher latency than smaller IOs. That is not a problem once reading ahead far enough, but at the start of a stream it can lead to longer waits for IO completion. 2) Sometimes read streams will not be read to completion. Immediately starting with full sized IOs leads to more wasted effort. This is not commonly an issue with existing read stream users, but the upcoming use of read streams to fetch table pages as part of an index scan frequently encounters this. Solve this issue by splitting ReadStream->distance into ->combine_distance and ->readahead_distance. Right now they are increased/decreased at the same time, but that will change in the next commit. One of the comments in read_stream_should_look_ahead() refers to a motivation that only really exists as of the next commit, but without it the code doesn't make sense on its own. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

This avoids increasing the distance to the maximum in cases where the I/O subsystem is already keeping up. This turns out to be important for performance for two reasons: - Pinning a lot of buffers is not cheap. If additional pins allow us to avoid IO waits, it's definitely worth it, but if we can already do all the necessary readahead at a distance of 16, reading ahead 512 buffers can increase the CPU overhead substantially. This is particularly noticeable when the to-be-read blocks are already in the kernel page cache. - If the read stream is read to completion, reading in data earlier than needed is of limited consequences, leaving aside the CPU costs mentioned above. But if the read stream will not be fully consumed, e.g. because it is on the inner side of a nested loop join, the additional IO can be a serious performance issue. This is not that commonly a problem for current read stream users, but the upcoming work, to use a read stream to fetch table pages as part of an index scan, frequently encounters this. Note that this commit would have substantial performance downsides without earlier commits: - Commit 6e36930, which avoids decreasing the readahead distance when there was recent IO, is crucial, as otherwise we very often would end up not reading ahead aggressively enough anymore with this commit, due to increasing the distance less often. - "read stream: Split decision about look ahead for AIO and combining" is important as we would otherwise not perform IO combining when the IO subsystem can keep up. - "aio: io_uring: Trigger async processing for large IOs" is important to continue to benefit from memory copy parallelism when using fewer IOs. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Tested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

Merge pgaio_worker_submit_internal() and pgaio_worker_submit(). The separation didn't serve any purpose. Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com

READ ONLY transactions should prevent modifications to foreign data as well as local data, but postgres_fdw transactions declared as READ ONLY that reference foreign tables mapped to a remote view executing volatile functions would modify data on remote servers, as it would open remote transactions in READ WRITE mode. Similarly, DEFERRABLE transactions should not abort due to a serialization failure even when accessing foreign data, but postgres_fdw transactions declared as DEFERRABLE would abort due to that failure in a remote server, as it would open remote transactions in NOT DEFERRABLE mode. To fix, modify postgres_fdw to open remote transactions in the same access/deferrable modes as the local transaction. This commit also modifies it to open remote subtransactions in the same access mode as the local subtransaction. This commit changes the behavior of READ ONLY/DEFERRABLE transactions using postgres_fdw; in particular, it doesn't allow the READ ONLY transactions to modify data on remote servers anymore, so such transactions should be redeclared as READ WRITE or rewritten using other tools like dblink. The release notes should note this as an incompatibility. These issues exist since the introduction of postgres_fdw, but to avoid the incompatibility in the back branches, fix them in master only. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com Discussion: https://postgr.es/m/E1uLe9X-000zsY-2g%40gemulon.postgresql.org

A future REPACK patch wants a way to suppress index_build doing its progress reports when building an index, because that would interfere with repack's own reporting; so add an INDEX_CREATE_SUPPRESS_PROGRESS bit that enables this. Furthermore, change the index_create_copy() API so that it takes flag bits for index_create() and passes them unchanged. This gives its callers more direct control, which eases the interface -- now its callers can pass the INDEX_CREATE_SUPPRESS_PROGRESS bit directly. We use it for the current caller in REINDEX CONCURRENTLY, since it's also not interested in progress reporting, since it doesn't want index_build() to be called at all in the first place. One thing to keep in mind, pointed out by Mihail, is that we're not suppressing the index-AM-specific progress report updates which happen during ambuild(). At present this is not a problem, because the values updated by those don't overlap with those used by commands other than CREATE INDEX; but maybe in the future we'll want the ability to suppress them also. (Alternatively we might want to display how each index-build-subcommand progresses during REPACK and others.) Author: Antonin Houska <ah@cybertec.at> Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Discussion: https://postgr.es/m/102906.1773668762@localhost

Introduce TriggerInstrumentation to capture trigger timing and firings (previously counted in "ntuples"), to aid a future refactoring that splits out all Instrumentation fields beyond timing and WAL/buffers into more specific structs. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

Previously different places (e.g. query "total time") were repurposing the Instrumentation struct initially introduced for capturing per-node statistics during execution. This overuse of the same struct is confusing, e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated code paths, and prevents future refactorings. Instead, simplify the Instrumentation struct to only track time and WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation - these calls were added without any apparent benefit since the relevant fields were never read. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information. WorkerInstrumentation is renamed to WorkerNodeInstrumentation for clarity. In passing, drop the "n" argument to InstrAlloc, as all remaining callers need exactly one Instrumentation struct. Additionally, clarify that InstrAggNode is expected to only run after InstrEndLoop (as it does in practice), and drop unused code. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

This replaces different repeated code blocks that read pgBufferUsage / pgWalUsage, and may have also been running a timer to measure activity, with the new Instrumentation struct and associated helpers. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

…ith INSTR_* macros This encapsulates the ownership of these globals better, and will allow a subsequent refactoring. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4

Fix the missing accumulation of "Heap Blocks" from parallel query workers to the leader, causing EXPLAIN (ANALYZE) to only show the leader statistics, significantly undercounting the true value. Additionally, add a regression test covering EXPLAIN (ANALYZE) of a Parallel Bitmap Heap Scan, which previously was not tested at all. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion

The functions dealing with copying back parallel worker instrumentation such as ExecIndexOnlyScanRetrieveInstrumentation were not exercised at all in the regression tests, leading to a gap in coverage. Add a query that verifies we correctly copy back "Index Searches" for EXPLAIN ANALYZE of a Parallel Index Only Scan. Reported-by: Andres Freund <andres@anarazel.de> Author: Lukas Fittl <lukas@fittl.com> Discussion:

This adds regression tests that cover some of the expected behaviour around the buffer statistics reported in EXPLAIN ANALYZE, specifically how they behave in parallel query, nested function calls and abort situations. Testing this is challenging because there can be different sources of buffer activity, so we rely on temporary tables where we can to prove that activity was captured and not lost. This supports a future commit that will rework some of the instrumentation logic that could cause areas covered by these tests to fail. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

Previously, in order to determine the buffer/WAL usage of a given code section, we utilized continuously incrementing global counters that get updated when the actual activity (e.g. shared block read) occurred, and then calculated a diff when the code section ended. This resulted in a bottleneck for executor node instrumentation specifically, with the function BufferUsageAccumDiff showing up in profiles and in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run. Instead, introduce a stack-based mechanism, where the actual activity writes into the current stack entry. In the case of executor nodes, this means that each node gets its own stack entry that is pushed at InstrStartNode, and popped at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism) and get added to their parent entry when they are finalized, i.e. no more modifications can occur. To correctly handle abort situations, any use of instrumentation stacks must involve either a top-level QueryInstrumentation struct, and its associated InstrQueryStart/InstrQueryStop helpers (which use resource owners to handle aborts), or the Instrumentation struct itself with dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after an abort. In tests, the stack-based instrumentation mechanism reduces the overhead of EXPLAIN (ANALYZE, BUFFERS ON, TIMING OFF) for a large COUNT(*) query from about 50% to 22% on top of the actual runtime. This also drops the global pgBufferUsage, any callers interested in measuring buffer activity should instead utilize InstrStart/InstrStop. The related global pgWalUsage is kept for now due to its use in pgstat to track aggregate WAL activity and heap_page_prune_and_freeze for measuring FPIs. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173

This simplifies the DSM allocations a bit since we don't need to separately allocate WAL and buffer usage, and allows the easier future addition of a third stack-based struct being discussed. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in ExecProcNodeInstr when starting/stopping instrumentation for that node. Previously each ExecProcNodeInstr would check which instrumentation options are active in the InstrStartNode/InstrStopNode calls, and do the corresponding work (timers, instrumentation stack, etc.). These conditionals being checked for each tuple being emitted add up, and cause non-optimal set of instructions to be generated by the compiler. Because we already have an existing mechanism to specify a function pointer when instrumentation is enabled, we can instead create specialized functions that are tailored to the instrumentation options enabled, and avoid conditionals on subsequent ExecProcNodeInstr calls. This results in the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes from ~ 20% to ~ 10% on top of actual runtime. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564

This sets up a separate instrumentation stack that is used whilst an Index Scan or Index Only Scan does scanning on the table, for example due to additional data being needed. EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity. The activity is also included in regular "Buffers" together with index activity and that of any child nodes. Author: Lukas Fittl <lukas@fittl.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173

This is intended for testing instrumentation related logic as it pertains to the top level stack that is maintained as a running total. There is currently no in-core user that utilizes the top-level values in this manner, and especially during abort situations this helps ensure we don't lose activity due to incorrect handling of unfinalized node stacks.

petergeoghegan and others added 9 commits April 4, 2026 13:49

Modernize struct declarations in snapbuild.h

69c11f0

Just a cosmetic cleanup.

lfittl force-pushed the instrument-usage-stack-v13 branch from f088bf8 to 5294160 Compare April 5, 2026 06:59

lfittl force-pushed the instrument-usage-stack-v13 branch 3 times, most recently from 17b7338 to ac5e506 Compare April 5, 2026 11:23

alvherre and others added 13 commits April 5, 2026 13:34

lfittl force-pushed the instrument-usage-stack-v13 branch from ac5e506 to ff71ea6 Compare April 5, 2026 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument usage stack v13#58

Instrument usage stack v13#58
lfittl wants to merge 23 commits intomasterfrom
instrument-usage-stack-v13

lfittl commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lfittl commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lfittl commented Apr 4, 2026 •

edited

Loading