Skip to content

Instrument usage stack v13#58

Open
lfittl wants to merge 23 commits intomasterfrom
instrument-usage-stack-v13
Open

Instrument usage stack v13#58
lfittl wants to merge 23 commits intomasterfrom
instrument-usage-stack-v13

Conversation

@lfittl
Copy link
Copy Markdown
Owner

@lfittl lfittl commented Apr 4, 2026

TODO

  • 0004: Add tests covering Parallel BHS and parallel IOS, and check coverage
  • 0004: Make separate bugfix commit

  • 0005: Revise commit message to explain overhead after the change
  • 0005: standard_ExplainOneQuery, pull forward some of the changes into 0002
  • 0005: Revise README
  • 0005: Review if we can do better in regards to query->totaltime handling (is the brittleness if an extension doesn't set it to INSTRUMENT_ALL fixable?)
  • 0005: Respond to "why do we not just associate the NodeInstrumentation's with the QueryInstrumentation during the creation of the NodeInstrumentation?" in the context of execMain.c (AFAIK that's to avoid doing work when we don't actually call ExecutorRun)
  • 0005: Answer "Haven't looked into how this all works in sufficient detail, so I'm just asking you: This works correctly even when using EXPLAIN (ANALYZE, VERBOSE) showing per-worker "subtrees"?
  • 0005: Trigger instrumentation, "Why do we not need to pass down the instrument_options anymore? I guess the assumption is that we always are going to use the flags from qinstr?"
  • 0005: Add a comment to ExecGetTriggerResultRel re trigger instrumentation
  • 0005: Review double counting comment re: pgWalUsage ("I'm not sure I understand why this doesn't also lead to double counting, given that InstrAccumParallelQuery() does also add the worker's usage to pgWalUsage?")
  • 0005: Explain why "Recurse into children first (bottom-up accumulation), passing our instrumentation as the parent context." is needed (this should be because instrumentation was never initialized -- but does that still hold true?)
  • 0005: Review InstrStopFinalize comment
  • 0005: Review/comment on "seems like it'd be neater if the need_timer thing weren't duplicated, but implemented by calling InstrStop()"
  • 0005: Review/comment on "Why are these Instr[Query]StopFinalize() rather than just Instr[Query]Finalize()?"
  • 0005: Review/comment on "Why do we hardcode INSTRUMENT_BUFFERS | INSTRUMENT_WAL?"

  • 0007: Check how much benefit happens by only moving ExecProcNodeInstr
  • 0007: Add pg_attribute_always_inline to ExecProcNodeInstr

  • 0008: Review if we can do "Index Buffers" instead of "Table Buffers"
  • 0008: Move InstrInitOptions calls to the respective files
  • 0008: Put together a nocfbot patch that sits on top of the index prefetching patch instead, to show how this evolves with prefetching in the picture

Changes beyond review feedback:

  • Simplified InstrAggNode, because it is never called with running=true

petergeoghegan and others added 9 commits April 4, 2026 13:49
Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
(xs_vmbuffer) within heapam_index_fetch_reset.  Retaining these pins
saves cycles during certain nested loop joins and merge joins that
frequently restore a saved mark: cases where the next tuple fetched
after a reset often falls on the same heap page will now avoid the cost
of repeated pinning and unpinning.

Avoiding dropping the scan's heap page buffer pin is preparation for an
upcoming patch that will add I/O prefetching to index scans.  Testing of
that patch (which makes heapam tend to pin more buffers concurrently
than was typical before now) shows that the aforementioned cases get a
small but clearly measurable benefit from this optimization.

Upcoming work to add a slot-based table AM interface for index scans
(which is further preparation for prefetching) will move VM checks for
index-only scans out of the executor and into heapam.  That will expand
the role of xs_vmbuffer to include VM lookups for index-only scans (the
field won't just be used for setting pages all-visible during on-access
pruning via the enhancement recently introduced by commit b46e1e5).
Avoiding dropping the xs_vmbuffer pin will preserve the historical
behavior of nodeIndexonlyscan.c, which always kept this pin on a rescan;
that aspect of this commit isn't really new.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
Also rename it to index_create_copy.  Add a 'boolean concurrent' option,
and make it work for both cases: in concurrent mode, just create the
catalog entries; caller is responsible for the actual building later.
In non-concurrent mode, the index is built right away.

This allows it to be reused for other purposes -- specifically, for
concurrent REPACK.

(With the CONCURRENTLY option, REPACK cannot simply swap the heap file and
rebuild its indexes.  Instead, it needs to build a separate set of
indexes, including their system catalog entries, *before* the actual
swap, to reduce the time AccessExclusiveLock needs to be held for.  This
approach is different from what CREATE INDEX CONCURRENTLY does.)

Per a suggestion from Mihail Nikalayeu.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/41104.1754922120@localhost
Guard definition pg_pmull_available() on compile-time availability
of PMULL. Oversight in fbc57f2. In passing, remove "inline" hint
for consistency.

Reported-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/f153d5a4-a9be-4211-b0b2-7e99b56d68d5@vondra.me
io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
once the IO depth is a bit larger. That heuristic is important when doing
buffered IO from the kernel page cache, to allow parallelizing of the memory
copy, as otherwise io_method=io_uring would be a lot slower than
io_method=worker in that case.

An upcoming commit will make read_stream.c only increase the read-ahead
distance if we needed to wait for IO to complete. If to-be-read data is in the
kernel page cache, io_uring will synchronously execute IO, unless the IO is
flagged as async.  Therefore the aforementioned change in read_stream.c
heuristic would lead to a substantial performance regression with io_uring
when data is in the page cache, as we would never reach a deep enough queue to
actually trigger the existing heuristic.

Parallelizing the copy from the page cache is mainly important when doing a
lot of IO, which commonly is only possible when doing largely sequential IO.

The reason we don't just mark all io_uring IOs as asynchronous is that the
dispatch to a kernel thread has overhead. This overhead is mostly noticeable
with small random IOs with a low queue depth, as in that case the gain from
parallelizing the memory copy is small and the latency cost high.

The facts from the two prior paragraphs show a way out: Use the size of the IO
in addition to the depth of the queue to trigger asynchronous processing.

One might think that just using the IO size might be enough, but
experimentation has shown that not to be the case - with deep look-ahead
distances being able to parallelize the memory copy is important even with
smaller IOs.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
The long if statements were hard to read and hard to document. Splitting them
into inline helpers makes it much easier to explain each part separately.

This is done in preparation for making the logic more complicated...

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
In a subsequent commit the read-ahead distance will only be increased when
waiting for IO. Without further work that would cause a regression: As IO
combining and read-ahead are currently controlled by the same mechanism, we
would end up not allowing IO combining when never needing to wait for IO (as
the distance ends up too small to allow for full sized IOs), which can
increase CPU overhead. A typical reason to not have to wait for IO completion
at a low look-ahead distance is use of io_uring with the to-be-read data in
the page cache. But even with worker the IO submission rate may be low enough
for the worker to keep up.

One might think that we could just always perform IO combining, but doing so
at the start of a scan can cause performance regressions:

1) Performing a large IO commonly has a higher latency than smaller IOs. That
   is not a problem once reading ahead far enough, but at the start of a stream
   it can lead to longer waits for IO completion.

2) Sometimes read streams will not be read to completion. Immediately starting
   with full sized IOs leads to more wasted effort. This is not commonly an
   issue with existing read stream users, but the upcoming use of read streams
   to fetch table pages as part of an index scan frequently encounters this.

Solve this issue by splitting ReadStream->distance into ->combine_distance and
->readahead_distance. Right now they are increased/decreased at the same time,
but that will change in the next commit.

One of the comments in read_stream_should_look_ahead() refers to a motivation
that only really exists as of the next commit, but without it the code doesn't
make sense on its own.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
This avoids increasing the distance to the maximum in cases where the I/O
subsystem is already keeping up. This turns out to be important for
performance for two reasons:

- Pinning a lot of buffers is not cheap. If additional pins allow us to avoid
  IO waits, it's definitely worth it, but if we can already do all the
  necessary readahead at a distance of 16, reading ahead 512 buffers can
  increase the CPU overhead substantially.  This is particularly noticeable
  when the to-be-read blocks are already in the kernel page cache.

- If the read stream is read to completion, reading in data earlier than
  needed is of limited consequences, leaving aside the CPU costs mentioned
  above. But if the read stream will not be fully consumed, e.g. because it is
  on the inner side of a nested loop join, the additional IO can be a serious
  performance issue. This is not that commonly a problem for current read
  stream users, but the upcoming work, to use a read stream to fetch table
  pages as part of an index scan, frequently encounters this.

Note that this commit would have substantial performance downsides without
earlier commits:

- Commit 6e36930, which avoids decreasing the readahead distance when
  there was recent IO, is crucial, as otherwise we very often would end up not
  reading ahead aggressively enough anymore with this commit, due to
  increasing the distance less often.

- "read stream: Split decision about look ahead for AIO and combining" is
  important as we would otherwise not perform IO combining when the IO
  subsystem can keep up.

- "aio: io_uring: Trigger async processing for large IOs" is important to
  continue to benefit from memory copy parallelism when using fewer IOs.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Merge pgaio_worker_submit_internal() and pgaio_worker_submit().  The
separation didn't serve any purpose.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
@lfittl lfittl force-pushed the instrument-usage-stack-v13 branch from f088bf8 to 5294160 Compare April 5, 2026 06:59
READ ONLY transactions should prevent modifications to foreign data as
well as local data, but postgres_fdw transactions declared as READ ONLY
that reference foreign tables mapped to a remote view executing volatile
functions would modify data on remote servers, as it would open remote
transactions in READ WRITE mode.

Similarly, DEFERRABLE transactions should not abort due to a
serialization failure even when accessing foreign data, but postgres_fdw
transactions declared as DEFERRABLE would abort due to that failure in a
remote server, as it would open remote transactions in NOT DEFERRABLE
mode.

To fix, modify postgres_fdw to open remote transactions in the same
access/deferrable modes as the local transaction.  This commit also
modifies it to open remote subtransactions in the same access mode as
the local subtransaction.

This commit changes the behavior of READ ONLY/DEFERRABLE transactions
using postgres_fdw; in particular, it doesn't allow the READ ONLY
transactions to modify data on remote servers anymore, so such
transactions should be redeclared as READ WRITE or rewritten using other
tools like dblink.  The release notes should note this as an
incompatibility.

These issues exist since the introduction of postgres_fdw, but to avoid
the incompatibility in the back branches, fix them in master only.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com
Discussion: https://postgr.es/m/E1uLe9X-000zsY-2g%40gemulon.postgresql.org
@lfittl lfittl force-pushed the instrument-usage-stack-v13 branch 3 times, most recently from 17b7338 to ac5e506 Compare April 5, 2026 11:23
alvherre and others added 13 commits April 5, 2026 13:34
A future REPACK patch wants a way to suppress index_build doing its
progress reports when building an index, because that would interfere
with repack's own reporting; so add an INDEX_CREATE_SUPPRESS_PROGRESS
bit that enables this.

Furthermore, change the index_create_copy() API so that it takes flag
bits for index_create() and passes them unchanged.  This gives its
callers more direct control, which eases the interface -- now its
callers can pass the INDEX_CREATE_SUPPRESS_PROGRESS bit directly.  We
use it for the current caller in REINDEX CONCURRENTLY, since it's also
not interested in progress reporting, since it doesn't want
index_build() to be called at all in the first place.

One thing to keep in mind, pointed out by Mihail, is that we're not
suppressing the index-AM-specific progress report updates which happen
during ambuild().  At present this is not a problem, because the values
updated by those don't overlap with those used by commands other than
CREATE INDEX; but maybe in the future we'll want the ability to suppress
them also.  (Alternatively we might want to display how each
index-build-subcommand progresses during REPACK and others.)

Author: Antonin Houska <ah@cybertec.at>
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Discussion: https://postgr.es/m/102906.1773668762@localhost
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.

Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.

Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.

In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct. Additionally, clarify that
InstrAggNode is expected to only run after InstrEndLoop (as it does in
practice), and drop unused code.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
This replaces different repeated code blocks that read pgBufferUsage /
pgWalUsage, and may have also been running a timer to measure activity,
with the new Instrumentation struct and associated helpers.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
…ith INSTR_* macros

This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
Fix the missing accumulation of "Heap Blocks" from parallel query workers
to the leader, causing EXPLAIN (ANALYZE) to only show the leader statistics,
significantly undercounting the true value.

Additionally, add a regression test covering EXPLAIN (ANALYZE) of a
Parallel Bitmap Heap Scan, which previously was not tested at all.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion
The functions dealing with copying back parallel worker instrumentation
such as ExecIndexOnlyScanRetrieveInstrumentation were not exercised
at all in the regression tests, leading to a gap in coverage. Add a
query that verifies we correctly copy back "Index Searches" for
EXPLAIN ANALYZE of a Parallel Index Only Scan.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Discussion:
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.

Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.

Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.

To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.

In tests, the stack-based instrumentation mechanism reduces the overhead
of EXPLAIN (ANALYZE, BUFFERS ON, TIMING OFF) for a large COUNT(*) query
from about 50% to 22% on top of the actual runtime.

This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.

The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.

Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.

Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.

EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.

Author: Lukas Fittl <lukas@fittl.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
@lfittl lfittl force-pushed the instrument-usage-stack-v13 branch from ac5e506 to ff71ea6 Compare April 5, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants