Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v11) by lfittl · Pull Request #39 · lfittl/postgres

lfittl · 2026-03-10T09:10:21Z

TODO

Review John's feedback re: pg_cpu (split it out?)
define for cycle to nanoseconds (instead of checking x86 directly)

The test mentioned pg_stat_ext_exprs, but the correct catalog name is pg_stats_ext_exprs. Thinko in ba97bf9. Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com

The tests of stats_import.sql include a set of queries to do differential checks of the three statistics catalog relations, based on the comparison of a source relation and a target relation, used for the copy of the stats data with the restore functions: - pg_statistic - pg_stats_ext - pg_stats_ext_exprs This commit refactors the tests to reduce the bloat of such differential queries, by creating a set of objects that make the differential queries smaller: - View for a base relation type. - First function to retrieve stats data, that returns a type based on the view previously created. - Second function that checks the difference, based on two calls of the first function. This change leads to a nice reduction of stats_import.sql, with a larger effect on the output file. While on it, this adds some sanity checks for the three catalogs, to warn developers that the stats import facilities may need to be updated if any of the three catalogs change. These are rare in practice, see 918eee0 as one example. Another stylistic change is the use of the extended output format for the differential queries, so as we avoid long lines of output if a diff is caught. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com

When make_new_segment() creates an odd-sized segment, the pagemap was only sized based on a number of usable_pages entries, forgetting that a segment also contains metadata pages, and that the FreePageManager uses absolute page indices that cover the entire segment. This miscalculation could cause accesses to pagemap entries to be out of bounds. During subsequent reuse of the allocated segment, allocations landing on pages with indices higher than usable_pages could cause out-of-bounds pagemap reads and/or writes. On write, 'span' pointers are stored into the data area, corrupting the allocated objects. On read (aka during a dsa_free), garbage is interpreted as a span pointer, typically crashing the server in dsa_get_address(). The normal geometric path correctly sizes the pagemap for all pages in the segment. The odd-sized path needs to do the same, but it works forward from usable_pages rather than backward from total_size. This commit fixes the sizing of the odd-sized case by adding pagemap entries for the metadata pages after the initial metadata_bytes calculation, using an integer ceiling division to compute the exact number of additional entries needed in one go, avoiding any iteration in the calculation. An assertion is added in the code path for odd-sized segments, ensuring that the pagemap includes the metadata area, and that the result is appropriately sized. This problem would show up depending on the size requested for the allocation of a DSA segment. The reporter has noticed this issue when a parallel hash join makes a DSA allocation large enough to trigger the odd-sized segment path, but it could happen for anything that does a DSA allocation. A regression test is added to test_dsa, down to v17 where the test module has been introduced. This adds a set of cheap tests to check the problem, the new assertion being useful for this purpose. Sami has proposed a test that took a longer time than what I have done here; the test committed is faster and good enough to check the odd-sized allocation path. Author: Paul Bunn <paul.bunn@icloud.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com Backpatch-through: 14

Commit 2cd40ad added the IF NOT EXISTS option to ALTER TABLE ADD COLUMN. This also enabled IF NOT EXISTS for ALTER FOREIGN TABLE ADD COLUMN, but the ALTER FOREIGN TABLE documentation was not updated to mention it. This commit updates the documentation to describe the IF NOT EXISTS option for ALTER FOREIGN TABLE ADD COLUMN. While updating that section, also this commit clarifies that the COLUMN keyword is optional in ALTER FOREIGN TABLE ADD/DROP COLUMN. Previously, part of the documentation could be read as if COLUMN were required. This commit adds regression tests covering these ALTER FOREIGN TABLE syntaxes. Backpatch to all supported versions. Suggested-by: Fujii Masao <masao.fujii@gmail.com> Author: Chao Li <lic@highgo.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAHGQGwFk=rrhrwGwPtQxBesbT4DzSZ86Q3ftcwCu3AR5bOiXLw@mail.gmail.com Backpatch-through: 14

Commit f014b1b inserted some new code in between existing code and a trailing comment. Move the comment back to near the code it belongs to.

Author: Sahitya Chandra <sahityajb@gmail.com> Discussion: https://postgr.es/m/20260308142806.181309-1-sahityajb@gmail.com

When I (rhaas) wrote the WAL summarizer code, I incorrectly believed that XLOG_SMGR_TRUNCATE truncates all forks to the same length. In fact, what other parts of the code do is compute the truncation length for the FSM and VM forks from the truncation length used for the main fork. But, because I was confused, I coded the WAL summarizer to set the limit block for the VM fork to the same value as for the main fork. (Incremental backup always copies FSM forks in full, so there is no similar issue in that case.) Doing that doesn't directly cause any data corruption, as far as I can see. However, it does create a serious risk of consuming a large amount of extra disk space, because pg_combinebackup's reconstruct.c believes that the reconstructed file should always be at least as long as the limit block value. We might want to be smarter about that at some point in the future, because it's always safe to omit all-zeroes blocks at the end of the last segment of a relation, and doing so could save disk space, but the current algorithm will rarely waste enough disk space to worry about unless we believe that a relation has been truncated to a length much longer than its actual length on disk, which is exactly what happens as a result of the problem mentioned in the previous paragraph. To fix, create a new visibilitymap helper function and use it to include the right limit block in the summary files. Incremental backups taken with existing summary files will still have this issue, but this should improve the situation going forward. Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com> Diagnosed-by: Amul Sul <sulamul@gmail.com> Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com Backpatch-through: 17

Previously, the comments stated that there was no purpose to considering startup cost for partial paths, but this is not the case: it's perfectly reasonable to want a fast-start path for a plan that involves a LIMIT (perhaps over an aggregate, so that there is enough data being processed to justify parallel query but yet we don't want all the result rows). Accordingly, rewrite add_partial_path and add_partial_path_precheck to consider startup costs. This also fixes an independent bug in add_partial_path_precheck: commit e222534 failed to update it to do anything with the new disabled_nodes field. That bug fix is formally separate from the rest of this patch and could be committed separately, but I think it makes more sense to fix both issues together, because then we can (as this commit does) just make add_partial_path_precheck do the cost comparisons in the same way as compare_path_costs_fuzzily, which hopefully reduces the chances of ending up with something that's still incorrect. This patch is based on earlier work on this topic by Tomas Vondra, but I have rewritten a great deal of it. Co-authored-by: Robert Haas <rhaas@postgresql.org> Co-authored-by: Tomas Vondra <tomas@vondra.me> Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com

For a long time, PostgreSQL has had a get_relation_info_hook which plugins can use to editorialize on the information that get_relation_info obtains from the catalogs. However, this hook is only called for baserels of type RTE_RELATION, and there is potential utility in a similar call back for other types of RTEs. This might have had utility even before commit 4020b37 added pgs_mask to RelOptInfo, but it certainly has utility now. So, move the callback up one level, deleting get_relation_info_hook and adding build_simple_rel_hook instead. The new callback is called just slightly later than before and with slightly different arguments, but it should be fairly straightforward to adjust existing code that currently uses get_relation_info_hook: the values previously available as relationObjectId and inhparent are now available via rte->relid and rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if desired. Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com

This commit makes use of the function added by commit b2898ba for these applications' handling of conflicting options. It doesn't fix any bugs, but it does trim several lines of code. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Steven Niu <niushiji@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CACJufxHDYn%2B3-2jR_kwYB0U7UrNP%2B0EPvAWzBBD5EfUzzr1uiw%40mail.gmail.com

What should be used is not "volatile foo *ptr" but "foo *volatile ptr", The incorrect (former) style means that what the pointer variable points to is volatile. The correct (latter) style means that the pointer variable itself needs to be treated as volatile. The latter style is required to ensure a consistent treatment of these variables after a longjmp with the TRY/CATCH blocks. Some casts can be removed thanks to this change. Issue introduced by 2e94721, so no backpatch is required. A similar set of issues has been fixed in 9300188 for contrib/xml2/. Author: ChangAo Chen <cca5507@qq.com> Discussion: https://postgr.es/m/tencent_5BE8DAD985EE140ED62EA728C8D4E1311F0A@qq.com

Crash recovery started without a backup_label previously crashed with a PANIC if the checkpoint record could not be found. This commit lowers the report generated to be a FATAL instead. With recovery methods being more imaginative these days, this should provide more flexibility when handling PostgreSQL recovery processing in the event of a driver error, similarly to 15f68ce. An extra benefit of this change is that it becomes possible to add a test to check that a FATAL is hit with an expected error message pattern. With the recovery code becoming more complicated over the last couple of years, I suspect that this will be benefitial to cover in the long-term. The original PANIC behavior has been introduced in the early days of crash recovery, as of 4d14fe0 (PANIC did not exist yet, the code used STOP). Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Discussion: https://postgr.es/m/CAMm1aWZbQ-Acp_xAxC7mX9uZZMH8+NpfepY9w=AOxbBVT9E=uA@mail.gmail.com

Up until now, the only way for a loadable module to disable the use of a particular index was to use build_simple_rel_hook (or, previous to yesterday's commit, get_relation_info_hook) to remove it from the index list. While that works, it has some disadvantages. First, the index becomes invisible for all purposes, and can no longer be used for optimizations such as self-join elimination or left join removal, which can severely degrade the resulting plan. Second, if the module attempts to compel the use of a certain index by removing all other indexes from the index list and disabling other scan types, but the planner is unable to use the chosen index for some reason, it will fall back to a sequential scan, because that is only disabled, whereas the other indexes are, from the planner's point of view, completely gone. While this situation ideally shouldn't occur, it's hard for a loadable module to be completely sure whether the planner will view a certain index as usable for a certain query. If it isn't, it may be better to fall back to a scan using a disabled index rather than falling back to an also-disabled sequential scan. Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com

A list of expressions with optional AS-labels is useful in a few different places. Right now, this is available as xml_attribute_list because it was first used in the XMLATTRIBUTES construct, but it is already used elsewhere, and there are other possible future uses. To reduce possible confusion going forward, rename it to labeled_expr_list (like existing expr_list plus ColLabel). Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org

Commit dae761a added initialization of some BrinBuildState fields in initialize_brin_buildstate(). Later, commit b437571 inadvertently added the same initialization again. This commit removes that redundant initialization. No behavioral change is intended. Author: Chao Li <lic@highgo.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/CAEoWx2nmrca6-9SNChDvRYD6+r==fs9qg5J93kahS7vpoq8QVg@mail.gmail.com

There's no need for a StringInfo when all you want is a string being constructed in a single pass. Author: Álvaro Herrera <alvherre@kurilemu.de> Reported-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Yang Yuanzhuo <1197620467@qq.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Discussion: https://postgr.es/m/CAEudQAq2wyXZRdsh+wVHcOrungPU+_aQeQU12wbcgrmE0bQovA@mail.gmail.com

Previously heap_inplace_update_and_unlock() used an operation order similar to MarkBufferDirty(), to reduce the number of different approaches used for updating buffers. However, in an upcoming patch, MarkBufferDirtyHint() will switch to using the update protocol used by most other places (enabled by hint bits only being set while holding a share-exclusive lock). Luckily it's pretty easy to adjust heap_inplace_update_and_unlock(). As a comment already foresaw, we can use the normal order, with the slight change of updating the buffer contents after WAL logging. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d

Align with the convention of using third-person singular (e.g., "Shows" instead of "Show") for GUC parameter descriptions. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com

REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single command. Because this functionality is completely different from regular VACUUM, having it separate from VACUUM makes it easier for users to understand; as for CLUSTER, the term is heavily overloaded in the IT world and even in Postgres itself, so it's good that we can avoid it. We retain those older commands, but de-emphasize them in the documentation, in favor of REPACK; the difference between VACUUM FULL and CLUSTER (namely, the fact that tuples are written in a specific ordering) is neatly absorbed as two different modes of REPACK. This allows us to introduce further functionality in the future that works regardless of whether an ordering is being applied, such as (and especially) a concurrent mode. Author: Antonin Houska <ah@cybertec.at> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/82651.1720540558@antos Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql

Previously WAL records that froze tuples used OldestXmin as the snapshot conflict horizon, or the visibility cutoff if the page would become all-frozen. Both are newer than (or equal to) the newst XID actually frozen on the page. Track the newest XID that will be frozen and use that as the snapshot conflict horizon instead. This yields an older horizon resulting in fewer query cancellations on standbys. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com

Commit 17f51ea introduced a new pendingRecoveryConflicts field in PGPROC to replace the various ProcSignals. The new field was cleared in ProcArrayEndTransaction(), which makes sense for conflicts with e.g. locks or buffer pins which are gone at end of transaction. But it is not appropriate for conflicts on a database, or a logical slot. Because of this, the 035_standby_logical_decoding.pl test was occasionally getting stuck in the buildfarm. It happens if the startup process signals recovery conflict with the logical slot just when the walsender process using the slot calls ProcArrayEndTransaction(). To fix, don't clear pendingRecoveryConflicts in ProcArrayEndTransaction(). We could still clear certain conflict flags, like conflicts on locks, but we didn't try to do that before commit 17f51ea either. In the passing, fix a misspelled comment, and make InitAuxiliaryProcess() to also clear pendingRecoveryConflicts. I don't think aux processes can have recovery conflicts, but it seems best to initialize the field and keep InitAuxiliaryProcess() as close to InitProcess() as possible. Analyzed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://www.postgresql.org/message-id/3e07149d-060b-48a0-8f94-3d5e4946ae45@gmail.com

c2a23dc removed use of PruneState.frz_conflict_horizon but neglected to actually remove the member. Do that now.

This commit replaces the per-page buffer read look in blgetbitmap() with a reading stream, to improve scan efficiency, particularly useful for large bloom indexes. Some benchmarking with a large number of rows has shown a very nice improvement in terms of runtime and IO read reduction with test cases up to 10M rows for a bloom index scan. For the io_uring method, The author has reported a 3x in runtime with io_uring while I was at close to a 7x. For the worker method with 3 workers, the author has reported better numbers than myself in runtime, with the reduction in IO stats being appealing for all the cases measured. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com

At the moment hint bits can be set with just a share lock on a page (and, until 45f658d, in one case even without any lock). Because of this we need to copy pages while writing them out, as otherwise the checksum could be corrupted. The need to copy the page is problematic to implement AIO writes: 1) Instead of just needing a single buffer for a copied page we need one for each page that's potentially undergoing I/O 2) To be able to use the "worker" AIO implementation the copied page needs to reside in shared memory It also causes problems for using unbuffered/direct-IO, independent of AIO: Some filesystems, raid implementations, ... do not tolerate the data being written out to change during the write. E.g. they may compute internal checksums that can be invalidated by concurrent modifications, leading e.g. to filesystem errors (as the case with btrfs). It also just is plain odd to allow modifications of buffers that are just share locked. To address these issues, this commit changes the rules so that modifications to pages are not allowed anymore while holding a share lock. Instead the new share-exclusive lock (introduced in fcb9c97) allows at most one backend to modify a buffer while other backends have the same page share locked. An existing share-lock can be upgraded to a share-exclusive lock, if there are no conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits() and BufferSetHintBits16() have been introduced. To prevent hint bits from being set while the buffer is being written out, writing out buffers now requires a share-exclusive lock. The use of share-exclusive to gate setting hint bits means that from now on only one backend can set hint bits at a time. To allow multiple backends to set hint bits would require more complicated locking: For setting hint bits we'd need to store the count of backends currently setting hint bits and we would need another lock-level for I/O conflicting with the lock-level to set hint bits. Given that the share-exclusive lock for setting hint bits is only held for a short time, that backends would often just set the same hint bits and that the cost of occasionally not setting hint bits in hotly accessed pages is fairly low, this seems like an acceptable tradeoff. The biggest change to adapt to this is in heapam. To avoid performance regressions for sequential scans that need to set a lot of hint bits, we need to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint bits on a page have been set. Conversely, to avoid regressions in cases where we can't set hint bits in bulk (because we're looking only at individual tuples), use BufferSetHintBits16() when setting hint bits without batching. Several other places also need to be adapted, but those changes are comparatively simpler. After this we do not need to copy buffers to write them out anymore. That change is done separately however. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m

table_open() is a wrapper around relation_open() that checks that the relkind is table-like and gives a user-facing error message if not. It is best used in directly user-facing areas to check that the user used the right kind of command for the relkind. In internal uses where the relkind was previously checked from the user's perspective, table_open() is not necessary and might even be confusing if it were to give out-of-context error messages. In rewriteHandler.c, there were several such table_open() calls, which this changes to relation_open(). This currently doesn't make a difference, but there are plans to have other relkinds that could appear in the rewriter but that shouldn't be accessible via table-specific commands, and this clears the way for that. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org

Introduce two helpers for CPUID, pg_cpuid and pg_cpuid_subleaf that wrap the platform specific __get_cpuid/__cpuid and __get_cpuid_count/__cpuidex functions. Additionally, introduce the CPUIDResult struct to make code working with CPUID easier to read by referencing the register name (e.g. ECX) instead of a numeric index. Author: Lukas Fittl <lukas@fittl.com> Suggested-By: John Naylor <john.naylor@postgresql.org> Reviewed-by: Discussion:

Previously we would only check for the availability of __cpuidex if the related __get_cpuid_count was not available on a platform. But there are cases where we want to be able to call __cpuidex as the only viable option, specifically, when accessing a high leaf like VM Hypervisor information (0x40000000), which __get_cpuid_count does not allow. This will be used in an future commit to access Hypervisor information about the TSC frequency of x86 CPUs, where available. Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available, adjust existing code to check for _MSC_VER when including intrin.h. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an absolute instr_time value in order to do a diff, which goes against the spirit of how the GET_* macros are supposed to be used, and will cause overhead in a future change that assumes these macros are typically used on intervals only. Additionally the program was doing unnecessary work in the test loop by measuring the time elapsed, instead of checking the existing current time measurement against a target end time. To support that, introduce a new INSTR_TIME_SET_NANOSEC macro that allows initializing an instr_time variable from a user-defined interval. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

…tforms The timing infrastructure (INSTR_* macros) measures time elapsed using clock_gettime() on POSIX systems, which returns the time as nanoseconds, and QueryPerformanceCounter() on Windows, which is a specialized timing clock source that returns a tick counter that needs to be converted to nanoseconds using the result of QueryPerformanceFrequency(). This conversion currently happens ad-hoc on Windows, e.g. when calling INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every invocation, despite the frequency being stable after program start, incurring unnecessary overhead. It also causes a fractured implementation where macros are defined differently between platforms. To ease code readability, and prepare for a future change that intends to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce a new pg_ticks_to_ns() function that gets called on all platforms. This function relies on a separately initialized ticks_per_ns_scaled value, that represents the conversion ratio. This value is initialized from QueryPerformanceFrequency() on Windows, and set to zero on x86-64 POSIX systems, which results in the ticks being treated as nanoseconds. Other architectures always directly return the original ticks. To support this, pg_initialize_timing() is introduced, and is now mandatory for both the backend and any frontend programs to call before utilizing INSTR_* macros. Author: Lukas Fittl <lukas@fittl.com> Author: Andres Freund <andres@anarazel.de> Author: David Geier <geidav.pg@gmail.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

…asurements This allows the direct use of the Time-Stamp Counter (TSC) value retrieved from the CPU using RDTSC/RDTSC instructions, instead of APIs like clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be reduced by up to 10% for queries moving lots of rows through the plan. To control use of the TSC, the new "timing_clock_source" GUC is introduced, whose default ("auto") automatically uses the TSC when running on Linux/x86-64, in case the system clocksource is reported as "tsc". The use of the system APIs can be enforced by setting "system", or on x86-64 architectures the use of TSC can be enforced by explicitly setting "tsc". In order to use the TSC the frequency is first determined by use of CPUID, and if not available, by running a short calibration loop at program start, falling back to the system time if TSC values are not stable. Note, that we split TSC usage into the RDTSC CPU instruction which does not wait for out-of-order execution (faster, less precise) and the RDTSCP instruction, which waits for outstanding instructions to retire. RDTSCP is deemed to have little benefit in the typical InstrStartNode() / InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC. The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be used when precision is more important than performance. When the system timing clock source is used both of these macros instead utilize the system APIs (clock_gettime / QueryPerformanceCounter) like before. Author: David Geier <geidav.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

Author: David Geier <geidav.pg@gmail.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces use of the cntvct_el0 instruction on ARM systems to access the generic timer that provides a synchronized ticks value across CPUs. Note this adds an exception for Apple Silicon CPUs, due to the observed fact that M3 and newer has different timer frequencies for the Efficiency and the Performance cores, and we can't be sure where we get scheduled. To simplify the implementation this does not support Windows on ARM, since its quite rare and hard to test. Relies on the existing timing_clock_source GUC to control whether TSC-like timer gets used, instead of system timer. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

michaelpq and others added 20 commits March 9, 2026 07:15

Fix typo in stats_import.sql

9e8193a

The test mentioned pg_stat_ext_exprs, but the correct catalog name is pg_stats_ext_exprs. Thinko in ba97bf9. Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com

Move comment back to better place

2799e29

Commit f014b1b inserted some new code in between existing code and a trailing comment. Move the comment back to near the code it belongs to.

Remove trailing period from errmsg in subscriptioncmds.c.

06d8302

Author: Sahitya Chandra <sahityajb@gmail.com> Discussion: https://postgr.es/m/20260308142806.181309-1-sahityajb@gmail.com

lfittl force-pushed the fast-timing-v11 branch from 36cfe2f to 6827d58 Compare March 10, 2026 20:25

hlinnaka and others added 4 commits March 11, 2026 00:06

Remove unused PruneState member frz_conflict_horizon

4c7362c

c2a23dc removed use of PruneState.frz_conflict_horizon but neglected to actually remove the member. Do that now.

lfittl force-pushed the fast-timing-v11 branch 5 times, most recently from 911b6ef to a467956 Compare March 11, 2026 08:17

lfittl force-pushed the fast-timing-v11 branch from a467956 to b5f0f8f Compare March 11, 2026 08:19

lfittl force-pushed the fast-timing-v11 branch 2 times, most recently from acae748 to d983834 Compare March 11, 2026 08:38

lfittl added 7 commits March 11, 2026 02:06

pg_test_timing: Also test RDTSC/RDTSCP timing and report time source

96a7c19

Author: David Geier <geidav.pg@gmail.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

lfittl force-pushed the fast-timing-v11 branch from d983834 to 6aa00e5 Compare March 11, 2026 09:13

lfittl added the awaiting-review label Mar 11, 2026

lfittl added archived and removed awaiting-review labels Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v11)#39

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v11)#39
lfittl wants to merge 32 commits intomasterfrom
fast-timing-v11

lfittl commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

lfittl commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

lfittl commented Mar 10, 2026 •

edited

Loading