Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12) by lfittl · Pull Request #47 · lfittl/postgres

lfittl · 2026-03-21T22:23:10Z

No description provided.

Introduce two helpers for CPUID, pg_cpuid and pg_cpuid_subleaf that wrap the platform specific __get_cpuid/__cpuid and __get_cpuid_count/__cpuidex functions. Additionally, introduce the CPUIDResult struct to make code working with CPUID easier to read by referencing the register name (e.g. ECX) instead of a numeric index. Author: Lukas Fittl <lukas@fittl.com> Suggested-By: John Naylor <john.naylor@postgresql.org> Reviewed-by: Discussion:

Previously we would only check for the availability of __cpuidex if the related __get_cpuid_count was not available on a platform. But there are cases where we want to be able to call __cpuidex as the only viable option, specifically, when accessing a high leaf like VM Hypervisor information (0x40000000), which __get_cpuid_count does not allow. This will be used in an future commit to access Hypervisor information about the TSC frequency of x86 CPUs, where available. Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available, adjust existing code to check for _MSC_VER when including intrin.h. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an absolute instr_time value in order to do a diff, which goes against the spirit of how the GET_* macros are supposed to be used, and will cause overhead in a future change that assumes these macros are typically used on intervals only. Additionally the program was doing unnecessary work in the test loop by measuring the time elapsed, instead of checking the existing current time measurement against a target end time. To support that, introduce a new INSTR_TIME_ADD_NANOSEC macro that allows adding user-defined nanoseconds to an instr_time variable. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

…tforms The timing infrastructure (INSTR_* macros) measures time elapsed using clock_gettime() on POSIX systems, which returns the time as nanoseconds, and QueryPerformanceCounter() on Windows, which is a specialized timing clock source that returns a tick counter that needs to be converted to nanoseconds using the result of QueryPerformanceFrequency(). This conversion currently happens ad-hoc on Windows, e.g. when calling INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every invocation, despite the frequency being stable after program start, incurring unnecessary overhead. It also causes a fractured implementation where macros are defined differently between platforms. To ease code readability, and prepare for a future change that intends to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce a new pg_ticks_to_ns() function that gets called on all platforms. This function relies on a separately initialized ticks_per_ns_scaled value, that represents the conversion ratio. This value is initialized from QueryPerformanceFrequency() on Windows, and set to zero on x86-64 POSIX systems, which results in the ticks being treated as nanoseconds. Other architectures always directly return the original ticks. To support this, pg_initialize_timing() is introduced, and is now mandatory for both the backend and any frontend programs to call before utilizing INSTR_* macros. Author: Lukas Fittl <lukas@fittl.com> Author: Andres Freund <andres@anarazel.de> Author: David Geier <geidav.pg@gmail.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

…asurements This allows the direct use of the Time-Stamp Counter (TSC) value retrieved from the CPU using RDTSC/RDTSC instructions, instead of APIs like clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be reduced by up to 10% for queries moving lots of rows through the plan. To control use of the TSC, the new "timing_clock_source" GUC is introduced, whose default ("auto") automatically uses the TSC when running on Linux/x86-64, in case the system clocksource is reported as "tsc". The use of the system APIs can be enforced by setting "system", or on x86-64 architectures the use of TSC can be enforced by explicitly setting "tsc". In order to use the TSC the frequency is first determined by use of CPUID, and if not available, by running a short calibration loop at program start, falling back to the system time if TSC values are not stable. Note, that we split TSC usage into the RDTSC CPU instruction which does not wait for out-of-order execution (faster, less precise) and the RDTSCP instruction, which waits for outstanding instructions to retire. RDTSCP is deemed to have little benefit in the typical InstrStartNode() / InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC. The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be used when precision is more important than performance. When the system timing clock source is used both of these macros instead utilize the system APIs (clock_gettime / QueryPerformanceCounter) like before. Author: David Geier <geidav.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

Author: David Geier <geidav.pg@gmail.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces use of the cntvct_el0 instruction on ARM systems to access the generic timer that provides a synchronized ticks value across CPUs. Note this adds an exception for Apple Silicon CPUs, due to the observed fact that M3 and newer has different timer frequencies for the Efficiency and the Performance cores, and we can't be sure where we get scheduled. To simplify the implementation this does not support Windows on ARM, since its quite rare and hard to test. Relies on the existing timing_clock_source GUC to control whether TSC-like timer gets used, instead of system timer. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:

lfittl force-pushed the fast-timing-v12 branch 4 times, most recently from a087350 to c93527f Compare March 22, 2026 00:48

lfittl added 2 commits March 21, 2026 18:21

lfittl force-pushed the fast-timing-v12 branch from c93527f to f4b13ad Compare March 22, 2026 01:22

lfittl added 5 commits March 22, 2026 09:35

pg_test_timing: Also test RDTSC/RDTSCP timing and report time source

62f4abd

Author: David Geier <geidav.pg@gmail.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

lfittl force-pushed the fast-timing-v12 branch from 3d7f4c2 to 2d48b04 Compare March 22, 2026 16:41

lfittl added awaiting-review archived and removed awaiting-review labels Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12)#47

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12)#47
lfittl wants to merge 7 commits intomasterfrom
fast-timing-v12

lfittl commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lfittl commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant