Skip to content

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12)#47

Open
lfittl wants to merge 7 commits intomasterfrom
fast-timing-v12
Open

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12)#47
lfittl wants to merge 7 commits intomasterfrom
fast-timing-v12

Conversation

@lfittl
Copy link
Copy Markdown
Owner

@lfittl lfittl commented Mar 21, 2026

No description provided.

@lfittl lfittl force-pushed the fast-timing-v12 branch 4 times, most recently from a087350 to c93527f Compare March 22, 2026 00:48
lfittl added 2 commits March 21, 2026 18:21
Introduce two helpers for CPUID, pg_cpuid and pg_cpuid_subleaf that wrap
the platform specific __get_cpuid/__cpuid and __get_cpuid_count/__cpuidex
functions.

Additionally, introduce the CPUIDResult struct to make code working with
CPUID easier to read by referencing the register name (e.g. ECX) instead
of a numeric index.

Author: Lukas Fittl <lukas@fittl.com>
Suggested-By: John Naylor <john.naylor@postgresql.org>
Reviewed-by:
Discussion:
Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform. But there
are cases where we want to be able to call __cpuidex as the only viable
option, specifically, when accessing a high leaf like VM Hypervisor
information (0x40000000), which __get_cpuid_count does not allow.

This will be used in an future commit to access Hypervisor information
about the TSC frequency of x86 CPUs, where available.

Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h
for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available,
adjust existing code to check for _MSC_VER when including intrin.h.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
lfittl added 5 commits March 22, 2026 09:35
The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an
absolute instr_time value in order to do a diff, which goes against the spirit
of how the GET_* macros are supposed to be used, and will cause overhead in a
future change that assumes these macros are typically used on intervals only.

Additionally the program was doing unnecessary work in the test loop by
measuring the time elapsed, instead of checking the existing current time
measurement against a target end time. To support that, introduce a new
INSTR_TIME_ADD_NANOSEC macro that allows adding user-defined nanoseconds
to an instr_time variable.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
…tforms

The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
a new pg_ticks_to_ns() function that gets called on all platforms.

This function relies on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
…asurements

This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSC instructions, instead of APIs like
clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
reduced by up to 10% for queries moving lots of rows through the plan.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
in case the system clocksource is reported as "tsc". The use of the system
APIs can be enforced by setting "system", or on x86-64 architectures the
use of TSC can be enforced by explicitly setting "tsc".

In order to use the TSC the frequency is first determined by use of CPUID,
and if not available, by running a short calibration loop at program start,
falling back to the system time if TSC values are not stable.

Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
which waits for outstanding instructions to retire. RDTSCP is deemed to have
little benefit in the typical InstrStartNode() / InstrStopNode() use case of
EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
to be used when precision is more important than performance. When the
system timing clock source is used both of these macros instead utilize
the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
use of the cntvct_el0 instruction on ARM systems to access the generic
timer that provides a synchronized ticks value across CPUs.

Note this adds an exception for Apple Silicon CPUs, due to the observed
fact that M3 and newer has different timer frequencies for the Efficiency
and the Performance cores, and we can't be sure where we get scheduled.

To simplify the implementation this does not support Windows on ARM,
since its quite rare and hard to test.

Relies on the existing timing_clock_source GUC to control whether
TSC-like timer gets used, instead of system timer.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant