Skip to content

Fast timing v16#54

Open
lfittl wants to merge 5 commits intomasterfrom
fast-timing-v16
Open

Fast timing v16#54
lfittl wants to merge 5 commits intomasterfrom
fast-timing-v16

Conversation

@lfittl
Copy link
Copy Markdown
Owner

@lfittl lfittl commented Apr 2, 2026

TODO

  • Add reviewers

0002 patch:

  • Split out more of the architecture dependent code that's not in instr_time.c into a separate commit
  • Review comment re: "doesn't that mean the effort to synchronize the tsc freq on EXEC_BACKEND is futile?"
  • Remove "Initialize timing infrastructure" comment change
  • Revise name of "use_tsc" global ("seems a tad too short a name for my personal taste. Too likely to also be used by something else.")
  • Review where "tsc_frequency_khz" being 0 vs -1 is documented (it should be in the include file?)
  • Always check that TSC is invariant
  • Put __rdtscp in a helper
  • Explain TSC_CALIBRATION_SKIPS
  • Review how we can have pg_test_timing always run the TSC calibration (and comment whether we should move up the pg_test_timing change to show the frequency?)
  • Add a debug log in pg_initialize_timing_tsc when tsc_detect_frequency is run
  • Revise 16H.EAX comment in x86_tsc_frequency_khz

0004 patch:

  • Review if we can somehow detect the mixed core situation in a more architecture agnostic way

@lfittl lfittl force-pushed the fast-timing-v16 branch 3 times, most recently from 7300f17 to b8ee5c1 Compare April 2, 2026 22:49
lfittl added 5 commits April 2, 2026 15:54
…tforms

The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
a new pg_ticks_to_ns() function that gets called on all platforms.

This function relies on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
This adds additional x86 specific CPUID checks for flags needed for
determining whether the Time-Stamp Counter (TSC) is usable on a given
system, as well as a helper function to retrieve the TSC frequency from
CPUID.

This is intended for a future patch that will utilize the TSC to lower
the overhead of timing instrumentation.

Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: John Naylor <john.naylor@postgresql.org>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
…asurements

This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSC instructions, instead of APIs like
clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
reduced by up to 10% for queries moving lots of rows through the plan.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
in case the system clocksource is reported as "tsc". The use of the system
APIs can be enforced by setting "system", or on x86-64 architectures the
use of TSC can be enforced by explicitly setting "tsc".

In order to use the TSC the frequency is first determined by use of CPUID,
and if not available, by running a short calibration loop at program start,
falling back to the system time if TSC values are not stable.

Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
which waits for outstanding instructions to retire. RDTSCP is deemed to have
little benefit in the typical InstrStartNode() / InstrStopNode() use case of
EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
to be used when precision is more important than performance. When the
system timing clock source is used both of these macros instead utilize
the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an earlier version)
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com> (in an earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (in an earlier version)
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
…and TSC frequency

Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
use of the cntvct_el0 instruction on ARM systems to access the generic
timer that provides a synchronized ticks value across CPUs.

Note this adds an exception for Apple Silicon CPUs, due to the observed
fact that M3 and newer has different timer frequencies for the Efficiency
and the Performance cores, and we can't be sure where we get scheduled.

To simplify the implementation this does not support Windows on ARM,
since its quite rare and hard to test.

Relies on the existing timing_clock_source GUC to control whether
TSC-like timer gets used, instead of system timer.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant