Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12)#47
Open
Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v12)#47
Conversation
a087350 to
c93527f
Compare
Introduce two helpers for CPUID, pg_cpuid and pg_cpuid_subleaf that wrap the platform specific __get_cpuid/__cpuid and __get_cpuid_count/__cpuidex functions. Additionally, introduce the CPUIDResult struct to make code working with CPUID easier to read by referencing the register name (e.g. ECX) instead of a numeric index. Author: Lukas Fittl <lukas@fittl.com> Suggested-By: John Naylor <john.naylor@postgresql.org> Reviewed-by: Discussion:
Previously we would only check for the availability of __cpuidex if the related __get_cpuid_count was not available on a platform. But there are cases where we want to be able to call __cpuidex as the only viable option, specifically, when accessing a high leaf like VM Hypervisor information (0x40000000), which __get_cpuid_count does not allow. This will be used in an future commit to access Hypervisor information about the TSC frequency of x86 CPUs, where available. Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available, adjust existing code to check for _MSC_VER when including intrin.h. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an absolute instr_time value in order to do a diff, which goes against the spirit of how the GET_* macros are supposed to be used, and will cause overhead in a future change that assumes these macros are typically used on intervals only. Additionally the program was doing unnecessary work in the test loop by measuring the time elapsed, instead of checking the existing current time measurement against a target end time. To support that, introduce a new INSTR_TIME_ADD_NANOSEC macro that allows adding user-defined nanoseconds to an instr_time variable. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:
…tforms The timing infrastructure (INSTR_* macros) measures time elapsed using clock_gettime() on POSIX systems, which returns the time as nanoseconds, and QueryPerformanceCounter() on Windows, which is a specialized timing clock source that returns a tick counter that needs to be converted to nanoseconds using the result of QueryPerformanceFrequency(). This conversion currently happens ad-hoc on Windows, e.g. when calling INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every invocation, despite the frequency being stable after program start, incurring unnecessary overhead. It also causes a fractured implementation where macros are defined differently between platforms. To ease code readability, and prepare for a future change that intends to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce a new pg_ticks_to_ns() function that gets called on all platforms. This function relies on a separately initialized ticks_per_ns_scaled value, that represents the conversion ratio. This value is initialized from QueryPerformanceFrequency() on Windows, and set to zero on x86-64 POSIX systems, which results in the ticks being treated as nanoseconds. Other architectures always directly return the original ticks. To support this, pg_initialize_timing() is introduced, and is now mandatory for both the backend and any frontend programs to call before utilizing INSTR_* macros. Author: Lukas Fittl <lukas@fittl.com> Author: Andres Freund <andres@anarazel.de> Author: David Geier <geidav.pg@gmail.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
…asurements
This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSC instructions, instead of APIs like
clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
reduced by up to 10% for queries moving lots of rows through the plan.
To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
in case the system clocksource is reported as "tsc". The use of the system
APIs can be enforced by setting "system", or on x86-64 architectures the
use of TSC can be enforced by explicitly setting "tsc".
In order to use the TSC the frequency is first determined by use of CPUID,
and if not available, by running a short calibration loop at program start,
falling back to the system time if TSC values are not stable.
Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
which waits for outstanding instructions to retire. RDTSCP is deemed to have
little benefit in the typical InstrStartNode() / InstrStopNode() use case of
EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.
The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
to be used when precision is more important than performance. When the
system timing clock source is used both of these macros instead utilize
the system APIs (clock_gettime / QueryPerformanceCounter) like before.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Author: David Geier <geidav.pg@gmail.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces use of the cntvct_el0 instruction on ARM systems to access the generic timer that provides a synchronized ticks value across CPUs. Note this adds an exception for Apple Silicon CPUs, due to the observed fact that M3 and newer has different timer frequencies for the Efficiency and the Performance cores, and we can't be sure where we get scheduled. To simplify the implementation this does not support Windows on ARM, since its quite rare and hard to test. Relies on the existing timing_clock_source GUC to control whether TSC-like timer gets used, instead of system timer. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Discussion:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.