Skip to content

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v11)#39

Open
lfittl wants to merge 32 commits intomasterfrom
fast-timing-v11
Open

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc (PG19, v11)#39
lfittl wants to merge 32 commits intomasterfrom
fast-timing-v11

Conversation

@lfittl
Copy link
Copy Markdown
Owner

@lfittl lfittl commented Mar 10, 2026

TODO

  • Review John's feedback re: pg_cpu (split it out?)
  • define for cycle to nanoseconds (instead of checking x86 directly)

michaelpq and others added 20 commits March 9, 2026 07:15
The test mentioned pg_stat_ext_exprs, but the correct catalog name is
pg_stats_ext_exprs.

Thinko in ba97bf9.

Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com
The tests of stats_import.sql include a set of queries to do
differential checks of the three statistics catalog relations, based on
the comparison of a source relation and a target relation, used for the
copy of the stats data with the restore functions:
- pg_statistic
- pg_stats_ext
- pg_stats_ext_exprs

This commit refactors the tests to reduce the bloat of such differential
queries, by creating a set of objects that make the differential queries
smaller:
- View for a base relation type.
- First function to retrieve stats data, that returns a type based on
the view previously created.
- Second function that checks the difference, based on two calls of the
first function.

This change leads to a nice reduction of stats_import.sql, with a larger
effect on the output file.

While on it, this adds some sanity checks for the three catalogs, to
warn developers that the stats import facilities may need to be updated
if any of the three catalogs change.  These are rare in practice, see
918eee0 as one example.  Another stylistic change is the use of the
extended output format for the differential queries, so as we avoid long
lines of output if a diff is caught.

Author: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com
When make_new_segment() creates an odd-sized segment, the pagemap was
only sized based on a number of usable_pages entries, forgetting that a
segment also contains metadata pages, and that the FreePageManager uses
absolute page indices that cover the entire segment.  This
miscalculation could cause accesses to pagemap entries to be out of
bounds.  During subsequent reuse of the allocated segment, allocations
landing on pages with indices higher than usable_pages could cause
out-of-bounds pagemap reads and/or writes.  On write, 'span' pointers
are stored into the data area, corrupting the allocated objects.  On
read (aka during a dsa_free), garbage is interpreted as a span pointer,
typically crashing the server in dsa_get_address().

The normal geometric path correctly sizes the pagemap for all pages in
the segment.  The odd-sized path needs to do the same, but it works
forward from usable_pages rather than backward from total_size.

This commit fixes the sizing of the odd-sized case by adding pagemap
entries for the metadata pages after the initial metadata_bytes
calculation, using an integer ceiling division to compute the exact
number of additional entries needed in one go, avoiding any iteration in
the calculation.

An assertion is added in the code path for odd-sized segments, ensuring
that the pagemap includes the metadata area, and that the result is
appropriately sized.

This problem would show up depending on the size requested for the
allocation of a DSA segment.  The reporter has noticed this issue when a
parallel hash join makes a DSA allocation large enough to trigger the
odd-sized segment path, but it could happen for anything that does a DSA
allocation.

A regression test is added to test_dsa, down to v17 where the test
module has been introduced.  This adds a set of cheap tests to check the
problem, the new assertion being useful for this purpose.  Sami has
proposed a test that took a longer time than what I have done here; the
test committed is faster and good enough to check the odd-sized
allocation path.

Author: Paul Bunn <paul.bunn@icloud.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com
Backpatch-through: 14
Commit 2cd40ad added the IF NOT EXISTS option to ALTER TABLE ADD COLUMN.
This also enabled IF NOT EXISTS for ALTER FOREIGN TABLE ADD COLUMN,
but the ALTER FOREIGN TABLE documentation was not updated to mention it.

This commit updates the documentation to describe the IF NOT EXISTS option for
ALTER FOREIGN TABLE ADD COLUMN.

While updating that section, also this commit clarifies that the COLUMN keyword
is optional in ALTER FOREIGN TABLE ADD/DROP COLUMN. Previously, part of
the documentation could be read as if COLUMN were required.

This commit adds regression tests covering these ALTER FOREIGN TABLE syntaxes.

Backpatch to all supported versions.

Suggested-by: Fujii Masao <masao.fujii@gmail.com>
Author: Chao Li <lic@highgo.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFk=rrhrwGwPtQxBesbT4DzSZ86Q3ftcwCu3AR5bOiXLw@mail.gmail.com
Backpatch-through: 14
Commit f014b1b inserted some new code in between existing code and
a trailing comment.  Move the comment back to near the code it belongs
to.
When I (rhaas) wrote the WAL summarizer code, I incorrectly believed
that XLOG_SMGR_TRUNCATE truncates all forks to the same length.  In
fact, what other parts of the code do is compute the truncation length
for the FSM and VM forks from the truncation length used for the main
fork. But, because I was confused, I coded the WAL summarizer to set the
limit block for the VM fork to the same value as for the main fork.
(Incremental backup always copies FSM forks in full, so there is no
similar issue in that case.)

Doing that doesn't directly cause any data corruption, as far as I can
see. However, it does create a serious risk of consuming a large amount
of extra disk space, because pg_combinebackup's reconstruct.c believes
that the reconstructed file should always be at least as long as the
limit block value. We might want to be smarter about that at some point
in the future, because it's always safe to omit all-zeroes blocks at the
end of the last segment of a relation, and doing so could save disk
space, but the current algorithm will rarely waste enough disk space to
worry about unless we believe that a relation has been truncated to a
length much longer than its actual length on disk, which is exactly what
happens as a result of the problem mentioned in the previous paragraph.

To fix, create a new visibilitymap helper function and use it to include
the right limit block in the summary files. Incremental backups taken
with existing summary files will still have this issue, but this should
improve the situation going forward.

Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com>
Diagnosed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com
Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com
Backpatch-through: 17
Previously, the comments stated that there was no purpose to considering
startup cost for partial paths, but this is not the case: it's perfectly
reasonable to want a fast-start path for a plan that involves a LIMIT
(perhaps over an aggregate, so that there is enough data being processed
to justify parallel query but yet we don't want all the result rows).

Accordingly, rewrite add_partial_path and add_partial_path_precheck to
consider startup costs. This also fixes an independent bug in
add_partial_path_precheck: commit e222534
failed to update it to do anything with the new disabled_nodes field.
That bug fix is formally separate from the rest of this patch and could
be committed separately, but I think it makes more sense to fix both
issues together, because then we can (as this commit does) just make
add_partial_path_precheck do the cost comparisons in the same way as
compare_path_costs_fuzzily, which hopefully reduces the chances of
ending up with something that's still incorrect.

This patch is based on earlier work on this topic by Tomas Vondra,
but I have rewritten a great deal of it.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com
For a long time, PostgreSQL has had a get_relation_info_hook which
plugins can use to editorialize on the information that
get_relation_info obtains from the catalogs. However, this hook is
only called for baserels of type RTE_RELATION, and there is
potential utility in a similar call back for other types of
RTEs. This might have had utility even before commit
4020b37 added pgs_mask to
RelOptInfo, but it certainly has utility now.

So, move the callback up one level, deleting get_relation_info_hook and
adding build_simple_rel_hook instead. The new callback is called just
slightly later than before and with slightly different arguments, but it
should be fairly straightforward to adjust existing code that currently
uses get_relation_info_hook: the values previously available as
relationObjectId and inhparent are now available via rte->relid and
rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if
desired.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com
This commit makes use of the function added by commit b2898ba
for these applications' handling of conflicting options.  It
doesn't fix any bugs, but it does trim several lines of code.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CACJufxHDYn%2B3-2jR_kwYB0U7UrNP%2B0EPvAWzBBD5EfUzzr1uiw%40mail.gmail.com
What should be used is not "volatile foo *ptr" but "foo *volatile ptr",
The incorrect (former) style means that what the pointer variable points
to is volatile.  The correct (latter) style means that the pointer
variable itself needs to be treated as volatile.  The latter style is
required to ensure a consistent treatment of these variables after a
longjmp with the TRY/CATCH blocks.

Some casts can be removed thanks to this change.

Issue introduced by 2e94721, so no backpatch is required.  A
similar set of issues has been fixed in 9300188 for contrib/xml2/.

Author: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/tencent_5BE8DAD985EE140ED62EA728C8D4E1311F0A@qq.com
Crash recovery started without a backup_label previously crashed with a
PANIC if the checkpoint record could not be found.  This commit lowers
the report generated to be a FATAL instead.

With recovery methods being more imaginative these days, this should
provide more flexibility when handling PostgreSQL recovery processing in
the event of a driver error, similarly to 15f68ce.  An extra
benefit of this change is that it becomes possible to add a test to
check that a FATAL is hit with an expected error message pattern.  With
the recovery code becoming more complicated over the last couple of
years, I suspect that this will be benefitial to cover in the long-term.

The original PANIC behavior has been introduced in the early days of
crash recovery, as of 4d14fe0 (PANIC did not exist yet, the code
used STOP).

Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Discussion: https://postgr.es/m/CAMm1aWZbQ-Acp_xAxC7mX9uZZMH8+NpfepY9w=AOxbBVT9E=uA@mail.gmail.com
Up until now, the only way for a loadable module to disable the use of a
particular index was to use build_simple_rel_hook (or, previous to
yesterday's commit, get_relation_info_hook) to remove it from the index
list. While that works, it has some disadvantages. First, the index
becomes invisible for all purposes, and can no longer be used for
optimizations such as self-join elimination or left join removal, which
can severely degrade the resulting plan.

Second, if the module attempts to compel the use of a certain index
by removing all other indexes from the index list and disabling
other scan types, but the planner is unable to use the chosen index
for some reason, it will fall back to a sequential scan, because that
is only disabled, whereas the other indexes are, from the planner's
point of view, completely gone. While this situation ideally shouldn't
occur, it's hard for a loadable module to be completely sure whether
the planner will view a certain index as usable for a certain query.
If it isn't, it may be better to fall back to a scan using a disabled
index rather than falling back to an also-disabled sequential scan.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com
A list of expressions with optional AS-labels is useful in a few
different places.  Right now, this is available as xml_attribute_list
because it was first used in the XMLATTRIBUTES construct, but it is
already used elsewhere, and there are other possible future uses.  To
reduce possible confusion going forward, rename it to
labeled_expr_list (like existing expr_list plus ColLabel).

Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
Commit dae761a added initialization of some BrinBuildState fields
in initialize_brin_buildstate(). Later, commit b437571 inadvertently
added the same initialization again.

This commit removes that redundant initialization. No behavioral
change is intended.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2nmrca6-9SNChDvRYD6+r==fs9qg5J93kahS7vpoq8QVg@mail.gmail.com
There's no need for a StringInfo when all you want is a string
being constructed in a single pass.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Yang Yuanzhuo <1197620467@qq.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CAEudQAq2wyXZRdsh+wVHcOrungPU+_aQeQU12wbcgrmE0bQovA@mail.gmail.com
Previously heap_inplace_update_and_unlock() used an operation order similar to
MarkBufferDirty(), to reduce the number of different approaches used for
updating buffers.  However, in an upcoming patch, MarkBufferDirtyHint() will
switch to using the update protocol used by most other places (enabled by hint
bits only being set while holding a share-exclusive lock).

Luckily it's pretty easy to adjust heap_inplace_update_and_unlock(). As a
comment already foresaw, we can use the normal order, with the slight change
of updating the buffer contents after WAL logging.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
Align with the convention of using third-person singular (e.g.,
"Shows" instead of "Show") for GUC parameter descriptions.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com
REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command.  Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.

We retain those older commands, but de-emphasize them in the
documentation, in favor of REPACK; the difference between VACUUM FULL
and CLUSTER (namely, the fact that tuples are written in a specific
ordering) is neatly absorbed as two different modes of REPACK.

This allows us to introduce further functionality in the future that
works regardless of whether an ordering is being applied, such as (and
especially) a concurrent mode.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.

Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com
hlinnaka and others added 4 commits March 11, 2026 00:06
Commit 17f51ea introduced a new pendingRecoveryConflicts field in
PGPROC to replace the various ProcSignals. The new field was cleared
in ProcArrayEndTransaction(), which makes sense for conflicts with
e.g. locks or buffer pins which are gone at end of transaction. But it
is not appropriate for conflicts on a database, or a logical slot.

Because of this, the 035_standby_logical_decoding.pl test was
occasionally getting stuck in the buildfarm. It happens if the startup
process signals recovery conflict with the logical slot just when the
walsender process using the slot calls ProcArrayEndTransaction().

To fix, don't clear pendingRecoveryConflicts in
ProcArrayEndTransaction(). We could still clear certain conflict
flags, like conflicts on locks, but we didn't try to do that before
commit 17f51ea either.

In the passing, fix a misspelled comment, and make
InitAuxiliaryProcess() to also clear pendingRecoveryConflicts. I don't
think aux processes can have recovery conflicts, but it seems best to
initialize the field and keep InitAuxiliaryProcess() as close to
InitProcess() as possible.

Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://www.postgresql.org/message-id/3e07149d-060b-48a0-8f94-3d5e4946ae45@gmail.com
c2a23dc removed use of PruneState.frz_conflict_horizon but
neglected to actually remove the member. Do that now.
This commit replaces the per-page buffer read look in blgetbitmap() with
a reading stream, to improve scan efficiency, particularly useful for
large bloom indexes.  Some benchmarking with a large number of rows has
shown a very nice improvement in terms of runtime and IO read reduction
with test cases up to 10M rows for a bloom index scan.

For the io_uring method, The author has reported a 3x in runtime with
io_uring while I was at close to a 7x.  For the worker method with 3
workers, the author has reported better numbers than myself in runtime,
with the reduction in IO stats being appealing for all the cases
measured.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
At the moment hint bits can be set with just a share lock on a page (and,
until 45f658d, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.

The need to copy the page is problematic to implement AIO writes:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory

It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).

It also just is plain odd to allow modifications of buffers that are just
share locked.

To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c97) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.

To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.

The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.

The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set.  Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.

Several other places also need to be adapted, but those changes are
comparatively simpler.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
@lfittl lfittl force-pushed the fast-timing-v11 branch 5 times, most recently from 911b6ef to a467956 Compare March 11, 2026 08:17
table_open() is a wrapper around relation_open() that checks that the
relkind is table-like and gives a user-facing error message if not.
It is best used in directly user-facing areas to check that the user
used the right kind of command for the relkind.  In internal uses
where the relkind was previously checked from the user's perspective,
table_open() is not necessary and might even be confusing if it were
to give out-of-context error messages.

In rewriteHandler.c, there were several such table_open() calls, which
this changes to relation_open().  This currently doesn't make a
difference, but there are plans to have other relkinds that could
appear in the rewriter but that shouldn't be accessible via
table-specific commands, and this clears the way for that.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
@lfittl lfittl force-pushed the fast-timing-v11 branch 2 times, most recently from acae748 to d983834 Compare March 11, 2026 08:38
lfittl added 7 commits March 11, 2026 02:06
Introduce two helpers for CPUID, pg_cpuid and pg_cpuid_subleaf that wrap
the platform specific __get_cpuid/__cpuid and __get_cpuid_count/__cpuidex
functions.

Additionally, introduce the CPUIDResult struct to make code working with
CPUID easier to read by referencing the register name (e.g. ECX) instead
of a numeric index.

Author: Lukas Fittl <lukas@fittl.com>
Suggested-By: John Naylor <john.naylor@postgresql.org>
Reviewed-by:
Discussion:
Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform. But there
are cases where we want to be able to call __cpuidex as the only viable
option, specifically, when accessing a high leaf like VM Hypervisor
information (0x40000000), which __get_cpuid_count does not allow.

This will be used in an future commit to access Hypervisor information
about the TSC frequency of x86 CPUs, where available.

Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h
for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available,
adjust existing code to check for _MSC_VER when including intrin.h.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
The pg_test_timing program was previously using INSTR_TIME_GET_NANOSEC on an
absolute instr_time value in order to do a diff, which goes against the spirit
of how the GET_* macros are supposed to be used, and will cause overhead in a
future change that assumes these macros are typically used on intervals only.

Additionally the program was doing unnecessary work in the test loop by
measuring the time elapsed, instead of checking the existing current time
measurement against a target end time. To support that, introduce a new
INSTR_TIME_SET_NANOSEC macro that allows initializing an instr_time
variable from a user-defined interval.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
…tforms

The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
a new pg_ticks_to_ns() function that gets called on all platforms.

This function relies on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
…asurements

This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSC instructions, instead of APIs like
clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
reduced by up to 10% for queries moving lots of rows through the plan.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
in case the system clocksource is reported as "tsc". The use of the system
APIs can be enforced by setting "system", or on x86-64 architectures the
use of TSC can be enforced by explicitly setting "tsc".

In order to use the TSC the frequency is first determined by use of CPUID,
and if not available, by running a short calibration loop at program start,
falling back to the system time if TSC values are not stable.

Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP instruction,
which waits for outstanding instructions to retire. RDTSCP is deemed to have
little benefit in the typical InstrStartNode() / InstrStopNode() use case of
EXPLAIN, and can be up to twice as slow. To separate these use cases, the new
macro INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
to be used when precision is more important than performance. When the
system timing clock source is used both of these macros instead utilize
the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
use of the cntvct_el0 instruction on ARM systems to access the generic
timer that provides a synchronized ticks value across CPUs.

Note this adds an exception for Apple Silicon CPUs, due to the observed
fact that M3 and newer has different timer frequencies for the Efficiency
and the Performance cores, and we can't be sure where we get scheduled.

To simplify the implementation this does not support Windows on ARM,
since its quite rare and hard to test.

Relies on the existing timing_clock_source GUC to control whether
TSC-like timer gets used, instead of system timer.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.