SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348
SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348mason-sharp merged 4 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds/changes spock.wait_for_sync_event procedures and public signature, advances replication origin for empty commits in apply logic, adds a regression test for sync_event/wait behavior, updates Makefile test targets, and adds installation/limitations documentation (including a duplicated subsection). Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
sql/spock--6.0.0-devel.sql (1)
444-508: Same timeout precision and NULL progress_lsn observations apply here as in the migration SQL.This is the base install version of the same procedure. The recommended refactor to use
clock_timestamp()for timeout tracking (instead of the incrementalelapsed_timecounter) applies here as well for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--6.0.0-devel.sql` around lines 444 - 508, The wait loop in procedure spock.wait_for_sync_event uses an incremental elapsed_time counter and compares progress_lsn directly to lsn without handling NULL; change timeout handling to use clock_timestamp() by recording a start_ts := clock_timestamp() before the loop and replace elapsed_time checks with clock_timestamp() - start_ts >= (timeout || ' seconds')::interval, and guard the progress_lsn vs lsn comparison to handle NULL (e.g. only compare when progress_lsn IS NOT NULL, otherwise treat as not reached). Update references in the procedure to use start_ts, progress_lsn and timeout interval comparison and leave the pg_sleep loop otherwise unchanged.sql/spock--5.0.4--6.0.0-devel.sql (2)
236-238: No row inpg_replication_origin_statussilently leavesprogress_lsnas NULL.If the apply worker hasn't started yet or the origin doesn't exist in
pg_replication_origin_status,progress_lsnstays NULL and the loop spins until timeout. This is likely acceptable, but a debug RAISE NOTICE on the first miss (or a NOT FOUND check) would help users diagnose whywait_for_sync_eventis timing out.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 236 - 238, The SELECT INTO from pg_replication_origin_status can leave progress_lsn NULL when no row exists; update the wait_for_sync_event logic to detect that case by checking the PL/pgSQL NOT FOUND condition (or testing progress_lsn IS NULL) immediately after the SELECT and emit a single RAISE NOTICE on the first miss to aid debugging; introduce a local boolean flag (e.g., first_miss) initialized true, RAISE NOTICE including sub_slot and that pg_replication_origin_status had no row, then set first_miss := false so subsequent loop iterations don’t spam notices, and otherwise continue the existing timeout/loop behavior.
190-254: Timeout tracking uses loop-count-based estimation, not wall-clock time.
elapsed_timeincrements by 0.2 each iteration, but the actual time per iteration includes query execution overhead (the subscription check + the origin status query + ROLLBACK). Under load, the real elapsed time could significantly exceed the requestedtimeout. Consider usingclock_timestamp()for accurate wall-clock tracking, similar towait_for_apply_workerin the same extension.♻️ Proposed fix for accurate timeout tracking
DECLARE target_id oid; - elapsed_time numeric := 0; + start_time timestamptz := clock_timestamp(); progress_lsn pg_lsn; sub_is_enabled bool; sub_slot name; ... - elapsed_time := elapsed_time + .2; - IF timeout <> 0 AND elapsed_time >= timeout THEN + IF timeout <> 0 AND EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= timeout THEN result := false; RETURN; END IF;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 190 - 254, The timeout logic in spock.wait_for_sync_event is using a loop-count-based elapsed_time += 0.2 which underestimates real wall-clock time; change to record a start timestamp (e.g., start_ts := clock_timestamp()) at the top of the procedure and replace the elapsed_time numeric increments with a wall-clock check like extract(epoch FROM (clock_timestamp() - start_ts)) to compare against timeout (seconds), keeping the existing timeout semantics and the rest of the loop (subscription checks, progress_lsn query, ROLLBACK, pg_sleep) unchanged; update or remove the elapsed_time variable and ensure comparisons use the new timestamp-based elapsed calculation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 236-238: The SELECT INTO from pg_replication_origin_status can
leave progress_lsn NULL when no row exists; update the wait_for_sync_event logic
to detect that case by checking the PL/pgSQL NOT FOUND condition (or testing
progress_lsn IS NULL) immediately after the SELECT and emit a single RAISE
NOTICE on the first miss to aid debugging; introduce a local boolean flag (e.g.,
first_miss) initialized true, RAISE NOTICE including sub_slot and that
pg_replication_origin_status had no row, then set first_miss := false so
subsequent loop iterations don’t spam notices, and otherwise continue the
existing timeout/loop behavior.
- Around line 190-254: The timeout logic in spock.wait_for_sync_event is using a
loop-count-based elapsed_time += 0.2 which underestimates real wall-clock time;
change to record a start timestamp (e.g., start_ts := clock_timestamp()) at the
top of the procedure and replace the elapsed_time numeric increments with a
wall-clock check like extract(epoch FROM (clock_timestamp() - start_ts)) to
compare against timeout (seconds), keeping the existing timeout semantics and
the rest of the loop (subscription checks, progress_lsn query, ROLLBACK,
pg_sleep) unchanged; update or remove the elapsed_time variable and ensure
comparisons use the new timestamp-based elapsed calculation.
In `@sql/spock--6.0.0-devel.sql`:
- Around line 444-508: The wait loop in procedure spock.wait_for_sync_event uses
an incremental elapsed_time counter and compares progress_lsn directly to lsn
without handling NULL; change timeout handling to use clock_timestamp() by
recording a start_ts := clock_timestamp() before the loop and replace
elapsed_time checks with clock_timestamp() - start_ts >= (timeout || '
seconds')::interval, and guard the progress_lsn vs lsn comparison to handle NULL
(e.g. only compare when progress_lsn IS NOT NULL, otherwise treat as not
reached). Update references in the procedure to use start_ts, progress_lsn and
timeout interval comparison and leave the pg_sleep loop otherwise unchanged.
danolivo
left a comment
There was a problem hiding this comment.
It is necessary to address the general issues mentioned before we allow this core integration to spread further throughout our code.
| if (replorigin_session_origin != InvalidRepOriginId && | ||
| replorigin_session_origin != DoNotReplicateId) | ||
| { | ||
| replorigin_session_advance(end_lsn, InvalidXLogRecPtr); |
There was a problem hiding this comment.
Using this machinery, Spock refers to the maximum number of replication states defined by the GUC max_active_replication_origins (which can be changed only at restart). We need to document it and provide recommendations for its use, as we do for other important parameters (see, for example, max_replication_slots).
There was a problem hiding this comment.
Ah this is added in PG 18+. Added a note in docs.
| * pg_replication_origin_status.remote_lsn is updated. | ||
| */ | ||
| if (replorigin_session_origin != InvalidRepOriginId && | ||
| replorigin_session_origin != DoNotReplicateId) |
There was a problem hiding this comment.
replorigin_session_origin refers to shared memory. The same does Postgres logical replication. The address it points to is based on an OriginID value. The originID spaces for pgoutput and Spock are separate, and generated values can accidentally overlap. In this case, we would have concurrent access to and rewriting of local_lsn and remote_lsn with values from entirely different address spaces.
It looks like to use replorigin_session_advance() correctly, you must guarantee that the originIDs of the two tools do not intersect, or disable pgoutput altogether.
There was a problem hiding this comment.
A very low probability of actually happening. Spock and pgoutput both use PostgreSQL's pg_replication_origin catalog—there's no separate origin ID space. They also use different naming prefixes (spk_* vs pg_*), so no chance of collision. PostgreSQL protects against concurrent access via ReplicationOriginLock. The replorigin_session_advance() call only executes within Spock's apply worker and advances the slot owned by that worker. Each subscription creates its own slot with a unique origin; there's no chance of collision.
There was a problem hiding this comment.
@rasifr , just look how the session_replication_state is initialised. It refers to the shared array element - replication_states[i]. The 'ownership' is identified by the originID value. The 'uniqueness' of the Origin ID is separate in Spock and Native LR.
Aside from the question of collision probability between the Spock and upstream algorithms, there is an obvious case in which the native LR will 'advance' the same shmem area as Spock, with totally unrelated data.
It seems quite a typical case when a user downloads some data via LR from a legacy cloud provider to a cutting-edge multimaster Spock configuration and experiences strange replication failures. I do strictly recommend at least reducing the chance to the bare minimum, and let us identify potential cases by using a specific range of origin IDs.
But generally, we should avoid referring to any native LR structures to avoid potential collisions.
There was a problem hiding this comment.
Thinking deeper into this, I think I understand your concern now. You are pointing out how Spock replaces PG native origin id with Spock's node_id, right?
That's a valid concern. This is a limitation of the architecture. a design decision since 4.x (commit 3165452). I will create a ticket to track this. For now, I've documented the limitation in docs/limitations.md.
There was a problem hiding this comment.
@rasifr , commit, you're referring to is a good commit that have no correlation to the issue I pointed out.
The problem I want to draw your attention to is the unavoidable conflict between native LR and the Spock one, introduced recently with reference to replication_states[].
Just for clarity, if it is ok to have potential interference with native LR, there is no reason to generate subscription IDs separately - just reuse SubscriptionRelationId/GetNewOidWithIndex() and have standard subscription OIDs instead of dangerously unstable hash values, right? But we don't do it and manage subscriptions on our own. So, why do we allow the danger of inconsistent slot advancement?
4ed0d10 to
563616f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
sql/spock--5.0.4--6.0.0-devel.sql (1)
186-250: Same concerns as inspock--6.0.0-devel.sqlapply here.The procedure body is identical to the base extension file. The issues flagged there (replication origin
external_idvssub_slotassumption, timeout drift, NULL error message) apply equally to this migration copy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 186 - 250, The spock.wait_for_sync_event procedure duplicates the same issues: it assumes pg_replication_origin_status.external_id equals sub_slot, uses elapsed_time with sleep causing timeout drift, and raises exceptions with NULL values; fix by (1) resolving the correct origin lookup (match pg_replication_origin_status using external_id = sub_slot only after validating sub_slot is NOT NULL and/or use the appropriate origin identifier stored for the subscription), (2) replace the elapsed_time accumulation with a monotonic timestamp check (record start := clock_timestamp() and compare clock_timestamp() - start >= make_interval(secs => timeout)), and (3) avoid including raw possibly-NULL identifiers in RAISE EXCEPTION messages (coalesce or format the message to handle NULLs), updating references in this procedure to sub_slot, progress_lsn, elapsed_time, timeout, and the SELECT from pg_replication_origin_status accordingly.
🧹 Nitpick comments (1)
sql/spock--6.0.0-devel.sql (1)
498-505: Timeout tracks only sleep time, not wall-clock time.
elapsed_timeaccumulates only the fixed 0.2s sleep increments and ignores the time spent executing queries and theROLLBACK. Under load, the actual wall-clock duration could significantly exceed the requestedtimeout. Note that the sibling functionwait_for_apply_worker(Line 668) usesclock_timestamp()for accurate wall-clock tracking.Use clock_timestamp() for accurate timeout
DECLARE target_id oid; - elapsed_time numeric := 0; + start_time timestamptz := clock_timestamp(); progress_lsn pg_lsn; sub_is_enabled bool; sub_slot name; ... - elapsed_time := elapsed_time + .2; - IF timeout <> 0 AND elapsed_time >= timeout THEN + IF timeout <> 0 AND + EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= timeout THEN result := false; RETURN; END IF;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--6.0.0-devel.sql` around lines 498 - 505, The loop currently increments elapsed_time by fixed 0.2 and thus tracks only sleep time; change the timeout check to use wall-clock time via clock_timestamp(): record a start_ts (e.g., start_ts := clock_timestamp()) before the loop and inside the loop compute elapsed := EXTRACT(EPOCH FROM (clock_timestamp() - start_ts)) (or update elapsed_time from that expression) and use that elapsed value for the IF timeout <> 0 AND elapsed >= timeout THEN check so ROLLBACK and query execution time are included (see the sibling function wait_for_apply_worker for the precise approach).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 457-459: The exception message uses origin_id when origin_id IS
NULL so it always prints '<NULL>'; update the RAISE EXCEPTION to include a
useful identifier instead (for example use the name-based variable origin or
other contextual variable already computed earlier) or omit the % placeholder
and provide a clear static message; locate the block checking origin_id (the
variables origin_id and origin are in scope) and change the RAISE EXCEPTION to
reference origin (the node name) or add both origin_id and origin name to the
message so it no longer prints only '<NULL>'.
---
Duplicate comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 186-250: The spock.wait_for_sync_event procedure duplicates the
same issues: it assumes pg_replication_origin_status.external_id equals
sub_slot, uses elapsed_time with sleep causing timeout drift, and raises
exceptions with NULL values; fix by (1) resolving the correct origin lookup
(match pg_replication_origin_status using external_id = sub_slot only after
validating sub_slot is NOT NULL and/or use the appropriate origin identifier
stored for the subscription), (2) replace the elapsed_time accumulation with a
monotonic timestamp check (record start := clock_timestamp() and compare
clock_timestamp() - start >= make_interval(secs => timeout)), and (3) avoid
including raw possibly-NULL identifiers in RAISE EXCEPTION messages (coalesce or
format the message to handle NULLs), updating references in this procedure to
sub_slot, progress_lsn, elapsed_time, timeout, and the SELECT from
pg_replication_origin_status accordingly.
---
Nitpick comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 498-505: The loop currently increments elapsed_time by fixed 0.2
and thus tracks only sleep time; change the timeout check to use wall-clock time
via clock_timestamp(): record a start_ts (e.g., start_ts := clock_timestamp())
before the loop and inside the loop compute elapsed := EXTRACT(EPOCH FROM
(clock_timestamp() - start_ts)) (or update elapsed_time from that expression)
and use that elapsed value for the IF timeout <> 0 AND elapsed >= timeout THEN
check so ROLLBACK and query execution time are included (see the sibling
function wait_for_apply_worker for the precise approach).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/regress/expected/sync_event.outis excluded by!**/*.out
📒 Files selected for processing (6)
Makefiledocs/install_spock.mdsql/spock--5.0.4--6.0.0-devel.sqlsql/spock--6.0.0-devel.sqlsrc/spock_apply.ctests/regress/sql/sync_event.sql
✅ Files skipped from review due to trivial changes (1)
- docs/install_spock.md
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/regress/sql/sync_event.sql
- src/spock_apply.c
- Makefile
563616f to
cec39c9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
sql/spock--5.0.4--6.0.0-devel.sql (2)
241-244:⚠️ Potential issue | 🟡 MinorSame
timeout IS NULLsilent infinite-loop risk as in the fresh-install script.The identical procedure body at Line 241 carries the same
timeout <> 0NULL-evaluation bug flagged insql/spock--6.0.0-devel.sqlLines 499-503. Apply the same fix here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 241 - 244, The IF that checks timeout currently uses "timeout <> 0" which evaluates to NULL when timeout is NULL and silently skips the timeout branch; update the condition in the procedure body that uses variables timeout, start_time, result and clock_timestamp() to explicitly guard against NULL (e.g. use "timeout IS NOT NULL AND timeout <> 0" or "timeout IS DISTINCT FROM 0") so the timeout check behaves identically to the fix in spock--6.0.0-devel.sql; keep the rest of the block (setting result := false; RETURN;) unchanged.
205-207: Same deadsub_is_enabledinitial fetch as in the fresh-install script.Applies identically here; see the optional refactor comment on
sql/spock--6.0.0-devel.sqlLines 463-465.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 205 - 207, The SELECT that fetches sub_enabled and sub_slot_name into variables sub_is_enabled and sub_slot is redundant here (same dead fetch as in the fresh-install script); remove or refactor it so you only read the subscription fields when actually needed—either delete the SELECT entirely if sub_is_enabled isn't used later, or change it to fetch only sub_slot_name INTO sub_slot (or perform the lookup lazily where used). Adjust references to sub_is_enabled/sub_slot accordingly to match the optional refactor used in sql/spock--6.0.0-devel.sql.
🧹 Nitpick comments (1)
sql/spock--6.0.0-devel.sql (1)
463-465:sub_is_enabledfetched initially but never used before it is overwritten.The pre-loop
SELECT sub_enabled, sub_slot_name INTO sub_is_enabled, sub_slotpopulatessub_is_enabled, but the loop unconditionally re-fetchessub_is_enabledat Line 474 before the value is ever read. The initial fetch ofsub_is_enabledis dead — onlysub_slotmatters from that query. Droppingsub_is_enabledfrom the initial SELECT removes the misleading dead assignment.♻️ Proposed refactor
- SELECT sub_enabled, sub_slot_name INTO sub_is_enabled, sub_slot - FROM spock.subscription - WHERE sub_origin = origin_id AND sub_target = target_id; + SELECT sub_slot_name INTO sub_slot + FROM spock.subscription + WHERE sub_origin = origin_id AND sub_target = target_id;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--6.0.0-devel.sql` around lines 463 - 465, The initial SELECT into sub_is_enabled and sub_slot from spock.subscription (SELECT sub_enabled, sub_slot_name INTO sub_is_enabled, sub_slot WHERE sub_origin = origin_id AND sub_target = target_id) assigns sub_is_enabled but it is overwritten later in the loop; remove sub_is_enabled from this pre-loop SELECT so only sub_slot is fetched (SELECT sub_slot_name INTO sub_slot ...), leaving the later in-loop retrieval of sub_is_enabled intact; update the INTO list and selected columns accordingly to avoid the dead assignment while keeping the later logic that re-fetches sub_is_enabled unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/limitations.md`:
- Around line 170-174: The file contains a duplicated subsection titled "Mixed
Spock and Native Logical Replication"; locate the second (identical) occurrence
of that subsection and remove it so the document only contains a single
instance. Ensure you remove the entire duplicated block including the heading
and the recommendation paragraph, and keep the first original occurrence intact
to preserve the intended content.
In `@sql/spock--6.0.0-devel.sql`:
- Around line 499-503: Replace the NULL-sensitive condition so explicit NULL
timeouts don't bypass the timeout check: change the clause that currently reads
"IF timeout <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >=
timeout THEN" to use COALESCE(timeout, 0) (or explicitly check timeout IS NOT
NULL) — i.e., use "IF COALESCE(timeout, 0) <> 0 AND EXTRACT(EPOCH FROM
(clock_timestamp() - start_time)) >= COALESCE(timeout, 0) THEN" (referencing the
timeout, start_time and result variables in this procedure) so NULL passed for
timeout is treated as 0 and the infinite loop is prevented.
---
Duplicate comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 241-244: The IF that checks timeout currently uses "timeout <> 0"
which evaluates to NULL when timeout is NULL and silently skips the timeout
branch; update the condition in the procedure body that uses variables timeout,
start_time, result and clock_timestamp() to explicitly guard against NULL (e.g.
use "timeout IS NOT NULL AND timeout <> 0" or "timeout IS DISTINCT FROM 0") so
the timeout check behaves identically to the fix in spock--6.0.0-devel.sql; keep
the rest of the block (setting result := false; RETURN;) unchanged.
- Around line 205-207: The SELECT that fetches sub_enabled and sub_slot_name
into variables sub_is_enabled and sub_slot is redundant here (same dead fetch as
in the fresh-install script); remove or refactor it so you only read the
subscription fields when actually needed—either delete the SELECT entirely if
sub_is_enabled isn't used later, or change it to fetch only sub_slot_name INTO
sub_slot (or perform the lookup lazily where used). Adjust references to
sub_is_enabled/sub_slot accordingly to match the optional refactor used in
sql/spock--6.0.0-devel.sql.
---
Nitpick comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 463-465: The initial SELECT into sub_is_enabled and sub_slot from
spock.subscription (SELECT sub_enabled, sub_slot_name INTO sub_is_enabled,
sub_slot WHERE sub_origin = origin_id AND sub_target = target_id) assigns
sub_is_enabled but it is overwritten later in the loop; remove sub_is_enabled
from this pre-loop SELECT so only sub_slot is fetched (SELECT sub_slot_name INTO
sub_slot ...), leaving the later in-loop retrieval of sub_is_enabled intact;
update the INTO list and selected columns accordingly to avoid the dead
assignment while keeping the later logic that re-fetches sub_is_enabled
unchanged.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/regress/expected/sync_event.outis excluded by!**/*.out
📒 Files selected for processing (4)
docs/install_spock.mddocs/limitations.mdsql/spock--5.0.4--6.0.0-devel.sqlsql/spock--6.0.0-devel.sql
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/install_spock.md
| ### Mixed Spock and Native Logical Replication | ||
|
|
||
| Spock uses its own 16-bit node_id (derived from node name hash) to track transaction origins in commit timestamps, rather than PostgreSQL's RepOriginId. Both ID spaces overlap (0-65535), which can cause ambiguity if Spock and native PostgreSQL logical replication run on the same database. | ||
|
|
||
| **Recommendation:** Avoid running Spock and native logical replication subscriptions on the same database. |
There was a problem hiding this comment.
Duplicate subsection — remove the second copy.
The AI-generated summary for this file explicitly notes that this "Mixed Spock and Native Logical Replication" subsection is inserted a second time later in the document with identical content. Please remove the duplicate occurrence.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/limitations.md` around lines 170 - 174, The file contains a duplicated
subsection titled "Mixed Spock and Native Logical Replication"; locate the
second (identical) occurrence of that subsection and remove it so the document
only contains a single instance. Ensure you remove the entire duplicated block
including the heading and the recommendation paragraph, and keep the first
original occurrence intact to preserve the intended content.
| IF timeout <> 0 AND | ||
| EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN | ||
| result := false; | ||
| RETURN; | ||
| END IF; |
There was a problem hiding this comment.
Silent infinite loop when timeout is explicitly passed as NULL.
timeout <> 0 evaluates to NULL (not TRUE) when timeout IS NULL, so the timeout branch is never taken and the procedure loops forever. The parameter has DEFAULT 0, so ordinary callers are unaffected, but a caller who explicitly passes NULL hits a deadlock-style hang with no error.
🛡️ Proposed fix
- IF timeout <> 0 AND
- EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN
+ IF (timeout IS NOT NULL AND timeout <> 0) AND
+ EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| IF timeout <> 0 AND | |
| EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN | |
| result := false; | |
| RETURN; | |
| END IF; | |
| IF (timeout IS NOT NULL AND timeout <> 0) AND | |
| EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN | |
| result := false; | |
| RETURN; | |
| END IF; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sql/spock--6.0.0-devel.sql` around lines 499 - 503, Replace the
NULL-sensitive condition so explicit NULL timeouts don't bypass the timeout
check: change the clause that currently reads "IF timeout <> 0 AND EXTRACT(EPOCH
FROM (clock_timestamp() - start_time)) >= timeout THEN" to use COALESCE(timeout,
0) (or explicitly check timeout IS NOT NULL) — i.e., use "IF COALESCE(timeout,
0) <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >=
COALESCE(timeout, 0) THEN" (referencing the timeout, start_time and result
variables in this procedure) so NULL passed for timeout is treated as 0 and the
infinite loop is prevented.
Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn to track apply progress. This required Spock to maintain its own progress table, adding complexity. This change switches to using PostgreSQL's native pg_replication_origin_status view. The key insight is that PostgreSQL's RecordTransactionCommit() only calls replorigin_session_advance() when there is actual WAL written. For empty transactions (like sync_event), this never happens. The fix adds explicit replorigin_session_advance() call in handle_commit() for empty transactions, ensuring pg_replication_origin_status.remote_lsn is advanced even when no WAL is written. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cec39c9 to
ef65317
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
sql/spock--6.0.0-devel.sql (1)
499-503: Silent infinite loop whentimeoutis explicitly passed asNULL.
timeout <> 0evaluates toNULL(notTRUE) whentimeout IS NULL, so the timeout branch is never taken and the procedure loops forever.Proposed fix
- IF timeout <> 0 AND - EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN + IF COALESCE(timeout, 0) <> 0 AND + EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--6.0.0-devel.sql` around lines 499 - 503, The IF branch that checks timeout currently uses "timeout <> 0" which yields NULL when timeout IS NULL and prevents the timeout logic from running; update the condition in that IF (the block referencing timeout, start_time, result and clock_timestamp()) to handle NULL explicitly — e.g. use COALESCE(timeout, 0) <> 0 AND EXTRACT(...) >= timeout or add an explicit "timeout IS NOT NULL AND timeout <> 0 AND ..." so a NULL timeout no longer causes a silent infinite loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@sql/spock--6.0.0-devel.sql`:
- Around line 499-503: The IF branch that checks timeout currently uses "timeout
<> 0" which yields NULL when timeout IS NULL and prevents the timeout logic from
running; update the condition in that IF (the block referencing timeout,
start_time, result and clock_timestamp()) to handle NULL explicitly — e.g. use
COALESCE(timeout, 0) <> 0 AND EXTRACT(...) >= timeout or add an explicit
"timeout IS NOT NULL AND timeout <> 0 AND ..." so a NULL timeout no longer
causes a silent infinite loop.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/regress/expected/sync_event.outis excluded by!**/*.out
📒 Files selected for processing (7)
Makefiledocs/install_spock.mddocs/limitations.mdsql/spock--5.0.4--6.0.0-devel.sqlsql/spock--6.0.0-devel.sqlsrc/spock_apply.ctests/regress/sql/sync_event.sql
🚧 Files skipped from review as they are similar to previous changes (4)
- sql/spock--5.0.4--6.0.0-devel.sql
- Makefile
- tests/regress/sql/sync_event.sql
- docs/limitations.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The two wait_for_sync_event() procedure overloads differed only in how they received the origin parameter (name vs OID). Refactor to have the name-based version resolve the node name to OID, then delegate to the OID-based version, reducing code duplication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add note about max_active_replication_origins for PostgreSQL 18+ (separate GUC from max_replication_slots) - Document limitation: avoid mixing Spock with native logical replication on the same database due to overlapping origin ID spaces Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ef65317 to
893c530
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
sql/spock--6.0.0-devel.sql (1)
499-500:⚠️ Potential issue | 🟡 MinorGuard against explicit
NULLtimeout to avoid silent unbounded waits.On Line 499,
timeout <> 0evaluates toNULLwhentimeoutis explicitlyNULL, so the timeout path is skipped forever.🛡️ Proposed fix
BEGIN IF origin_id IS NULL THEN RAISE EXCEPTION 'Invalid NULL origin_id'; END IF; + IF timeout IS NULL THEN + RAISE EXCEPTION 'Invalid NULL timeout'; + END IF; target_id := node_id FROM spock.node_info();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--6.0.0-devel.sql` around lines 499 - 500, The IF guard currently uses "timeout <> 0" which yields NULL when timeout is explicitly NULL and therefore skips the timeout check; update the condition to explicitly handle NULL (e.g., check timeout IS NOT NULL AND timeout <> 0, or use COALESCE(timeout, 0) <> 0) so the branch executes only for a real non-zero timeout before evaluating EXTRACT(EPOCH FROM (clock_timestamp() - start_time)); reference the timeout variable and start_time/clock_timestamp() call in the IF condition to locate and update the logic.sql/spock--5.0.4--6.0.0-devel.sql (1)
241-243:⚠️ Potential issue | 🟡 MinorKeep upgrade path consistent: same nullable-timeout trap exists here.
Line 241 has the same
timeout <> 0NULL-sensitive behavior, so explicitNULLcan bypass timeout handling in upgraded installations too.🛡️ Proposed fix
BEGIN IF origin_id IS NULL THEN RAISE EXCEPTION 'Invalid NULL origin_id'; END IF; + IF timeout IS NULL THEN + RAISE EXCEPTION 'Invalid NULL timeout'; + END IF; target_id := node_id FROM spock.node_info();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 241 - 243, The timeout check is NULL-sensitive and can be bypassed if timeout is NULL; update the condition in the block that currently reads "IF timeout <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN" to explicitly handle NULLs (e.g., "IF timeout IS NOT NULL AND timeout <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN" or use COALESCE(timeout, 0) in the comparison) so upgraded installations cannot bypass the timeout; modify the conditional where variables timeout, start_time, and result are used accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 241-243: The timeout check is NULL-sensitive and can be bypassed
if timeout is NULL; update the condition in the block that currently reads "IF
timeout <> 0 AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout
THEN" to explicitly handle NULLs (e.g., "IF timeout IS NOT NULL AND timeout <> 0
AND EXTRACT(EPOCH FROM (clock_timestamp() - start_time)) >= timeout THEN" or use
COALESCE(timeout, 0) in the comparison) so upgraded installations cannot bypass
the timeout; modify the conditional where variables timeout, start_time, and
result are used accordingly.
In `@sql/spock--6.0.0-devel.sql`:
- Around line 499-500: The IF guard currently uses "timeout <> 0" which yields
NULL when timeout is explicitly NULL and therefore skips the timeout check;
update the condition to explicitly handle NULL (e.g., check timeout IS NOT NULL
AND timeout <> 0, or use COALESCE(timeout, 0) <> 0) so the branch executes only
for a real non-zero timeout before evaluating EXTRACT(EPOCH FROM
(clock_timestamp() - start_time)); reference the timeout variable and
start_time/clock_timestamp() call in the IF condition to locate and update the
logic.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/regress/expected/sync_event.outis excluded by!**/*.out
📒 Files selected for processing (6)
Makefiledocs/install_spock.mddocs/limitations.mdsql/spock--5.0.4--6.0.0-devel.sqlsql/spock--6.0.0-devel.sqltests/regress/sql/sync_event.sql
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/regress/sql/sync_event.sql
- Makefile
- docs/install_spock.md
Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn
to track apply progress. This required Spock to maintain its own progress
table, adding complexity.
This change switches to using PostgreSQL's native pg_replication_origin_status
view. The key insight is that PostgreSQL's RecordTransactionCommit() only
calls replorigin_session_advance() when there is actual WAL written. For
empty transactions (like sync_event), this never happens.
The fix adds explicit replorigin_session_advance() call in handle_commit()
for empty transactions, ensuring pg_replication_origin_status.remote_lsn
is advanced even when no WAL is written.