Skip to content

Fix interfaces test to wait for replicating state and slot confirmation#360

Merged
mason-sharp merged 1 commit intomainfrom
spoc-449
Feb 27, 2026
Merged

Fix interfaces test to wait for replicating state and slot confirmation#360
mason-sharp merged 1 commit intomainfrom
spoc-449

Conversation

@danolivo
Copy link
Copy Markdown
Contributor

  • After sub_alter_interface, wait for the apply worker to reach replicating status instead of just checking it is not down. The previous check could pass while the worker was still in a transitional state (e.g., initializing), causing subsequent assertions to fail intermittently.
  • Add wait_slot_confirm_lsn(NULL, NULL) on the provider before checking replication slot activity, ensuring all pending changes are flushed and preventing a race where the slot appears momentarily inactive.

@danolivo danolivo self-assigned this Feb 24, 2026
@danolivo danolivo added the enhancement New feature or request label Feb 24, 2026
@danolivo danolivo requested a review from mason-sharp February 24, 2026 13:37
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Revised SQL test to change replication status polling from status = 'down' to status != 'replicating', added SELECT spock.wait_slot_confirm_lsn(NULL, NULL); calls with a duplicated \c :provider_dsn to ensure the wait executes in the intended session, and reorganized the wait/poll sections.

Changes

Cohort / File(s) Summary
Test Regression Suite - SQL
tests/regress/sql/interfaces.sql
Replaced status = 'down' check with status != 'replicating'. Inserted SELECT spock.wait_slot_confirm_lsn(NULL, NULL); after reconnecting with \c :provider_dsn (duplicated to ensure session context). Added an explicit wait-loop section and minor formatting adjustments.

Poem

🐰 I hopped into the replication glen,
Swapped a strict 'down' for broader ken,
I waited on LSN with patient cheer,
Reconnected once — then once more near,
Now tests can hop forward, loud and clear. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding waits for replicating state and slot confirmation in the interfaces test.
Description check ✅ Passed The description clearly explains both key changes: waiting for replicating status after sub_alter_interface and adding wait_slot_confirm_lsn to prevent race conditions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch spoc-449

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/regress/sql/interfaces.sql (1)

65-73: ⚠️ Potential issue | 🟠 Major

Second wait loop was not updated — same early-exit race as the one fixed above

The PR objective is to wait until the worker reaches 'replicating' status, yet this loop (after reverting back to 'test_provider') still exits as soon as status is no longer 'down', leaving the same transitional-state race condition that motivated this PR.

🔧 Proposed consistency fix
 DO $$
 BEGIN
-    WHILE EXISTS (SELECT 1 FROM spock.sub_show_status() WHERE status = 'down')
+    WHILE EXISTS (SELECT 1 FROM spock.sub_show_status()
+                  WHERE status NOT IN ('replicating', 'disabled', 'apply_error'))
 	LOOP
 	-- TODO: The multimaster testing buildfarm should have a general parameter
 	-- like 'wait_change_timeout' that may be used to take control over infinite
 	-- waiting cycles.
     END LOOP;
 END;$$;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/interfaces.sql` around lines 65 - 73, The WHILE loop that
queries spock.sub_show_status() currently exits as soon as status != 'down',
which recreates the transitional-state race; change the loop condition to wait
until the worker reaches 'replicating' (e.g., use WHILE NOT EXISTS (SELECT 1
FROM spock.sub_show_status() WHERE status = 'replicating') LOOP or WHILE EXISTS
(... WHERE status <> 'replicating') LOOP) so the loop only finishes when status
is explicitly 'replicating', updating the loop surrounding the
revert-to-'test_provider' block accordingly.
🧹 Nitpick comments (1)
tests/regress/sql/interfaces.sql (1)

49-51: Redundant \c :provider_dsn at line 50

Line 48 (unchanged) already reconnects to :provider_dsn. The newly added line 50 reconnects to the same DSN immediately after, making it a no-op that adds noise.

🧹 Proposed cleanup
 \c :provider_dsn
-
-\c :provider_dsn
 SELECT spock.wait_slot_confirm_lsn(NULL, NULL);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/interfaces.sql` around lines 49 - 51, Remove the redundant
reconnect command by deleting the extra "\c :provider_dsn" that immediately
precedes the call to spock.wait_slot_confirm_lsn(NULL, NULL) in
tests/regress/sql/interfaces.sql; if a different connection was intended,
replace that line with the correct DSN, otherwise just remove the duplicate
reconnect so the file only reconnects once to :provider_dsn before calling
spock.wait_slot_confirm_lsn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/regress/sql/interfaces.sql`:
- Around line 38-42: The loop condition using spock.sub_show_status() currently
checks WHERE status != 'replicating', which can hang if status becomes
'initializing' or 'disabled'; change those WHERE clauses (including the similar
one later) to explicitly only wait for the non-terminal 'down' state (e.g. WHERE
status = 'down' or WHERE status NOT IN
('replicating','initializing','disabled')) so the loop can exit on terminal
states; also remove the redundant connection directive "\c :provider_dsn" (the
earlier \c :provider_dsn already switches to the provider DSN).

---

Outside diff comments:
In `@tests/regress/sql/interfaces.sql`:
- Around line 65-73: The WHILE loop that queries spock.sub_show_status()
currently exits as soon as status != 'down', which recreates the
transitional-state race; change the loop condition to wait until the worker
reaches 'replicating' (e.g., use WHILE NOT EXISTS (SELECT 1 FROM
spock.sub_show_status() WHERE status = 'replicating') LOOP or WHILE EXISTS (...
WHERE status <> 'replicating') LOOP) so the loop only finishes when status is
explicitly 'replicating', updating the loop surrounding the
revert-to-'test_provider' block accordingly.

---

Nitpick comments:
In `@tests/regress/sql/interfaces.sql`:
- Around line 49-51: Remove the redundant reconnect command by deleting the
extra "\c :provider_dsn" that immediately precedes the call to
spock.wait_slot_confirm_lsn(NULL, NULL) in tests/regress/sql/interfaces.sql; if
a different connection was intended, replace that line with the correct DSN,
otherwise just remove the duplicate reconnect so the file only reconnects once
to :provider_dsn before calling spock.wait_slot_confirm_lsn.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c94cb28 and 0f7cdcf.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/interfaces.out is excluded by !**/*.out
📒 Files selected for processing (1)
  • tests/regress/sql/interfaces.sql

Comment thread tests/regress/sql/interfaces.sql
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/regress/sql/interfaces.sql (1)

38-39: ⚠️ Potential issue | 🟠 Major

Guard the “wait for replicating” loops with a timeout or explicit failure path.

Right now the loops at Line 38–39 and Line 65 can spin forever if the apply worker never reaches replicating (e.g., stalled or disabled), which will hang the test indefinitely. Consider adding a short sleep plus a timeout (and/or raising an exception) so failures are explicit and bounded.

🔧 Example timeout guard (apply to both loops)
 DO $$
+DECLARE
+    start_ts timestamptz := clock_timestamp();
 BEGIN
     WHILE EXISTS (SELECT 1 FROM spock.sub_show_status()
 				  WHERE status != 'replicating')
 	LOOP
+        PERFORM pg_sleep(0.1);
+        IF clock_timestamp() - start_ts > interval '60 seconds' THEN
+            RAISE EXCEPTION 'Timed out waiting for subscription to reach replicating status';
+        END IF;
     END LOOP;
 END;$$;

Also applies to: 65-65

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/interfaces.sql` around lines 38 - 39, The WHILE loops that
poll spock.sub_show_status() for status != 'replicating' should be bounded:
modify the polling logic in the procedure containing the WHILE (the loop that
calls spock.sub_show_status()) to record a start time (e.g.,
now()/clock_timestamp()), sleep briefly inside each iteration (e.g., pg_sleep),
and if elapsed time exceeds a configurable timeout threshold then RAISE
EXCEPTION (or RETURN/FAIL the test) so the test does not spin forever; apply the
same timeout/sleep pattern to both occurrences that poll spock.sub_show_status()
so failures are explicit and bounded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/regress/sql/interfaces.sql`:
- Around line 38-39: The WHILE loops that poll spock.sub_show_status() for
status != 'replicating' should be bounded: modify the polling logic in the
procedure containing the WHILE (the loop that calls spock.sub_show_status()) to
record a start time (e.g., now()/clock_timestamp()), sleep briefly inside each
iteration (e.g., pg_sleep), and if elapsed time exceeds a configurable timeout
threshold then RAISE EXCEPTION (or RETURN/FAIL the test) so the test does not
spin forever; apply the same timeout/sleep pattern to both occurrences that poll
spock.sub_show_status() so failures are explicit and bounded.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7cdcf and 0de57e5.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/interfaces.out is excluded by !**/*.out
📒 Files selected for processing (1)
  • tests/regress/sql/interfaces.sql

@mason-sharp mason-sharp merged commit 337f07b into main Feb 27, 2026
10 checks passed
@mason-sharp mason-sharp deleted the spoc-449 branch February 27, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants