Skip to content

fix(node): race on caught up#3133

Merged
julienrbrt merged 1 commit intomainfrom
alex/sync_race
Mar 4, 2026
Merged

fix(node): race on caught up#3133
julienrbrt merged 1 commit intomainfrom
alex/sync_race

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Mar 4, 2026

Overview

Add Syncer.PendingCount and use it to ensure the sync pipeline is drained before marking the node as caught up in failover.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced catch-up completion detection to ensure the system verifies all pending synchronization tasks are fully processed before reporting a caught-up state and initiating block production. This improves reliability of the synchronization mechanism.

…e is drained before marking the node as caught up in failover.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

Two modifications enable the block syncing pipeline to track pending events: a new PendingCount() method exposes unprocessed height events from the Syncer, and the catch-up completion logic now requires the syncing pipeline to be fully drained alongside DA and P2P consensus before resuming block production.

Changes

Cohort / File(s) Summary
Syncer Pipeline Introspection
block/internal/syncing/syncer.go
Adds public PendingCount() method returning the count of unprocessed height events from the heightInCh channel.
Catch-up Completion Gate
node/failover.go
Extends catch-up completion check to require pipeline drainage (Syncer is nil or PendingCount() returns 0) in addition to DA and P2P consensus status.

Sequence Diagram

sequenceDiagram
    participant Node as Node
    participant Failover as Failover Loop
    participant Syncer as Syncer
    participant DA as DA Status
    participant P2P as P2P Status

    loop Catch-up Check
        Failover->>DA: Check if caught up
        DA-->>Failover: daCaughtUp status
        Failover->>P2P: Check if caught up
        P2P-->>Failover: p2pCaughtUp status
        Failover->>Syncer: PendingCount()
        Syncer-->>Failover: pending height events count
        alt All conditions met: daCaughtUp && p2pCaughtUp && pipelineDrained
            Failover->>Node: Report fully caught up
            Note over Node: Block production can resume
        else Any condition false
            Note over Failover: Continue waiting for sync
        end
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A syncer counts its pending heap,
While failover doth watch and keep,
The pipeline drains, events all clear,
Before consensus reappear—
Now catch-up waits for all three signs,
Where coordination truly aligns! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description provides a clear overview of the changes but lacks context, background, rationale for the fix, and any linked issue reference. Add background context explaining the race condition, the rationale for using PendingCount, and link to the related issue (e.g., Closes #).
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'fix(node): race on caught up' is specific and directly related to the main changes—it addresses a race condition fix in the node's catch-up logic, which aligns with adding PendingCount and using it to ensure pipeline drain before marking caught-up.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch alex/sync_race

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Contributor

claude bot commented Mar 4, 2026

Claude finished @alpe's task in 2m 21s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze code changes
  • Post review feedback

Summary

This is a targeted, minimal fix for a race condition in the failover catchup phase. The previous condition declared "caught up" as soon as daCaughtUp && p2pCaughtUp, but there could still be unprocessed events queued in heightInCh that the processLoop goroutine hadn't consumed yet. The fix adds a PendingCount() check to ensure the channel is drained.


What's Good

  • Minimal and correct: len(s.heightInCh) is safe to call concurrently in Go — channel length reads are atomic.
  • Well-placed: The daHeadReached flag is already a one-way latch (atomic.Bool, once true stays true), so when both daCaughtUp and pipelineDrained are true together, there's a strong signal that the pipeline is truly quiesced.
  • No interface bloat: PendingCount is correctly added only on the concrete *syncing.Syncer type (accessed through f.bc.Syncer which is already a concrete type). No interface change required.
  • Clear docstring on the new method.

Observations / Concerns

1. Pipeline partially covered — pending cache not counted

PendingCount() returns len(s.heightInCh), but there is a second level of buffering: s.cache.pendingEventsCache. Out-of-order blocks that arrive before their predecessors are stored there by processHeightEvent via s.cache.SetPendingEvent(), and processPendingEvents() drains the cache back into heightInCh every 10ms.

A scenario where this matters:

1. heightInCh is empty (PendingCount() == 0)
2. But pendingEventsCache holds blocks at heights N+2, N+3 waiting for N+1
3. waitForCatchup declares "caught up" prematurely
4. Block N+1 arrives later via a priority DA hint — never processed

In practice the pendingWorkerLoop ticks every 10ms while waitForCatchup polls at daBlockTime (usually ≥1s), so the cache is drained many times per poll. The risk is low but not zero, especially on startup with many out-of-order P2P events.

A more complete guard could be:

pipelineDrained := f.bc.Syncer == nil || 
    (f.bc.Syncer.PendingCount() == 0 && f.bc.Syncer.PendingCacheCount() == 0)

Fix this →

2. Mild TOCTOU window (accepted)

Even after PendingCount() == 0, the DA worker could wake up from its blockTime sleep, fetch new blocks, and pipe events into the channel before the node transitions to aggregator mode. This is inherent to a polling approach and is acceptable — the aggregator mode will handle any new blocks produced on DA after the catchup window.

3. No regression test

The PR has no test covering the pipelineDrained condition. Given this is a concurrency fix, a test (even a unit test with a mock Syncer) would help prevent regression. Something like:

func TestWaitForCatchupDrainsChannel(t *testing.T) {
    // Set daCaughtUp=true, p2pCaughtUp=true, but PendingCount > 0
    // Assert that waitForCatchup does NOT return caught-up
    // Then drain the channel and assert it returns true
}

4. Minor: PendingCount() placement

The method is placed between fetchDAUntilCaughtUp and pendingWorkerLoop, slightly interrupting the flow of worker-related methods. Placing it near HasReachedDAHead() (both are status/observability methods) would improve readability. Minor nit.


Verdict

The fix is correct and directly addresses the stated race condition. The concern about the pending cache is a real edge case worth discussing — depending on how the node behaves during catchup with out-of-order blocks, it may or may not be worth a follow-up. The core of the PR is sound and ready to merge with this in mind.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 4, 2026, 5:00 PM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node/failover.go`:
- Around line 318-319: The current catchup check uses Syncer.PendingCount()
which only looks at channel buffer and misses pending cache entries and
in-flight processing; add a dedicated idle mechanism on Syncer (e.g., WaitIdle()
or IdleChan/IsIdle()) that atomically ensures: heightInCh is empty, an in-flight
counter (incremented/decremented in processHeightEvent and when dequeuing from
heightInCh) is zero, and the pending cache (populated by SetPendingEvent and
polled in pendingWorkerLoop) is empty; update Syncer to maintain the in-flight
counter and a condition or channel that is signaled when all three are zero,
expose a blocking WaitIdle() or boolean IsIdle() used in failover.go (replace
pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0 with a
call to the new idle API) so catchup waits for true quiescence.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f0c2f1f-a1ec-4274-a820-a1c7a3703f46

📥 Commits

Reviewing files that changed from the base of the PR and between 2c75e9e and 933a6d7.

📒 Files selected for processing (2)
  • block/internal/syncing/syncer.go
  • node/failover.go

Comment on lines +318 to +319
pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0
if daCaughtUp && p2pCaughtUp && pipelineDrained {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) PendingCount implementation ==="
rg -n -A5 'func \(s \*Syncer\) PendingCount\(\) int' block/internal/syncing/syncer.go

echo
echo "=== 2) Syncer struct definition and state fields ==="
rg -n -B2 -A10 'type Syncer struct' block/internal/syncing/syncer.go

echo
echo "=== 3) heightInCh and pipeline queue operations ==="
rg -n -B2 -A2 'heightInCh' block/internal/syncing/syncer.go | head -80

echo
echo "=== 4) Failover.go catchup logic ==="
rg -n -B5 -A5 'pipelineDrained' node/failover.go

Repository: evstack/ev-node

Length of output: 3017


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) processHeightEvent implementation ==="
rg -n -A20 'func \(s \*Syncer\) processHeightEvent' block/internal/syncing/syncer.go

echo
echo "=== 2) Cache operations and pending events ==="
rg -n -B2 -A2 'cache\.GetNextPendingEvent\|SetPendingEvent\|pendingWorkerLoop' block/internal/syncing/syncer.go | head -100

echo
echo "=== 3) Check if Syncer has any Done/Idle/Quiescent signals ==="
rg -n 'sync\.WaitGroup\|chan.*done\|chan.*idle\|chan.*quiescent' block/internal/syncing/syncer.go

echo
echo "=== 4) Any existing waiters for pipeline quiescence ==="
rg -n 'WaitForIdle\|WaitForCompletion\|WaitForQuiescence' block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 869


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Full processHeightEvent implementation ==="
rg -n -A50 'func \(s \*Syncer\) processHeightEvent' block/internal/syncing/syncer.go | head -80

echo
echo "=== 2) pendingWorkerLoop - cache to channel pipeline ==="
rg -n -A30 'func \(s \*Syncer\) pendingWorkerLoop' block/internal/syncing/syncer.go

echo
echo "=== 3) Check Syncer fields for tracking work state ==="
sed -n '46,150p' block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 6599


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) processPendingEvents to confirm cache polling ==="
rg -n -A15 'func \(s \*Syncer\) processPendingEvents' block/internal/syncing/syncer.go

echo
echo "=== 2) Check cache.GetNextPendingEvent calls ==="
rg -n 'GetNextPendingEvent' block/internal/syncing/syncer.go

echo
echo "=== 3) How failover uses/waits for syncer ==="
rg -n -B5 -A10 'bc\.Syncer.*PendingCount\|bc\.Stop\|Syncer.*idle\|Syncer.*done' node/failover.go | head -80

Repository: evstack/ev-node

Length of output: 874


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Failover.bc.Stop and syncer lifecycle ==="
rg -n -B10 -A10 'bc\.Stop\(\)' node/failover.go

echo
echo "=== 2) Check if syncer has any Stop or WaitAll methods ==="
rg -n 'func \(s \*Syncer\) Stop\|func \(s \*Syncer\) Wait' block/internal/syncing/syncer.go

echo
echo "=== 3) Full catchup logic context in failover ==="
sed -n '300,330p' node/failover.go

echo
echo "=== 4) Verify cache.GetNextPendingEvent can be non-nil when heightInCh empty ==="
rg -n -B5 -A10 'cache.GetNextPendingEvent\|pendingEvents\|cache.*map' block/internal/syncing/syncer.go | head -60

Repository: evstack/ev-node

Length of output: 1013


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Syncer Stop method ==="
rg -n 'func \(s \*Syncer\) Stop' block/internal/syncing/syncer.go

echo
echo "=== 2) All public Syncer methods ==="
rg -n '^func \(s \*Syncer\)' block/internal/syncing/syncer.go | head -30

echo
echo "=== 3) Look at sync.WaitGroup usage in syncer ==="
rg -n 'wg\.Add\|wg\.Wait\|wg\.Done' block/internal/syncing/syncer.go

echo
echo "=== 4) Full catchup decision logic in failover ==="
sed -n '300,330p' node/failover.go

Repository: evstack/ev-node

Length of output: 2363


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Syncer.Stop() method ==="
sed -n '209,250p' block/internal/syncing/syncer.go

echo
echo "=== 2) Syncer.Start() - understand wg usage ==="
sed -n '166,210p' block/internal/syncing/syncer.go

echo
echo "=== 3) Full failover catchup check context ==="
sed -n '295,330p' node/failover.go

echo
echo "=== 4) Understanding cache state - pending events structure ==="
rg -n 'pendingEventsMu\|pendingEvents\s*map' block/internal/syncing/ -A2

Repository: evstack/ev-node

Length of output: 3638


Pipeline can complete catchup before pending cache events are processed.

PendingCount() only measures channel buffer length (line 471-472) and misses two important pipeline states:

  1. Pending cache events: When processHeightEvent() encounters an out-of-sequence block, it stores it in cache via SetPendingEvent() (line 599). The pendingWorkerLoop() polls this cache every 10ms (line 481) and re-queues events to heightInCh when the sequence is complete (line 1109-1111). These cache events don't appear in PendingCount().

  2. In-flight processing: Events dequeued from heightInCh (line 354) remain in-flight during processHeightEvent() execution, which performs I/O and state validation (lines 584-602). This work isn't tracked by channel depth.

The evidence: Stop() calls wg.Wait() (line 216) to wait for all worker goroutines, then after goroutine completion, drains heightInCh in a loop (lines 221-237). This pattern shows the channel can be empty while the pipeline hasn't reached true quiescence.

Line 318 in failover.go declares catchup complete as soon as PendingCount() == 0, which creates a race where pending cache events or in-flight processing aren't accounted for.

Implement a dedicated idle signal in Syncer that waits for all three stages: empty channel, no in-flight processing, and empty pending cache.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node/failover.go` around lines 318 - 319, The current catchup check uses
Syncer.PendingCount() which only looks at channel buffer and misses pending
cache entries and in-flight processing; add a dedicated idle mechanism on Syncer
(e.g., WaitIdle() or IdleChan/IsIdle()) that atomically ensures: heightInCh is
empty, an in-flight counter (incremented/decremented in processHeightEvent and
when dequeuing from heightInCh) is zero, and the pending cache (populated by
SetPendingEvent and polled in pendingWorkerLoop) is empty; update Syncer to
maintain the in-flight counter and a condition or channel that is signaled when
all three are zero, expose a blocking WaitIdle() or boolean IsIdle() used in
failover.go (replace pipelineDrained := f.bc.Syncer == nil ||
f.bc.Syncer.PendingCount() == 0 with a call to the new idle API) so catchup
waits for true quiescence.

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.82%. Comparing base (2c75e9e) to head (933a6d7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/syncing/syncer.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3133      +/-   ##
==========================================
+ Coverage   60.81%   60.82%   +0.01%     
==========================================
  Files         113      113              
  Lines       11692    11695       +3     
==========================================
+ Hits         7111     7114       +3     
  Misses       3777     3777              
  Partials      804      804              
Flag Coverage Δ
combined 60.82% <50.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt changed the title fix: race on caught up fix(node): race on caught up Mar 4, 2026
@julienrbrt julienrbrt added this pull request to the merge queue Mar 4, 2026
Merged via the queue into main with commit c0bc141 Mar 4, 2026
37 checks passed
@julienrbrt julienrbrt deleted the alex/sync_race branch March 4, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants