fix(node): race on caught up by alpe · Pull Request #3133 · evstack/ev-node

alpe · 2026-03-04T17:00:11Z

Overview

Add Syncer.PendingCount and use it to ensure the sync pipeline is drained before marking the node as caught up in failover.

Summary by CodeRabbit

Bug Fixes
- Enhanced catch-up completion detection to ensure the system verifies all pending synchronization tasks are fully processed before reporting a caught-up state and initiating block production. This improves reliability of the synchronization mechanism.

…e is drained before marking the node as caught up in failover.

coderabbitai · 2026-03-04T17:00:29Z

📝 Walkthrough

Walkthrough

Two modifications enable the block syncing pipeline to track pending events: a new PendingCount() method exposes unprocessed height events from the Syncer, and the catch-up completion logic now requires the syncing pipeline to be fully drained alongside DA and P2P consensus before resuming block production.

Changes

Cohort / File(s)	Summary
Syncer Pipeline Introspection `block/internal/syncing/syncer.go`	Adds public `PendingCount()` method returning the count of unprocessed height events from the `heightInCh` channel.
Catch-up Completion Gate `node/failover.go`	Extends catch-up completion check to require pipeline drainage (Syncer is nil or `PendingCount()` returns 0) in addition to DA and P2P consensus status.

Sequence Diagram

sequenceDiagram
    participant Node as Node
    participant Failover as Failover Loop
    participant Syncer as Syncer
    participant DA as DA Status
    participant P2P as P2P Status

    loop Catch-up Check
        Failover->>DA: Check if caught up
        DA-->>Failover: daCaughtUp status
        Failover->>P2P: Check if caught up
        P2P-->>Failover: p2pCaughtUp status
        Failover->>Syncer: PendingCount()
        Syncer-->>Failover: pending height events count
        alt All conditions met: daCaughtUp && p2pCaughtUp && pipelineDrained
            Failover->>Node: Report fully caught up
            Note over Node: Block production can resume
        else Any condition false
            Note over Failover: Continue waiting for sync
        end
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A syncer counts its pending heap,
While failover doth watch and keep,
The pipeline drains, events all clear,
Before consensus reappear—
Now catch-up waits for all three signs,
Where coordination truly aligns! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description provides a clear overview of the changes but lacks context, background, rationale for the fix, and any linked issue reference.	Add background context explaining the race condition, the rationale for using PendingCount, and link to the related issue (e.g., Closes #).

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title 'fix(node): race on caught up' is specific and directly related to the main changes—it addresses a race condition fix in the node's catch-up logic, which aligns with adding PendingCount and using it to ensure pipeline drain before marking caught-up.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch alex/sync_race

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-03-04T17:00:31Z

Claude finished @alpe's task in 2m 21s —— View job

Code Review

Gather context
Read changed files
Analyze code changes
Post review feedback

Summary

This is a targeted, minimal fix for a race condition in the failover catchup phase. The previous condition declared "caught up" as soon as daCaughtUp && p2pCaughtUp, but there could still be unprocessed events queued in heightInCh that the processLoop goroutine hadn't consumed yet. The fix adds a PendingCount() check to ensure the channel is drained.

What's Good

Minimal and correct: len(s.heightInCh) is safe to call concurrently in Go — channel length reads are atomic.
Well-placed: The daHeadReached flag is already a one-way latch (atomic.Bool, once true stays true), so when both daCaughtUp and pipelineDrained are true together, there's a strong signal that the pipeline is truly quiesced.
No interface bloat: PendingCount is correctly added only on the concrete *syncing.Syncer type (accessed through f.bc.Syncer which is already a concrete type). No interface change required.
Clear docstring on the new method.

Observations / Concerns

1. Pipeline partially covered — pending cache not counted

PendingCount() returns len(s.heightInCh), but there is a second level of buffering: s.cache.pendingEventsCache. Out-of-order blocks that arrive before their predecessors are stored there by processHeightEvent via s.cache.SetPendingEvent(), and processPendingEvents() drains the cache back into heightInCh every 10ms.

A scenario where this matters:

1. heightInCh is empty (PendingCount() == 0)
2. But pendingEventsCache holds blocks at heights N+2, N+3 waiting for N+1
3. waitForCatchup declares "caught up" prematurely
4. Block N+1 arrives later via a priority DA hint — never processed

In practice the pendingWorkerLoop ticks every 10ms while waitForCatchup polls at daBlockTime (usually ≥1s), so the cache is drained many times per poll. The risk is low but not zero, especially on startup with many out-of-order P2P events.

A more complete guard could be:

pipelineDrained := f.bc.Syncer == nil || 
    (f.bc.Syncer.PendingCount() == 0 && f.bc.Syncer.PendingCacheCount() == 0)

Fix this →

2. Mild TOCTOU window (accepted)

Even after PendingCount() == 0, the DA worker could wake up from its blockTime sleep, fetch new blocks, and pipe events into the channel before the node transitions to aggregator mode. This is inherent to a polling approach and is acceptable — the aggregator mode will handle any new blocks produced on DA after the catchup window.

3. No regression test

The PR has no test covering the pipelineDrained condition. Given this is a concurrency fix, a test (even a unit test with a mock Syncer) would help prevent regression. Something like:

func TestWaitForCatchupDrainsChannel(t *testing.T) {
    // Set daCaughtUp=true, p2pCaughtUp=true, but PendingCount > 0
    // Assert that waitForCatchup does NOT return caught-up
    // Then drain the channel and assert it returns true
}

4. Minor: PendingCount() placement

The method is placed between fetchDAUntilCaughtUp and pendingWorkerLoop, slightly interrupting the flow of worker-related methods. Placing it near HasReachedDAHead() (both are status/observability methods) would improve readability. Minor nit.

Verdict

The fix is correct and directly addresses the stated race condition. The concern about the pending cache is a real edge case worth discussing — depending on how the node behaves during catchup with out-of-order blocks, it may or may not be worth a follow-up. The core of the PR is sound and ready to merge with this in mind.

github-actions · 2026-03-04T17:00:33Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Mar 4, 2026, 5:00 PM

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node/failover.go`:
- Around line 318-319: The current catchup check uses Syncer.PendingCount()
which only looks at channel buffer and misses pending cache entries and
in-flight processing; add a dedicated idle mechanism on Syncer (e.g., WaitIdle()
or IdleChan/IsIdle()) that atomically ensures: heightInCh is empty, an in-flight
counter (incremented/decremented in processHeightEvent and when dequeuing from
heightInCh) is zero, and the pending cache (populated by SetPendingEvent and
polled in pendingWorkerLoop) is empty; update Syncer to maintain the in-flight
counter and a condition or channel that is signaled when all three are zero,
expose a blocking WaitIdle() or boolean IsIdle() used in failover.go (replace
pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0 with a
call to the new idle API) so catchup waits for true quiescence.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f0c2f1f-a1ec-4274-a820-a1c7a3703f46

📥 Commits

Reviewing files that changed from the base of the PR and between 2c75e9e and 933a6d7.

📒 Files selected for processing (2)

block/internal/syncing/syncer.go
node/failover.go

coderabbitai · 2026-03-04T17:07:10Z

node/failover.go

+			pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0
+			if daCaughtUp && p2pCaughtUp && pipelineDrained {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) PendingCount implementation ===" rg -n -A5 'func \(s \*Syncer\) PendingCount\(\) int' block/internal/syncing/syncer.go echo echo "=== 2) Syncer struct definition and state fields ===" rg -n -B2 -A10 'type Syncer struct' block/internal/syncing/syncer.go echo echo "=== 3) heightInCh and pipeline queue operations ===" rg -n -B2 -A2 'heightInCh' block/internal/syncing/syncer.go | head -80 echo echo "=== 4) Failover.go catchup logic ===" rg -n -B5 -A5 'pipelineDrained' node/failover.go

Repository: evstack/ev-node

Length of output: 3017

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) processHeightEvent implementation ===" rg -n -A20 'func \(s \*Syncer\) processHeightEvent' block/internal/syncing/syncer.go echo echo "=== 2) Cache operations and pending events ===" rg -n -B2 -A2 'cache\.GetNextPendingEvent\|SetPendingEvent\|pendingWorkerLoop' block/internal/syncing/syncer.go | head -100 echo echo "=== 3) Check if Syncer has any Done/Idle/Quiescent signals ===" rg -n 'sync\.WaitGroup\|chan.*done\|chan.*idle\|chan.*quiescent' block/internal/syncing/syncer.go echo echo "=== 4) Any existing waiters for pipeline quiescence ===" rg -n 'WaitForIdle\|WaitForCompletion\|WaitForQuiescence' block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 869

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) Full processHeightEvent implementation ===" rg -n -A50 'func \(s \*Syncer\) processHeightEvent' block/internal/syncing/syncer.go | head -80 echo echo "=== 2) pendingWorkerLoop - cache to channel pipeline ===" rg -n -A30 'func \(s \*Syncer\) pendingWorkerLoop' block/internal/syncing/syncer.go echo echo "=== 3) Check Syncer fields for tracking work state ===" sed -n '46,150p' block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 6599

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) processPendingEvents to confirm cache polling ===" rg -n -A15 'func \(s \*Syncer\) processPendingEvents' block/internal/syncing/syncer.go echo echo "=== 2) Check cache.GetNextPendingEvent calls ===" rg -n 'GetNextPendingEvent' block/internal/syncing/syncer.go echo echo "=== 3) How failover uses/waits for syncer ===" rg -n -B5 -A10 'bc\.Syncer.*PendingCount\|bc\.Stop\|Syncer.*idle\|Syncer.*done' node/failover.go | head -80

Repository: evstack/ev-node

Length of output: 874

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) Failover.bc.Stop and syncer lifecycle ===" rg -n -B10 -A10 'bc\.Stop\(\)' node/failover.go echo echo "=== 2) Check if syncer has any Stop or WaitAll methods ===" rg -n 'func \(s \*Syncer\) Stop\|func \(s \*Syncer\) Wait' block/internal/syncing/syncer.go echo echo "=== 3) Full catchup logic context in failover ===" sed -n '300,330p' node/failover.go echo echo "=== 4) Verify cache.GetNextPendingEvent can be non-nil when heightInCh empty ===" rg -n -B5 -A10 'cache.GetNextPendingEvent\|pendingEvents\|cache.*map' block/internal/syncing/syncer.go | head -60

Repository: evstack/ev-node

Length of output: 1013

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) Syncer Stop method ===" rg -n 'func \(s \*Syncer\) Stop' block/internal/syncing/syncer.go echo echo "=== 2) All public Syncer methods ===" rg -n '^func \(s \*Syncer\)' block/internal/syncing/syncer.go | head -30 echo echo "=== 3) Look at sync.WaitGroup usage in syncer ===" rg -n 'wg\.Add\|wg\.Wait\|wg\.Done' block/internal/syncing/syncer.go echo echo "=== 4) Full catchup decision logic in failover ===" sed -n '300,330p' node/failover.go

Repository: evstack/ev-node

Length of output: 2363

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== 1) Syncer.Stop() method ===" sed -n '209,250p' block/internal/syncing/syncer.go echo echo "=== 2) Syncer.Start() - understand wg usage ===" sed -n '166,210p' block/internal/syncing/syncer.go echo echo "=== 3) Full failover catchup check context ===" sed -n '295,330p' node/failover.go echo echo "=== 4) Understanding cache state - pending events structure ===" rg -n 'pendingEventsMu\|pendingEvents\s*map' block/internal/syncing/ -A2

Repository: evstack/ev-node

Length of output: 3638

Pipeline can complete catchup before pending cache events are processed.

PendingCount() only measures channel buffer length (line 471-472) and misses two important pipeline states:

Pending cache events: When processHeightEvent() encounters an out-of-sequence block, it stores it in cache via SetPendingEvent() (line 599). The pendingWorkerLoop() polls this cache every 10ms (line 481) and re-queues events to heightInCh when the sequence is complete (line 1109-1111). These cache events don't appear in PendingCount().

In-flight processing: Events dequeued from heightInCh (line 354) remain in-flight during processHeightEvent() execution, which performs I/O and state validation (lines 584-602). This work isn't tracked by channel depth.

The evidence: Stop() calls wg.Wait() (line 216) to wait for all worker goroutines, then after goroutine completion, drains heightInCh in a loop (lines 221-237). This pattern shows the channel can be empty while the pipeline hasn't reached true quiescence.

Line 318 in failover.go declares catchup complete as soon as PendingCount() == 0, which creates a race where pending cache events or in-flight processing aren't accounted for.

Implement a dedicated idle signal in Syncer that waits for all three stages: empty channel, no in-flight processing, and empty pending cache.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@node/failover.go` around lines 318 - 319, The current catchup check uses Syncer.PendingCount() which only looks at channel buffer and misses pending cache entries and in-flight processing; add a dedicated idle mechanism on Syncer (e.g., WaitIdle() or IdleChan/IsIdle()) that atomically ensures: heightInCh is empty, an in-flight counter (incremented/decremented in processHeightEvent and when dequeuing from heightInCh) is zero, and the pending cache (populated by SetPendingEvent and polled in pendingWorkerLoop) is empty; update Syncer to maintain the in-flight counter and a condition or channel that is signaled when all three are zero, expose a blocking WaitIdle() or boolean IsIdle() used in failover.go (replace pipelineDrained := f.bc.Syncer == nil || f.bc.Syncer.PendingCount() == 0 with a call to the new idle API) so catchup waits for true quiescence.

codecov · 2026-03-04T17:09:14Z

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.82%. Comparing base (2c75e9e) to head (933a6d7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
block/internal/syncing/syncer.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3133      +/-   ##
==========================================
+ Coverage   60.81%   60.82%   +0.01%     
==========================================
  Files         113      113              
  Lines       11692    11695       +3     
==========================================
+ Hits         7111     7114       +3     
  Misses       3777     3777              
  Partials      804      804

Flag	Coverage Δ
combined	`60.82% <50.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

feat: add Syncer.PendingCount and use it to ensure the sync pipelin…

933a6d7

…e is drained before marking the node as caught up in failover.

github-actions bot assigned alpe Mar 4, 2026

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

julienrbrt approved these changes Mar 4, 2026

View reviewed changes

julienrbrt changed the title ~~fix: race on caught up~~ fix(node): race on caught up Mar 4, 2026

julienrbrt added this pull request to the merge queue Mar 4, 2026

Merged via the queue into main with commit c0bc141 Mar 4, 2026
37 checks passed

julienrbrt deleted the alex/sync_race branch March 4, 2026 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(node): race on caught up#3133

fix(node): race on caught up#3133
julienrbrt merged 1 commit intomainfrom
alex/sync_race

alpe commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

claude bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

codecov bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		pipelineDrained := f.bc.Syncer == nil \|\| f.bc.Syncer.PendingCount() == 0
		if daCaughtUp && p2pCaughtUp && pipelineDrained {

Conversation

alpe commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

claude bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

What's Good

Observations / Concerns

Verdict

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alpe commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

claude bot commented Mar 4, 2026 •

edited

Loading

codecov bot commented Mar 4, 2026 •

edited

Loading