Skip to content

epoch/finder: use GetEpochInfo for authoritative epoch at boundaries#3227

Draft
snormore wants to merge 2 commits intomainfrom
snor/epoch-finder-stale-slot-test
Draft

epoch/finder: use GetEpochInfo for authoritative epoch at boundaries#3227
snormore wants to merge 2 commits intomainfrom
snor/epoch-finder-stale-slot-test

Conversation

@snormore
Copy link
Contributor

Summary

  • Replace GetSlot-based epoch approximation with GetEpochInfo which returns the authoritative epoch directly, fixing epoch misassignment at epoch boundaries
  • For recent targets (within the current epoch's slot index), the authoritative epoch is returned directly without slot math
  • Falls back to slot math for targets in prior epochs, using the authoritative AbsoluteSlot from GetEpochInfo

Context

On 2026-03-10, "Account Not Found" alerts fired on both devnet and testnet for all circuits after an epoch rollover. The collector's epoch finder used GetSlot(finalized) to approximate the epoch, but the RPC returned a stale slot in epoch 192 for ~51 minutes after epoch 193 started. This caused all records to be assigned epoch 192, so no epoch 193 accounts were ever initialized. The monitor (which uses GetEpochInfo) saw epoch 193 and flagged all circuits as missing.

See plans/internet-latency-collector-2026-03-10.md for the full incident analysis.

Testing Verification

  • All existing epoch finder tests updated and passing
  • New test TestEpochFinder_GetEpochInfoFixesStaleSlotBug verifies authoritative epoch is used for recent targets, slot math fallback for prior epochs, and correct behavior at epoch boundaries
  • Downstream exporter tests passing
  • All packages that consume epoch.SolanaRPCClient build cleanly (production callers use *solanarpc.Client which already implements GetEpochInfo)

@snormore snormore force-pushed the snor/epoch-finder-stale-slot-test branch 2 times, most recently from 1bd0a8c to 7884926 Compare March 10, 2026 17:31
Add tests that reproduce the 2026-03-10 incident where the epoch finder
returned epoch 192 for ~51 minutes after epoch 193 started because
GetSlot(finalized) was returning a stale slot. Three subtests cover:

- Stale GetSlot causing wrong epoch for post-boundary records
- Fresh GetSlot returning correct epoch (control case)
- Cache amplification: stale result persists for 30min per minute-bucket
…poch

Replace the slot-based epoch approximation with GetEpochInfo which
returns the authoritative epoch directly from the RPC. This fixes an
issue where a stale GetSlot(finalized) response caused the epoch finder
to return the wrong epoch for ~51 minutes after an epoch boundary,
leading to Account Not Found alerts on all circuits.

For recent targets (within the current epoch), the authoritative epoch
from GetEpochInfo is returned directly. For targets in prior epochs,
slot math is used as before but with the authoritative slot from
GetEpochInfo.
@snormore snormore force-pushed the snor/epoch-finder-stale-slot-test branch from 7884926 to e810f8e Compare March 14, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant