feat(evm): handle nonce gaps before resubmission#726
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR introduces comprehensive nonce management for EVM relayers by extracting nonce synchronization into a dedicated module, implementing nonce-gap detection and resolution with health-check actions, adding RPC error classification for transaction submission flows, and extending repositories with nonce occupancy queries. The changes unify error pattern matching, escalate nonce-too-high failures to health-check workflows, and enable fast-path nonce reconciliation during status checks. Changes
Sequence DiagramssequenceDiagram
actor Client
participant EVM Relayer
participant Transaction Handler
participant Repository
participant Provider
Client->>EVM Relayer: detect_nonce_gaps(on_chain_nonce)
EVM Relayer->>Provider: get_transaction_count (on-chain nonce)
Provider-->>EVM Relayer: nonce
EVM Relayer->>Repository: get_nonce_occupancy(on_chain..tx_nonce)
Repository-->>EVM Relayer: Vec<(nonce, Option<Status>)>
EVM Relayer-->>Client: gaps_detected: Vec<u64>
Note over Client,EVM Relayer: Gap resolution triggered by health action
Client->>EVM Relayer: resolve_nonce_gaps(nonce_hint)
EVM Relayer->>EVM Relayer: optionally raise counter via nonce_hint
EVM Relayer->>EVM Relayer: pause to mitigate race conditions
EVM Relayer->>EVM Relayer: sync_nonce (re-sync with chain)
EVM Relayer->>EVM Relayer: detect_nonce_gaps (re-check)
loop For each remaining gap
EVM Relayer->>EVM Relayer: create_gap_filling_noop(nonce)
EVM Relayer->>Repository: persist NOOP as Pending
EVM Relayer->>Client: enqueue transaction + status-check job
end
EVM Relayer-->>Client: gaps_filled: usize
sequenceDiagram
actor Job Handler
participant Health Check Handler
participant EVM Relayer
participant Nonce Module
Job Handler->>Health Check Handler: handle_relayer_health_check(metadata)
alt metadata present and action = nonce_health
Health Check Handler->>EVM Relayer: handle_health_action(metadata)
EVM Relayer->>Nonce Module: resolve_nonce_gaps(nonce_hint from metadata)
Nonce Module->>Nonce Module: detect, fill, sync workflow
Nonce Module-->>EVM Relayer: Result<usize, Error>
EVM Relayer-->>Health Check Handler: Ok(true) - action handled
Health Check Handler-->>Job Handler: Ok(())
else no metadata or action unhandled
Health Check Handler->>Health Check Handler: run normal health check
Health Check Handler-->>Job Handler: Ok(())
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR improves EVM transaction reliability by proactively detecting nonce gaps (and externally-consumed nonces) and routing recovery through targeted relayer “nonce health” jobs instead of repeatedly resubmitting transactions that cannot be mined.
Changes:
- Add nonce-gap-ahead detection in EVM status checking to schedule nonce health jobs when a stale Submitted tx is blocked by missing/terminal nonces below it.
- Introduce repository-level
get_nonce_occupancyfor efficient nonce-slot scans (batched Redis MGET; in-memory implementation for tests). - Enhance nonce-related error classification (provider retry behavior + EVM submission/resubmission handling), plus add health-check metadata and a dedicated EVM nonce health implementation.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/services/provider/retry.rs |
Adds a regression test ensuring non-retriable tx-level RPC errors don’t trigger retries/failover. |
src/services/provider/mod.rs |
Classifies certain RPC error-code messages as non-retriable “tx-level” rejections; adds unit tests. |
src/repositories/transaction/mod.rs |
Extends TransactionRepository with get_nonce_occupancy and wires it through storage/mock. |
src/repositories/transaction/transaction_redis.rs |
Implements get_nonce_occupancy using batched Redis MGET; adds integration tests. |
src/repositories/transaction/transaction_in_memory.rs |
Implements get_nonce_occupancy for the in-memory repository; adds unit tests. |
src/models/transaction/repository.rs |
Adds nonce_too_high_retries tracking + helper reset method; adds backward-compat tests. |
src/jobs/status_check_context.rs |
Adds optional job metadata to StatusCheckContext to support one-shot recovery hints. |
src/jobs/job.rs |
Adds metadata support to RelayerHealthCheck and constructors for nonce-health jobs (with optional nonce hint). |
src/jobs/handlers/transaction_status_handler.rs |
Plumbs job metadata into StatusCheckContext. |
src/jobs/handlers/relayer_health_check_handler.rs |
Executes targeted health actions (like nonce health) based on job metadata. |
src/domain/transaction/stellar/submit.rs |
Updates tests/metadata structs to include the new nonce_too_high_retries field. |
src/domain/transaction/common.rs |
Adds is_active_nonce_status helper used for nonce-gap logic and tests it. |
src/domain/transaction/evm/status.rs |
Adds nonce-gap-ahead detection + nonce reconciliation path triggered by job metadata; updates circuit-breaker behavior for Sent. |
src/domain/transaction/evm/evm_transaction.rs |
Adds nonce error classification, nonce-too-high retry escalation, and status-check metadata scheduling. |
src/domain/relayer/mod.rs |
Adds a default handle_health_action method to the Relayer trait and forwards in NetworkRelayer. |
src/domain/relayer/evm/mod.rs |
Registers the new EVM nonce module. |
src/domain/relayer/evm/evm_relayer.rs |
Exposes internal fields needed by the new nonce module and removes the old sync_nonce impl from this file. |
src/domain/relayer/evm/nonce.rs |
New module implementing nonce sync/gap detection/gap filling and targeted nonce-health action handling. |
src/constants/evm_transaction.rs |
Adds shared nonce/tx error pattern constants and health-check metadata keys. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## main #726 +/- ##
==========================================
+ Coverage 90.21% 90.31% +0.10%
==========================================
Files 290 291 +1
Lines 122099 124258 +2159
==========================================
+ Hits 110147 112219 +2072
- Misses 11952 12039 +87
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
src/repositories/transaction/transaction_in_memory.rs (1)
323-334: Take one store snapshot for the whole occupancy scan.This loops through
find_by_nonce, so each nonce reacquires the mutex and rescans the full map. A single lock keeps the returned window consistent and avoidsO(range × store_size)work in the reference implementation.♻️ Possible simplification
async fn get_nonce_occupancy( &self, relayer_id: &str, from_nonce: u64, to_nonce: u64, ) -> Result<Vec<(u64, Option<TransactionStatus>)>, RepositoryError> { - let mut results = Vec::new(); - for nonce in from_nonce..to_nonce { - let tx = self.find_by_nonce(relayer_id, nonce).await?; - results.push((nonce, tx.map(|t| t.status))); - } + let store = Self::acquire_lock(&self.store).await?; + let mut results = Vec::with_capacity(to_nonce.saturating_sub(from_nonce) as usize); + for nonce in from_nonce..to_nonce { + let status = store + .values() + .find(|tx| { + tx.relayer_id == relayer_id + && matches!( + &tx.network_data, + NetworkTransactionData::Evm(data) if data.nonce == Some(nonce) + ) + }) + .map(|tx| tx.status.clone()); + results.push((nonce, status)); + } Ok(results) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/repositories/transaction/transaction_in_memory.rs` around lines 323 - 334, The get_nonce_occupancy implementation currently calls find_by_nonce inside the loop, re-locking and rescanning the store for each nonce; instead acquire the in-memory store lock once, take a snapshot (clone or iterate the map while holding the lock) and then release the lock and build the Vec<(u64, Option<TransactionStatus>)> from that snapshot for nonces in from_nonce..to_nonce. Replace per-iteration find_by_nonce calls with lookups into the snapshot, keeping the function signature and returned TransactionStatus mapping unchanged.src/jobs/job.rs (1)
267-270:with_metadatais destructive on routed health-check jobs.Because this replaces the whole map, chaining it after
nonce_health()/nonce_health_with_hint()would discard the routing keys those builders add, andschedule_relayer_nonce_health_jobinsrc/domain/transaction/evm/evm_transaction.rs:277-301would stop routing the job as a nonce-health action. Consider merging into the existing map here, or rename this toreplace_metadataand add a short doc comment so the behavior is explicit.
As per coding guidelines "Include relevant doc comments (///) on public functions, structs, and modules. Use comments for 'why', not 'what'. Avoid redundant doc comments".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/jobs/job.rs` around lines 267 - 270, The with_metadata method currently overwrites the entire metadata map which discards routing keys set by nonce_health / nonce_health_with_hint and breaks schedule_relayer_nonce_health_job; change with_metadata to merge the incoming HashMap into the existing self.metadata (preserving existing keys and adding/updating entries) or alternatively rename it to replace_metadata and add a /// doc comment on the public method explaining that it replaces metadata (and mention nonce_health/nonce_health_with_hint behavior), and update callers if you rename the method; reference with_metadata, nonce_health, nonce_health_with_hint, and schedule_relayer_nonce_health_job to locate relevant code.src/repositories/transaction/mod.rs (1)
91-101: Give the occupancy result a named type and explicit invariants.
Vec<(u64, Option<TransactionStatus>)>is repeated across the trait, mock, and storage adapter, but the nonce-gap scanner depends on stronger semantics than the signature shows. A small alias or entry type would make the API self-describing and give you one place to document “ascending order, one entry per nonce, and empty whenfrom_nonce >= to_nonce,” which will help keep the two backends aligned.
As per coding guidelines "Use type aliases for complex types to improve readability".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/repositories/transaction/mod.rs` around lines 91 - 101, Replace the opaque return type Vec<(u64, Option<TransactionStatus>)> on get_nonce_occupancy with a named type (e.g., NonceOccupancyEntry or type alias NonceOccupancy = Vec<NonceOccupancyEntry>) and update the trait signature plus the mock and storage adapter to use that type; document the invariants on that named type: entries are in ascending nonce order, contain exactly one entry per nonce in [from_nonce, to_nonce) and the vector is empty when from_nonce >= to_nonce, so callers (like the nonce-gap scanner) and implementors rely on the same contract.src/domain/relayer/evm/nonce.rs (1)
62-68: Consider logging counter service errors before falling back.The
.ok().flatten().unwrap_or(0)pattern silently swallows any error fromget(). While defaulting to 0 is safe (the sync will useon_chain_nonce), logging the error would help diagnose counter service issues.🔧 Suggested change
let transaction_counter_nonce = self .transaction_counter_service .get() .await - .ok() - .flatten() - .unwrap_or(0); + .map_err(|e| { + debug!(error = %e, "failed to get counter, defaulting to 0"); + e + }) + .ok() + .flatten() + .unwrap_or(0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/domain/relayer/evm/nonce.rs` around lines 62 - 68, The call to transaction_counter_service.get() currently swallows errors via .ok().flatten().unwrap_or(0); change this to capture the Result from transaction_counter_service.get().await (or inspect the Option<Result<...>>) and log any Err or error-containing Option before falling back to 0, then assign the fallback to transaction_counter_nonce; reference the transaction_counter_service.get() call and the transaction_counter_nonce assignment so you add a processLogger/debug!("...") or similar logging on error and preserve the existing unwrap_or(0) behavior.src/domain/transaction/evm/evm_transaction.rs (1)
237-241: Minor: Use importedNetworkTypeinstead of full path.
NetworkTypeis already imported at line 32, so the full path here is unnecessary.Suggested fix
let mut job = TransactionStatusCheck::new( tx.id.clone(), tx.relayer_id.clone(), - crate::models::NetworkType::Evm, + NetworkType::Evm, );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/domain/transaction/evm/evm_transaction.rs` around lines 237 - 241, Replace the fully-qualified enum path when constructing the TransactionStatusCheck: in the TransactionStatusCheck::new call that passes tx.id.clone() and tx.relayer_id.clone(), change crate::models::NetworkType::Evm to the imported NetworkType::Evm to use the existing import and avoid the redundant full path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/constants/evm_transaction.rs`:
- Around line 166-180: Test case includes "future nonce" which isn't covered by
the configured NONCE_TOO_HIGH_PATTERNS; update the test fixtures in
test_nonce_too_high_patterns_match_expected_strings to use a supported substring
(e.g. replace "future nonce" with "nonce too far in the future") or
alternatively add the broader pattern "future nonce" to the
NONCE_TOO_HIGH_PATTERNS constant so the test matches actual provider responses;
locate NONCE_TOO_HIGH_PATTERNS and the test function name to apply the chosen
change.
In `@src/repositories/transaction/transaction_redis.rs`:
- Around line 1614-1621: The get_nonce_occupancy path currently obtains a
connection from self.get_connection(self.connections.reader(),
"get_nonce_occupancy") which risks replica lag and fabricated nonce gaps; change
it to acquire a primary/writer connection (e.g.
self.get_connection(self.connections.writer() or self.connections.primary(),
"get_nonce_occupancy")) so the MGET against nonce_keys reads the latest
committed data; keep the existing error mapping (map_redis_error call) unchanged
except ensure the same context string ("get_nonce_occupancy:mget_nonces") is
used.
In `@src/services/provider/mod.rs`:
- Around line 1311-1319: The test is asserting against a string ("future nonce")
that isn't in the shared NONCE_TOO_HIGH_PATTERNS used by
is_non_retriable_transaction_rpc_message; update the test to use the canonical
pattern present in NONCE_TOO_HIGH_PATTERNS (e.g., "nonce too far in the future")
or remove the incorrect assertion so the test matches production behavior
checked by is_non_retriable_transaction_rpc_message.
---
Nitpick comments:
In `@src/domain/relayer/evm/nonce.rs`:
- Around line 62-68: The call to transaction_counter_service.get() currently
swallows errors via .ok().flatten().unwrap_or(0); change this to capture the
Result from transaction_counter_service.get().await (or inspect the
Option<Result<...>>) and log any Err or error-containing Option before falling
back to 0, then assign the fallback to transaction_counter_nonce; reference the
transaction_counter_service.get() call and the transaction_counter_nonce
assignment so you add a processLogger/debug!("...") or similar logging on error
and preserve the existing unwrap_or(0) behavior.
In `@src/domain/transaction/evm/evm_transaction.rs`:
- Around line 237-241: Replace the fully-qualified enum path when constructing
the TransactionStatusCheck: in the TransactionStatusCheck::new call that passes
tx.id.clone() and tx.relayer_id.clone(), change crate::models::NetworkType::Evm
to the imported NetworkType::Evm to use the existing import and avoid the
redundant full path.
In `@src/jobs/job.rs`:
- Around line 267-270: The with_metadata method currently overwrites the entire
metadata map which discards routing keys set by nonce_health /
nonce_health_with_hint and breaks schedule_relayer_nonce_health_job; change
with_metadata to merge the incoming HashMap into the existing self.metadata
(preserving existing keys and adding/updating entries) or alternatively rename
it to replace_metadata and add a /// doc comment on the public method explaining
that it replaces metadata (and mention nonce_health/nonce_health_with_hint
behavior), and update callers if you rename the method; reference with_metadata,
nonce_health, nonce_health_with_hint, and schedule_relayer_nonce_health_job to
locate relevant code.
In `@src/repositories/transaction/mod.rs`:
- Around line 91-101: Replace the opaque return type Vec<(u64,
Option<TransactionStatus>)> on get_nonce_occupancy with a named type (e.g.,
NonceOccupancyEntry or type alias NonceOccupancy = Vec<NonceOccupancyEntry>) and
update the trait signature plus the mock and storage adapter to use that type;
document the invariants on that named type: entries are in ascending nonce
order, contain exactly one entry per nonce in [from_nonce, to_nonce) and the
vector is empty when from_nonce >= to_nonce, so callers (like the nonce-gap
scanner) and implementors rely on the same contract.
In `@src/repositories/transaction/transaction_in_memory.rs`:
- Around line 323-334: The get_nonce_occupancy implementation currently calls
find_by_nonce inside the loop, re-locking and rescanning the store for each
nonce; instead acquire the in-memory store lock once, take a snapshot (clone or
iterate the map while holding the lock) and then release the lock and build the
Vec<(u64, Option<TransactionStatus>)> from that snapshot for nonces in
from_nonce..to_nonce. Replace per-iteration find_by_nonce calls with lookups
into the snapshot, keeping the function signature and returned TransactionStatus
mapping unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b8f60d94-69c3-49fa-8f2e-228c7067de17
📒 Files selected for processing (19)
src/constants/evm_transaction.rssrc/domain/relayer/evm/evm_relayer.rssrc/domain/relayer/evm/mod.rssrc/domain/relayer/evm/nonce.rssrc/domain/relayer/mod.rssrc/domain/transaction/common.rssrc/domain/transaction/evm/evm_transaction.rssrc/domain/transaction/evm/status.rssrc/domain/transaction/stellar/submit.rssrc/jobs/handlers/relayer_health_check_handler.rssrc/jobs/handlers/transaction_status_handler.rssrc/jobs/job.rssrc/jobs/status_check_context.rssrc/models/transaction/repository.rssrc/repositories/transaction/mod.rssrc/repositories/transaction/transaction_in_memory.rssrc/repositories/transaction/transaction_redis.rssrc/services/provider/mod.rssrc/services/provider/retry.rs
Summary
checks if it's blocked by nonce gaps below it. If gaps found, schedules a nonce health job instead of
futile resubmission.
instead of relying on the transaction counter, which can be stale after restarts.
below existing tx nonces.
Testing Process
Checklist
Note
If you are using Relayer in your stack, consider adding your team or organization to our list of Relayer Users in the Wild!
Summary by CodeRabbit
New Features
Bug Fixes
Refactor