fix: bound per-attempt RPC timeout and retry empty get_code responses by flyq · Pull Request #129 · megaeth-labs/stateless-validator

flyq · 2026-04-30T06:19:40Z

Summary

Two production reliability fixes for the shared RpcClient used by both stateless-validator and debug-trace-server.

1. Per-attempt timeout for stalled providers

A provider that accepts the TCP connection but never replies would wedge the retry loop forever on the chain-sync path (deadline = None).
alloy + reqwest's default Client::new() has no request timeout, and alloy_rpc_client::ClientBuilder exposes none either, so without an external bound the retry loop has no way to know an attempt is hung.

round_robin_with_backoff now applies tokio::time::timeout(min(per_attempt_timeout, remaining-to-deadline)) on every attempt.
An attempt-level timeout synthesizes a normal Err and falls into the same "rotate to next provider, log retry" path returned errors take.
A deadline-level timeout still surfaces as RpcDeadlineExceeded for callers that pass a deadline (the trace server's user-facing path).

Default per_attempt_timeout = 30s, derived from "longest healthy attempt is ~10s (witness fetch near tip), 3× safety margin".
Configurable via:

--rpc-per-attempt-timeout-ms / STATELESS_VALIDATOR_RPC_PER_ATTEMPT_TIMEOUT_MS (validator)
--rpc-per-attempt-timeout-ms / DEBUG_TRACE_SERVER_RPC_PER_ATTEMPT_TIMEOUT_MS (trace server)

2. Retry empty `eth_getCodeByHash` responses

mega-reth's eth_getCodeByHash handler uses state.bytecode_by_hash(&code_hash)?.unwrap_or_default(), returning "0x" (empty) when the node hasn't synced the relevant state yet — not a JSON-RPC error.
Before this fix, that empty response landed in get_codes(verify=true), failed keccak verification (keccak("") != requested_hash), and halted the validator as a fatal VerificationFailure.

Production trace that motivated this:

ERROR ... Contract hash mismatch: expected 0x06df675a...,
       got 0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470

The got value is exactly KECCAK_EMPTY.

get_code_with_deadline now treats an empty response for any non-KECCAK_EMPTY codehash as a transient error, so round_robin_with_backoff rotates / backs off and retries.
The KECCAK_EMPTY case is whitelisted — a genuine empty-bytecode request must accept "0x", otherwise the call would loop forever.
The synthesized error message includes "(provider may not be sync-ready)" so the round-3+ WARN log carries an operator-actionable diagnostic.

Test plan

cargo test --workspace --lib — 117 tests pass
cargo clippy --workspace --all-targets — clean
cargo check --workspace --all-targets — clean
New unit tests in crates/stateless-common/src/rpc_client.rs:
- test_per_attempt_timeout_rotates_past_stalled_provider — pairs a TCP listener that accepts but never replies with a healthy provider; verifies the call returns successfully via the healthy provider on the deadline = None path.
- test_get_code_retries_empty_response_until_deadline — mock server always returns "0x"; verifies the call retries (hit count > 1) and surfaces RpcDeadlineExceeded rather than a VerificationFailure.
- test_get_code_accepts_empty_response_for_keccak_empty — guards the whitelist branch so KECCAK_EMPTY requests don't infinite-loop.

claude · 2026-04-30T06:20:33Z

The PR currently has no labels, but it should have at least the bug label.

The title starts with fix: and the description details two production reliability fixes:

Bounding per-attempt RPC timeout to prevent stalled providers from wedging the retry loop.
Retrying empty eth_getCodeByHash responses to avoid a fatal VerificationFailure halt.

Both are bug fixes for real production issues (the second even includes a production trace). The bug label applies here.

claude · 2026-04-30T06:23:10Z

Two well-motivated fixes. Logic in round_robin_with_backoff is correct — the per-attempt/deadline interplay is handled cleanly, permit drop paths are sound, and metrics attribution is right. The get_code KECCAK_EMPTY whitelist correctly prevents an infinite retry loop. Tests cover the important cases. Two nits inline.

claude · 2026-04-30T07:12:47Z

Both previous nits addressed. One new issue inline.

The per-attempt timeout logic is correct — the permit drop paths, deadline/timeout interplay, and metrics attribution are all sound. The KECCAK_EMPTY whitelist correctly prevents the infinite retry loop that would otherwise result. Three tests are well-structured and cover the right cases.

`client.get_code` retries forever (deadline = None). If the `hash != KECCAK_EMPTY` guard in `get_code_with_deadline` ever regresses (e.g. `!=` flipped to `==`), the call would loop instead of short-circuit, and the test would hang CI rather than fail. Wrap the call in `tokio::time::timeout(500ms, ..)` so a regression surfaces as a deterministic test failure. Nit from claude[bot] review on #129. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-04-30T07:19:58Z

Both fixes are correct and well-motivated. All three previous review threads have been addressed. LGTM.

claude · 2026-05-06T16:28:45Z

Both fixes are correct and well-implemented. The per-attempt timeout logic correctly handles the three cases (deadline expired before attempt, attempt times out with budget remaining, attempt completes normally), and the KECCAK_EMPTY whitelist for the empty-code retry is the right design. All prior review nits have been addressed.

LGTM.

batatmosfer-byte · 2026-05-06T20:40:47Z

https://deepwiki.com/megaeth-labs/stateless-validator/13.2-validation-functions

batatmosfer-byte · 2026-05-06T20:47:43Z

+    /// even when the caller passes no overall deadline (e.g. background chain sync), so
+    /// a provider that accepts the TCP connection but never replies is detected and
+    /// rotated past instead of wedging the retry loop.
+    #[clap(long, env = "DEBUG_TRACE_SERVER_RPC_PER_ATTEMPT_TIMEOUT_MS")]


claude · 2026-05-07T00:20:58Z

Previous review suggestions have all been addressed. The implementation is correct:

Per-attempt timeout logic correctly handles all three cases: deadline = None (stalled provider), deadline = Some with budget remaining (stalled provider rotates), and deadline = Some exhausted (surfaces RpcDeadlineExceeded). The permit is properly dropped in every path.
The KECCAK_EMPTY whitelist is correctly placed — empty responses for non-empty hashes become transient errors, avoiding the VerificationFailure halt without risking an infinite retry loop on legitimate empty-bytecode requests.
Tests cover the three scenarios and are regression-safe (the KECCAK_EMPTY test now has an outer timeout that would fail CI instead of hanging if the guard regressed).

LGTM.

vincent-k2026

LGTM. Two production-shaped fixes, both well-tested. The KECCAK_EMPTY whitelist + the regression-guard timeout in test_get_code_accepts_empty_response_for_keccak_empty are particularly nice — easy to get wrong, easy to break later, and you've pinned both directions.

Two non-blocking nits, follow-up if you want:

The per-attempt-timeout branch and the "closure returned Err" branch are operationally indistinguishable today — both increment on_rpc_retry. A steadily-stalled provider (the exact failure mode this PR detects) won't show up in any counter. A dedicated on_rpc_attempt_timeout metric, or at minimum a WARN log in the timeout arm, would close the loop on the observability side.
Default per_attempt_timeout = 30s is on the loose side — back-to-back stalled providers stack 30s each before rotation. Worth either tightening to 10–15s or nudging operators in the doc string to lower it when their upstream P99 is well under 10s.

fix get_code & fix request hang

36a3d57

flyq requested review from Troublor, abelmega and vincent-k2026 as code owners April 30, 2026 06:19

claude Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread crates/stateless-common/src/rpc_client.rs Outdated

claude Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread crates/stateless-common/src/rpc_client.rs Outdated

flyq added the bug Something isn't working label Apr 30, 2026

test: address review nits in rpc_client tests

78683a0

claude Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread crates/stateless-common/src/rpc_client.rs Outdated

add gap info

20ff8e7

batatmosfer-byte approved these changes May 6, 2026

View reviewed changes

batatmosfer-byte reviewed May 6, 2026

View reviewed changes

revert the debug log

9a5d652

vincent-k2026 approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound per-attempt RPC timeout and retry empty get_code responses#129

fix: bound per-attempt RPC timeout and retry empty get_code responses#129
flyq wants to merge 5 commits intomainfrom
liquan/opt_rpc

flyq commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

batatmosfer-byte commented May 6, 2026

Uh oh!

batatmosfer-byte May 6, 2026

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

vincent-k2026 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

flyq commented Apr 30, 2026

Summary

1. Per-attempt timeout for stalled providers

2. Retry empty eth_getCodeByHash responses

Test plan

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

batatmosfer-byte commented May 6, 2026

Uh oh!

batatmosfer-byte May 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

vincent-k2026 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Retry empty `eth_getCodeByHash` responses