fix: bound per-attempt RPC timeout and retry empty get_code responses#129
fix: bound per-attempt RPC timeout and retry empty get_code responses#129
Conversation
|
The PR currently has no labels, but it should have at least the bug label. The title starts with
Both are bug fixes for real production issues (the second even includes a production trace). The bug label applies here. |
|
Two well-motivated fixes. Logic in |
|
Both previous nits addressed. One new issue inline. The per-attempt timeout logic is correct — the permit drop paths, deadline/timeout interplay, and metrics attribution are all sound. The KECCAK_EMPTY whitelist correctly prevents the infinite retry loop that would otherwise result. Three tests are well-structured and cover the right cases. |
`client.get_code` retries forever (deadline = None). If the `hash != KECCAK_EMPTY` guard in `get_code_with_deadline` ever regresses (e.g. `!=` flipped to `==`), the call would loop instead of short-circuit, and the test would hang CI rather than fail. Wrap the call in `tokio::time::timeout(500ms, ..)` so a regression surfaces as a deterministic test failure. Nit from claude[bot] review on #129. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both fixes are correct and well-motivated. All three previous review threads have been addressed. LGTM. |
|
Both fixes are correct and well-implemented. The per-attempt timeout logic correctly handles the three cases (deadline expired before attempt, attempt times out with budget remaining, attempt completes normally), and the LGTM. |
| /// even when the caller passes no overall deadline (e.g. background chain sync), so | ||
| /// a provider that accepts the TCP connection but never replies is detected and | ||
| /// rotated past instead of wedging the retry loop. | ||
| #[clap(long, env = "DEBUG_TRACE_SERVER_RPC_PER_ATTEMPT_TIMEOUT_MS")] |
|
Previous review suggestions have all been addressed. The implementation is correct:
LGTM. |
vincent-k2026
left a comment
There was a problem hiding this comment.
LGTM. Two production-shaped fixes, both well-tested. The KECCAK_EMPTY whitelist + the regression-guard timeout in test_get_code_accepts_empty_response_for_keccak_empty are particularly nice — easy to get wrong, easy to break later, and you've pinned both directions.
Two non-blocking nits, follow-up if you want:
-
The per-attempt-timeout branch and the "closure returned Err" branch are operationally indistinguishable today — both increment
on_rpc_retry. A steadily-stalled provider (the exact failure mode this PR detects) won't show up in any counter. A dedicatedon_rpc_attempt_timeoutmetric, or at minimum a WARN log in the timeout arm, would close the loop on the observability side. -
Default
per_attempt_timeout = 30sis on the loose side — back-to-back stalled providers stack 30s each before rotation. Worth either tightening to 10–15s or nudging operators in the doc string to lower it when their upstream P99 is well under 10s.
Summary
Two production reliability fixes for the shared
RpcClientused by bothstateless-validatoranddebug-trace-server.1. Per-attempt timeout for stalled providers
A provider that accepts the TCP connection but never replies would wedge the retry loop forever on the chain-sync path (
deadline = None).alloy + reqwest's default
Client::new()has no request timeout, andalloy_rpc_client::ClientBuilderexposes none either, so without an external bound the retry loop has no way to know an attempt is hung.round_robin_with_backoffnow appliestokio::time::timeout(min(per_attempt_timeout, remaining-to-deadline))on every attempt.An attempt-level timeout synthesizes a normal
Errand falls into the same "rotate to next provider, log retry" path returned errors take.A deadline-level timeout still surfaces as
RpcDeadlineExceededfor callers that pass a deadline (the trace server's user-facing path).Default
per_attempt_timeout = 30s, derived from "longest healthy attempt is ~10s (witness fetch near tip), 3× safety margin".Configurable via:
--rpc-per-attempt-timeout-ms/STATELESS_VALIDATOR_RPC_PER_ATTEMPT_TIMEOUT_MS(validator)--rpc-per-attempt-timeout-ms/DEBUG_TRACE_SERVER_RPC_PER_ATTEMPT_TIMEOUT_MS(trace server)2. Retry empty
eth_getCodeByHashresponsesmega-reth's
eth_getCodeByHashhandler usesstate.bytecode_by_hash(&code_hash)?.unwrap_or_default(), returning"0x"(empty) when the node hasn't synced the relevant state yet — not a JSON-RPC error.Before this fix, that empty response landed in
get_codes(verify=true), failed keccak verification (keccak("") != requested_hash), and halted the validator as a fatalVerificationFailure.Production trace that motivated this:
The
gotvalue is exactlyKECCAK_EMPTY.get_code_with_deadlinenow treats an empty response for any non-KECCAK_EMPTYcodehash as a transient error, soround_robin_with_backoffrotates / backs off and retries.The
KECCAK_EMPTYcase is whitelisted — a genuine empty-bytecode request must accept"0x", otherwise the call would loop forever.The synthesized error message includes
"(provider may not be sync-ready)"so the round-3+ WARN log carries an operator-actionable diagnostic.Test plan
cargo test --workspace --lib— 117 tests passcargo clippy --workspace --all-targets— cleancargo check --workspace --all-targets— cleancrates/stateless-common/src/rpc_client.rs:test_per_attempt_timeout_rotates_past_stalled_provider— pairs a TCP listener that accepts but never replies with a healthy provider; verifies the call returns successfully via the healthy provider on thedeadline = Nonepath.test_get_code_retries_empty_response_until_deadline— mock server always returns"0x"; verifies the call retries (hit count > 1) and surfacesRpcDeadlineExceededrather than aVerificationFailure.test_get_code_accepts_empty_response_for_keccak_empty— guards the whitelist branch soKECCAK_EMPTYrequests don't infinite-loop.