Conversation
…ontention This commit addresses RPS concerns by reducing lock contention in the block height tracking hot path. Changes: - Change perceivedBlockNumber from uint64 to atomic.Uint64 - Remove locks from basicEndpointValidation() and isBlockNumberValid() - Use atomic Load() for reads and CompareAndSwap() for writes - Optimize filterValidEndpointsWithDetails() to copy data under lock then release lock before iterating (O(1) lock hold instead of O(n)) - UpdateFromExtractedData() now uses atomic CAS instead of mutex Before: Request path held RLock for entire endpoint filtering loop, blocking observation writes and causing cascading delays. After: Lock held only briefly to copy endpoint data, atomic reads for perceivedBlockNumber are lock-free. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This PR combines and extends the work from #505 with additional bug fixes and improvements for hedge racing and retry reliability. Closes #505 ## Features (from #505) ### Protocol Error Propagation - Add `SetProtocolError` to `RequestQoSContext` interface for specific error messages - Replace generic "no endpoint responses received" with specific errors like "no valid endpoints available for service" ### Hedge Racing (New Feature) - Spawn parallel "hedge" request after configurable delay if primary hasn't responded - First successful response wins; the other is cancelled - Configurable via `retry_config.hedge_delay` and `retry_config.connect_timeout` - Track outcomes via `X-Hedge-Result` header ### Retry Enhancements - **Time Budget**: `max_retry_latency` skips retries when failed request already took too long - **Endpoint Rotation**: Each retry attempt uses a different endpoint - **Heuristic Detection**: Retry on JSON-RPC errors hidden in HTTP 200 responses - **Observability**: Track via `X-Retry-Count` and `X-Suppliers-Tried` headers ### Heuristic Response Analysis - Detect errors in response payloads despite HTTP 200 status - Identify: JSON-RPC errors, HTML error pages, empty responses, malformed JSON - Record correcting reputation signals for detected failures ### Response Metadata Headers | Header | Description | |----------------------|---------------------------------------------------------------------------------| | `X-Retry-Count` | Number of retry attempts (0 = first attempt succeeded) | | `X-Suppliers-Tried` | Comma-separated list of attempted supplier addresses | | `X-Hedge-Result` | Hedge racing outcome: `primary_only`, `primary_won`, `hedge_won`, `both_failed` | | `X-App-Address` | Application address used for the relay | | `X-Supplier-Address` | Supplier address of the responding endpoint | | `X-Session-ID` | Session ID for the relay | ### Health Check & Sync Check - **Sync check validation**: Health checks now validate endpoint block height against QoS perceived block number using `sync_allowance` config - Consolidated block height validation directly into health check executor (removed standalone `BlockHeightValidator`, `BlockHeightReferenceCache`) - Simplified health check config structure - Fix defer pattern in solana.go for mutex unlock - Add nil map initialization safety check in solana.go ## Bug Fixes (this PR) - **X-Suppliers-Tried header**: Pre-register both primary and hedge suppliers when racing starts - **selectTopRankedEndpoint**: Return original endpoint address instead of reputation key (fixes 'endpoint not available' errors) - **Retry blockchain errors**: Detect and retry node-specific errors (missing trie node, unhealthy node) even in valid JSON-RPC responses - **Health check refactor**: Simplify block height validation and consolidate into health checks ## Contributions from @oten91 - Prioritized endpoint inclusion during reputation filtering (mitigates race conditions) - Request-awareness for data extraction methods - Enhanced JSON-RPC response analysis with stricter error classification - Heuristic-based error classification with unit tests - Improved supplier tracking and debugging - JSON-RPC error handling to prevent retries for valid client errors ## Configuration ```yaml services: - service_id: eth retry_config: enabled: true max_retries: 2 hedge_delay: 500ms connect_timeout: 200ms max_retry_latency: 5s retry_on_5xx: true retry_on_timeout: true retry_on_connection: true ``` Includes #508 and #507 ### Testing - [x] Unit tests - [x] E2E tests (eth service 74.33% success rate) - [x] Local hedge testing verified with `scripts/test_hedge.sh` --------- Co-authored-by: Otto V <ottoevargas@gmail.com>
Remove unused functions (calculateRetryBackoff, mock getBlock/setBlock), simplify embedded field selectors flagged by staticcheck, and handle unchecked Encode error.
No suppliers stake comet_bft endpoints for xrplevm, so the example config should match production which only supports json_rpc and websocket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jorgecuesta
approved these changes
Mar 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#505 #506 #507 #508 into main