Skip to content

Merge Staging with Main for new release#510

Merged
oten91 merged 6 commits intomainfrom
staging
Mar 12, 2026
Merged

Merge Staging with Main for new release#510
oten91 merged 6 commits intomainfrom
staging

Conversation

@oten91
Copy link
Copy Markdown
Contributor

@oten91 oten91 commented Mar 11, 2026

#505 #506 #507 #508 into main

oten91 and others added 2 commits January 23, 2026 12:36
…ontention

This commit addresses RPS concerns by reducing lock contention in the
block height tracking hot path.

Changes:
- Change perceivedBlockNumber from uint64 to atomic.Uint64
- Remove locks from basicEndpointValidation() and isBlockNumberValid()
- Use atomic Load() for reads and CompareAndSwap() for writes
- Optimize filterValidEndpointsWithDetails() to copy data under lock
  then release lock before iterating (O(1) lock hold instead of O(n))
- UpdateFromExtractedData() now uses atomic CAS instead of mutex

Before: Request path held RLock for entire endpoint filtering loop,
blocking observation writes and causing cascading delays.

After: Lock held only briefly to copy endpoint data, atomic reads
for perceivedBlockNumber are lock-free.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This PR combines and extends the work from #505 with additional bug
fixes and improvements for hedge racing and retry reliability.

Closes #505

## Features (from #505)

### Protocol Error Propagation
- Add `SetProtocolError` to `RequestQoSContext` interface for specific
error messages
- Replace generic "no endpoint responses received" with specific errors
like "no valid endpoints available for service"

### Hedge Racing (New Feature)
- Spawn parallel "hedge" request after configurable delay if primary
hasn't responded
- First successful response wins; the other is cancelled
- Configurable via `retry_config.hedge_delay` and
`retry_config.connect_timeout`
- Track outcomes via `X-Hedge-Result` header

### Retry Enhancements
- **Time Budget**: `max_retry_latency` skips retries when failed request
already took too long
- **Endpoint Rotation**: Each retry attempt uses a different endpoint
- **Heuristic Detection**: Retry on JSON-RPC errors hidden in HTTP 200
responses
- **Observability**: Track via `X-Retry-Count` and `X-Suppliers-Tried`
headers

### Heuristic Response Analysis
- Detect errors in response payloads despite HTTP 200 status
- Identify: JSON-RPC errors, HTML error pages, empty responses,
malformed JSON
- Record correcting reputation signals for detected failures

### Response Metadata Headers
| Header | Description |

|----------------------|---------------------------------------------------------------------------------|
| `X-Retry-Count` | Number of retry attempts (0 = first attempt
succeeded) |
| `X-Suppliers-Tried` | Comma-separated list of attempted supplier
addresses |
| `X-Hedge-Result` | Hedge racing outcome: `primary_only`,
`primary_won`, `hedge_won`, `both_failed` |
| `X-App-Address` | Application address used for the relay |
| `X-Supplier-Address` | Supplier address of the responding endpoint |
| `X-Session-ID` | Session ID for the relay |

### Health Check & Sync Check
- **Sync check validation**: Health checks now validate endpoint block
height against QoS perceived block number using `sync_allowance` config
- Consolidated block height validation directly into health check
executor (removed standalone `BlockHeightValidator`,
`BlockHeightReferenceCache`)
- Simplified health check config structure
- Fix defer pattern in solana.go for mutex unlock
- Add nil map initialization safety check in solana.go

## Bug Fixes (this PR)

- **X-Suppliers-Tried header**: Pre-register both primary and hedge
suppliers when racing starts
- **selectTopRankedEndpoint**: Return original endpoint address instead
of reputation key (fixes 'endpoint not available' errors)
- **Retry blockchain errors**: Detect and retry node-specific errors
(missing trie node, unhealthy node) even in valid JSON-RPC responses
- **Health check refactor**: Simplify block height validation and
consolidate into health checks

## Contributions from @oten91 

- Prioritized endpoint inclusion during reputation filtering (mitigates
race conditions)
- Request-awareness for data extraction methods
- Enhanced JSON-RPC response analysis with stricter error classification
- Heuristic-based error classification with unit tests
- Improved supplier tracking and debugging
- JSON-RPC error handling to prevent retries for valid client errors

## Configuration

```yaml                                                                                                                                                                                                                           
  services:                                                                                                                                                                                                                         
    - service_id: eth                                                                                                                                                                                                               
      retry_config:                                                                                                                                                                                                                 
        enabled: true                                                                                                                                                                                                               
        max_retries: 2                                                                                                                                                                                                              
        hedge_delay: 500ms                                                                                                                                                                   
        connect_timeout: 200ms                                                                                                                                                             
        max_retry_latency: 5s                                                                                                                                                                
        retry_on_5xx: true                                                                                                                                                                                                          
        retry_on_timeout: true                                                                                                                                                                                                      
        retry_on_connection: true                                                                                                                                                                                                  
```

Includes #508 and #507

### Testing
- [x] Unit tests
- [x] E2E tests  (eth service 74.33% success rate)
- [x] Local hedge testing verified with `scripts/test_hedge.sh`

---------

Co-authored-by: Otto V <ottoevargas@gmail.com>
@oten91 oten91 requested a review from jorgecuesta March 11, 2026 23:25
oten91 and others added 4 commits March 12, 2026 00:34
Remove unused functions (calculateRetryBackoff, mock getBlock/setBlock),
simplify embedded field selectors flagged by staticcheck, and handle
unchecked Encode error.
No suppliers stake comet_bft endpoints for xrplevm, so the example
config should match production which only supports json_rpc and websocket.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 merged commit 2646ab1 into main Mar 12, 2026
17 of 30 checks passed
@oten91 oten91 deleted the staging branch March 12, 2026 11:55
@oten91 oten91 restored the staging branch March 12, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants