Skip to content

RPC: Debug* streaming approach#11693

Open
svlachakis wants to merge 17 commits into
masterfrom
debug-streaming
Open

RPC: Debug* streaming approach#11693
svlachakis wants to merge 17 commits into
masterfrom
debug-streaming

Conversation

@svlachakis
Copy link
Copy Markdown
Contributor

@svlachakis svlachakis commented May 19, 2026

Fixes Closes Resolves #11718

Overview

Default-tracer struct-log traces are now streamed per-opcode through:

IStreamableResult → Utf8JsonWriter → PipeWriter

instead of being accumulated into a List<GethTxTraceEntry> and serialized at the end.

Peak memory is now O(1) entries regardless of trace length, and time-to-first-byte drops from “after trace completes” to “first opcode boundary”.

Changes

Scope

Streaming activates only when options.Tracer is null/empty (default struct-log tracer).

Custom tracers still use the existing buffered path:

  • callTracer
  • prestateTracer
  • 4byteTracer
  • JS tracers

Streamed Methods

  • debug_traceTransaction
  • debug_traceCall
  • debug_traceTransactionByBlockAndIndex
  • debug_traceTransactionByBlockhashAndIndex
  • debug_traceTransactionInBlockByHash
  • debug_traceTransactionInBlockByIndex
  • debug_traceBlock
  • debug_traceBlockByNumber
  • debug_traceBlockByHash
  • debug_traceCallMany
    • Simple path only
    • Overrides path remains buffered

Behavior Changes

JSON field order for the default struct-log tracer: streamed responses emit structLogs first, then gas / failed / returnValue. The non-streaming (custom-tracer) path and Geth's reference implementation emit gas / failed / returnValue / structLogs. The order is forced by the streaming design — the array must be opened before EVM execution begins so entries can be emitted per opcode. Clients that parse by field name are unaffected; byte-for-byte comparators against Geth (Hive conformance, vendor dashboards) will see a diff.

Pre-flight failures on debug_traceCall / debug_traceCallMany: previously returned a JSON-RPC error envelope; now return a result envelope with failed: true, error: "tracing failed: …", errorCode: -32000. On-chain trace methods (debug_traceTransaction*, debug_traceBlock*) are unaffected — their txs already passed validation at block-creation time.

Partial structLogs on mid-execution failure: failing traces may now contain the N opcode entries that executed before the failure, instead of an empty array. Inherent to streaming.

Follow-up — error/message architecture

Geth-compat parity on error semantics is scoped as separate work, not bundled here:

  1. Surface TransactionResult validation failures (nonce, intrinsic gas, EIP-3607, etc.) through to the tracer — fixes a pre-existing silent-failure gap that affects buffered too.
  2. Lazy-envelope architecture so pre-flight failures propagate as JSON-RPC error envelope instead of mid-stream failed: true.
  3. Per-scenario error-message mirroring of Geth's exact format strings ("insufficient funds for gas * price + value: address X have N want M", etc.).
  4. Differentiated errorCodes beyond the current InvalidInput / InternalError split.

These touch ITxTracer / TransactionProcessor and exception classes used by sequencer/simulate/MEV adapters.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude finished @svlachakis's task in 5m 57s —— View job


PR Review: RPC Debug* Streaming Approach

Tasks

  • Gather context and read changed files
  • Review new streaming tracer classes
  • Review GethStyleTracer changes
  • Review DebugRpcModule and DebugBridge changes
  • Review streaming result types
  • Check test coverage
  • Post findings with inline comments

Summary

The architecture is well-designed: GethLikeTxStreamingMemoryTracer correctly overrides AddTraceEntry to serialise per-opcode, GethLikeBlockEnvelopeStreamingTracer wraps each tx in its envelope, and the dual-mode result types (GethLikeTxTraceStreamingSingleResult inheriting GethLikeTxTrace, StreamingResultBase hierarchy) cleanly satisfy the existing return-type contracts while enabling IStreamableResult-aware callers to stream. The cancellation composition (CancellationTokenSource.CreateLinkedTokenSource) is correct and the CTS ownership/lifetime model is sound.

Two High severity issues need addressing before merge, plus several Medium findings.


Findings

High

1. Unclosed JSON array/object on cancellationGethLikeTxTraceStreamingBlockResult.EmitContent (line 55) and StructLogEnvelopeWriter.EmitTraceObject (line 40) both use catch (Exception ex) when (!cancellationToken.IsCancellationRequested). When the token IS cancelled and the trace throws (e.g. OperationCanceledException), the exception propagates without reaching writer.WriteEndArray(). The client receives a truncated JSON fragment (missing ]). HTTP clients that buffer before parsing (e.g. some proxies, test infrastructure) will see a parse error. Fix: wrap with finally { writer.WriteEndArray(); }. Inline comments posted on both sites.

2. debug_traceCall InsufficientBalanceException behavior change — the non-streaming path (line 127) catches InsufficientBalanceException before any response bytes are emitted and returns a proper JSON-RPC error object. On the streaming path the exception happens inside StructLogEnvelopeWriter, after {"structLogs":[ is already written to the wire, so the only possible response is {"structLogs":[],"failed":true,"error":"...","errorCode":-32000}. This is acknowledged in the PR description, but callers checking response.error at the JSON-RPC envelope level (rather than result.failed) will break silently. Investigation needed: if InsufficientBalanceException is always thrown before EVM execution starts, it could be pre-checked and surfaced correctly before committing to the streaming path.

Medium

3. Sync-over-async on EVM threadpipeWriter.FlushAsync(_cancellationToken).GetAwaiter().GetResult() in GethLikeTxStreamingMemoryTracer.FlushToWire (called every 256 opcodes) blocks EVM execution waiting on network I/O. Adversarial inputs (tight loops against slow clients) will stall execution repeatedly. The comment documents this, but the flush interval could be made configurable or defaulted higher (e.g. 4096) to reduce exposure. Same pattern in GethLikeBlockEnvelopeStreamingTracer.FlushPerTxEnvelope and GethLikeTxTraceStreamingBundleResult.FlushBetweenBundles.

4. Count = 0 / empty enumerator contract mismatchGethLikeTxTraceStreamingBlockResult and GethLikeTxTraceStreamingBundleResult implement IReadOnlyCollection<T> with Count = 0 and empty enumerators. Any caller that checks Count or calls ToList() will silently get nothing. The test infrastructure calls the JSON converter fallback (WriteAsJson), which is fine, but this is a subtle footgun. XML <remarks> explaining the invariant would prevent future misuse.

5. No streaming-specific tests — tests verify the final serialised JSON output via RpcTest.TestSerializedRequest, but there are no tests that:

  • assert struct-log entries are absent from the GethLikeTxTrace Entries list (proving O(1) heap)
  • verify streaming delivers bytes before the trace completes (TTFB regression)
  • test mid-stream timeout or cancellation and inspect the resulting fragment

Low

6. WriteRawLong duplicated — identical logic in GethLikeBlockEnvelopeStreamingTracer and StructLogEnvelopeWriter. Extract to a shared internal static helper. Inline comment posted.

7. GetBlockTrace outside try-catch in debug_traceBlockByNumber/debug_traceBlockByHash — pre-existing issue preserved by this PR: an ArgumentNullException thrown by GetBlockTrace itself won't be caught. Inline comment posted.


What's good

  • The StreamingResultBase / GethLikeTxTraceStreamingSingleResult dual-mode design is elegant: the same object works through both the IStreamableResult async path and the JSON-converter synchronous fallback with no duplication of the trace logic.
  • CTS ownership model (created by BuildStreamingResult, transferred to the result wrapper, disposed on completion) is correct.
  • SkipValidation: true on JsonWriterOptions is the right call — the writer is used mid-stream and the state machine would fire false positives otherwise.
  • TraceCallManyWithOverrides correctly isolates the overrides path to a SimulateTxExecutor so streaming doesn't need to handle state/block overrides.
  • Error surfacing (failed:true, error, errorCode) on mid-trace failures is a meaningful improvement over silently truncating the stream.

Comment thread src/Nethermind/Nethermind.JsonRpc/Modules/DebugModule/DebugRpcModule.cs Outdated
@svlachakis
Copy link
Copy Markdown
Contributor Author

@claude review again carefully, check possible functionality regressions.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude finished @svlachakis's task in 10m 41s —— View job


PR Review: RPC Debug* Streaming Approach — Regression Analysis

Tasks

  • Load rules files
  • Read all changed files carefully
  • Check for functionality regressions in debug_trace* methods
  • Check serialization correctness vs. non-streaming path
  • Check cancellation/timeout behavior
  • Check error propagation paths
  • Post findings

Summary

The core streaming architecture is sound, and several issues from the prior review (unclosed JSON arrays, finally fixes) have already been addressed. This pass focused specifically on functionality regressions — cases where the streaming path produces different observable behavior from the non-streaming path.

4 High findings and 2 Medium findings that need addressing before merge.


High Severity

1. JSON field order changed — breaks Geth-compatible tooling → inline

File: StructLogEnvelopeWriter.cs:33

Non-streaming path (GethLikeTxTraceConverter.Write) emits:

{"gas":…,"failed":…,"returnValue":"0x","structLogs":[…]}

Streaming path emits:

{"structLogs":[],"gas":…,"failed":…,"returnValue":"0x"}

Geth's reference implementation uses the first order. Any conformance tooling, Hive test, or vendor dashboard that compares traces byte-for-byte against Geth will fail. The PR ticks Optimization but not Breaking change — this is a silent format regression. The rewritten tests use field-by-field assertion so they cannot detect a future regression back. The field order is forced by the streaming design (the array must be opened before execution), but the breaking-change checkbox must be ticked and the change documented.

2. Block body not found silently returns [] instead of an error → inline

File: GethStyleTracer.cs:130

When blockTree.FindBlock() returns null (header present but body pruned), the non-streaming path catches the resulting ArgumentNullException in debug_traceBlockByNumber/debug_traceBlockByHash and returns a proper JSON-RPC error. In the streaming path, FindBlock is called inside the deferred _runBlockTrace lambda. The ArgumentNullException is caught by GethLikeTxTraceStreamingBlockResult.EmitContent's exception filter and swallowed — the client gets {"result":[]} (success, empty array) instead of an error. Pruned-node users silently receive a wrong response. Fix: call FindBlock eagerly in TraceBlockStreaming before deferring to the lambda.

3. Mid-block exception leaves inner JSON object unclosed → inline

File: GethLikeTxTraceStreamingBlockResult.cs:50

If the block processor throws after GethLikeBlockEnvelopeStreamingTracer.OnStart (which writes {"result":{"structLogs":[) but before OnEnd closes it, EmitContent's finally only writes ] for the outer array. The inner result object is never closed — the client receives invalid JSON. The single-tx path handles this correctly via StructLogEnvelopeWriter.EmitTraceObject's finally which always closes ] + calls WriteFooter. The block path lacks equivalent per-tx protection.

4. OperationCanceledException from timeout propagates out of WriteToAsync — truncated response instead of timeout error → inline

File: StreamingResultBase.cs:53

When the timeout fires, the catch filter when (!cancellationToken.IsCancellationRequested) evaluates to false and the OperationCanceledException propagates out of WriteToAsync. The HTTP response is already committed so the connection is aborted mid-stream. The non-streaming path returns {"error":{"code":-32017,"message":"Method execution timed out"}}. Clients on the streaming path see only a truncated response — no actionable error. A try/catch in WriteToAsync that swallows OperationCanceledException (and optionally logs it) would restore observable parity with the non-streaming behavior.


Medium Severity

5. Partial struct-log entries committed to wire before error detected → inline

File: GethLikeTxTraceStreamingBundleResult.cs:102

The non-streaming GetBundleTrace catches all tracing exceptions and returns a clean {failed:true, gas:X, returnValue:"0x", structLogs:[]}. The streaming path has already written opcodes 0..N-1 by the time exception at opcode N fires — those cannot be retracted. Clients receive {"structLogs":[…N partial entries…],"failed":true,"error":"…"}. Code that inspects structLogs.length > 0 to determine whether execution was attempted will give a different answer than before. No code fix is possible without buffering; this should be documented as an accepted behavior change.

6. errorCode always -32000 regardless of exception type → inline

File: StructLogEnvelopeWriter.cs:88

ErrorCodes.InvalidInput = -32000 is hardcoded for every streaming failure. Clients that branch on errorCode to distinguish timeouts (-32016), internal errors (-32603), or resource-not-found (-32000) from user errors cannot do so. At minimum consider passing the code explicitly to WriteFooter rather than hardcoding.


Previously flagged and resolved ✓

  • finally blocks now correctly close WriteEndArray() in both GethLikeTxTraceStreamingBlockResult.EmitContent and StructLogEnvelopeWriter.EmitTraceObject — the earlier High finding from the prior review is fixed.
  • CTS ownership/lifetime model is correct.
  • SkipValidation: true on JsonWriterOptions is correct.

Comment thread src/Nethermind/Nethermind.Consensus/Tracing/GethStyleTracer.cs Outdated
Comment thread src/Nethermind/Nethermind.JsonRpc/Modules/DebugModule/StreamingResultBase.cs Outdated
@svlachakis svlachakis changed the title RPC: Debug* streaming approach [WIP] - RPC: Debug* streaming approach May 19, 2026
@svlachakis
Copy link
Copy Markdown
Contributor Author

@claude review again.

Follow-up:

Errors and messages need to be architected properly as separate work - surfacing TransactionResult validation failures, lazy-envelope for pre-flight error parity with Geth, per-scenario Geth-format messages, and differentiated errorCodes. Touches ITxTracer / TransactionProcessor and exception classes used by sequencer/simulate/MEV adapters; scoped to its own ticket.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 19, 2026

Claude finished @svlachakis's task in 5m 15s —— View job


PR Review: RPC Debug* Streaming Approach — Third Pass

Tasks

  • Read rules files and CLAUDE.md
  • Read all changed files carefully
  • Re-evaluate prior High/Medium findings in light of deferred error-architecture ticket
  • Check for remaining correctness/regression issues
  • Check serialization correctness
  • Check resource management and cancellation
  • Post findings with inline comments

Note: Author has explicitly scoped error/message architecture (errorCode differentiation, TransactionResult surfacing, lazy-envelope parity with Geth, differentiated errorCodes) to a separate follow-up ticket. Those findings from the prior review are not re-raised here.


What has been fixed since the previous passes ✓

  • Finally blocks now correctly close JSON arrays on cancellationStructLogEnvelopeWriter.EmitTraceObject and GethLikeTxTraceStreamingBlockResult.EmitContent both have finally { writer.WriteEndArray(); }.
  • OperationCanceledException is caught in WriteToAsync — Both StreamingResultBase and GethLikeTxTraceStreamingSingleResult catch OperationCanceledException when (combinedToken.IsCancellationRequested) and swallow it, preventing the HTTP layer from seeing an unhandled exception.
  • Block-not-found now returns a proper error for streaming debug_traceBlockByNumber/debug_traceBlockByHashDebugRpcModule now eagerly calls blockFinder.FindBlock(...) before entering the streaming path and returns ResourceNotFound if the block body is missing.
  • Mid-block exception closes the inner envelope correctlyGethLikeBlockEnvelopeStreamingTracer implements IDisposable; TraceBlockWithFactory calls tracer.TryDispose() on exception, which writes the closing tokens for any in-progress tx envelope.
  • ResolveErrorCode differentiates exception typesInsufficientBalanceException / InvalidBlockExceptionInvalidInput; everything else → InternalError.
  • New streaming-specific testsGethLikeTxTraceStreamingSingleResult_WhenCancelledMidTrace_ClosesJsonEnvelope and WhenTraceThrows_WritesErrorAndErrorCode cover the two most important streaming failure modes.

Remaining Findings

High

1. JSON field order is a silent breaking change — PR checkbox not ticked → inline

The streaming path emits:

{"structLogs":[], "gas":…, "failed":…, "returnValue":"0x"}

The non-streaming path (and Geth's reference format) emits:

{"gas":…, "failed":…, "returnValue":"0x", "structLogs":[…]}

The field order is forced by the streaming design (structLogs must be opened before the EVM starts), so it cannot be fixed without buffering. The new test pins the new behavior explicitly. However, the PR type-of-change checklist still does not tick "Breaking change", and the "Changes" section does not document this format shift. Any tooling that compares traces byte-for-byte against Geth (Hive conformance, vendor dashboards) will see failures. At minimum: tick the checkbox and add a sentence in the PR description.

Medium

2. GethLikeBlockEnvelopeStreamingTracer.Dispose() omits txHash in the error-close path → inline

OnEnd (normal path) writes txHash after the result object. Dispose() (exception path) does not. When a mid-block exception fires after OnStart and before OnEnd, the per-tx envelope is {"result":{"structLogs":[],"gas":0,"failed":true,"returnValue":"0x"}} — missing txHash. The JSON is valid but the envelope shape is inconsistent with every other entry in the array. Callers that correlate results by txHash will break.

Fix: store tx?.Hash in OnStart as a field (e.g. _currentTxHash) and write it in Dispose() before the outer WriteEndObject().

3. TraceBlockStreaming(BlockParameter) / IDebugBridge.GetBlockTraceStreaming(BlockParameter) still has the silent-empty-array bug → inline

DebugRpcModule no longer calls this overload for streaming (it resolves to Block first). But the interface method is still public. If called directly, a pruned block returns nullArgumentNullException inside the lambda → caught by the non-cancellation filter → swallowed as []. Future callers would hit this silently.

Simplest fix: add an eager null-guard in GethStyleTracer.TraceBlockStreaming(BlockParameter) to throw before the lambda is deferred.

4. No streaming tests for block-trace error paths

The new tests cover single-tx cancellation and mid-trace throws. Missing:

  • Block trace with a mid-block exception (verifying outer array is closed and txHash is present on all entries including the failed one)
  • Block trace cancellation mid-block
  • Streaming vs. non-streaming produce the same tx-count for a complete trace (regression guard)

Low

5. IDebugBridge.GetBlockTraceStreaming(BlockParameter) is unused from DebugRpcModule

Since the streaming path always resolves to a Block first, the BlockParameter overload is dead code from the RPC module's perspective. Consider removing it from the interface to prevent misuse, or leave it with a prominent <remarks> warning.


Summary

Second pass raised 6 High/Medium issues; 5 are now resolved. One High remains (breaking-change checkbox + documentation for field order). Two Mediums remain (missing txHash in Dispose(), TraceBlockStreaming(BlockParameter) footgun). The deferred error-architecture work is explicitly acknowledged. Inline comments posted on all three outstanding issues.

Recommended before merge:

  1. Tick "Breaking change" and document the structLogs-first field order in the PR body.
  2. Add txHash to GethLikeBlockEnvelopeStreamingTracer.Dispose().
  3. Add null guard in GethStyleTracer.TraceBlockStreaming(BlockParameter).

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @svlachakis's task in 9m 31s —— View job


PR Review: RPC Debug* Streaming — Sixth Pass (Geth Parity Error Messages)

Tasks

  • Read rules files
  • Read all changed files (StructLogEnvelopeWriter, GethLikeTxTraceStreamingBlockResult, tests)
  • Trace exception hierarchy (InvalidTransactionException vs InsufficientBalanceException)
  • Verify Geth-compat message mappings are correct
  • Check all prior findings
  • Post findings

Status of All Prior Findings

All 17 previously raised High/Medium findings remain resolved. The two Low nits from pass 5 (using on test disposables, dead InvalidInput assignment) are addressed below.


Geth Parity Error Messages — Analysis

The new FormatErrorMessage / FormatErrorDescription / ReplacePrefix architecture in StructLogEnvelopeWriter.cs is correct. Here is the verified reasoning:

For debug_traceCall (streaming path), the execution goes through TraceTransactionProcessorAdaptertransactionProcessor.Trace()ExecutionOptions.SkipValidationAndCommit. Despite SkipValidation being set, the BuyGas check at TransactionProcessor.cs:781 is unconditional:

if (UInt256.SubtractUnderflow(in senderBalance, in tx.ValueRef, out UInt256 balanceLeft))
    return InsufficientFundsForTransfer(tx, senderBalance); // returns TransactionResult

This returns TransactionResult.InsufficientSenderBalanceBlockValidationTransactionsExecutor.ThrowInvalidTransactionException wraps it → InvalidTransactionException propagates into StructLogEnvelopeWriter.EmitTraceObject → caught → FormatErrorMessage(InvalidTransactionException)FormatErrorDescription(result)ReplacePrefix("insufficient sender balance for transfer...", ..., "insufficient funds for transfer...") → Geth-compatible. All three test cases in TraceCallGethCompatFailureCases should pass. ✓

ResolveErrorCode is correct: InvalidTransactionException : InvalidBlockException so the inner-exception loop hits the is ... or InvalidBlockException branch and returns ErrorCodes.InvalidInput. ✓


New Findings

Low

1. EmitFailedTrace error message format inconsistency — bundle pre-flight failures write a raw validationError string (e.g. "Gas cap exceeded"), while EmitTraceObject's failure path always adds "tracing failed: " prefix. Clients parsing error from debug_traceCallMany responses will see different shapes depending on whether the failure was pre-trace or mid-trace.

2. BlockGasLimitExceeded unmappedFormatErrorDescription's _ => detail fallback emits "Block gas limit exceeded" (capital B, different phrasing). Geth emits "exceeds block gas limit". A one-line entry in the switch would fix it; unlike the other deferred error-arch items this requires no architecture change.

3. Two tests still missing using on timeoutCts/result — pass-5 Low finding, not yet addressed. The GethLikeTxTraceStreamingBlockResult_WhenTraceThrows_EmitsSentinelErrorEntry and GethLikeTxTraceStreamingSingleResult_WhenCancelledMidTrace_ClosesJsonEnvelope tests create disposable objects without using.


What's Good

  • InvalidTransactionException + TransactionResult decomposition is well-designed. The FormatErrorDescription switch covers the six Geth-significant error types; all others sensibly fall through to the raw ErrorDescription which often matches Geth's format already (GasLimitBelowIntrinsicGas, TransactionNonceTooHigh/Low).
  • ReplacePrefix is defensive: if the ErrorDescription is extended with context (address, amounts, etc.), only the prefix is replaced and the rest is preserved — producing the full Geth format "insufficient funds for transfer: address 0xXXX have Y want Z".
  • ResolveErrorCode ArgumentNullException.ThrowIfNull replaces the former dead InvalidInput default — correct.
  • Test coverage for the three main Geth-compat failure scenarios is new and directly exercises the expected prefixes and error codes.
  • debug_traceCallMany_fails_when_not_enough_balance test uses Contain("insufficient funds") which correctly exercises the prefix translation path.

Summary

All High and Medium issues from all five prior passes remain resolved. Three Low nits remain (error prefix inconsistency in EmitFailedTrace, unmapped BlockGasLimitExceeded wording, two test using leaks). The PR is mergeable.

@svlachakis
Copy link
Copy Markdown
Contributor Author

@claude do one last pass, check streaming design, any possible race conditions, memory leaks, or lower possible allocation paths.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @svlachakis's task in 9m 9s —— View job


PR Review: RPC Debug* Streaming — Final Pass (Race Conditions, Memory Leaks, Allocations)

Tasks

  • Post initial todo list
  • Read rules files
  • Read all new streaming files
  • Check for race conditions in streaming path
  • Check for memory leaks (CTS, IDisposable, event handlers)
  • Analyze allocation paths and identify reduction opportunities
  • Review streaming design correctness
  • Post findings

Race Conditions

None introduced by this PR. All new streaming code runs serially within WriteToAsync — there is no concurrent access to any shared mutable state that wasn't already present in the non-streaming path.

The one pre-existing race: ChangeableTransactionProcessorAdapter.CurrentAdapter is swapped without synchronisation in TraceStreaming(BlockParameter, Transaction, ...) (GethStyleTracer.cs:260-270), which is identical to the non-streaming Trace(BlockParameter, Transaction, ...). Concurrent debug_traceCall requests would interfere with each other. Pre-existing; this PR does not introduce or worsen it.

The deferred-execution model (_runTrace lambda called from WriteToAsync) doesn't introduce a new race: BuildStreamingResult captures references (immutable GethTraceOptions, BlockParameter, IDebugBridge) and constructs the result object synchronously. By the time WriteToAsync runs on the HTTP thread, those captures are finalized.


Resource Leaks

None found. The CTS ownership chain is sound:

  • BuildStreamingResult creates timeoutCts, passes to GethLikeTxTraceStreamingSingleResult constructor, then transfers ownership via : base(timeoutCts) → stored as GethLikeTxTrace._disposable
  • JsonRpcSuccessResponse.Dispose()Result.TryDispose()GethLikeTxTrace.Dispose()timeoutCts.Dispose()
  • This fires via using (result) in Startup.cs:343, which wraps the entire response write — WriteToAsync is awaited at line 394, so the CTS is always disposed after WriteToAsync completes

GethLikeBlockEnvelopeStreamingTracer.Dispose() is correctly reachable on both paths:

  • Normal completion: Dispose is not needed (_innerEnvelopeOpen = false after OnEnd)
  • Exception in TraceBlockWithFactory: catch { tracer.TryDispose(); throw; }object.TryDispose sees IDisposable and calls Dispose(), which writes the closing JSON for any in-progress tx envelope

Memory / Allocation Findings

Medium

1. Block-trace streaming is O(N_transactions) in result-object memory, not O(1)

GethLikeBlockEnvelopeStreamingTracer inherits BlockTracerBase<GethLikeTxTrace, ...>. BlockTracerBase.EndTxTrace() calls AddTrace(OnEnd(CurrentTxTracer)), which appends the GethLikeTxTrace returned by OnEnd into TxTraces. For a 300-tx block, 300 GethLikeTxTrace objects accumulate in TxTraces for the entire duration of block processing.

TraceBlockStreamingImpl discards the return value of TraceBlockWithFactory, so the GethLikeTxTraceCollection wrapping those 300 objects is immediately eligible for GC after the block completes — but the 300 trace objects are all live simultaneously until that point.

Each GethLikeTxTrace carries an allocated-but-empty List<GethTxTraceEntry> (see below), a Stack<Dictionary<string, string>> StoragesByDepth, plus the Gas/Failed/ReturnValue/TxHash fields. For a 300-tx block that is ~45 KB of trace scaffolding that cannot be freed until the last transaction completes.

The PR's O(1) claim is accurate for struct-log entries — entries are correct. Block-level result objects remain O(N_tx). Fix: override AddTrace in GethLikeBlockEnvelopeStreamingTracer to discard the trace once OnEnd has written the envelope bytes: Fix this →

Low

2. GethLikeTxTrace.Entries = new() allocated but never populated in streaming path

GethLikeTxTrace.cs field initializer allocates new List<GethTxTraceEntry>() for every trace object. In the streaming path GethLikeTxStreamingMemoryTracer.AddTraceEntry serialises directly to the Utf8JsonWriter without calling base.AddTraceEntry, so Entries is always empty. For block traces this is one empty List<> per tx. Could be lazy-initialized (= null! with ??= on write) or a factory overridden on the streaming tracer to return a subclass with no list: Fix this →

3. new byte[32] in GethLikeTxMemoryTracer.SetOperationStorage — can be stackalloc

byte[] bigEndian = new byte[32];
storageIndex.ToBigEndian(bigEndian);

bigEndian is only used locally (for the ToHexString call). This is a heap allocation on every storage-modifying opcode. Replace with Span<byte> bigEndian = stackalloc byte[32]. Pre-existing issue, but in the streaming path (where entry allocations are otherwise short-lived) this is the next notable per-opcode heap hit: Fix this →

4. new Dictionary<string, string>(previousTraceEntry.Storage) per-opcode copy

In GethLikeTxMemoryTracer.StartOperation, at every opcode at the same call depth, the entire previous storage dictionary is deep-copied into CurrentTraceEntry.Storage. In the buffered path, the previous entry and its storage live until the trace completes. In the streaming path, the previous entry is serialized and de-referenced immediately, but the copy is still made unconditionally before de-referencing. Architecturally, once the previous entry has been serialized (in the streaming override of AddTraceEntry), its storage dictionary could be moved into the new entry rather than copied, eliminating the allocation. This would require the streaming tracer to know it's safe to destructively mutate the previous entry's storage, which requires a design change. Pre-existing issue; calling it out as the dominant remaining per-opcode allocation in the streaming path.


Streaming Design

5. SkipValidation = true removes all JSON writer state checks across four cooperating classes

StreamingWriterOptions.SkipValidation = true is the correct setting for mid-stream writers. However, the four classes that write to the shared Utf8JsonWriter (GethLikeBlockEnvelopeStreamingTracer, GethLikeTxStreamingMemoryTracer, StructLogEnvelopeWriter, GethLikeTxTraceStreamingBlockResult) must maintain consistent open/close nesting. Any future regression — a missed WriteEndObject, an extra WriteStartArray — will produce silently malformed JSON with no exception or assertion to catch it. Consider adding an integration test that round-trips a known block trace and parses the full JSON output end-to-end (not just field-by-field assertions), so that nesting errors produce a parse failure rather than a silent wrong answer.


All Prior Findings

All High and Medium findings from passes 1–6 remain resolved. The two previously-open Low nits (test using wrappers, dead InvalidInput expression) are not re-raised.


Summary

No race conditions or resource leaks introduced. The dominant actionable finding is that block-trace streaming still holds O(N_tx) GethLikeTxTrace result objects during processing (#1 above); overriding AddTrace to discard after OnEnd serialises the envelope would achieve true O(1) for block traces matching the O(1) already achieved for single-tx traces. Items #2–4 are straightforward allocation micro-optimisations.

@svlachakis
Copy link
Copy Markdown
Contributor Author

@claude review again

1 - fixed
2 - not fixed in this PR requires modifiying the base gethliketxtrace - out of scope
3 - not fixed, pre existing in the buffered tracer path worth maybe a separate cleanup
4 - same pre existing
5 - already covered debug_traceblock tests in debugrpcmoduletests.traceblock use jsondocument parse.rootelement so any nesting regression throws at parse time

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @svlachakis's task in 5m 55s —— View job


PR Review: RPC Debug* Streaming — Eighth Pass

Tasks


Status of All Pass-7 Findings

# Finding Status
1 O(N_tx) block trace accumulation FIXEDGethLikeBlockEnvelopeStreamingTracer:78 has protected override void AddTrace(GethLikeTxTrace trace) { }. Per-tx GethLikeTxTrace objects are no longer accumulated in TxTraces; each is discarded immediately after OnEnd serializes the envelope.
2 GethLikeTxTrace.Entries = new() allocated but never populated ACKNOWLEDGED — requires modifying the shared base class; out of scope for this PR. The allocation is O(1) per tx and is now promptly GC-eligible after OnEnd (since AddTrace is a no-op).
3 new byte[32] in GethLikeTxMemoryTracer.SetOperationStorage ACKNOWLEDGED — pre-existing in the buffered tracer path (GethLikeTxMemoryTracer), shared by both streaming and non-streaming. Not introduced by this PR.
4 Per-opcode Dictionary deep copy ACKNOWLEDGED — same as above, pre-existing in GethLikeTxMemoryTracer.StartOperation.
5 SkipValidation = true nesting coverage ACKNOWLEDGEDGethLikeTxTraceStreamingBlockResult_WhenCancelledMidBlock and WhenTraceThrows both call JToken.Parse (via Newtonsoft) and JArray.Parse on the raw streamed bytes, which throw on any structural malformation. Adequate coverage.

Status of Pass-6 Low Nits

Nit Status
BlockGasLimitExceeded unmapped wording FIXEDFormatErrorDescription now has ErrorType.BlockGasLimitExceeded => ReplacePrefix(detail, "Block gas limit exceeded", "exceeds block gas limit")
EmitFailedTrace missing "tracing failed: " prefix FIXEDEmitFailedTrace now applies the prefix internally via string? formattedMessage = errorMessage is null ? null : $"tracing failed: {errorMessage}"
Test using wrappers on timeoutCts/result FIXED — all result objects (GethLikeTxTraceStreamingSingleResult, GethLikeTxTraceStreamingBlockResult) have using. timeoutCts without explicit using is safe because ownership transfers to result via StreamingResultBase._timeoutCts / GethLikeTxTrace._disposable, both disposed on result.Dispose().

New Findings (this pass)

None. After reviewing the complete set of changed files:

  • FormatErrorDescription + ResolveErrorCode correctness verified: InvalidTransactionException inherits InvalidBlockException, so the is ... or InvalidBlockException guard in ResolveErrorCode correctly returns ErrorCodes.InvalidInput for all transaction-validation failures including IntrinsicGasTooLow. FormatErrorMessage correctly dispatches InvalidTransactionException through FormatErrorDescription. The three new TraceCallGethCompatFailureCases test cases are well-formed and will exercise the prefix-translation path end-to-end.
  • Bundle result EmitContent non-cancellation exception path: EmitTraceForTx delegates to StructLogEnvelopeWriter.EmitTraceObject, which catches all non-cancellation exceptions internally and writes failed:true. The only surface where a non-cancellation exception could escape EmitBundle is FlushBetweenBundles (IO error from FlushAsync). GethLikeTxTraceStreamingBlockResult has an explicit catch for this case; the bundle result does not. This is an edge case (IO error) and the asymmetry is justified — _runBlockTrace has a much broader throw surface than FlushBetweenBundles.
  • debug_traceCall log message uses nameof(debug_traceTransaction) (line 138): pre-existing on master; not introduced by this PR.
  • CTS ownership and disposal chain re-verified correct across all four streaming result types.
  • Exception propagation paths (cancellation, non-cancellation, mid-block, mid-bundle) all correctly produce closed JSON structures.

What's Good

  • The O(1) claim for struct-log entries holds end-to-end (streaming path): GethLikeTxStreamingMemoryTracer.AddTraceEntry serializes directly to the Utf8JsonWriter and the GethLikeTxTrace object is discarded after each tx.
  • The O(1) claim now also holds for block-level result objects: AddTrace no-op prevents accumulation.
  • Error-message Geth parity (InsufficientFunds, IntrinsicGasTooLow, InsufficientFundsForGas*Price+Value, SenderNotEOA, NonceAtMaxValue, MaxPriorityFeeHigherThanMaxFee, MaxInitCodeSizeExceeded, ExceedsBlockGasLimit) is correctly handled by the FormatErrorDescription switch.
  • ResolveErrorCode exception hierarchy is correct: InvalidTransactionException : InvalidBlockException means all transaction validation errors map to ErrorCodes.InvalidInput; all others map to ErrorCodes.InternalError.
  • GethLikeBlockEnvelopeStreamingTracer.Dispose() correctly seals in-flight tx envelopes on mid-block exceptions, including txHash.
  • Sentinel error entry ({"error": "...", "errorCode": N}) in GethLikeTxTraceStreamingBlockResult.EmitContent.finally gives clients an in-band failure signal without buffering.
  • Five streaming-specific tests cover all four major failure paths (single cancel, single throw, block cancel, block throw) and the txHash on dispose invariant.

Summary

All High and Medium findings from all seven prior passes are resolved. All Low nits from passes 5–7 are fixed or reasonably acknowledged. No new issues found in this pass. The PR is mergeable.

@svlachakis svlachakis changed the title [WIP] - RPC: Debug* streaming approach RPC: Debug* streaming approach May 21, 2026
@svlachakis svlachakis marked this pull request as ready for review May 21, 2026 23:35
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @svlachakis's task in 7m 35s —— View job


PR Review: RPC Debug* Streaming — Ninth Pass (ready_for_review)

Tasks

  • Load tools
  • Read rules files
  • Read all changed files in current state
  • Verify all prior findings are resolved
  • Check for new issues
  • Post findings

Status of All Prior Findings

All 17+ High and Medium findings from passes 1–8 remain resolved. The three Low nits raised in pass 8 (test using wrappers, EmitFailedTrace prefix inconsistency, BlockGasLimitExceeded wording) have been verified fixed in the current code:

  • EmitFailedTrace now wraps the error string with $"tracing failed: {errorMessage}"
  • FormatErrorDescription has the BlockGasLimitExceeded => "exceeds block gas limit" case ✓
  • All result objects in tests are wrapped in using; timeoutCts ownership transfers to result and is disposed transitively ✓

Architecture Correctness — Verified

  • Cancellation path: StructLogEnvelopeWriter.EmitTraceObject.finally always closes structLogs array and writes the footer; WriteToAsync swallows OperationCanceledException. Client receives structurally valid JSON. ✓
  • Mid-trace throw (single-tx): Same finally path — failed:true + error + differentiated errorCode. ✓
  • Mid-block throw: GethLikeBlockEnvelopeStreamingTracer.Dispose() seals in-flight per-tx envelope including txHash; sentinel error entry appended by GethLikeTxTraceStreamingBlockResult.EmitContent.finally. ✓
  • Block-not-found: Eager null checks in both DebugRpcModule and GethStyleTracer.TraceBlockStreaming(BlockParameter) surface errors before any bytes are written. ✓
  • Flush/cancellation interaction: When FlushPerTxEnvelope throws OperationCanceledException, _innerEnvelopeOpen is already false at that point (set in OnEnd before the flush call), so Dispose() correctly returns early. The exception propagates to the block-result's catch filter, which sees cancellationToken.IsCancellationRequested = true and lets it pass to StreamingResultBase.WriteToAsync's outer catch. ✓
  • O(1) block streaming: GethLikeBlockEnvelopeStreamingTracer.AddTrace is a no-op (line 78), so no GethLikeTxTrace objects accumulate during block processing. ✓
  • CTS ownership chain: BuildStreamingResult/BuildStreamingBlockResult create the CTS with a defensive catch { timeoutCts.Dispose(); throw; }, then transfer ownership to the result wrapper whose Dispose (via using in Startup.cs) releases it after WriteToAsync completes. ✓

New Findings (Low only)

1. TraceStreaming(Hash256 txHash) returning null produces failed:true without an error field — diverges from non-streaming path

GethStyleTracer.TraceStreaming(Hash256 txHash, ...) returns null if FindBlockHash returns null or FindBlock returns null. In the streaming path the lambda passes this null back to StructLogEnvelopeWriter.EmitTraceObject, where WriteFooter(writer, trace: null, errorMessage: null, ...) emits {"structLogs":[],"gas":0,"failed":true,"returnValue":"0x"} — no error field and no errorCode.

The non-streaming path in debug_traceTransaction checks if (transactionTrace is null) and returns ResourceNotFound. debug_traceTransaction does validate blockHash and the block header before entering the streaming path, so this null-return scenario requires the block body to disappear between the header check and the trace call (race on a pruning node). The scenario is rare but the behavior difference is observable to conformance tooling. No code change needed for merge; a <remarks> on IDebugBridge.GetTransactionTraceStreaming noting the null-return semantic would future-proof the contract.


What's Good

  • Geth parity error messages: FormatErrorDescription switch handles all six significant error types with correct prefix transformations. ReplacePrefix is defensive — preserves the suffix (address, amounts) so Geth's full message format ("insufficient funds for transfer: address 0x… have N want M") is reproduced verbatim when ErrorDescription contains it.
  • ResolveErrorCode hierarchy: InvalidTransactionException : InvalidBlockException ensures all tx-validation errors correctly resolve to ErrorCodes.InvalidInput; unexpected failures go to InternalError.
  • SkipValidation = true: Correct on the shared Utf8JsonWriter — four cooperating classes write to it; a stateful validator would fire false positives.
  • GethLikeBlockEnvelopeStreamingTracer.Dispose(): Writes txHash from _currentTxHash (stored in OnStart) so every entry in a block-trace array has the field regardless of whether the tx completed or was interrupted.
  • Sentinel error entry: {"error":"…","errorCode":N} appended by EmitContent.finally gives clients in-band failure detection without buffering.
  • Test coverage: 5 targeted streaming tests pin all major failure paths (single cancel, single throw, block cancel, block throw, Dispose-mid-tx). The TraceCallGethCompatFailureCases parameterized tests exercise the three Geth-prefix-translation cases end-to-end.
  • Breaking-change documented: [x] Breaking change is ticked; "Behavior Changes" section explicitly calls out structLogs-first field order, pre-flight failure envelope change, and partial structLogs on mid-execution failure.

Summary

All High and Medium findings from all eight prior passes are resolved. This ninth pass finds only one Low-severity observation (null trace produces failed:true without error field on an edge case race, diverging from the non-streaming path's ResourceNotFound). The PR is mergeable.

@svlachakis
Copy link
Copy Markdown
Contributor Author

@claude review again

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @svlachakis's task in 3m 39s —— View job


PR Review: RPC Debug* Streaming — Tenth Pass

Tasks

  • Load tools
  • Read rules files
  • Read all changed files in current state
  • Verify all prior findings remain resolved
  • Check for new issues
  • Post findings

Status of All Prior Findings

All 17+ High and Medium findings from passes 1–9 remain resolved. The three Low nits from pass 9 are confirmed fixed:

Pass-9 Nit Status
BlockGasLimitExceeded wording ("Block gas limit exceeded""exceeds block gas limit") ✅ Fixed — FormatErrorDescription line 103
EmitFailedTrace missing "tracing failed: " prefix ✅ Fixed — $"tracing failed: {errorMessage}" at line 70
Test using wrappers on disposable timeoutCts / result ✅ Transitively correct — all four test instances either use using directly or transfer ownership to a using-wrapped result that disposes the CTS

Key Correctness Paths — Verified ✓

  • Cancellation path: StructLogEnvelopeWriter.EmitTraceObject's finally always closes the structLogs array and writes the footer; WriteToAsync swallows OperationCanceledException. Client receives structurally valid JSON.
  • Mid-trace throw (single-tx): Same finally path — failed:true + error + differentiated errorCode.
  • Mid-block throw: GethLikeBlockEnvelopeStreamingTracer.Dispose() seals the in-flight per-tx envelope including txHash from _currentTxHash; sentinel {"error":"…","errorCode":N} entry appended by GethLikeTxTraceStreamingBlockResult.EmitContent.finally.
  • Block-not-found: Eager null checks in DebugRpcModule (lines 358-362, 399-402) and GethStyleTracer.TraceBlockStreaming(BlockParameter) (line 131-133) surface errors before any bytes are written.
  • O(1) block streaming: GethLikeBlockEnvelopeStreamingTracer.AddTrace is a no-op (line 78), preventing per-tx result accumulation in TxTraces.
  • errorCode differentiation: ResolveErrorCode correctly returns InvalidInput for InsufficientBalanceException/InvalidBlockException and InternalError for everything else; ArgumentNullException.ThrowIfNull(failure) guards the null case.
  • Geth parity messages: FormatErrorDescription switch covers all 7 Geth-significant error types; ReplacePrefix is defensive and preserves suffix context (address, amounts).
  • CTS ownership chain: Created in BuildStreamingResult/BuildStreamingBlockResult, transferred to result wrapper, disposed via using in Startup.cs after WriteToAsync completes.

New Findings

None. No new High, Medium, or Low findings in this pass.


What's Good

The PR is clean and correct across all failure modes. The architecture delivers on its stated goals:

  • O(1) peak memory for struct-log entries (serialised per-opcode, never accumulated)
  • O(1) block-level objects (per-tx GethLikeTxTrace discarded immediately after OnEnd serialises the envelope)
  • Time-to-first-byte improvement (first opcode entries reach the client before the trace completes)
  • All error paths produce closed, valid JSON regardless of how or when failure occurs
  • Breaking changes properly documented: [x] Breaking change ticked, "Behavior Changes" section explicitly calls out structLogs-first field order, pre-flight failure envelope change, and partial structLogs on mid-execution failure
  • Comprehensive tests: 5 streaming-specific tests cover single-tx cancel, single-tx throw, block cancel, block throw, and dispose-mid-tx; TraceCallGethCompatFailureCases parameterized tests exercise the three main Geth prefix-translation paths; debug_traceBlock integration tests parse the full JSON output end-to-end

Summary

All High and Medium findings from all nine prior passes are resolved. No new issues found. The PR is mergeable.

Copy link
Copy Markdown
Member

@LukaszRozmej LukaszRozmej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Any benchmarks? (Including memory pressure and node total memory!)
  2. Do the same for trace* endpoints
  3. To some extent this should be configurable, not sure if on the node, or maybe in the Options on the call itself?
  4. A bit too much delegates flying around? Maybe better to have proper types.

Comment thread src/Nethermind/Nethermind.Consensus/Tracing/GethStyleTracer.cs Outdated
Comment thread src/Nethermind/Nethermind.Consensus/Tracing/IGethStyleTracer.cs Outdated
Comment thread src/Nethermind/Nethermind.JsonRpc.Test/Modules/DebugModuleTests.cs Outdated
Comment thread src/Nethermind/Nethermind.JsonRpc/Modules/DebugModule/DebugBridge.cs Outdated
@LukaszRozmej
Copy link
Copy Markdown
Member

As for benchmark I would love to see real-life benchmark on real node, not only synthetic one

@svlachakis
Copy link
Copy Markdown
Contributor Author

@LukaszRozmej sure will do it and share results before merging

@svlachakis svlachakis mentioned this pull request May 22, 2026
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming on debug* endpoints

2 participants