Fix #2804: bound tool result size to prevent SDK buffer-overflow crashes#2810
Fix #2804: bound tool result size to prevent SDK buffer-overflow crashes#2810jwbron wants to merge 5 commits into
Conversation
The Claude Agent SDK message-reader has a 1 MB JSON buffer cap; tool results that exceed it kill the agent with exit 255 and the consensus wrapper wastes its full restart budget retrying a deterministic failure. Fix layers three defenses so the failure cannot reach the restart loop in the first place — and when it does, it stops the wrapper immediately instead of burning retries. 1. CLI-side PostToolUse hook (sandbox/hooks/posttooluse_truncate.py) suppresses tool results above EGG_TOOL_RESULT_CAP_BYTES (default 200 KB) by returning a decision=block with truncation guidance. Runs inside the CLI subprocess so it intercepts before the result crosses the SDK JSON channel. Mutating tools (Edit/Write/MultiEdit/ NotebookEdit) get a reason that warns against retrying the call. 2. ClaudeAgentOptions.max_buffer_size bumped to 4 MB (env-overridable via EGG_AGENT_MAX_BUFFER_SIZE) as belt-and-suspenders. 3. consensus_wrapper greps captured agent output for the SDK overflow marker and exits immediately without consuming the restart budget, surfacing the failure as buffer-overflow rather than a generic AGENT_FAILED. Verified against the issue-2777 16:30:47 UTC crash signature.
There was a problem hiding this comment.
Review
The defense-in-depth motivation is sound, and layers (2) buffer bump and (3) wrapper detection are clear wins. However, the claim that layer (1) — the PostToolUse hook — actually prevents the SDK buffer overflow does not appear to be supported by the Claude Agent SDK contract. The end-to-end test for the central claim is unchecked in the PR description, and the unit tests only verify the hook's own JSON output, not whether the CLI suppresses the oversized payload on the wire.
Blocking
1. PostToolUse decision: block does not suppress the tool_response payload — primary defense is likely a no-op
sandbox/hooks/posttooluse_truncate.py:131-138
The hook returns {"decision": "block", "reason": "..."}. The PR's central claim is:
The hook runs inside the CLI subprocess so it intercepts before the oversized result crosses the SDK JSON channel.
The Claude Agent SDK documentation and types contradict this:
claude_agent_sdk/types.py:379-384—PostToolUseHookSpecificOutputexposes onlyadditionalContext(append) andupdatedMCPToolOutput(MCP-only). There is no field for suppressing or replacing the originaltool_responsefor built-in tools likeEdit/Read/Bash.- Official docs (code.claude.com/docs/en/hooks) state explicitly: "the tool has already run successfully, so this cannot prevent the action but can provide feedback for future actions."
decision: block"automatically prompts Claude with the reason" — it does not redact what the model sees. - For PostToolUse the table shows "Can block? No" — the tool already ran, and the payload still flows to the conversation.
If this reading is correct, then for the exact incident the PR cites — an Edit on phases.py whose tool_response exceeded the buffer — the hook will fire, return decision: block, and the oversized tool_response will still be serialized into a tool_result message on the CLI→SDK JSON stream, which is where the 1 MB buffer trip happens. The hook then becomes pure telemetry: it adds a reason the model never sees (the turn dies before the next inference), and the agent crashes anyway.
What this means for the PR:
- The "200 KB cap is the real ceiling" rationale collapses — the real ceiling is whatever
max_buffer_sizeresolves to (4 MB after this PR, env-overridable). Anything between 200 KB and 4 MB is now wasted retry-prevention; anything > 4 MB still crashes; the cited 1 MB → 4 MB headroom is the actual protection. - The "Why CLI-side hook, not programmatic SDK hook" rationale ("intercepts before the oversized result crosses the SDK JSON channel") is the load-bearing claim and appears to be false.
- The PR's own test plan has this exact verification step unchecked: "confirm... oversized tool results return a block-with-reason payload to the agent instead of crashing." That is the experiment that decides whether layer (1) works.
If the goal is to actually prevent the payload from crossing the channel, the standard SDK pattern is a PreToolUse hook returning permissionDecision: "deny" (or updatedInput to narrow the call) — which fires before the tool runs and can prevent the result from existing at all. For Read/Bash/Grep/Glob the input is predictive enough (file size, command, pattern) to estimate output size. For Edit/Write the failure surface is the input payload (giant new_string), which is visible at PreToolUse — not the tool_response.
Required before merge: Run the actual end-to-end test (the unchecked checkbox in the PR description) against a payload > 200 KB but < 4 MB and confirm whether the agent crashes or survives. If it crashes, the hook should be rewritten as a PreToolUse hook (or removed and replaced with documentation of the buffer-bump + wrapper-detection as the real defenses). If it survives, please update the PR description with the experiment so future maintainers don't repeat this analysis.
2. The hook's tests do not exercise the production code path
sandbox/tests/test_posttooluse_truncate.py — every test calls evaluate() or main() and asserts on the returned JSON. None of them invoke the Claude CLI with the hook installed and verify that an oversized tool_response does not reach the SDK reader. This is the canonical "hand-built fixtures that bypass the production code path" pattern called out in the review rules:
A test that directly constructs a manifest entry, serialised payload, or cache key rather than going through the production helper does not exercise that helper. A regression there would not break the test.
Given the uncertainty in finding #1, this gap is what allows the hook to ship without anyone noticing it's a no-op. At minimum, add one test that:
- Spawns the actual
claudeCLI (or a faithful mock that mirrors the JSON-over-stdout protocol) with the hook registered. - Feeds a tool that produces a > cap
tool_response. - Asserts that the SDK Python process does not raise
CLIJSONDecodeError.
Without this, every future change to the SDK's PostToolUse semantics silently breaks the hook and no CI signal fires.
Non-blocking
3. tee via process substitution is not synchronized with the parent shell
orchestrator/consensus_wrapper.py:138-146 —
{agent_command_prefix} "$prompt" \
> >(tee -a "$AGENT_OUTPUT_LOG") \
2> >(tee -a "$AGENT_OUTPUT_LOG" >&2)Bash does not wait for process-substitution subshells. When the agent exits, the parent shell continues immediately and may call is_buffer_overflow (a grep) before the tee subshell has flushed its tail to disk. In the initial-exit path this is masked by the intervening egg-orch pipeline status call (network round-trip gives tee plenty of time), but in the restart-loop path (consensus_wrapper.py:587-590) the grep runs sooner after exit and could race.
The shape of the marker (single-line, written atomically by the SDK's logger as the process is dying) makes a real miss unlikely in practice, but if you want to harden it the standard fix is to capture the tee PIDs with $! after the substitution and wait for them, or rewrite as ( cmd 2>&1 ) | tee -a "$AGENT_OUTPUT_LOG" (sacrifices stderr/stdout separation but pipefail preserves the agent's exit code).
4. except TypeError, ValueError: relies on PEP 758 (3.14+) and the branch is dead code
sandbox/hooks/posttooluse_truncate.py:65 — This is valid in Python 3.14 (PEP 758) and ruff 0.15 accepts it, but it reads like a Python-2-era bug to anyone scanning the file. More importantly, with json.dumps(value, default=str) the only way to reach this branch is an object whose __str__/__repr__ itself raises — at which point the str(value) fallback raises too and the hook crashes. The test test_falls_back_to_str_for_unserializable does not actually trigger the except (the Weird class serializes fine via default=str).
Either make it except (TypeError, ValueError): for readability across linters/IDEs that don't yet know about PEP 758, or just drop the try/except entirely and let an unserializable object crash the hook (which fails open — the CLI gets no hook output and the tool_result passes through, which is the same outcome you wanted for the error path).
5. Retry guidance uses CLI-flag spelling for tool-input parameter names
sandbox/hooks/posttooluse_truncate.py:91 — the Grep guidance says --head-limit. The actual parameter name is head_limit (no dashes, no hyphens) — that's what the model sees in its tool schema and what it has to emit. Same risk in spirit for the Read guidance, though offset/limit happen to be spelled the same way the model uses them. Fix: head_limit everywhere; clarify "parameter head_limit" so the model isn't tempted to pass --head-limit as a string argument.
6. Restart-loop overflow detection is not behaviorally tested
orchestrator/tests/test_consensus_wrapper.py — test_buffer_overflow_check_in_restart_loop only asserts the string is_buffer_overflow appears ≥ 2 times in the template. The behavioral test test_buffer_overflow_aborts_without_retry exercises only the initial-exit handler (consensus not reached → first crash → abort). There's no test where:
- Initial run succeeds (consensus reached)
- A restart is triggered (e.g., clean exit) and the recovery run crashes with the overflow marker
- The restart-loop's
is_buffer_overflowcheck fires
This is the harder path to reach and the one most likely to regress silently if someone reshuffles the restart loop.
7. tool_name fallback to "unknown" may misclassify mutating tools
sandbox/hooks/posttooluse_truncate.py:133 — if tool_name is missing or null in the event, the hook uses "unknown", which falls through to _FALLBACK_RETRY_GUIDANCE ("re-issue the call with narrower input"). If the underlying tool was a mutating tool (e.g., a future MCP tool that's effectively a Write), telling the model to retry could double-apply the side effect. PostToolUse events should always include tool_name, so this is largely theoretical — but worth either erroring out or biasing the fallback toward "do NOT retry" since that's the safer default for unknown tools.
8. Hook process overhead on every tool call
sandbox/entrypoint.py:1077-1088 — matcher "*" runs a fresh Python subprocess for every tool call, including small Reads and MCP calls. Python cold-start is ~50–100 ms; for a session with hundreds of tool calls this adds up. Consider narrowing the matcher to the tools that have actually overflowed in practice (Edit|Write|MultiEdit|NotebookEdit|Read|Bash|Grep|Glob) — the current "*" is broader than the failure surface, and MCP tools have their own size cap planned in the follow-up #2805.
9. Wrapper detection relies on a string match against the SDK error message
orchestrator/consensus_wrapper.py:158-161 — greps for "exceeded maximum buffer size". This matches the current SDK source (subprocess_cli.py:654) but the SDK can change the wording at any minor version and the wrapper would silently fall through to is_transient_crash, burning the restart budget again. There's a comment that says "Match-string kept stable; the wrapper greps for it" but no enforcement. Consider either:
- A pinned SDK-version check in CI that re-asserts the marker substring exists in the installed SDK, or
- Asking upstream to expose a stable error code / exit code / structured marker the wrapper can match on instead of substring-matching free text.
This isn't a regression — it's just adding a new substring dependency the codebase will have to maintain.
Nits
sandbox/entrypoint.py:1083— the parenthesized one-element value("/opt/egg-runtime/sandbox/hooks/posttooluse_truncate.py")is just a parenthesized string, not a tuple, but the formatting suggests someone might think it is. Drop the parens.shared/egg_agent/client.py:65-77— the comment block is good context; consider tightening "Belt-and-suspenders" to a single line and moving the rationale into the issue/PR body so the source file stays scannable.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Address reviewer feedback on PR #2810. The PostToolUse hook layer doesn't work: the Claude Agent SDK contract (PostToolUseHookSpecificOutput exposes only additionalContext and updatedMCPToolOutput; "Can block? No" in docs) means decision:block cannot suppress the tool_response payload on the wire. The hook fires, the oversized payload still crosses the JSON channel, the buffer crashes anyway — pure telemetry on a channel the crash kills. Also drop the 4 MB max_buffer_size bump. Allowing larger contexts is the wrong escape hatch; the real fix is tool-layer truncation (#2805) so massive payloads never get produced in the first place. What this PR now does (scoped to the wrapper): - Detect the SDK buffer-overflow signature in the agent's output and short-circuit the restart budget — the overflow is deterministic, so retries just re-hit the same crash and burn the budget for no gain. - Restructure run_agent to pipe through a single tee instead of two process-substitution tees; bash waits on pipelines synchronously, so the post-mortem grep can no longer race a still-flushing tee. - Add a behavioral test for the restart-loop overflow path (initial clean exit triggers restart, recovery run crashes with the marker, wrapper aborts without further retries). What goes elsewhere: - Tool-layer truncation in egg MCP tools — #2805. - Buffer-overflow crashes on built-in tools (Edit on large files, etc.) — remain a known terminal failure until tool-layer caps land. The wrapper now fails fast instead of burning the restart budget.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Review
Re-review at ddd9b01ca (delta = 3068b6120 Fix #2804: drop broken PostToolUse hook, keep fail-fast on overflow on top of the previous review at a31fe0fa5; merge commits 9897b1e27 and ddd9b01ca are noise from main).
Previously blocking concerns — verified addressed
- PostToolUse hook entirely removed.
sandbox/hooks/posttooluse_truncate.py,sandbox/tests/test_posttooluse_truncate.py, and thehooks: PostToolUseblock insandbox/entrypoint.py:1064-1090are all gone. The fix-commit message correctly cites the SDK contract (PostToolUseHookSpecificOutputexposes onlyadditionalContext/updatedMCPToolOutput, "Can block? No") — the layer was telemetry on a channel the crash kills, and dropping it is the right call. - Restart-loop behavioral test added.
orchestrator/tests/test_consensus_wrapper.py:1466(test_buffer_overflow_in_restart_loop_aborts_without_further_retries) drives a mockpython3that exits 0 on call 1 (triggers the restart loop) and emits the SDK overflow signature + exits 255 on call 2, then assertsresult.returncode == 255, "on restart" in stderr, andcall_count == 2. This is the regression guard the previous review (#6) asked for — the restart-loop short-circuit atconsensus_wrapper.py:599is now exercised end-to-end, not just structurally. teerace fixed via single pipeline.run_agent()atconsensus_wrapper.py:137-150now usescmd 2>&1 | tee -a "$AGENT_OUTPUT_LOG"withreturn ${PIPESTATUS[0]}. Bash waits on pipelines synchronously, so the post-mortemgrepinis_buffer_overflowcan no longer race a still-flushing process-substitutiontee. The inline comment at lines 138-145 correctly explains why this matters.max_buffer_size=4*1024*1024bump dropped.shared/egg_agent/client.pyno longer raises the limit. Per the commit message, the right fix is tool-layer truncation (#2805) — burning context-window budget so an over-payload can succeed is the wrong direction.
Blocking
B1. sandbox/tests/test_entrypoint_settings.py deletion drops coverage for an unrelated, still-live feature (3068b6120 removes the entire 129-line file).
The deleted file contained two test classes:
-
TestPostToolUseHook(3 tests) — correctly deleted because the production hook is gone. -
TestDisallowedToolsPrivateMode(5 tests) — coveredEGG_PRIVATE_MODE→settings["disallowedTools"] = ["WebFetch", "WebSearch"]written into~/.claude/settings.json. That production code still exists atsandbox/entrypoint.py:1071-1075:private_mode_env = os.environ.get("EGG_PRIVATE_MODE", "").lower() if private_mode_env in ("true", "1"): settings["disallowedTools"] = ["WebFetch", "WebSearch"] logger.info("Private mode: disallowed WebFetch/WebSearch in agent settings")
Originally landed in PR #1402. The 5 deleted tests covered:
EGG_PRIVATE_MODE=true,=1,=false,=0, and unset.
There's no replacement coverage. tests/sandbox/test_entrypoint.py is the only other entrypoint test file and grep -i "disallow\|private_mode\|webfetch" against it returns zero matches. The SDK-layer test at tests/shared/egg_agent/test_client.py:759-821 covers ClaudeAgentOptions.disallowed_tools (a different code path inside run_agent_async) — that's defense-in-depth #2, not the settings.json layer.
The PR commit message is scoped to "drop broken PostToolUse hook, keep fail-fast on overflow." Dropping coverage for a wholly unrelated settings.json write is out of scope for this PR and out of scope for the cited issue. Either:
- (preferred) Restore the
TestDisallowedToolsPrivateModeclass to a new file (e.g.tests/sandbox/test_entrypoint_settings.py), or - explicitly remove the
disallowedToolscode inentrypoint.py:1069-1075along with the tests — though that's a separate decision deserving its own PR, since the SDK-layer disallow insiderun_agent_asyncalready exists alongside it as documented defense-in-depth (client.py:147-153: "settings.json may also contain disallowedTools (set by the entrypoint), but passing them here as a CLI flag is more reliable and eliminates the gateway log noise").
Silently dropping the tests while leaving the production code in place is the worst of the three options — the next contributor who refactors setup_claude won't have a guard rail, and the existing dual-layer comment in client.py will be a lie about what's tested.
Non-blocking
N1. _BUFFER_OVERFLOW_MARKER in shared/egg_agent/client.py:72 is unused. The constant is declared at module scope but no code reads it. The second textual mention at line 133 is inside a comment block, not a code reference. Either reference the constant from the test (e.g. parameterize test_buffer_overflow_returns_failure_with_marker against it so a rename would flag breakage) or drop the declaration — declaring an unused constant only invites the next reader to assume it's load-bearing.
N2. Misleading comment in shared/egg_agent/client.py:131-137. The comment says: "issue #2804 relies on its error message preserving the exceeded maximum buffer size marker so the consensus-wrapper can short-circuit retry on this failure class. We don't import it explicitly here (the marker stability is verified by tests)."
The marker stability is not verified by any test on this branch. test_buffer_overflow_returns_failure_with_marker (tests/shared/egg_agent/test_client.py:833) hand-constructs the error: mock_query.side_effect = CLIJSONDecodeError("JSON message exceeded maximum buffer size of 1048576 bytes..."). The test verifies that if the SDK produces that string, it propagates to result.error — but the SDK could change the marker tomorrow and this test still passes. Either:
- Add a real-SDK smoke test that triggers the overflow (e.g. by exhausting a synthetic stream) and asserts the marker substring — the right answer per previous review #9, deferrable to a follow-up but worth a tracking issue/TODO, or
- Soften the comment to "preserved into
result.erroris verified by tests; stability against SDK changes is not — see #2804 for the SDK-version coupling."
The current wording overstates the test coverage.
N3. /tmp/agent-output-$$.log with tee -a is symlink-followable in /tmp. consensus_wrapper.py:127 defines AGENT_OUTPUT_LOG="${AGENT_OUTPUT_LOG:-/tmp/agent-output-$$.log}" and tee -a follows symlinks — a co-tenant on the same host who can predict the wrapper's PID ($$) could pre-create a symlink at that path and capture or redirect agent output. The container is single-tenant in normal operation, so this is theoretical; in a multi-tenant sandbox setup it would be a real issue. Consider mktemp for the default path, or document the single-tenant assumption.
N4. SDK marker dependency without CI pin. Carried from previous review #9, still not addressed. The wrapper greps for a literal substring of a third-party error message. If claude-agent-sdk changes that string (e.g. drops "maximum"), the wrapper silently regresses to burning the full restart budget on every overflow crash. A version pin in pyproject.toml + a CHANGELOG-watch on SDK releases, or a smoke test that imports the SDK and asserts the marker, would prevent silent regression. The fix-commit message acknowledges this as a known tradeoff; a tracking issue would make the deferral explicit.
N5. stderr merged with stdout in agent log. cmd 2>&1 | tee -a "$AGENT_OUTPUT_LOG" makes the log a single interleaved stream, which is what is_buffer_overflow needs (the SDK writes the overflow signature to stderr), but it also blurs ordering for any future log analysis. The inline comment at lines 138-145 explains the choice; if anyone post-mortems agent crashes by reading these logs, a note in docs/ or the wrapper docstring that the merge is intentional would help.
Verdict
The buffer-overflow short-circuit changes themselves are well-scoped and properly tested. B1 is the blocker — the deletion of TestDisallowedToolsPrivateMode is unrelated to the stated PR scope and removes the only coverage for a still-live production code path. Once the tests are restored (or the production code is removed in a separate PR), this is good to merge.
— Authored by egg
This comment has been minimized.
This comment has been minimized.
| egg is addressing review feedback... |
B1 (blocking): restore TestDisallowedToolsPrivateMode coverage. The previous fix-commit deleted sandbox/tests/test_entrypoint_settings.py wholesale to drop TestPostToolUseHook, but that file also covered the EGG_PRIVATE_MODE -> settings["disallowedTools"] branch in sandbox/entrypoint.py:1071-1075, which is still live. Re-add the 5 private-mode tests as tests/sandbox/test_entrypoint_settings.py (adjacent to the existing entrypoint tests). Verified all 5 pass. N1: wire _BUFFER_OVERFLOW_MARKER from shared/egg_agent/client.py into the test that asserts the SDK error message round-trips into result.error, and add a new test in orchestrator/tests/test_consensus_wrapper.py that asserts the bash wrapper script's grep substring matches the same constant — so a rename of the constant fails both tests rather than silently desyncing the client from the wrapper. N2: rewrite the comment in client.py to stop claiming "marker stability is verified by tests." It isn't — the tests verify round-trip propagation, not stability against a future SDK rename. Point at #2823 (filed below) for the real fix. N3: switch consensus_wrapper.py's default AGENT_OUTPUT_LOG path from the predictable /tmp/agent-output-$$.log to mktemp(1) when the env var isn't set. tee -a follows symlinks, so the predictable path is a theoretical pre-create attack in multi-tenant sandbox setups; mktemp is defense-in-depth (the container is single-tenant in normal operation). Falls back to the old path only if mktemp is missing from PATH. N4: filed #2823 to track pinning the SDK marker against silent drift. Couldn't fix in-PR — claude-agent-sdk isn't installed in the egg test environment / CI, so a smoke test that imports the SDK and asserts the marker substring would skip everywhere it could catch the regression. Three approaches enumerated in the issue. N5: extend the run_agent docstring comment in consensus_wrapper.py to explain why 2>&1 | tee -a interleaves stderr into the captured log (the SDK's overflow marker is emitted on stderr by logger.error in claude_agent_sdk.query, and the grep needs to see it in the same file as stdout), and note the trade-off for future log analysis. Tests: 137 passed across the touched files (tests/sandbox/ test_entrypoint_settings.py + tests/shared/egg_agent/test_client.py + orchestrator/tests/test_consensus_wrapper.py).
Re-review response — addressing the 2nd round (a31fe0f...ddd9b01)Pushed commit cb35179 on top of ddd9b01. Disposition per item: BlockingB1. Non-blockingN1. Unused
A rename of the constant now fails both tests immediately rather than silently desyncing the client from the wrapper. N2. Misleading comment about marker stability — N3. N4. SDK marker dependency without CI pin — N5. Document intentional stderr/stdout merge — — Authored by egg |
Summary
Fixes #2804. The Claude Agent SDK message-reader has a 1 MB JSON
buffer cap; tool results that exceed it kill the agent with exit 255,
and the consensus-wrapper would then waste its full restart budget
retrying a deterministic failure (the same codebase produces the same
overflow on every retry).
This PR makes the overflow a clean fail-fast condition in the
wrapper. The actual prevention is tool-layer truncation, which lives
in #2805 — agents shouldn't be passing megabyte-scale context around
in the first place, so raising the SDK buffer would be the wrong
escape hatch.
What changed
Wrapper-side buffer-overflow detection
(
orchestrator/consensus_wrapper.py): captures agent stdout+stderrvia
teeand greps for the SDK marker"exceeded maximum buffer size". When matched, the wrapper exitsimmediately — both in the initial-exit handler and inside the
restart loop — instead of classifying the crash as a transient
signal-255 and consuming the restart budget. Logged as a structured
buffer overflow (issue #2804)message rather than a genericAGENT_FAILED.teepipe restructured for synchronous wait: wascmd > >(tee ...) 2> >(tee ...)(process substitution, bash doesNOT wait for the tee subshells), is now
cmd 2>&1 | tee -a "$AGENT_OUTPUT_LOG"withreturn ${PIPESTATUS[0]}. Bash waits on pipelines synchronously, sothe post-mortem
is_buffer_overflowgrep can no longer race astill-flushing tee — the previous form happened to work in practice
because of intervening network calls, but the structure didn't
guarantee it.
What was dropped from the original PR
After verifying against the SDK source + docs during code review, two
of the original three layers were removed:
PostToolUse
decision: blockhook — does not suppress thetool_responsepayload.PostToolUseHookSpecificOutputinclaude_agent_sdk/types.pyexposes onlyadditionalContext(append-only) and
updatedMCPToolOutput(MCP-only); the officialhook docs state "Can block? No · tool already ran." The hook would
fire, return
decision: block, and the oversized payload wouldstill cross the SDK JSON channel and crash the reader. Pure
telemetry on a channel the crash kills.
max_buffer_sizebump to 4 MB — wrong escape hatch. Raisingthe buffer trades one symptom (exit-255) for another (eventual
context-budget exhaustion) and tacitly allows megabyte-scale tool
results to flow into the agent. The right ceiling is tool-layer
truncation that bounds output before it crosses any buffer.
Failure modes after this PR
(tool-layer truncation, separate PR).
Editon a large file,
Grep --output_mode contenton broad patterns):agent still crashes with exit 255. Wrapper now classifies this as
terminal and exits without burning the restart budget; operator
receives a structured
buffer overflow (issue #2804)log line andcan recover via
restart_phase. Out of scope here — best handledupstream or via a separate PreToolUse predictive-deny layer if it
becomes a recurring source of failures.
Verified
pytest orchestrator/tests/test_consensus_wrapper.py tests/shared/egg_agent/test_client.py— 131 passed, including the existing initial-exit overflow test and a
new behavioral test for the restart-loop overflow path (clean
initial exit → restart → recovery crash with overflow marker →
wrapper aborts without consuming the remaining budget).
Related
@tooloutput sizes at the toollayer. With Fix #2804: bound tool result size to prevent SDK buffer-overflow crashes #2810 landed (fail-fast), Cap MCP @tool output sizes at the tool layer (follow-up to #2804) #2805 then becomes
"actually prevent the crash for the tools we own."
crashes mid-task.
Test plan
exits the wrapper with the structured
buffer overflow (issue #2804)message and no
Restarting (N/M)...lines in the orchestrator log.