Skip to content

Fix #2804: bound tool result size to prevent SDK buffer-overflow crashes#2810

Open
jwbron wants to merge 5 commits into
mainfrom
egg/issue-2804
Open

Fix #2804: bound tool result size to prevent SDK buffer-overflow crashes#2810
jwbron wants to merge 5 commits into
mainfrom
egg/issue-2804

Conversation

@jwbron
Copy link
Copy Markdown
Owner

@jwbron jwbron commented May 27, 2026

Summary

Fixes #2804. The Claude Agent SDK message-reader has a 1 MB JSON
buffer cap; tool results that exceed it kill the agent with exit 255,
and the consensus-wrapper would then waste its full restart budget
retrying a deterministic failure (the same codebase produces the same
overflow on every retry).

This PR makes the overflow a clean fail-fast condition in the
wrapper. The actual prevention is tool-layer truncation, which lives
in #2805 — agents shouldn't be passing megabyte-scale context around
in the first place, so raising the SDK buffer would be the wrong
escape hatch.

What changed

  1. Wrapper-side buffer-overflow detection
    (orchestrator/consensus_wrapper.py): captures agent stdout+stderr
    via tee and greps for the SDK marker
    "exceeded maximum buffer size". When matched, the wrapper exits
    immediately — both in the initial-exit handler and inside the
    restart loop — instead of classifying the crash as a transient
    signal-255 and consuming the restart budget. Logged as a structured
    buffer overflow (issue #2804) message rather than a generic
    AGENT_FAILED.

  2. tee pipe restructured for synchronous wait: was
    cmd > >(tee ...) 2> >(tee ...) (process substitution, bash does
    NOT wait for the tee subshells), is now
    cmd 2>&1 | tee -a "$AGENT_OUTPUT_LOG" with
    return ${PIPESTATUS[0]}. Bash waits on pipelines synchronously, so
    the post-mortem is_buffer_overflow grep can no longer race a
    still-flushing tee — the previous form happened to work in practice
    because of intervening network calls, but the structure didn't
    guarantee it.

What was dropped from the original PR

After verifying against the SDK source + docs during code review, two
of the original three layers were removed:

  • PostToolUse decision: block hook — does not suppress the
    tool_response payload. PostToolUseHookSpecificOutput in
    claude_agent_sdk/types.py exposes only additionalContext
    (append-only) and updatedMCPToolOutput (MCP-only); the official
    hook docs state "Can block? No · tool already ran." The hook would
    fire, return decision: block, and the oversized payload would
    still cross the SDK JSON channel and crash the reader. Pure
    telemetry on a channel the crash kills.

  • max_buffer_size bump to 4 MB — wrong escape hatch. Raising
    the buffer trades one symptom (exit-255) for another (eventual
    context-budget exhaustion) and tacitly allows megabyte-scale tool
    results to flow into the agent. The right ceiling is tool-layer
    truncation that bounds output before it crosses any buffer.

Failure modes after this PR

  • egg-owned MCP tool returns oversized payload: fixed by Cap MCP @tool output sizes at the tool layer (follow-up to #2804) #2805
    (tool-layer truncation, separate PR).
  • Built-in Claude Code tool returns oversized payload (e.g. Edit
    on a large file, Grep --output_mode content on broad patterns):
    agent still crashes with exit 255. Wrapper now classifies this as
    terminal and exits without burning the restart budget; operator
    receives a structured buffer overflow (issue #2804) log line and
    can recover via restart_phase. Out of scope here — best handled
    upstream or via a separate PreToolUse predictive-deny layer if it
    becomes a recurring source of failures.

Verified

  • pytest orchestrator/tests/test_consensus_wrapper.py tests/shared/egg_agent/test_client.py
    — 131 passed, including the existing initial-exit overflow test and a
    new behavioral test for the restart-loop overflow path (clean
    initial exit → restart → recovery crash with overflow marker →
    wrapper aborts without consuming the remaining budget).

Related

Test plan

  • Targeted test suite green on the wrapper + client modules.
  • Deploy to staging; confirm a real overflow crash in a sandbox
    exits the wrapper with the structured buffer overflow (issue #2804)
    message and no Restarting (N/M)... lines in the orchestrator log.

The Claude Agent SDK message-reader has a 1 MB JSON buffer cap; tool
results that exceed it kill the agent with exit 255 and the consensus
wrapper wastes its full restart budget retrying a deterministic
failure. Fix layers three defenses so the failure cannot reach the
restart loop in the first place — and when it does, it stops the wrapper
immediately instead of burning retries.

1. CLI-side PostToolUse hook (sandbox/hooks/posttooluse_truncate.py)
   suppresses tool results above EGG_TOOL_RESULT_CAP_BYTES (default
   200 KB) by returning a decision=block with truncation guidance.
   Runs inside the CLI subprocess so it intercepts before the result
   crosses the SDK JSON channel. Mutating tools (Edit/Write/MultiEdit/
   NotebookEdit) get a reason that warns against retrying the call.
2. ClaudeAgentOptions.max_buffer_size bumped to 4 MB (env-overridable
   via EGG_AGENT_MAX_BUFFER_SIZE) as belt-and-suspenders.
3. consensus_wrapper greps captured agent output for the SDK overflow
   marker and exits immediately without consuming the restart budget,
   surfacing the failure as buffer-overflow rather than a generic
   AGENT_FAILED.

Verified against the issue-2777 16:30:47 UTC crash signature.
Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The defense-in-depth motivation is sound, and layers (2) buffer bump and (3) wrapper detection are clear wins. However, the claim that layer (1) — the PostToolUse hook — actually prevents the SDK buffer overflow does not appear to be supported by the Claude Agent SDK contract. The end-to-end test for the central claim is unchecked in the PR description, and the unit tests only verify the hook's own JSON output, not whether the CLI suppresses the oversized payload on the wire.

Blocking

1. PostToolUse decision: block does not suppress the tool_response payload — primary defense is likely a no-op

sandbox/hooks/posttooluse_truncate.py:131-138

The hook returns {"decision": "block", "reason": "..."}. The PR's central claim is:

The hook runs inside the CLI subprocess so it intercepts before the oversized result crosses the SDK JSON channel.

The Claude Agent SDK documentation and types contradict this:

  • claude_agent_sdk/types.py:379-384PostToolUseHookSpecificOutput exposes only additionalContext (append) and updatedMCPToolOutput (MCP-only). There is no field for suppressing or replacing the original tool_response for built-in tools like Edit / Read / Bash.
  • Official docs (code.claude.com/docs/en/hooks) state explicitly: "the tool has already run successfully, so this cannot prevent the action but can provide feedback for future actions." decision: block "automatically prompts Claude with the reason" — it does not redact what the model sees.
  • For PostToolUse the table shows "Can block? No" — the tool already ran, and the payload still flows to the conversation.

If this reading is correct, then for the exact incident the PR cites — an Edit on phases.py whose tool_response exceeded the buffer — the hook will fire, return decision: block, and the oversized tool_response will still be serialized into a tool_result message on the CLI→SDK JSON stream, which is where the 1 MB buffer trip happens. The hook then becomes pure telemetry: it adds a reason the model never sees (the turn dies before the next inference), and the agent crashes anyway.

What this means for the PR:

  • The "200 KB cap is the real ceiling" rationale collapses — the real ceiling is whatever max_buffer_size resolves to (4 MB after this PR, env-overridable). Anything between 200 KB and 4 MB is now wasted retry-prevention; anything > 4 MB still crashes; the cited 1 MB → 4 MB headroom is the actual protection.
  • The "Why CLI-side hook, not programmatic SDK hook" rationale ("intercepts before the oversized result crosses the SDK JSON channel") is the load-bearing claim and appears to be false.
  • The PR's own test plan has this exact verification step unchecked: "confirm... oversized tool results return a block-with-reason payload to the agent instead of crashing." That is the experiment that decides whether layer (1) works.

If the goal is to actually prevent the payload from crossing the channel, the standard SDK pattern is a PreToolUse hook returning permissionDecision: "deny" (or updatedInput to narrow the call) — which fires before the tool runs and can prevent the result from existing at all. For Read/Bash/Grep/Glob the input is predictive enough (file size, command, pattern) to estimate output size. For Edit/Write the failure surface is the input payload (giant new_string), which is visible at PreToolUse — not the tool_response.

Required before merge: Run the actual end-to-end test (the unchecked checkbox in the PR description) against a payload > 200 KB but < 4 MB and confirm whether the agent crashes or survives. If it crashes, the hook should be rewritten as a PreToolUse hook (or removed and replaced with documentation of the buffer-bump + wrapper-detection as the real defenses). If it survives, please update the PR description with the experiment so future maintainers don't repeat this analysis.

2. The hook's tests do not exercise the production code path

sandbox/tests/test_posttooluse_truncate.py — every test calls evaluate() or main() and asserts on the returned JSON. None of them invoke the Claude CLI with the hook installed and verify that an oversized tool_response does not reach the SDK reader. This is the canonical "hand-built fixtures that bypass the production code path" pattern called out in the review rules:

A test that directly constructs a manifest entry, serialised payload, or cache key rather than going through the production helper does not exercise that helper. A regression there would not break the test.

Given the uncertainty in finding #1, this gap is what allows the hook to ship without anyone noticing it's a no-op. At minimum, add one test that:

  1. Spawns the actual claude CLI (or a faithful mock that mirrors the JSON-over-stdout protocol) with the hook registered.
  2. Feeds a tool that produces a > cap tool_response.
  3. Asserts that the SDK Python process does not raise CLIJSONDecodeError.

Without this, every future change to the SDK's PostToolUse semantics silently breaks the hook and no CI signal fires.

Non-blocking

3. tee via process substitution is not synchronized with the parent shell

orchestrator/consensus_wrapper.py:138-146

{agent_command_prefix} "$prompt" \
    > >(tee -a "$AGENT_OUTPUT_LOG") \
    2> >(tee -a "$AGENT_OUTPUT_LOG" >&2)

Bash does not wait for process-substitution subshells. When the agent exits, the parent shell continues immediately and may call is_buffer_overflow (a grep) before the tee subshell has flushed its tail to disk. In the initial-exit path this is masked by the intervening egg-orch pipeline status call (network round-trip gives tee plenty of time), but in the restart-loop path (consensus_wrapper.py:587-590) the grep runs sooner after exit and could race.

The shape of the marker (single-line, written atomically by the SDK's logger as the process is dying) makes a real miss unlikely in practice, but if you want to harden it the standard fix is to capture the tee PIDs with $! after the substitution and wait for them, or rewrite as ( cmd 2>&1 ) | tee -a "$AGENT_OUTPUT_LOG" (sacrifices stderr/stdout separation but pipefail preserves the agent's exit code).

4. except TypeError, ValueError: relies on PEP 758 (3.14+) and the branch is dead code

sandbox/hooks/posttooluse_truncate.py:65 — This is valid in Python 3.14 (PEP 758) and ruff 0.15 accepts it, but it reads like a Python-2-era bug to anyone scanning the file. More importantly, with json.dumps(value, default=str) the only way to reach this branch is an object whose __str__/__repr__ itself raises — at which point the str(value) fallback raises too and the hook crashes. The test test_falls_back_to_str_for_unserializable does not actually trigger the except (the Weird class serializes fine via default=str).

Either make it except (TypeError, ValueError): for readability across linters/IDEs that don't yet know about PEP 758, or just drop the try/except entirely and let an unserializable object crash the hook (which fails open — the CLI gets no hook output and the tool_result passes through, which is the same outcome you wanted for the error path).

5. Retry guidance uses CLI-flag spelling for tool-input parameter names

sandbox/hooks/posttooluse_truncate.py:91 — the Grep guidance says --head-limit. The actual parameter name is head_limit (no dashes, no hyphens) — that's what the model sees in its tool schema and what it has to emit. Same risk in spirit for the Read guidance, though offset/limit happen to be spelled the same way the model uses them. Fix: head_limit everywhere; clarify "parameter head_limit" so the model isn't tempted to pass --head-limit as a string argument.

6. Restart-loop overflow detection is not behaviorally tested

orchestrator/tests/test_consensus_wrapper.pytest_buffer_overflow_check_in_restart_loop only asserts the string is_buffer_overflow appears ≥ 2 times in the template. The behavioral test test_buffer_overflow_aborts_without_retry exercises only the initial-exit handler (consensus not reached → first crash → abort). There's no test where:

  • Initial run succeeds (consensus reached)
  • A restart is triggered (e.g., clean exit) and the recovery run crashes with the overflow marker
  • The restart-loop's is_buffer_overflow check fires

This is the harder path to reach and the one most likely to regress silently if someone reshuffles the restart loop.

7. tool_name fallback to "unknown" may misclassify mutating tools

sandbox/hooks/posttooluse_truncate.py:133 — if tool_name is missing or null in the event, the hook uses "unknown", which falls through to _FALLBACK_RETRY_GUIDANCE ("re-issue the call with narrower input"). If the underlying tool was a mutating tool (e.g., a future MCP tool that's effectively a Write), telling the model to retry could double-apply the side effect. PostToolUse events should always include tool_name, so this is largely theoretical — but worth either erroring out or biasing the fallback toward "do NOT retry" since that's the safer default for unknown tools.

8. Hook process overhead on every tool call

sandbox/entrypoint.py:1077-1088 — matcher "*" runs a fresh Python subprocess for every tool call, including small Reads and MCP calls. Python cold-start is ~50–100 ms; for a session with hundreds of tool calls this adds up. Consider narrowing the matcher to the tools that have actually overflowed in practice (Edit|Write|MultiEdit|NotebookEdit|Read|Bash|Grep|Glob) — the current "*" is broader than the failure surface, and MCP tools have their own size cap planned in the follow-up #2805.

9. Wrapper detection relies on a string match against the SDK error message

orchestrator/consensus_wrapper.py:158-161 — greps for "exceeded maximum buffer size". This matches the current SDK source (subprocess_cli.py:654) but the SDK can change the wording at any minor version and the wrapper would silently fall through to is_transient_crash, burning the restart budget again. There's a comment that says "Match-string kept stable; the wrapper greps for it" but no enforcement. Consider either:

  • A pinned SDK-version check in CI that re-asserts the marker substring exists in the installed SDK, or
  • Asking upstream to expose a stable error code / exit code / structured marker the wrapper can match on instead of substring-matching free text.

This isn't a regression — it's just adding a new substring dependency the codebase will have to maintain.

Nits

  • sandbox/entrypoint.py:1083 — the parenthesized one-element value ("/opt/egg-runtime/sandbox/hooks/posttooluse_truncate.py") is just a parenthesized string, not a tuple, but the formatting suggests someone might think it is. Drop the parens.
  • shared/egg_agent/client.py:65-77 — the comment block is good context; consider tightening "Belt-and-suspenders" to a single line and moving the rationale into the issue/PR body so the source file stays scannable.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

jwbron and others added 2 commits May 27, 2026 11:26
Address reviewer feedback on PR #2810. The PostToolUse hook layer
doesn't work: the Claude Agent SDK contract (PostToolUseHookSpecificOutput
exposes only additionalContext and updatedMCPToolOutput; "Can block? No"
in docs) means decision:block cannot suppress the tool_response payload
on the wire. The hook fires, the oversized payload still crosses the
JSON channel, the buffer crashes anyway — pure telemetry on a channel
the crash kills.

Also drop the 4 MB max_buffer_size bump. Allowing larger contexts is
the wrong escape hatch; the real fix is tool-layer truncation (#2805)
so massive payloads never get produced in the first place.

What this PR now does (scoped to the wrapper):
- Detect the SDK buffer-overflow signature in the agent's output and
  short-circuit the restart budget — the overflow is deterministic, so
  retries just re-hit the same crash and burn the budget for no gain.
- Restructure run_agent to pipe through a single tee instead of two
  process-substitution tees; bash waits on pipelines synchronously,
  so the post-mortem grep can no longer race a still-flushing tee.
- Add a behavioral test for the restart-loop overflow path (initial
  clean exit triggers restart, recovery run crashes with the marker,
  wrapper aborts without further retries).

What goes elsewhere:
- Tool-layer truncation in egg MCP tools — #2805.
- Buffer-overflow crashes on built-in tools (Edit on large files,
  etc.) — remain a known terminal failure until tool-layer caps land.
  The wrapper now fails fast instead of burning the restart budget.
@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Re-review at ddd9b01ca (delta = 3068b6120 Fix #2804: drop broken PostToolUse hook, keep fail-fast on overflow on top of the previous review at a31fe0fa5; merge commits 9897b1e27 and ddd9b01ca are noise from main).

Previously blocking concerns — verified addressed

  1. PostToolUse hook entirely removed. sandbox/hooks/posttooluse_truncate.py, sandbox/tests/test_posttooluse_truncate.py, and the hooks: PostToolUse block in sandbox/entrypoint.py:1064-1090 are all gone. The fix-commit message correctly cites the SDK contract (PostToolUseHookSpecificOutput exposes only additionalContext / updatedMCPToolOutput, "Can block? No") — the layer was telemetry on a channel the crash kills, and dropping it is the right call.
  2. Restart-loop behavioral test added. orchestrator/tests/test_consensus_wrapper.py:1466 (test_buffer_overflow_in_restart_loop_aborts_without_further_retries) drives a mock python3 that exits 0 on call 1 (triggers the restart loop) and emits the SDK overflow signature + exits 255 on call 2, then asserts result.returncode == 255, "on restart" in stderr, and call_count == 2. This is the regression guard the previous review (#6) asked for — the restart-loop short-circuit at consensus_wrapper.py:599 is now exercised end-to-end, not just structurally.
  3. tee race fixed via single pipeline. run_agent() at consensus_wrapper.py:137-150 now uses cmd 2>&1 | tee -a "$AGENT_OUTPUT_LOG" with return ${PIPESTATUS[0]}. Bash waits on pipelines synchronously, so the post-mortem grep in is_buffer_overflow can no longer race a still-flushing process-substitution tee. The inline comment at lines 138-145 correctly explains why this matters.
  4. max_buffer_size=4*1024*1024 bump dropped. shared/egg_agent/client.py no longer raises the limit. Per the commit message, the right fix is tool-layer truncation (#2805) — burning context-window budget so an over-payload can succeed is the wrong direction.

Blocking

B1. sandbox/tests/test_entrypoint_settings.py deletion drops coverage for an unrelated, still-live feature (3068b6120 removes the entire 129-line file).

The deleted file contained two test classes:

  • TestPostToolUseHook (3 tests) — correctly deleted because the production hook is gone.

  • TestDisallowedToolsPrivateMode (5 tests) — covered EGG_PRIVATE_MODEsettings["disallowedTools"] = ["WebFetch", "WebSearch"] written into ~/.claude/settings.json. That production code still exists at sandbox/entrypoint.py:1071-1075:

    private_mode_env = os.environ.get("EGG_PRIVATE_MODE", "").lower()
    if private_mode_env in ("true", "1"):
        settings["disallowedTools"] = ["WebFetch", "WebSearch"]
        logger.info("Private mode: disallowed WebFetch/WebSearch in agent settings")

    Originally landed in PR #1402. The 5 deleted tests covered: EGG_PRIVATE_MODE=true, =1, =false, =0, and unset.

There's no replacement coverage. tests/sandbox/test_entrypoint.py is the only other entrypoint test file and grep -i "disallow\|private_mode\|webfetch" against it returns zero matches. The SDK-layer test at tests/shared/egg_agent/test_client.py:759-821 covers ClaudeAgentOptions.disallowed_tools (a different code path inside run_agent_async) — that's defense-in-depth #2, not the settings.json layer.

The PR commit message is scoped to "drop broken PostToolUse hook, keep fail-fast on overflow." Dropping coverage for a wholly unrelated settings.json write is out of scope for this PR and out of scope for the cited issue. Either:

  • (preferred) Restore the TestDisallowedToolsPrivateMode class to a new file (e.g. tests/sandbox/test_entrypoint_settings.py), or
  • explicitly remove the disallowedTools code in entrypoint.py:1069-1075 along with the tests — though that's a separate decision deserving its own PR, since the SDK-layer disallow inside run_agent_async already exists alongside it as documented defense-in-depth (client.py:147-153: "settings.json may also contain disallowedTools (set by the entrypoint), but passing them here as a CLI flag is more reliable and eliminates the gateway log noise").

Silently dropping the tests while leaving the production code in place is the worst of the three options — the next contributor who refactors setup_claude won't have a guard rail, and the existing dual-layer comment in client.py will be a lie about what's tested.

Non-blocking

N1. _BUFFER_OVERFLOW_MARKER in shared/egg_agent/client.py:72 is unused. The constant is declared at module scope but no code reads it. The second textual mention at line 133 is inside a comment block, not a code reference. Either reference the constant from the test (e.g. parameterize test_buffer_overflow_returns_failure_with_marker against it so a rename would flag breakage) or drop the declaration — declaring an unused constant only invites the next reader to assume it's load-bearing.

N2. Misleading comment in shared/egg_agent/client.py:131-137. The comment says: "issue #2804 relies on its error message preserving the exceeded maximum buffer size marker so the consensus-wrapper can short-circuit retry on this failure class. We don't import it explicitly here (the marker stability is verified by tests)."

The marker stability is not verified by any test on this branch. test_buffer_overflow_returns_failure_with_marker (tests/shared/egg_agent/test_client.py:833) hand-constructs the error: mock_query.side_effect = CLIJSONDecodeError("JSON message exceeded maximum buffer size of 1048576 bytes..."). The test verifies that if the SDK produces that string, it propagates to result.error — but the SDK could change the marker tomorrow and this test still passes. Either:

  • Add a real-SDK smoke test that triggers the overflow (e.g. by exhausting a synthetic stream) and asserts the marker substring — the right answer per previous review #9, deferrable to a follow-up but worth a tracking issue/TODO, or
  • Soften the comment to "preserved into result.error is verified by tests; stability against SDK changes is not — see #2804 for the SDK-version coupling."

The current wording overstates the test coverage.

N3. /tmp/agent-output-$$.log with tee -a is symlink-followable in /tmp. consensus_wrapper.py:127 defines AGENT_OUTPUT_LOG="${AGENT_OUTPUT_LOG:-/tmp/agent-output-$$.log}" and tee -a follows symlinks — a co-tenant on the same host who can predict the wrapper's PID ($$) could pre-create a symlink at that path and capture or redirect agent output. The container is single-tenant in normal operation, so this is theoretical; in a multi-tenant sandbox setup it would be a real issue. Consider mktemp for the default path, or document the single-tenant assumption.

N4. SDK marker dependency without CI pin. Carried from previous review #9, still not addressed. The wrapper greps for a literal substring of a third-party error message. If claude-agent-sdk changes that string (e.g. drops "maximum"), the wrapper silently regresses to burning the full restart budget on every overflow crash. A version pin in pyproject.toml + a CHANGELOG-watch on SDK releases, or a smoke test that imports the SDK and asserts the marker, would prevent silent regression. The fix-commit message acknowledges this as a known tradeoff; a tracking issue would make the deferral explicit.

N5. stderr merged with stdout in agent log. cmd 2>&1 | tee -a "$AGENT_OUTPUT_LOG" makes the log a single interleaved stream, which is what is_buffer_overflow needs (the SDK writes the overflow signature to stderr), but it also blurs ordering for any future log analysis. The inline comment at lines 138-145 explains the choice; if anyone post-mortems agent crashes by reading these logs, a note in docs/ or the wrapper docstring that the merge is intentional would help.

Verdict

The buffer-overflow short-circuit changes themselves are well-scoped and properly tested. B1 is the blocker — the deletion of TestDisallowedToolsPrivateMode is unrelated to the stated PR scope and removes the only coverage for a still-live production code path. Once the tests are restored (or the production code is removed in a separate PR), this is good to merge.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Copy Markdown
Contributor

egg is addressing review feedback...

B1 (blocking): restore TestDisallowedToolsPrivateMode coverage. The
previous fix-commit deleted sandbox/tests/test_entrypoint_settings.py
wholesale to drop TestPostToolUseHook, but that file also covered the
EGG_PRIVATE_MODE -> settings["disallowedTools"] branch in
sandbox/entrypoint.py:1071-1075, which is still live. Re-add the 5
private-mode tests as tests/sandbox/test_entrypoint_settings.py
(adjacent to the existing entrypoint tests). Verified all 5 pass.

N1: wire _BUFFER_OVERFLOW_MARKER from shared/egg_agent/client.py into
the test that asserts the SDK error message round-trips into
result.error, and add a new test in
orchestrator/tests/test_consensus_wrapper.py that asserts the bash
wrapper script's grep substring matches the same constant — so a
rename of the constant fails both tests rather than silently desyncing
the client from the wrapper.

N2: rewrite the comment in client.py to stop claiming "marker
stability is verified by tests." It isn't — the tests verify
round-trip propagation, not stability against a future SDK rename.
Point at #2823 (filed below) for the real fix.

N3: switch consensus_wrapper.py's default AGENT_OUTPUT_LOG path from
the predictable /tmp/agent-output-$$.log to mktemp(1) when the env
var isn't set. tee -a follows symlinks, so the predictable path is a
theoretical pre-create attack in multi-tenant sandbox setups; mktemp
is defense-in-depth (the container is single-tenant in normal
operation). Falls back to the old path only if mktemp is missing
from PATH.

N4: filed #2823 to track pinning the SDK marker against silent drift.
Couldn't fix in-PR — claude-agent-sdk isn't installed in the egg test
environment / CI, so a smoke test that imports the SDK and asserts
the marker substring would skip everywhere it could catch the
regression. Three approaches enumerated in the issue.

N5: extend the run_agent docstring comment in consensus_wrapper.py to
explain why 2>&1 | tee -a interleaves stderr into the captured log
(the SDK's overflow marker is emitted on stderr by logger.error in
claude_agent_sdk.query, and the grep needs to see it in the same
file as stdout), and note the trade-off for future log analysis.

Tests: 137 passed across the touched files (tests/sandbox/
test_entrypoint_settings.py + tests/shared/egg_agent/test_client.py +
orchestrator/tests/test_consensus_wrapper.py).
@james-in-a-box
Copy link
Copy Markdown
Contributor

Re-review response — addressing the 2nd round (a31fe0f...ddd9b01)

Pushed commit cb35179 on top of ddd9b01. Disposition per item:

Blocking

B1. TestDisallowedToolsPrivateMode restorationfixed-in-PR (commit cb351797c).
Restored the 5 private-mode tests as tests/sandbox/test_entrypoint_settings.py (preferred path — adjacent to the existing entrypoint tests in tests/sandbox/, the same directory you suggested). The TestPostToolUseHook class stays gone, matching the production hook removal. Verified the 5 restored tests pass.

Non-blocking

N1. Unused _BUFFER_OVERFLOW_MARKERfixed-in-PR (commit cb351797c).
The constant is now referenced from two tests:

  • tests/shared/egg_agent/test_client.py::test_buffer_overflow_returns_failure_with_marker substitutes the constant into the mocked SDK error and asserts against the constant in result.error.
  • New orchestrator/tests/test_consensus_wrapper.py::test_script_marker_matches_client_constant asserts the bash wrapper grep substring equals _BUFFER_OVERFLOW_MARKER.

A rename of the constant now fails both tests immediately rather than silently desyncing the client from the wrapper.

N2. Misleading comment about marker stabilityfixed-in-PR (commit cb351797c).
Rewrote the comment in shared/egg_agent/client.py to state explicitly that tests verify propagation of the marker into result.error, not stability of the SDK wording across releases. Comment now points at #2823 for the real fix.

N3. /tmp symlink-followable log pathfixed-in-PR (commit cb351797c).
consensus_wrapper.py now uses mktemp -t agent-output.XXXXXX for the default $AGENT_OUTPUT_LOG path when the env var isn't set, with the old /tmp/agent-output-$$.log only as a fallback if mktemp isn't on PATH. The single-tenant assumption was the right call for "shouldn't block merge," but it's a cheap upgrade to remove the predictable path entirely.

N4. SDK marker dependency without CI pindeferred-to #2823.
Filed #2823 before posting this response. Reason for deferral: claude-agent-sdk isn't installed in the egg test environment / CI, so a smoke test that imports the SDK and asserts the marker substring would skip everywhere it could actually catch the regression. The issue enumerates three real options (in-image smoke test, version pin + CHANGELOG-watch, upstream stable error class) and asks for a pick. This matches your suggestion that "a tracking issue would make the deferral explicit."

N5. Document intentional stderr/stdout mergefixed-in-PR (commit cb351797c).
Extended the comment block in run_agent() to explain the merge is intentional (the SDK overflow marker comes via logger.error on stderr, and the grep needs to see it in the same file as stdout), and noted the trade-off for future log analysis.

— Authored by egg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent SDK message reader has 1MB JSON buffer limit — large tool results crash agents with exit code 255

1 participant