Problem
When running the live eval with --inflate 400000, the Lore baseline's conversation flow produces hundreds of tool_use ids were found without tool_result blocks 400 errors from the Anthropic API. This corrupts the eval results — Lore scored 3.1 on CM-1 at 400K (down from 3.5 pre-#417), while compaction scored 4.5.
Evidence
From the eval run on 2026-05-20 after merging #423:
[lore] upstream error: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"messages.1: \`tool_use\` ids were found without \`tool_result\` blocks
immediately after: toolu_eval_000010. Each \`tool_use\` block must have a corresponding
\`tool_result\` block in the next message."}}
This error repeated for hundreds of turns during the inflated session replay through the gateway. The non-inflated eval (same code) scored Lore at 4.8/5.0 — confirming the code changes work correctly when content fits in context.
Root Cause Hypothesis
The inflated filler turns likely contain tool_use/tool_result pairs that, when the gradient context manager strips or reorders messages at higher layers, break the Anthropic API's requirement that every tool_use block has an immediately following tool_result block. The gateway's message sanitization may not handle this edge case for inflated/synthetic conversations.
Impact
Cannot measure the eval impact of #423 (distillation detail retention improvements) at 400K tokens until this is fixed.
Reproduction
ANTHROPIC_API_KEY=... bun packages/core/eval/run.ts --mode live --dimensions context --inflate 400000
Related
Problem
When running the live eval with
--inflate 400000, the Lore baseline's conversation flow produces hundreds oftool_use ids were found without tool_result blocks400 errors from the Anthropic API. This corrupts the eval results — Lore scored 3.1 on CM-1 at 400K (down from 3.5 pre-#417), while compaction scored 4.5.Evidence
From the eval run on 2026-05-20 after merging #423:
This error repeated for hundreds of turns during the inflated session replay through the gateway. The non-inflated eval (same code) scored Lore at 4.8/5.0 — confirming the code changes work correctly when content fits in context.
Root Cause Hypothesis
The inflated filler turns likely contain
tool_use/tool_resultpairs that, when the gradient context manager strips or reorders messages at higher layers, break the Anthropic API's requirement that everytool_useblock has an immediately followingtool_resultblock. The gateway's message sanitization may not handle this edge case for inflated/synthetic conversations.Impact
Cannot measure the eval impact of #423 (distillation detail retention improvements) at 400K tokens until this is fixed.
Reproduction
Related