Skip to content

bug: inflated eval produces tool_use/tool_result mismatch errors through gateway #424

@BYK

Description

@BYK

Problem

When running the live eval with --inflate 400000, the Lore baseline's conversation flow produces hundreds of tool_use ids were found without tool_result blocks 400 errors from the Anthropic API. This corrupts the eval results — Lore scored 3.1 on CM-1 at 400K (down from 3.5 pre-#417), while compaction scored 4.5.

Evidence

From the eval run on 2026-05-20 after merging #423:

[lore] upstream error: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"messages.1: \`tool_use\` ids were found without \`tool_result\` blocks
immediately after: toolu_eval_000010. Each \`tool_use\` block must have a corresponding
\`tool_result\` block in the next message."}}

This error repeated for hundreds of turns during the inflated session replay through the gateway. The non-inflated eval (same code) scored Lore at 4.8/5.0 — confirming the code changes work correctly when content fits in context.

Root Cause Hypothesis

The inflated filler turns likely contain tool_use/tool_result pairs that, when the gradient context manager strips or reorders messages at higher layers, break the Anthropic API's requirement that every tool_use block has an immediately following tool_result block. The gateway's message sanitization may not handle this edge case for inflated/synthetic conversations.

Impact

Cannot measure the eval impact of #423 (distillation detail retention improvements) at 400K tokens until this is fixed.

Reproduction

ANTHROPIC_API_KEY=... bun packages/core/eval/run.ts --mode live --dimensions context --inflate 400000

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions