bug: inflated eval produces tool_use/tool_result mismatch errors through gateway

## Problem

When running the live eval with `--inflate 400000`, the Lore baseline's conversation flow produces hundreds of `tool_use ids were found without tool_result blocks` 400 errors from the Anthropic API. This corrupts the eval results — Lore scored **3.1** on CM-1 at 400K (down from 3.5 pre-#417), while compaction scored 4.5.

## Evidence

From the eval run on 2026-05-20 after merging #423:

```
[lore] upstream error: 400 {"type":"error","error":{"type":"invalid_request_error",
"message":"messages.1: \`tool_use\` ids were found without \`tool_result\` blocks
immediately after: toolu_eval_000010. Each \`tool_use\` block must have a corresponding
\`tool_result\` block in the next message."}}
```

This error repeated for hundreds of turns during the inflated session replay through the gateway. The non-inflated eval (same code) scored Lore at 4.8/5.0 — confirming the code changes work correctly when content fits in context.

## Root Cause Hypothesis

The inflated filler turns likely contain `tool_use`/`tool_result` pairs that, when the gradient context manager strips or reorders messages at higher layers, break the Anthropic API's requirement that every `tool_use` block has an immediately following `tool_result` block. The gateway's message sanitization may not handle this edge case for inflated/synthetic conversations.

## Impact

Cannot measure the eval impact of #423 (distillation detail retention improvements) at 400K tokens until this is fixed.

## Reproduction

```bash
ANTHROPIC_API_KEY=... bun packages/core/eval/run.ts --mode live --dimensions context --inflate 400000
```

## Related

- #417 — the distillation retention issue this eval was meant to validate
- #423 — the merged fix (budget increase, tool output visibility, recent segment protection)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: inflated eval produces tool_use/tool_result mismatch errors through gateway #424

Problem

Evidence

Root Cause Hypothesis

Impact

Reproduction

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: inflated eval produces tool_use/tool_result mismatch errors through gateway #424

Description

Problem

Evidence

Root Cause Hypothesis

Impact

Reproduction

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions