Skip to content

Add deliberate-eval: framework for measuring planning impact on agent outcomes#11

Open
tavian-dev wants to merge 9 commits intomainfrom
feat/deliberate-eval
Open

Add deliberate-eval: framework for measuring planning impact on agent outcomes#11
tavian-dev wants to merge 9 commits intomainfrom
feat/deliberate-eval

Conversation

@tavian-dev
Copy link
Copy Markdown
Owner

@tavian-dev tavian-dev commented Apr 4, 2026

Summary

  • deliberate-eval: A standalone eval framework that measures whether AI planning tools improve coding agent outcomes using paired comparison (with-planning vs without-planning)
  • 10 curated pilot tasks from real GitHub issues across humanize, boltons, more-itertools, attrs, isort, click, black, and flask
  • Pilot results: Planning shows +10% pass rate lift, -19% fewer tokens, -27% faster, -15% cheaper

What's included

Eval framework (deliberate_eval/)

  • Data models (Task, Run, Trajectory) with JSONL serialization
  • Agent adapters for Claude Code headless and Codex CLI
  • Treatment prompt templates (class_a: no planning, class_b: plan-first)
  • Runner with per-run isolation (fresh clone/worktree, venv, cleanup)
  • Planning-specific metrics (Planning ROI, Waste Reduction Ratio, per-task paired deltas)
  • CLI with run, report, and validate subcommands
  • 151 tests

Pilot task set (tasks/)

  • 10 tasks: 2 trivial, 4 medium, 3 hard, 1 unsolvable (black fmt:off)
  • SWE-bench-style test patches for automated pass/fail verification
  • All tasks validated: test patches fail at parent ref, proving bug exists

Results (results/)

  • 20 runs across 10 tasks (1 seed × 2 treatments)
  • 6-run variance analysis on click-3019 (3 seeds × 2 treatments)
  • Key finding: planning is more reliable (90% vs 80% pass), more efficient, and prevents expensive thrashing on failed attempts

Key metrics (10 tasks, 1 seed)

Metric Baseline Planning
Pass Rate 80% 90%
Median Fresh Tokens 18,563 14,994
Median Cost $0.30 $0.26
Median Duration 107s 78s

Test plan

  • 151 tests passing across both packages
  • Pilot eval completed end-to-end
  • 3-seed variance analysis on differentiating task
  • CI green on all Python versions (3.10, 3.11, 3.12)

tavian-dev and others added 9 commits April 4, 2026 04:24
Initial implementation of the eval framework for measuring whether
planning tools improve AI agent outcomes.

Phase 1 (Core):
- Data models: Task, Trajectory, Run with JSONL serialization
- Task loading/validation from JSONL files
- Agent adapters: Claude Code (headless -p --output-format json),
  Codex CLI (exec --dangerously-bypass-approvals-and-sandbox)

Phase 2 (Metrics):
- Planning ROI = (pass rate lift) / (extra planning tokens)
- Waste Reduction Ratio = (baseline waste - planned waste) / baseline
- Treatment stats: pass rate, median tokens/cost/duration
- Treatment prompt templates (Class A: no planning, Class B: brief)

23 new tests, all passing. 122 total tests across both packages.

Campaign artifacts at .deliberate/active/eval-framework/:
- research.md: Literature review (SWE-bench, SWE-Effi, AgentDiet,
  LLM-as-Judge, agent-eval-harness)
- spec.md: 4 user stories, 10 requirements
- plan.md: 4 phases, 6 technical decisions
- tasks.md: 29 tasks across 4 phases
Both Sonnet and Codex independently flagged critical methodology issues.
Fixed the most impactful ones:

Methodology fixes:
- Per-task paired aggregation: compute per-task pass rate deltas then
  average, instead of aggregating across all tasks (prevents task
  difficulty from dominating signal)
- Treatment prompts now isolate planning as concept, not deliberate-
  the-tool. Class A is neutral (no anti-planning instruction), Class B
  asks agent to plan free-hand (not invoke deliberate)
- Removed arbitrary 999.0 ROI cap — now uses symmetric formula

Code fixes:
- run_claude: fix token counting (input_tokens + cache_read + cache_creation)
- run_codex: fix token parsing (was grabbing any digit line from stdout,
  now specifically finds "tokens used\nNNNN" pattern)
- Run.from_dict: copy dict before mutating to avoid caller side effects

Tests:
- Added test_per_task_aggregation to verify paired comparison
- Updated test_planning_saves_tokens for new ROI formula
- 123 tests passing

Issues deferred for later:
- Treatment compliance verification (detect if agent actually planned)
- Order randomization across treatments
- Seed count / statistical power analysis
Complete eval framework implementation:
- Runner with worktree isolation, treatment injection, agent invocation,
  test execution, and trajectory capture
- Treatment prompt renderer (class_a/class_b templates)
- CLI with run, report, and validate subcommands
- Comparison report with per-task breakdown
- 10 curated pilot tasks (3 trivial, 4 medium, 3 hard) from
  humanize, boltons, more-itertools, attrs, isort, click, black, flask
- Test-only patches for SWE-bench-style validation
- Task helper (fetch_github_issue, task_from_issue, append_task)
- 151 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test patch modifies test_echo_writing_to_standard_error, not test_prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --dangerously-skip-permissions to claude headless mode (required
  for tool use in -p mode)
- Pass venv PATH/VIRTUAL_ENV to agent subprocess so it uses the right
  Python/pip instead of wasting turns fighting the environment
- Commit test patches before agent runs so git diff only shows agent work
- Track fresh input tokens separately from cache reads (cache was 95%
  of reported tokens, inflating the metric)
- Add median_turns to TreatmentStats and report output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results across 5 tasks (1 trivial, 3 medium, 1 hard):
- Planning: 5/5 pass (100%), Baseline: 4/5 pass (80%)
- Planning ROI: +0.14 (pass rate lift per 1K extra tokens)
- Planning used fewer median tokens (-1,467) and was faster (71s vs 96s)
- click-3019 was the differentiator: baseline timed out, planning solved in 55s
- On tasks both solve, planning is marginally more efficient

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 seeds × 2 treatments on click-3019 (prompt_suffix bug):
- Baseline: 2/3 pass (66.7%), planning: 3/3 pass (100%)
- Planning ROI: +0.20, Waste Reduction: 100%
- Baseline seed=1 thrashed for 51 turns/$1.36 and failed;
  planning solved same seed in 16 turns/$0.41
- Planning is more reliable but not always faster

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks (1 trivial, 4 medium, 1 hard), 1 seed each:
- Planning: 5/6 pass (83.3%), Baseline: 4/6 pass (66.7%)
- Planning ROI: +0.047, Waste Reduction: 100%
- Planning uses fewer median tokens (-3,569) and is faster (78s vs 107s)
- click-3019: planning solved, baseline failed (key differentiator)
- black-4841: neither solved (genuinely hard, 2-file fix in Black)
- 3-seed variance on click-3019: baseline 2/3 pass, planning 3/3 pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 10 pilot tasks evaluated (2 trivial, 4 medium, 3 hard + 1 unsolvable):
- Planning: 9/10 pass (90%), Baseline: 8/10 pass (80%)
- Planning uses -19% fewer fresh tokens (14,994 vs 18,563 median)
- Planning is -27% faster (78s vs 107s median wall time)
- Planning is -15% cheaper ($0.26 vs $0.30 median)
- Waste Reduction: 100% — all baseline waste eliminated
- click-3019 key differentiator: baseline fails, planning passes
- black-4841 defeats both treatments
- Total pilot cost: $9.50 across 20 runs

Also fixes boltons-389 setup to use pytest<8 for compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant