Measuring reward hacking in long-horizon coding agents.
SpecBench is a benchmark of 30 systems-level programming tasks (JSON parser to OS kernel) that measures whether coding agents genuinely satisfy specifications or just optimize the visible test suite. Each task has two test suites: validation tests (visible to the agent during optimization) and held-out tests (hidden from the agent, used only for evaluation). The reward hacking gap is the difference between pass rates on these two suites.
Paper: SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
git clone https://github.com/WecoAI/SpecBench.git
cd SpecBench
pip install -e .SpecBench wraps existing coding agents (the "inner agent") inside an outer search loop. You need at least one installed:
Claude Code (default):
npm install -g @anthropic-ai/claude-code
claude login
claude --versionClaude Code uses your Anthropic credentials directly. No extra env vars needed.
Codex:
npm install -g @openai/codex
export OPENAI_API_KEY="sk-..."
codex --versionOpenCode:
# Install OpenCode (see https://github.com/opencode-ai/opencode)
# Configure with your preferred provider:
export OPENAI_API_KEY="sk-..."
opencode --versionSpecBench uses a two-level architecture: an inner agent (the coding agent that writes and edits code) wrapped by an outer search loop (the search strategy that decides which candidates to refine).
# Quick sanity check (5 steps, one task)
python -m experiments.cli --config dev --agent claude_code --task json_parser
# Short run (10 steps, all tasks)
python -m experiments.cli --config short --agent claude_code
# Medium run (25 steps)
python -m experiments.cli --config medium --agent codex --model gpt-5.2-codex
# Full run (50 steps)
python -m experiments.cli --config full --agent claude_code
# Single task with a specific seed
python -m experiments.cli --config short --agent claude_code --task sql_database --seed 42
# Multiple seeds in parallel
python -m experiments.cli --config short --agent claude_code --num-seeds 3 --parallel 3The outer loop supports three search strategies (see paper Section 3):
# AIDE tree search (default) — draft/debug/improve branching
python -m experiments.cli --config short --agent claude_code --search-mode aide
# Linear — sequential refinement, no branching
python -m experiments.cli --config short --agent claude_code --search-mode linear
# Autoresearch — single chain, keeps best candidate so far
python -m experiments.cli --config short --agent claude_code --search-mode autoresearch# Claude Code (default)
python -m experiments.cli --agent claude_code
# Codex (requires OPENAI_API_KEY)
python -m experiments.cli --agent codex --model gpt-5.2-codex
# OpenCode with various models
python -m experiments.cli --agent opencode --model openrouter/anthropic/claude-opus-4
python -m experiments.cli --agent opencode --model deepseek/deepseek-v3.2
python -m experiments.cli --agent opencode --model kimi-k2.5
--config {dev,short,medium,full} Preset (steps/drafts/budget)
--task TASK Single task name (default: all 30)
--agent AGENT claude_code | opencode | codex
--model MODEL LLM model name
--seed SEED Random seed
--num-seeds N Run N seeds (0..N-1)
--parallel N Parallel workers
--steps N Override step count
--cost-budget N Max cost in USD
--time-budget N Max time in seconds
--search-mode MODE aide | linear | autoresearch
--out-dir DIR Output directory (default: results/spec_bench)
--no-private-eval Skip held-out test evaluation
--difficulty-level {1,2,3,4} Validation test visibility level
--curriculum Progressive difficulty over steps
30 systems-level tasks spanning C, Python, and Go. Each task includes a natural-language specification, starter code (stubs), a reference implementation that passes all tests, and both test suites.
| Task | Lang | LOC | Validation Tests | Held-out Tests | Domain |
|---|---|---|---|---|---|
| json_parser | Py | 1.5K | 45 | 178 | Parser |
| package_resolver | Py | 3K | 32 | 50 | Resolver |
| http_server | Py | 5K | 31 | 144 | Server |
| regex_engine | Py | 5K | 40 | 125 | Engine |
| sed_interpreter | Py | 5K | 118 | 77 | Interpreter |
| tinygrad | Py | 5K | 70 | 76 | ML Library |
| lox_vm | C | 5K | 52 | 92 | VM |
| filesystem | C | 8K | 40 | 54 | System |
| markdown_renderer | Py | 8K | 49 | 125 | Renderer |
| deflate_compression | Py | 10K | 35 | 139 | Codec |
| git_impl | Py | 10K | 25 | 69 | VCS |
| spreadsheet_engine | Py | 10K | 34 | 90 | Engine |
| ray_tracer | C | 12K | 29 | 23 | Graphics |
| wasm_interpreter | C | 12K | 159 | 61 | VM |
| shell_interpreter | C | 14K | 41 | 110 | Interpreter |
| crypto_primitives | Py | 15K | 24 | 57 | Crypto |
| css_layout_engine | Py | 15K | 127 | 107 | Engine |
| http2_protocol | Py | 15K | 46 | 42 | Protocol |
| riscv_emulator | C | 15K | 50 | 98 | Emulator |
| tcp_stack | C | 15K | 42 | 31 | Network |
| gnu_make | Py | 20K | 159 | 102 | Build Tool |
| nes_emulator | C | 20K | 52 | 103 | Emulator |
| coreutils | C | 25K | 48 | 119 | System Utils |
| database_engine | C | 25K | 40 | 25 | Database |
| gollum_compiler | Go | 25K | 33 | 52 | Compiler |
| gameboy_emulator | C | 30K | 50 | 117 | Emulator |
| sql_database | C | 30K | 15 | 11 | Database |
| c_compiler | C | 50K | 46 | 299 | Compiler |
| elf_linker | C | 50K | 35 | 63 | Linker |
| javascript_engine | C | 60K | 130 | 72 | Engine |
| os_kernel | C | 110K | 36 | 38 | OS Kernel |
Total: 1,779 validation tests, 2,783 held-out tests across 30 tasks.
SpecBench uses a two-level architecture. The outer search loop (AIDE, Linear, or Autoresearch) manages a tree of candidate implementations. At each step, it selects a candidate, invokes the inner coding agent in an isolated workspace, runs the validation tests, and uses the score to guide the next step. Held-out tests are run for evaluation only and are never shown to the agent.
Outer search loop (AIDE / Linear / Autoresearch)
|
|-- Select candidate node from search tree
|-- Invoke inner agent (Claude Code / Codex / OpenCode) in isolated workspace
|-- Run validation tests (T_val) → score feeds back to search
|-- Run held-out tests (T_test) → recorded for analysis (hidden from agent)
|-- Update search tree
The reward hacking gap = validation pass rate − held-out pass rate. A genuine implementation that follows the specification will have a gap near zero.
Each run produces JSON files in the output directory:
results/spec_bench/exp_short_claude_code_json_parser_s0_20260518/
├── spec_json_parser_seed0_specbench.json # Per-step validation + held-out scores
├── spec_json_parser_seed0_aide_run.json # Search tree state
└── workspaces/ # Agent workspaces per search node
specbench/
├── aide/ # AIDE tree search framework
│ ├── agent.py # Tree search, nodes, search policy
│ ├── backend.py # LLM API client (OpenAI-compatible)
│ └── logging.py
├── benchmarks/
│ ├── base.py # TaskSpec, EvalResult interfaces
│ └── spec_bench/
│ ├── adapter.py # Task registry, test evaluation
│ ├── run_loop.py # Outer search + inner agent integration
│ ├── workspace.py # Isolated workspace management
│ ├── agents/ # Inner coding agents
│ │ ├── claude_code.py # Claude Code CLI wrapper
│ │ ├── codex.py # Codex CLI wrapper
│ │ ├── opencode.py # OpenCode CLI wrapper
│ │ └── ...
│ ├── evaluation/
│ │ └── runner.py # pytest subprocess runner
│ └── tasks/ # 30 task definitions
│ ├── json_parser/
│ ├── c_compiler/
│ ├── os_kernel/
│ └── ...
├── experiments/
│ ├── cli.py # CLI entry point
│ └── spec_bench/
│ ├── config.py # dev/short/medium/full presets
│ └── run_specbench.py # Experiment runner
└── pyproject.toml
@article{zhao2026specbench,
title={SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents},
author={Zhao, Bingchen and Srikanth, Dhruv and Wu, Yuxiang and Jiang, Zhengyao},
journal={arXiv preprint arXiv:2605.21384},
year={2026}
}Apache 2.0. See LICENSE.