SpecBench

Measuring reward hacking in long-horizon coding agents.

SpecBench is a benchmark of 30 systems-level programming tasks (JSON parser to OS kernel) that measures whether coding agents genuinely satisfy specifications or just optimize the visible test suite. Each task has two test suites: validation tests (visible to the agent during optimization) and held-out tests (hidden from the agent, used only for evaluation). The reward hacking gap is the difference between pass rates on these two suites.

Paper: SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Setup

1. Install SpecBench

git clone https://github.com/WecoAI/SpecBench.git
cd SpecBench
pip install -e .

2. Install an inner agent

SpecBench wraps existing coding agents (the "inner agent") inside an outer search loop. You need at least one installed:

Claude Code (default):

npm install -g @anthropic-ai/claude-code
claude login
claude --version

Claude Code uses your Anthropic credentials directly. No extra env vars needed.

Codex:

npm install -g @openai/codex
export OPENAI_API_KEY="sk-..."
codex --version

OpenCode:

# Install OpenCode (see https://github.com/opencode-ai/opencode)
# Configure with your preferred provider:
export OPENAI_API_KEY="sk-..."
opencode --version

Running Experiments

SpecBench uses a two-level architecture: an inner agent (the coding agent that writes and edits code) wrapped by an outer search loop (the search strategy that decides which candidates to refine).

# Quick sanity check (5 steps, one task)
python -m experiments.cli --config dev --agent claude_code --task json_parser

# Short run (10 steps, all tasks)
python -m experiments.cli --config short --agent claude_code

# Medium run (25 steps)
python -m experiments.cli --config medium --agent codex --model gpt-5.2-codex

# Full run (50 steps)
python -m experiments.cli --config full --agent claude_code

# Single task with a specific seed
python -m experiments.cli --config short --agent claude_code --task sql_database --seed 42

# Multiple seeds in parallel
python -m experiments.cli --config short --agent claude_code --num-seeds 3 --parallel 3

Search Strategies

The outer loop supports three search strategies (see paper Section 3):

# AIDE tree search (default) — draft/debug/improve branching
python -m experiments.cli --config short --agent claude_code --search-mode aide

# Linear — sequential refinement, no branching
python -m experiments.cli --config short --agent claude_code --search-mode linear

# Autoresearch — single chain, keeps best candidate so far
python -m experiments.cli --config short --agent claude_code --search-mode autoresearch

Using different inner agents

# Claude Code (default)
python -m experiments.cli --agent claude_code

# Codex (requires OPENAI_API_KEY)
python -m experiments.cli --agent codex --model gpt-5.2-codex

# OpenCode with various models
python -m experiments.cli --agent opencode --model openrouter/anthropic/claude-opus-4
python -m experiments.cli --agent opencode --model deepseek/deepseek-v3.2
python -m experiments.cli --agent opencode --model kimi-k2.5

All CLI Options

--config {dev,short,medium,full}    Preset (steps/drafts/budget)
--task TASK                         Single task name (default: all 30)
--agent AGENT                       claude_code | opencode | codex
--model MODEL                       LLM model name
--seed SEED                         Random seed
--num-seeds N                       Run N seeds (0..N-1)
--parallel N                        Parallel workers
--steps N                           Override step count
--cost-budget N                     Max cost in USD
--time-budget N                     Max time in seconds
--search-mode MODE                  aide | linear | autoresearch
--out-dir DIR                       Output directory (default: results/spec_bench)
--no-private-eval                   Skip held-out test evaluation
--difficulty-level {1,2,3,4}        Validation test visibility level
--curriculum                        Progressive difficulty over steps

Tasks

30 systems-level tasks spanning C, Python, and Go. Each task includes a natural-language specification, starter code (stubs), a reference implementation that passes all tests, and both test suites.

Task	Lang	LOC	Validation Tests	Held-out Tests	Domain
json_parser	Py	1.5K	45	178	Parser
package_resolver	Py	3K	32	50	Resolver
http_server	Py	5K	31	144	Server
regex_engine	Py	5K	40	125	Engine
sed_interpreter	Py	5K	118	77	Interpreter
tinygrad	Py	5K	70	76	ML Library
lox_vm	C	5K	52	92	VM
filesystem	C	8K	40	54	System
markdown_renderer	Py	8K	49	125	Renderer
deflate_compression	Py	10K	35	139	Codec
git_impl	Py	10K	25	69	VCS
spreadsheet_engine	Py	10K	34	90	Engine
ray_tracer	C	12K	29	23	Graphics
wasm_interpreter	C	12K	159	61	VM
shell_interpreter	C	14K	41	110	Interpreter
crypto_primitives	Py	15K	24	57	Crypto
css_layout_engine	Py	15K	127	107	Engine
http2_protocol	Py	15K	46	42	Protocol
riscv_emulator	C	15K	50	98	Emulator
tcp_stack	C	15K	42	31	Network
gnu_make	Py	20K	159	102	Build Tool
nes_emulator	C	20K	52	103	Emulator
coreutils	C	25K	48	119	System Utils
database_engine	C	25K	40	25	Database
gollum_compiler	Go	25K	33	52	Compiler
gameboy_emulator	C	30K	50	117	Emulator
sql_database	C	30K	15	11	Database
c_compiler	C	50K	46	299	Compiler
elf_linker	C	50K	35	63	Linker
javascript_engine	C	60K	130	72	Engine
os_kernel	C	110K	36	38	OS Kernel

Total: 1,779 validation tests, 2,783 held-out tests across 30 tasks.

How It Works

SpecBench uses a two-level architecture. The outer search loop (AIDE, Linear, or Autoresearch) manages a tree of candidate implementations. At each step, it selects a candidate, invokes the inner coding agent in an isolated workspace, runs the validation tests, and uses the score to guide the next step. Held-out tests are run for evaluation only and are never shown to the agent.

Outer search loop (AIDE / Linear / Autoresearch)
  |
  |-- Select candidate node from search tree
  |-- Invoke inner agent (Claude Code / Codex / OpenCode) in isolated workspace
  |-- Run validation tests (T_val) → score feeds back to search
  |-- Run held-out tests (T_test) → recorded for analysis (hidden from agent)
  |-- Update search tree

The reward hacking gap = validation pass rate − held-out pass rate. A genuine implementation that follows the specification will have a gap near zero.

Output

Each run produces JSON files in the output directory:

results/spec_bench/exp_short_claude_code_json_parser_s0_20260518/
├── spec_json_parser_seed0_specbench.json   # Per-step validation + held-out scores
├── spec_json_parser_seed0_aide_run.json    # Search tree state
└── workspaces/                             # Agent workspaces per search node

Project Structure

specbench/
├── aide/                           # AIDE tree search framework
│   ├── agent.py                    # Tree search, nodes, search policy
│   ├── backend.py                  # LLM API client (OpenAI-compatible)
│   └── logging.py
├── benchmarks/
│   ├── base.py                     # TaskSpec, EvalResult interfaces
│   └── spec_bench/
│       ├── adapter.py              # Task registry, test evaluation
│       ├── run_loop.py             # Outer search + inner agent integration
│       ├── workspace.py            # Isolated workspace management
│       ├── agents/                 # Inner coding agents
│       │   ├── claude_code.py      # Claude Code CLI wrapper
│       │   ├── codex.py            # Codex CLI wrapper
│       │   ├── opencode.py         # OpenCode CLI wrapper
│       │   └── ...
│       ├── evaluation/
│       │   └── runner.py           # pytest subprocess runner
│       └── tasks/                  # 30 task definitions
│           ├── json_parser/
│           ├── c_compiler/
│           ├── os_kernel/
│           └── ...
├── experiments/
│   ├── cli.py                      # CLI entry point
│   └── spec_bench/
│       ├── config.py               # dev/short/medium/full presets
│       └── run_specbench.py        # Experiment runner
└── pyproject.toml

Citation

@article{zhao2026specbench,
  title={SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents},
  author={Zhao, Bingchen and Srikanth, Dhruv and Wu, Yuxiang and Jiang, Zhengyao},
  journal={arXiv preprint arXiv:2605.21384},
  year={2026}
}

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
aide		aide
benchmarks		benchmarks
examples/hash_table_hack		examples/hash_table_hack
experiments		experiments
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpecBench

Setup

1. Install SpecBench

2. Install an inner agent

Running Experiments

Search Strategies

Using different inner agents

All CLI Options

Tasks

How It Works

Output

Project Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpecBench

Setup

1. Install SpecBench

2. Install an inner agent

Running Experiments

Search Strategies

Using different inner agents

All CLI Options

Tasks

How It Works

Output

Project Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages