Trace-Bench provides five commands via the trace-bench entry point.
List all discoverable tasks.
trace-bench list-tasks --root benchmarks/LLM4AD/benchmark_tasks
trace-bench list-tasks --root benchmarks/LLM4AD/benchmark_tasks --bench llm4ad
trace-bench list-tasks --root benchmarks/LLM4AD/benchmark_tasks --bench veribench
trace-bench list-tasks --root benchmarks/LLM4AD/benchmark_tasks --bench trace_examplesThe --bench flag filters by suite. Accepted values: llm4ad, trace_examples, internal, veribench. Combine with commas: --bench llm4ad,internal.
List available optimization algorithms.
trace-bench list-trainers
trace-bench list-trainers --all # include unavailable trainersOutput is tab-separated: trainer_id\tavailable|unavailable.
Dry-run a config: checks that all trainers exist and all tasks can build their bundles.
trace-bench validate --config configs/smoke.yaml --root benchmarks/LLM4AD/benchmark_tasks
trace-bench validate --config configs/smoke.yaml --root benchmarks/LLM4AD/benchmark_tasks --strictWith --strict, validation also checks that every task has trainable parameters, verifies trainer kwargs against the allow-list, expands the full job matrix, and writes a manifest to --runs-dir.
Execute a benchmark configuration.
trace-bench run --config configs/smoke.yaml --root benchmarks/LLM4AD/benchmark_tasks
trace-bench run --config configs/m2_coverage.yaml --root benchmarks/LLM4AD/benchmark_tasks \
--runs-dir results/m2 --max-workers 4
trace-bench run --config configs/smoke_real.yaml --root benchmarks/LLM4AD/benchmark_tasks \
--job-timeout 300 --logger ConsoleLoggerKey flags:
| Flag | Default | Description |
|---|---|---|
--runs-dir |
runs |
Output directory for run artifacts |
--max-workers |
1 |
Parallel job threads |
--force |
off | Re-run all jobs even if results exist |
--resume |
auto |
Resume mode: auto, failed, none |
--job-timeout |
mode-dependent when unset | Per-job timeout in seconds (CLI defaults to 0 in stub mode, 600 in real mode) |
--logger |
config | Override logger for all trainers (ConsoleLogger, none, etc.) |
Launch the Gradio results dashboard.
trace-bench ui --runs-dir runs
trace-bench ui --runs-dir runs --share --port 7860A benchmark run is a matrix: tasks x trainers x params_variants x seeds. To isolate one variable, change only that axis.
| Comparison | What to change | What to hold fixed |
|---|---|---|
| LLM provider/model | llm.model / llm.base_url in config |
Same tasks, trainers, seeds |
| Trainer algorithm | trainers[].id |
Same tasks, LLM, seeds |
| Optimizer params | trainers[].params_variants |
Same trainer, tasks, LLM, seeds |
| Task set | tasks[] |
Same trainers, LLM, seeds |
| Seed | seeds[] |
Everything else identical |
run_id values include a timestamp plus a hash of the config snapshot and git SHA. Identical configs do not guarantee identical run_id values. Use config_hash in meta/manifest.json to detect duplicate configs.
After a run starts, Trace-Bench writes a resolved config snapshot to:
runs/<run_id>/meta/config.snapshot.yaml
The expanded job matrix, resolved trainer kwargs, and config_hash live in:
runs/<run_id>/meta/manifest.json
This is the authoritative view of what actually executed.
Create two configs that differ only in the llm block, then compare results:
# base.yaml
mode: real
seeds: [123]
tasks:
- id: internal:numeric_param
trainers:
- id: PrioritySearch
params_variants:
- ps_steps: 1
ps_batches: 1
llm:
provider: openrouter
base_url: https://openrouter.ai/api/v1
model: openrouter/x-ai/grok-4.1-fast# variant.yaml (only model differs)
llm:
provider: openrouter
base_url: https://openrouter.ai/api/v1
model: openrouter/openai/gpt-4o-miniKeep tasks/trainers/seeds identical, then compare results.csv and leaderboard.csv.
Only change one axis at a time:
- Trainer: change
trainers[].id, keep tasks + llm + seeds the same. - Optimizer params: change one field inside
trainers[].params_variants. - Task set: change only
tasks[]. - Seed: change only
seeds[].
Example: Only change optimizer params:
trainers:
- id: PrioritySearch
params_variants:
- ps_steps: 1
ps_batches: 1trainers:
- id: PrioritySearch
params_variants:
- ps_steps: 3
ps_batches: 1Example: Only change seed:
seeds: [123]seeds: [124]After trace-bench run completes, the run directory contains:
runs/<run_id>/
meta/
config.snapshot.yaml # resolved config snapshot
manifest.json # full job matrix + resolved kwargs + config_hash
env.json # Python version, platform, packages
git.json # repo commit info (if in a git repo)
files_index.json # index of all produced files
jobs/<job_id>/
job_meta.json # job metadata (task, trainer, seed, params)
results.json # score, feedback, timing, token usage
events.jsonl # timestamped event log
stdout.log # captured stdout (tail shown in UI)
artifacts/
initial_state.yaml # model state before training
best_state.yaml # best model state observed
final_state.yaml # model state after training
state_history.jsonl # step-by-step state changes
results.csv # one row per job, all scores
summary.json # aggregate counts + token totals
leaderboard.csv # best score per task
- run_id: timestamp + hash. Find it in
summary.jsonor the directory name. - config_hash: deterministic hash of the config snapshot, stored in
meta/manifest.jsonfor dedupe. - job_id: unique per (task, trainer, params, seed). Found in
manifest.jsonandresults.csv. - results.csv: the primary results table. One row per job with columns for task, trainer, seed, score, duration, tokens.
- summary.json: aggregate counts and token totals across jobs.
- leaderboard.csv: best score per task (excludes failed jobs).
- manifest.json: the full expanded matrix with resolved kwargs -- useful for debugging which parameters were actually used.
notebooks/03_task_coverage.ipynb-- explore which tasks are available and their propertiesnotebooks/05_full_benchmark.ipynb-- running and analyzing a full benchmark matrix