Skip to content

Latest commit

 

History

History
65 lines (50 loc) · 3.35 KB

File metadata and controls

65 lines (50 loc) · 3.35 KB

Trace-Bench Documentation

Trace-Bench is a benchmarking framework for evaluating LLM-based optimization algorithms built on OpenTrace. It provides a reproducible harness that pairs tasks (benchmark problems) with trainers (optimization algorithms), runs them across seeds, and produces structured artifacts for comparison.

How to Use This Documentation

Start with Overview to learn the core concepts and how a run is structured. If you want to execute experiments, read Running Experiments and Config Reference next. For UI workflows, jump to UI Guide. For extension points, follow Adding a Task, Adding an Agent, Adding a Trainer, and Adding a Benchmark.

Validation Evidence

See Validation Evidence for the exact commands and transcripts used to validate the repository after the layout changes.

Quick Start

# Trace-Bench requires Trace/OpenTrace (`opto`) to be installed.
# Option A: pip install trace-opt
# Option B: editable sibling checkout: pip install -e ../OpenTrace

pip install -e .
trace-bench list-tasks
trace-bench run --config configs/smoke.yaml --runs-dir runs

See the root README for full install options.

Table of Contents

Page Description
Overview What Trace-Bench is, concept glossary, and pointers to intro notebooks
Running Experiments CLI reference, fair comparisons, reading results
Agents and Tasks Technical distinction between agents/models and tasks
Adding an Agent How to optimize a new agent with Trace-Bench
Adding a Task How to contribute a new benchmark task
Adding a Trainer How trainers are discovered and registered
Adding a Benchmark How to integrate an external benchmark suite
UI Guide Gradio UI tabs, workflows, and screenshots
MLflow Integration Enabling MLflow, what is logged
Task Inventory All available tasks by suite (LLM4AD, VeriBench, examples, internal)
Config Reference YAML schema, matrix expansion, resume modes, output artifacts
Result Analysis Reading results via CLI, UI, and Python

Notebooks

Notebook Topic
notebooks/01_quick_start.ipynb First run in under 5 minutes
notebooks/02_api_walkthrough.ipynb Python API and config objects
notebooks/03_task_coverage.ipynb Exploring available tasks
notebooks/04_gradio_ui.ipynb Interactive results dashboard
notebooks/05_full_benchmark.ipynb Running a full benchmark matrix
notebooks/06_multiobjective_convex.ipynb Multi-objective optimization (convex)
notebooks/07_multiobjective_bbeh.ipynb Multi-objective optimization (BBEH)
notebooks/08_multiobjective_gsm8k.ipynb Multi-objective optimization (GSM8K)

Project Layout

trace_bench/       Python package (CLI, runner, registry, config)
benchmarks/            Benchmark suites (LLM4AD, KernelBench, Veribench)
configs/               YAML run configurations
notebooks/             Jupyter notebooks with worked examples
runs/                  Default output directory (created on first run)