Skip to content

AgentOpt/Trace-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trace-Bench

A benchmarking framework for evaluating LLM-as-optimizer algorithms, built on OpenTrace.

Trace-Bench provides a CLI, Gradio UI, and notebook workflows to run reproducible experiments across multiple benchmarks (LLM4AD, VeriBench, KernelBench) with fair comparisons between trainers, optimizers, and LLM backends.

Full documentation: docs/

Install

git clone https://github.com/AgentOpt/Trace-Bench.git
cd Trace-Bench

# Trace-Bench depends on Trace/OpenTrace (`opto`).
# Install one of the following before running commands:

# Option A (preferred if published in your environment):
pip install trace-opt

# Option B (editable sibling checkout):
git clone https://github.com/AgentOpt/Trace.git ../OpenTrace
pip install -e ../OpenTrace

# Install Trace-Bench
pip install -e .

Quick Start

# List available tasks
trace-bench list-tasks

# Run a smoke test (stub mode, no API keys needed)
trace-bench run --config configs/smoke.yaml --runs-dir runs

# Launch the Gradio UI
trace-bench ui --runs-dir runs

Real-Mode Setup

To run benchmarks with actual LLM calls, configure an API provider:

export OPENROUTER_API_KEY="sk-or-v1-..."
export OPENAI_API_KEY="$OPENROUTER_API_KEY"
export OPENAI_API_BASE="https://openrouter.ai/api/v1"
export TRACE_DEFAULT_LLM_BACKEND="LiteLLM"
export TRACE_LITELLM_MODEL="openrouter/x-ai/grok-4.1-fast"

Then run with a real-mode config:

trace-bench run --config configs/smoke_real.yaml --runs-dir runs

Repository Layout

trace_bench/           # Python package
  trainers/            # External / user-customized trainers only.
                       # Built-in trainers (PrioritySearch, GEPA-*) live in
                       # OpenTrace and are discovered automatically.
benchmarks/
  LLM4AD/                  # 65 algorithm design tasks
  KernelBench/             # CUDA kernel optimization
  Veribench/               # Lean 4 formal verification
configs/                   # YAML experiment configs
notebooks/                 # Jupyter notebooks (Colab-ready)
docs/                      # Full documentation
tests/                     # Test suite (m0/ m1/ m2/ m3/)
runs/                      # Output directory (gitignored)

Benchmarks

LLM4AD (65 tasks)

Algorithm design tasks from LLM4AD: optimization (basic, constructive, CO-Bench), machine learning, and scientific discovery.

VeriBench (~140 tasks)

Formal verification tasks translating Python to Lean 4. Integrated via trace_bench/veribench_adapter.py with dual discovery: local entrypoint module or HuggingFace dataset fallback.

KernelBench

CUDA kernel optimization with remote evaluation server. See benchmarks/KernelBench/README.md.

CLI Reference

trace-bench list-tasks      List discoverable tasks
trace-bench list-trainers    List available trainers
trace-bench validate         Validate a config file
trace-bench run              Run a benchmark experiment
trace-bench ui               Launch the Gradio UI

See docs/running-experiments.md for full CLI usage.

Dependencies

Core:

  • Python >= 3.9, Graphviz (system package)
  • graphviz, pyyaml, pytest, litellm, aiohttp, tensorboard, tensorboardX, scikit-learn

Full coverage (all benchmarks):

  • pandas, datasets, sympy, pymoo, gymnasium, scipy, networkx

Tests

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q

License

MIT

About

Benchmark to evaluate LLM as optimizers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors