Complete reference for using beyondbench (BeyondBench) evaluation framework.
pip install beyondbench# OpenAI support
pip install beyondbench[openai]
# Google Gemini support
pip install beyondbench[gemini]
# Anthropic Claude support
pip install beyondbench[anthropic]
# All API clients
pip install beyondbench[all-apis]
# vLLM support (requires CUDA)
pip install beyondbench[vllm]
# Full installation (everything)
pip install beyondbench[full]git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .Launch the interactive setup wizard:
beyondbenchThe wizard guides you through:
- Selecting backend (API/Local)
- Choosing model
- Configuring API keys
- Selecting task suite
- Setting evaluation parameters
Run model evaluation:
beyondbench evaluate [OPTIONS]| Option | Description |
|---|---|
--model-id MODEL |
Model identifier (e.g., gpt-4o, meta-llama/Llama-3-8B-Instruct) |
| Option | Description |
|---|---|
--backend BACKEND |
Backend: vllm (default, fast), transformers, openai, or gemini. For API models, prefer --api-provider |
--api-provider PROVIDER |
API backend: openai, gemini, or anthropic |
--api-key KEY |
API key (or set via environment variable) |
| Option | Description |
|---|---|
--suite SUITE |
Task suite: easy, medium, hard, or all (default: all) |
--tasks TASKS |
Specific tasks (comma-separated): --tasks sum,sorting,median |
| Option | Default | Description |
|---|---|---|
--datapoints N |
100 | Number of data points per task |
--temperature T |
0.7 | Sampling temperature |
--top-p P |
0.9 | Top-p (nucleus) sampling |
--max-tokens N |
32768 | Maximum tokens to generate (falls back to 8192 on error) |
--seed SEED |
None | Random seed for reproducibility |
--folds N |
1 | Number of evaluation folds |
| Option | Default | Description |
|---|---|---|
--output-dir DIR |
./beyondbench_results |
Output directory for results |
--store-details |
False | Store detailed per-example results |
| Option | Description |
|---|---|
--reasoning-effort EFFORT |
OpenAI GPT-5 reasoning effort: minimal, low, medium, high |
--thinking-budget N |
Gemini thinking budget: integer, 0 to disable, -1 for dynamic |
| Option | Default | Description |
|---|---|---|
--tensor-parallel-size N |
1 | Number of GPUs for tensor parallelism |
--gpu-memory-utilization F |
0.96 | GPU memory utilization (0.0-1.0) |
--trust-remote-code |
False | Trust remote code from HuggingFace |
--cuda-device DEVICE |
cuda:0 |
CUDA device for local models |
| Option | Default | Description |
|---|---|---|
--list-sizes TEXT |
8,16,32 |
Comma-separated list sizes for scalable tasks |
--range-min N |
-100 | Minimum value for number generation |
--range-max N |
100 | Maximum value for number generation |
--batch-size N |
1 | Batch size for local model inference |
--max-retries N |
3 | Maximum retries for failed operations |
--timeout N |
300 | Timeout for individual operations (seconds) |
--log-level LEVEL |
INFO | Logging level: DEBUG, INFO, WARNING, ERROR |
beyondbench list-tasks [--suite SUITE] [--format FORMAT]Options:
--suite: Filter by suite (easy,medium,hard,all)--format: Output format (table,json,yaml)
Start the API server:
# Install serve dependencies
pip install beyondbench[serve]
# Start server
beyondbench serve --port 8000
# With auto-reload for development
beyondbench serve --reloadCreate a config file interactively:
beyondbench init
beyondbench init --output my_config.yamlGet task details:
beyondbench info sorting
beyondbench info tower_hanoiView and compare results:
# List past results
beyondbench results list
# Show details
beyondbench results show ./beyondbench_results/final_results.json
# Compare two runs
beyondbench results compare ./results_a/final_results.json ./results_b/final_results.json# Run from YAML config
beyondbench run-config beyondbench/configs/default.yaml
beyondbench run-config beyondbench/configs/openai_example.yaml# Set API key via environment
export OPENAI_API_KEY="sk-..."
# Or pass directly
beyondbench evaluate \
--model-id gpt-4o \
--api-provider openai \
--api-key "sk-..." \
--suite easySupported models:
gpt-4o- GPT-4 Optimizedgpt-4o-mini- Smaller GPT-4 variantgpt-5- Latest GPT-5 (with reasoning)gpt-5-mini- Smaller GPT-5gpt-5-nano- Smallest GPT-5
For GPT-5 models with reasoning:
beyondbench evaluate \
--model-id gpt-5 \
--api-provider openai \
--reasoning-effort high \
--suite hardexport GEMINI_API_KEY="..."
beyondbench evaluate \
--model-id gemini-2.5-pro \
--api-provider gemini \
--suite mediumWith thinking configuration:
beyondbench evaluate \
--model-id gemini-2.5-pro \
--api-provider gemini \
--thinking-budget 16384 \
--suite hardexport ANTHROPIC_API_KEY="sk-ant-..."
beyondbench evaluate \
--model-id claude-sonnet-4-20250514 \
--api-provider anthropic \
--suite allSupported models:
claude-sonnet-4-20250514- Claude Sonnet 4claude-opus-4-20250514- Claude Opus 4
vLLM provides fast batch inference with GPU parallelism:
beyondbench evaluate \
--model-id Qwen/Qwen2.5-3B-Instruct \
--backend vllm \
--suite allWith multi-GPU:
beyondbench evaluate \
--model-id meta-llama/Llama-3.3-70B-Instruct \
--backend vllm \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--suite allFor CPU or single-GPU inference:
beyondbench evaluate \
--model-id Qwen/Qwen2.5-3B-Instruct \
--backend transformers \
--suite easyfrom beyondbench import EvaluationEngine, ModelHandler, TaskRegistry
# Initialize model handler
model = ModelHandler(
model_id="gpt-4o",
api_provider="openai",
api_key="your-api-key"
)
# Run evaluation
engine = EvaluationEngine(
model_handler=model,
output_dir="./results",
store_details=True
)
results = engine.run_evaluation(suite="easy", datapoints=100, temperature=0.1, max_tokens=32768)
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")from beyondbench import ModelHandler, EvaluationEngine
# vLLM backend (fast, batched inference)
model = ModelHandler(
model_id="Qwen/Qwen2.5-3B-Instruct",
backend="vllm",
tensor_parallel_size=1,
gpu_memory_utilization=0.96
)
engine = EvaluationEngine(
model_handler=model,
output_dir="./results"
)
results = engine.run_evaluation(suite="all")from beyondbench import ModelHandler, EvaluationEngine
# Transformers backend
model = ModelHandler(
model_id="Qwen/Qwen2.5-3B-Instruct",
backend="transformers",
trust_remote_code=True
)
engine = EvaluationEngine(
model_handler=model,
output_dir="./results"
)
results = engine.run_evaluation(suite="easy")# Get specific tasks
registry = TaskRegistry()
tasks = registry.get_tasks_for_suite("easy")
# Run only selected tasks
results = engine.run_evaluation(
tasks=["sum", "sorting", "median"]
)# Overall metrics
print(f"Total Tasks: {results['summary']['total_tasks']}")
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")
print(f"Total Tokens: {results['summary']['total_tokens']}")
# Per-task metrics
for task_name, metrics in results['task_results'].items():
if isinstance(metrics, dict) and 'summary' in metrics:
print(f"{task_name}: {metrics['summary'].get('avg_accuracy', 0):.2%}")Fundamental operations with clear numerical answers:
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easyTasks:
- Arithmetic: sum, multiplication, subtraction, division, absolute_difference, alternating_sum
- Statistics: mean, median, mode, range
- Counting: odd_count, even_count, count_negative, count_unique, count_multiples, count_perfect_squares, count_palindromic, count_greater_than_previous
- Extrema: find_maximum, find_minimum, second_maximum, index_of_maximum, local_maxima_count
- Ordering: sorting
- Sequences: longest_increasing_subsequence, sum_of_digits, sum_of_max_indices
- Difference: max_adjacent_difference
- Comparison: comparison
Sequence pattern recognition:
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite mediumTasks:
- fibonacci_sequence (6 variations): Tribonacci, Lucas, Modified recursive
- algebraic_sequence (10 variations): Polynomial, arithmetic, quadratic
- geometric_sequence (10 variations): Exponential, compound, factorial
- prime_sequence (11 variations): Prime gaps, twin primes, Sophie Germain
- complex_pattern (12 variations): Interleaved, conditional, multi-rule
NP-complete and constraint satisfaction problems:
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite hardTasks:
- tower_hanoi (6 variations): Classic, bidirectional, cyclic
- n_queens (4 variations): Standard, modified constraints
- graph_coloring (10 variations): Various graph types
- boolean_sat (5 variations): 2-SAT, 3-SAT, Horn clauses
- sudoku_solving (8 variations): Standard, diagonal, irregular
- cryptarithmetic (12 variations): Various equation types
- matrix_chain_multiplication (5 variations): Multiplication ordering
- modular_systems (5 variations): Chinese remainder theorem
- constraint_optimization (5 variations): Knapsack, scheduling
- logic_grid_puzzles (8 variations): Einstein puzzles, zebra
beyondbench evaluate \
--model-id gpt-4o \
--api-provider openai \
--suite all \
--seed 42 \
--temperature 0.0Store per-example results for analysis:
beyondbench evaluate \
--model-id gpt-4o \
--api-provider openai \
--suite easy \
--store-details \
--output-dir ./detailed_resultsRun multiple evaluation folds:
beyondbench evaluate \
--model-id gpt-4o \
--api-provider openai \
--suite easy \
--folds 3beyondbench evaluate \
--model-id gpt-4o \
--api-provider openai \
--suite hard \
--datapoints 200export OPENAI_API_KEY="sk-..."
beyondbench evaluate \
--model-id gpt-4o \
--api-provider openai \
--suite all \
--datapoints 100 \
--temperature 0.1 \
--max-tokens 32768 \
--seed 42 \
--store-details \
--output-dir ./results/gpt-4obeyondbench evaluate \
--model-id Qwen/Qwen2.5-7B-Instruct \
--backend vllm \
--suite all \
--datapoints 50 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--output-dir ./results/qwen-7bbeyondbench evaluate \
--model-id gpt-4o-mini \
--api-provider openai \
--suite easy \
--datapoints 10 \
--tasks sum,sorting,median# Script to compare models
for model in "gpt-4o" "gpt-4o-mini" "claude-sonnet-4-20250514"; do
beyondbench evaluate \
--model-id $model \
--api-provider ${model%%/*} \
--suite easy \
--datapoints 50 \
--output-dir ./results/$model
doneexport GEMINI_API_KEY="..."
beyondbench evaluate \
--model-id gemini-2.5-pro \
--api-provider gemini \
--thinking-budget 16384 \
--suite hard \
--datapoints 50 \
--output-dir ./results/gemini-thinkingexport OPENAI_API_KEY="sk-..."
beyondbench evaluate \
--model-id gpt-5 \
--api-provider openai \
--reasoning-effort high \
--suite hard \
--datapoints 50 \
--output-dir ./results/gpt5-high-reasoningResults are saved in JSON format:
{
"summary": {
"total_duration": 123.4,
"total_tasks": 29,
"completed_tasks": 29,
"failed_tasks": 0,
"total_evaluations": 87,
"successful_evaluations": 80,
"success_rate": 0.92,
"avg_accuracy": 0.85,
"avg_success_rate": 0.95,
"total_tokens": 150432,
"evaluations_per_second": 0.71
},
"task_results": {
"sum": { "summary": { "avg_accuracy": 0.98, "success_rate": 1.0 } },
"sorting": { "summary": { "avg_accuracy": 0.95, "success_rate": 0.97 } }
},
"model_info": {
"model_id": "gpt-4o",
"backend": "openai"
},
"evaluation_config": {
"suite": "multiple",
"tasks": ["sum", "sorting"],
"output_dir": "./beyondbench_results"
}
}Reduce GPU memory utilization:
--gpu-memory-utilization 0.7Or use smaller batch sizes by reducing datapoints:
--datapoints 20The framework automatically handles rate limiting with exponential backoff. For heavy usage, consider:
- Using lower
--datapoints - Running tasks sequentially
- Using multiple API keys
For HuggingFace models, ensure the model ID is correct:
# Correct format
--model-id Qwen/Qwen2.5-3B-Instruct
# Not correct
--model-id qwen2.5-3bSome models require trusting remote code:
--trust-remote-code| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key |
GEMINI_API_KEY |
Google Gemini API key |
ANTHROPIC_API_KEY |
Anthropic API key |
CUDA_VISIBLE_DEVICES |
GPU selection for local models |
HF_TOKEN |
HuggingFace token for gated models |
- GitHub Issues: https://github.com/ctrl-gaurav/BeyondBench/issues
- Documentation: https://github.com/ctrl-gaurav/BeyondBench/wiki