A Framework for Evaluating and Developing Next-Generation Unified Agents
- CocoaBench Dataset — Benchmark tasks included directly in this repo:
cocoabench-v1.0/(stable) andcocoabench-head/(community contributions, continuously merged) - CocoaAgent Framework — Model-agnostic agent executor that equips agents with general tools (browser, terminal, file operations, code interpreter) via AIO Sandbox
Note
cocoabench-head/ contains community contributions that are continuously merged. For reproducible evaluation, use a stable release like v1.0.
- Python 3.13+
- Docker & Docker Compose
- uv (recommended) or pip
# Browse v1.0 tasks (already in repo)
ls cocoabench-v1.0/
# Decrypt tasks (if encrypted)
python decrypt.py --tasks-dir cocoabench-v1.0/Note
v0.1 is still available as a historical archive: https://cocoabench.github.io/assets/data/cocoa-bench-v0.1.zip
Each task directory contains:
| File | Purpose |
|---|---|
task.yaml |
Task instruction to give your agent |
test.py |
Evaluation script with test(result) function |
Dockerfile |
Task environment setup |
docker-compose.yaml |
Docker config |
assets/ |
Additional files for the task (optional) |
Evaluation: Each test.py exports a test(result) function. If you're using your own agent, you typically just need to pass {"task_result": "<agent's final answer>"}. See Evaluation for details.
# 1. Install
git clone https://github.com/cocoabench/cocoa-agent.git && cd cocoa-agent
uv sync # or: pip install -r requirements.txt
# 2. Choose tasks
# See included example tasks: cocoabench-example-tasks/
# Or download full benchmark dataset: follow Option A above
# 3. Configure
cp configs/default_gpt.json configs/my-config.json
# Edit my-config.json: set your API key
# 4. Run with example tasks
python inference_main.py \
--config configs/my-config.json \
--tasks-dir cocoabench-example-tasks/ \
--output-dir results/
# Or run with full v1.0 dataset (decryption is handled automatically):
# python inference_main.py \
# --config configs/my-config.json \
# --tasks-dir cocoabench-v1.0/ \
# --output-dir results/To run tasks in parallel across multiple workers (each with its own Docker sandbox port):
python parallel_inference.py \
--config <config_path> \
--tasks-dir cocoabench-v1.0/ \
--output-dir <results_dir> \
--workers 8| Arg | Required | Description |
|---|---|---|
--config |
Yes | Model config file path (JSON) |
--tasks-dir |
Yes | Directory containing task subdirectories |
--output-dir |
Yes | Final output directory for result JSONs |
--workers |
No | Number of parallel workers (default: 4) |
--base-port |
No | Starting Docker sandbox port (default: 8084), auto-scans for available ports |
--model |
No | Override model name from config |
--run-all |
No | Run all tasks including previously passed ones. Default: skip passed, rerun failed/missing only |
--work-dir |
No | Temp directory for worker configs/logs (default: .parallel_run) |
By default, tasks that already have a successful result in --output-dir are skipped, so you can rerun the same command to retry only failed/missing tasks. Use --run-all to force rerun everything.
Output:
output-dir/{task_name}.json— result file per taskoutput-dir/statistics.txt— pass rate, failure list, and API cost summarywork-dir/— per-session logs and intermediate files for debugging
Edit your config file to customize the agent:
{
"controller": {
"type": "llm",
"args": {
"model": "gpt-5.2",
"api_key": "sk-...",
"base_url": ""
}
},
"sandbox": {
"docker_port": 8080,
"max_iterations": 30
}
}| Key | Description |
|---|---|
controller.args.model |
Model name (e.g., gpt-5.2) |
controller.args.api_key |
Your API key |
controller.args.base_url |
Custom endpoint for local models (optional) |
sandbox.docker_port |
Port for sandbox container (default: 8080) |
sandbox.max_iterations |
Max agent iterations per task (default: 30) |
Each task includes a test.py that runs on the host machine after the agent completes. The framework calls test(result) with the full execution result and expects a pass/fail verdict.
def test(result: dict) -> dict:
"""Evaluate task results after execution.
Args:
result: Complete execution result containing:
- task_result: Agent's final answer
- conversation: Full message history with controller
- execution_trace: All actions and their outputs
- status: Task status ("success" or "failed")
- instruction: Original task instruction
- iterations: Number of iterations completed
- sandbox: Sandbox configuration (docker_port, etc.)
Returns:
Dictionary with:
- passed (bool): Whether task passed evaluation
- feedback (str): Human-readable evaluation message
- details (dict, optional): Additional metrics
"""Tip
Most test.py scripts first try to extract the answer from task_result, then fall back to searching the conversation history. If you're using your own agent, you can typically just pass task_result with the agent's final answer.
Results are saved to results/<task-name>.json when using the CocoaAgent framework.
Learn more:
- Evaluation Guide — Complete result dictionary structure and return format
- Sandbox API Reference — How to access files and state inside the sandbox container
We welcome new benchmark tasks! See contrib/CONTRIBUTING.md for guidelines.
Important
Please encrypt your task before submitting a PR to keep benchmark data safe from being found by the agent.
@article{team2026cocoabench,
title={CocoaBench: Evaluating Unified Digital Agents in the Wild},
author={Team, CocoaBench and Hao, Shibo and Zhang, Zhining and Liang, Zhiqi and Liu, Tianyang and Zha, Yuheng and Gao, Qiyue and Chen, Jixuan and Wang, Zilong and Cheng, Zhoujun and others},
journal={arXiv preprint arXiv:2604.11201},
year={2026}
}