CocoaAgent

A Framework for Evaluating and Developing Next-Generation Unified Agents

What's Inside

CocoaBench Dataset — Benchmark tasks included directly in this repo: cocoabench-v1.0/ (stable) and cocoabench-head/ (community contributions, continuously merged)
CocoaAgent Framework — Model-agnostic agent executor that equips agents with general tools (browser, terminal, file operations, code interpreter) via AIO Sandbox

Note

cocoabench-head/ contains community contributions that are continuously merged. For reproducible evaluation, use a stable release like v1.0.

Prerequisites

Python 3.13+
Docker & Docker Compose
uv (recommended) or pip

Quick Start

Option A: Use the Dataset Only (with your own agent)

# Browse v1.0 tasks (already in repo)
ls cocoabench-v1.0/

# Decrypt tasks (if encrypted)
python decrypt.py --tasks-dir cocoabench-v1.0/

Note

v0.1 is still available as a historical archive: https://cocoabench.github.io/assets/data/cocoa-bench-v0.1.zip

Each task directory contains:

File	Purpose
`task.yaml`	Task instruction to give your agent
`test.py`	Evaluation script with `test(result)` function
`Dockerfile`	Task environment setup
`docker-compose.yaml`	Docker config
`assets/`	Additional files for the task (optional)

Evaluation: Each test.py exports a test(result) function. If you're using your own agent, you typically just need to pass {"task_result": "<agent's final answer>"}. See Evaluation for details.

Option B: Run with CocoaAgent Framework

# 1. Install
git clone https://github.com/cocoabench/cocoa-agent.git && cd cocoa-agent
uv sync  # or: pip install -r requirements.txt

# 2. Choose tasks
# See included example tasks: cocoabench-example-tasks/
# Or download full benchmark dataset: follow Option A above

# 3. Configure
cp configs/default_gpt.json configs/my-config.json
# Edit my-config.json: set your API key

# 4. Run with example tasks
python inference_main.py \
  --config configs/my-config.json \
  --tasks-dir cocoabench-example-tasks/ \
  --output-dir results/

# Or run with full v1.0 dataset (decryption is handled automatically):
# python inference_main.py \
#   --config configs/my-config.json \
#   --tasks-dir cocoabench-v1.0/ \
#   --output-dir results/

Parallel Inference

To run tasks in parallel across multiple workers (each with its own Docker sandbox port):

python parallel_inference.py \
  --config <config_path> \
  --tasks-dir cocoabench-v1.0/ \
  --output-dir <results_dir> \
  --workers 8

Arg	Required	Description
`--config`	Yes	Model config file path (JSON)
`--tasks-dir`	Yes	Directory containing task subdirectories
`--output-dir`	Yes	Final output directory for result JSONs
`--workers`	No	Number of parallel workers (default: 4)
`--base-port`	No	Starting Docker sandbox port (default: 8084), auto-scans for available ports
`--model`	No	Override model name from config
`--run-all`	No	Run all tasks including previously passed ones. Default: skip passed, rerun failed/missing only
`--work-dir`	No	Temp directory for worker configs/logs (default: `.parallel_run`)

By default, tasks that already have a successful result in --output-dir are skipped, so you can rerun the same command to retry only failed/missing tasks. Use --run-all to force rerun everything.

Output:

output-dir/{task_name}.json — result file per task
output-dir/statistics.txt — pass rate, failure list, and API cost summary
work-dir/ — per-session logs and intermediate files for debugging

Configuration

Edit your config file to customize the agent:

{
  "controller": {
    "type": "llm",
    "args": {
      "model": "gpt-5.2",
      "api_key": "sk-...",
      "base_url": ""
    }
  },
  "sandbox": {
    "docker_port": 8080,
    "max_iterations": 30
  }
}

Key	Description
`controller.args.model`	Model name (e.g., `gpt-5.2`)
`controller.args.api_key`	Your API key
`controller.args.base_url`	Custom endpoint for local models (optional)
`sandbox.docker_port`	Port for sandbox container (default: 8080)
`sandbox.max_iterations`	Max agent iterations per task (default: 30)

Evaluation

Each task includes a test.py that runs on the host machine after the agent completes. The framework calls test(result) with the full execution result and expects a pass/fail verdict.

def test(result: dict) -> dict:
    """Evaluate task results after execution.

    Args:
        result: Complete execution result containing:
            - task_result: Agent's final answer
            - conversation: Full message history with controller
            - execution_trace: All actions and their outputs
            - status: Task status ("success" or "failed")
            - instruction: Original task instruction
            - iterations: Number of iterations completed
            - sandbox: Sandbox configuration (docker_port, etc.)

    Returns:
        Dictionary with:
            - passed (bool): Whether task passed evaluation
            - feedback (str): Human-readable evaluation message
            - details (dict, optional): Additional metrics
    """

Tip

Most test.py scripts first try to extract the answer from task_result, then fall back to searching the conversation history. If you're using your own agent, you can typically just pass task_result with the agent's final answer.

Results are saved to results/<task-name>.json when using the CocoaAgent framework.

Learn more:

Evaluation Guide — Complete result dictionary structure and return format
Sandbox API Reference — How to access files and state inside the sandbox container

Contributing New Tasks

We welcome new benchmark tasks! See contrib/CONTRIBUTING.md for guidelines.

Important

Please encrypt your task before submitting a PR to keep benchmark data safe from being found by the agent.

Citation

@article{team2026cocoabench,
  title={CocoaBench: Evaluating Unified Digital Agents in the Wild},
  author={Team, CocoaBench and Hao, Shibo and Zhang, Zhining and Liang, Zhiqi and Liu, Tianyang and Zha, Yuheng and Gao, Qiyue and Chen, Jixuan and Wang, Zilong and Cheng, Zhoujun and others},
  journal={arXiv preprint arXiv:2604.11201},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
agents		agents
assets		assets
cocoabench-example-tasks		cocoabench-example-tasks
cocoabench-head		cocoabench-head
cocoabench-v1.0		cocoabench-v1.0
configs-example		configs-example
contrib		contrib
docs		docs
executor		executor
visualizer		visualizer
.gitignore		.gitignore
.python-version		.python-version
Claude_Code_run_benchmark.py		Claude_Code_run_benchmark.py
Codex_CLI_run_benchmark.py		Codex_CLI_run_benchmark.py
LICENSE		LICENSE
README.md		README.md
calculate_stats.py		calculate_stats.py
clean_cocoa_bench_dockers.sh		clean_cocoa_bench_dockers.sh
cocoabench.pdf		cocoabench.pdf
decrypt.py		decrypt.py
encrypt.py		encrypt.py
inference_main.py		inference_main.py
parallel_inference.py		parallel_inference.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CocoaAgent

What's Inside

Prerequisites

Quick Start

Option A: Use the Dataset Only (with your own agent)

Option B: Run with CocoaAgent Framework

Parallel Inference

Configuration

Evaluation

Contributing New Tasks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CocoaAgent

What's Inside

Prerequisites

Quick Start

Option A: Use the Dataset Only (with your own agent)

Option B: Run with CocoaAgent Framework

Parallel Inference

Configuration

Evaluation

Contributing New Tasks

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages