Skip to content

cocoabench/cocoa-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

249 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CocoaBench CocoaAgent

A Framework for Evaluating and Developing Next-Generation Unified Agents

Website Blog Leaderboard Python License


What's Inside

  • CocoaBench Dataset — Benchmark tasks included directly in this repo: cocoabench-v1.0/ (stable) and cocoabench-head/ (community contributions, continuously merged)
  • CocoaAgent Framework — Model-agnostic agent executor that equips agents with general tools (browser, terminal, file operations, code interpreter) via AIO Sandbox

Note

cocoabench-head/ contains community contributions that are continuously merged. For reproducible evaluation, use a stable release like v1.0.

Prerequisites

  • Python 3.13+
  • Docker & Docker Compose
  • uv (recommended) or pip

Quick Start

Option A: Use the Dataset Only (with your own agent)

# Browse v1.0 tasks (already in repo)
ls cocoabench-v1.0/

# Decrypt tasks (if encrypted)
python decrypt.py --tasks-dir cocoabench-v1.0/

Note

v0.1 is still available as a historical archive: https://cocoabench.github.io/assets/data/cocoa-bench-v0.1.zip

Each task directory contains:

File Purpose
task.yaml Task instruction to give your agent
test.py Evaluation script with test(result) function
Dockerfile Task environment setup
docker-compose.yaml Docker config
assets/ Additional files for the task (optional)

Evaluation: Each test.py exports a test(result) function. If you're using your own agent, you typically just need to pass {"task_result": "<agent's final answer>"}. See Evaluation for details.

Option B: Run with CocoaAgent Framework

# 1. Install
git clone https://github.com/cocoabench/cocoa-agent.git && cd cocoa-agent
uv sync  # or: pip install -r requirements.txt

# 2. Choose tasks
# See included example tasks: cocoabench-example-tasks/
# Or download full benchmark dataset: follow Option A above

# 3. Configure
cp configs/default_gpt.json configs/my-config.json
# Edit my-config.json: set your API key

# 4. Run with example tasks
python inference_main.py \
  --config configs/my-config.json \
  --tasks-dir cocoabench-example-tasks/ \
  --output-dir results/

# Or run with full v1.0 dataset (decryption is handled automatically):
# python inference_main.py \
#   --config configs/my-config.json \
#   --tasks-dir cocoabench-v1.0/ \
#   --output-dir results/

Parallel Inference

To run tasks in parallel across multiple workers (each with its own Docker sandbox port):

python parallel_inference.py \
  --config <config_path> \
  --tasks-dir cocoabench-v1.0/ \
  --output-dir <results_dir> \
  --workers 8
Arg Required Description
--config Yes Model config file path (JSON)
--tasks-dir Yes Directory containing task subdirectories
--output-dir Yes Final output directory for result JSONs
--workers No Number of parallel workers (default: 4)
--base-port No Starting Docker sandbox port (default: 8084), auto-scans for available ports
--model No Override model name from config
--run-all No Run all tasks including previously passed ones. Default: skip passed, rerun failed/missing only
--work-dir No Temp directory for worker configs/logs (default: .parallel_run)

By default, tasks that already have a successful result in --output-dir are skipped, so you can rerun the same command to retry only failed/missing tasks. Use --run-all to force rerun everything.

Output:

  • output-dir/{task_name}.json — result file per task
  • output-dir/statistics.txt — pass rate, failure list, and API cost summary
  • work-dir/ — per-session logs and intermediate files for debugging

Configuration

Edit your config file to customize the agent:

{
  "controller": {
    "type": "llm",
    "args": {
      "model": "gpt-5.2",
      "api_key": "sk-...",
      "base_url": ""
    }
  },
  "sandbox": {
    "docker_port": 8080,
    "max_iterations": 30
  }
}
Key Description
controller.args.model Model name (e.g., gpt-5.2)
controller.args.api_key Your API key
controller.args.base_url Custom endpoint for local models (optional)
sandbox.docker_port Port for sandbox container (default: 8080)
sandbox.max_iterations Max agent iterations per task (default: 30)

Evaluation

Each task includes a test.py that runs on the host machine after the agent completes. The framework calls test(result) with the full execution result and expects a pass/fail verdict.

def test(result: dict) -> dict:
    """Evaluate task results after execution.

    Args:
        result: Complete execution result containing:
            - task_result: Agent's final answer
            - conversation: Full message history with controller
            - execution_trace: All actions and their outputs
            - status: Task status ("success" or "failed")
            - instruction: Original task instruction
            - iterations: Number of iterations completed
            - sandbox: Sandbox configuration (docker_port, etc.)

    Returns:
        Dictionary with:
            - passed (bool): Whether task passed evaluation
            - feedback (str): Human-readable evaluation message
            - details (dict, optional): Additional metrics
    """

Tip

Most test.py scripts first try to extract the answer from task_result, then fall back to searching the conversation history. If you're using your own agent, you can typically just pass task_result with the agent's final answer.

Results are saved to results/<task-name>.json when using the CocoaAgent framework.

Learn more:

Contributing New Tasks

We welcome new benchmark tasks! See contrib/CONTRIBUTING.md for guidelines.

Important

Please encrypt your task before submitting a PR to keep benchmark data safe from being found by the agent.

Citation

@article{team2026cocoabench,
  title={CocoaBench: Evaluating Unified Digital Agents in the Wild},
  author={Team, CocoaBench and Hao, Shibo and Zhang, Zhining and Liang, Zhiqi and Liu, Tianyang and Zha, Yuheng and Gao, Qiyue and Chen, Jixuan and Wang, Zilong and Cheng, Zhoujun and others},
  journal={arXiv preprint arXiv:2604.11201},
  year={2026}
}

About

An agent framework for building and evaluating general digital agents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors