Skip to content

PKU-Alignment/VLA-Arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

30 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿค– VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

arXiv License Python Leaderboard Task Store Models & Datasets Docs Community

VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models. VLA-Arena provides a full toolchain covering scenes modeling, demonstrations collection, models training and evaluation. It features 170 tasks across 11 specialized suites, hierarchical difficulty levels (L0-L2), and comprehensive metrics for safety, generalization, and efficiency assessment.

VLA-Arena focuses on four key domains:

  • Safety: Operate reliably and safely in the physical world.
  • Distractors: Maintain stable performance when facing environmental unpredictability.
  • Extrapolation: Generalize learned knowledge to novel situations.
  • Long Horizon: Combine long sequences of actions to achieve a complex goal.

๐Ÿ“ฐ News

  • [2025.12.27] ๐Ÿ“„ Our paper is now available!
  • [2025.09.29] ๐Ÿš€ VLA-Arena is officially released!

๐Ÿ”ฅ Highlights

  • ๐Ÿš€ End-to-End & Out-of-the-Box: We provide a complete and unified toolchain covering everything from scene modeling and behavior collection to model training and evaluation. Paired with comprehensive docs and tutorials, you can get started in minutes.
  • ๐Ÿ”Œ Plug-and-Play Evaluation: Seamlessly integrate and benchmark your own VLA models. Our framework is designed with a unified API, making the evaluation of new architectures straightforward with minimal code changes.
  • ๐Ÿ› ๏ธ Effortless Task Customization: Leverage the Constrained Behavior Domain Definition Language (CBDDL) to rapidly define entirely new tasks and safety constraints. Its declarative nature allows you to achieve comprehensive scenario coverage with minimal effort.
  • ๐Ÿ“Š Systematic Difficulty Scaling: Systematically assess model capabilities across three distinct difficulty levels (L0โ†’L1โ†’L2). Isolate specific skills and pinpoint failure points, from basic object manipulation to complex, long-horizon tasks.

๐Ÿ“š Table of Contents

Quick Start

Prerequisite: install uv: https://docs.astral.sh/uv/

Step 1 โ€” Clone

git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

Step 2 โ€” Run (Evaluate or Train)

You can directly evaluate using our official finetuned models, or train your own. (The first uv run may take a while as it automatically creates the isolated environment and installs dependencies).

To Evaluate:

uv run --project envs/openvla \
  vla-arena eval --model openvla --config vla_arena/configs/evaluation/openvla.yaml

To Train:

uv run --project envs/openvla \
  vla-arena train --model openvla --config vla_arena/configs/train/openvla.yaml

โš™๏ธ Configuration

Before running the commands above, edit the YAML configs for your model setup. Example (OpenVLA):

  • Training Config (vla_arena/configs/train/openvla.yaml): Set vla_path, data_root_dir, and dataset_name.
  • Evaluation Config (vla_arena/configs/evaluation/openvla.yaml): Set pretrained_checkpoint, task_suite_name, and task_level.

Other models follow the same pattern: use the matching vla_arena/configs/train/<model>.yaml, vla_arena/configs/evaluation/<model>.yaml, and envs/<model>.

๐Ÿ’ก For data collection and dataset conversion, see docs/data_collection.md.

Task Suites Overview

VLA-Arena provides 11 specialized task suites with 170 tasks total, organized into four domains:

๐Ÿ›ก๏ธ Safety (5 suites, 75 tasks)

Suite Description L0 L1 L2 Total
safety_static_obstacles Static collision avoidance 5 5 5 15
safety_cautious_grasp Safe grasping strategies 5 5 5 15
safety_hazard_avoidance Hazard area avoidance 5 5 5 15
safety_state_preservation Object state preservation 5 5 5 15
safety_dynamic_obstacles Dynamic collision avoidance 5 5 5 15

๐Ÿ”„ Distractor (2 suites, 30 tasks)

Suite Description L0 L1 L2 Total
distractor_static_distractors Cluttered scene manipulation 5 5 5 15
distractor_dynamic_distractors Dynamic scene manipulation 5 5 5 15

๐ŸŽฏ Extrapolation (3 suites, 45 tasks)

Suite Description L0 L1 L2 Total
preposition_combinations Spatial relationship understanding 5 5 5 15
task_workflows Multi-step task planning 5 5 5 15
unseen_objects Unseen object recognition 5 5 5 15

๐Ÿ“ˆ Long Horizon (1 suite, 20 tasks)

Suite Description L0 L1 L2 Total
long_horizon Long-horizon task planning 10 5 5 20

Difficulty Levels:

  • L0: Basic tasks with clear objectives
  • L1: Intermediate tasks with increased complexity
  • L2: Advanced tasks with challenging scenarios

๐Ÿ›ก๏ธ Safety Suites Visualization

Suite Name L0 L1 L2
Static Obstacles
Cautious Grasp
Hazard Avoidance
State Preservation
Dynamic Obstacles

๐Ÿ”„ Distractor Suites Visualization

Suite Name L0 L1 L2
Static Distractors
Dynamic Distractors

๐ŸŽฏ Extrapolation Suites Visualization

Suite Name L0 L1 L2
Preposition Combinations
Task Workflows
Unseen Objects

๐Ÿ“ˆ Long Horizon Suite Visualization

Suite Name L0 L1 L2
Long Horizon

Installation

System Requirements

  • OS: Ubuntu 20.04+ or macOS 12+
  • Python: 3.11.x (==3.11.*)
  • CUDA: 11.8+ (for GPU acceleration)

Install from Source (Recommended)

# Clone repository
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Install uv: https://docs.astral.sh/uv/

# (Optional) Pre-install base environment (otherwise the first `uv run` will do it)
uv sync --project envs/base

# (Optional) Download / update task suites and assets from the Hub (~850 MB)
uv run --project envs/base vla-arena.download-tasks install-all --repo vla-arena/tasks

Note: If you cloned this repository, tasks and assets are already included. You can skip the download step unless you want to update from the Hub.

Install from PyPI (Alternative)

python3 -m pip install vla-arena

# One-time: initialize local uv projects (`envs/*`) and copy default configs
vla-arena.init-workspace --force

# (Optional) Download task suites / assets (~850 MB)
uv run --project envs/base vla-arena.download-tasks install-all --repo vla-arena/tasks

# One-line train / eval (config auto-defaults; override via --config if needed)
uv run --project envs/openvla vla-arena train --model openvla
uv run --project envs/openvla vla-arena eval --model openvla

For source checkout users, the existing envs/<model_name> workflow remains unchanged.

Documentation

VLA-Arena provides comprehensive documentation for all aspects of the framework. Choose the guide that best fits your needs:

๐Ÿ“– Core Guides

Build custom task scenarios using CBDDL (Constrained Behavior Domain Definition Language).

  • CBDDL file structure and syntax
  • Region, fixture, and object definitions
  • Moving objects with various motion types (linear, circular, waypoint, parabolic)
  • Initial and goal state specifications
  • Cost constraints and safety predicates
  • Image effect settings
  • Asset management and registration
  • Scene visualization tools

Collect demonstrations in custom scenes and convert data formats.

  • Interactive simulation environment with keyboard controls
  • Demonstration data collection workflow
  • Data format conversion (HDF5 to training dataset)
  • Dataset regeneration (filtering noops and optimizing trajectories)
  • Convert dataset to RLDS format (for X-embodiment frameworks)
  • Convert RLDS dataset to LeRobot format (for Hugging Face LeRobot)

Fine-tune and evaluate VLA models using VLA-Arena generated datasets.

  • Unified uv-only workflow for all supported models
  • Per-model isolated environments (envs/openvla, envs/openvla_oft, envs/univla, envs/smolvla, envs/openpi)
  • Training configuration and hyperparameter settings
  • Evaluation scripts and metrics
  • Policy server setup for inference (OpenPi)

๐Ÿ”œ Quick Reference

Common Commands

  • Train: uv run --project envs/<model_name> vla-arena train --model <model_cli_name> (optional override: --config ...)
  • Eval: uv run --project envs/<model_name> vla-arena eval --model <model_cli_name> (optional override: --config ...)
  • See the Model Fine-tuning and Evaluation Guide.

Documentation Index

  • English: README_EN.md - Complete English documentation index
  • ไธญๆ–‡: README_ZH.md - ๅฎŒๆ•ดไธญๆ–‡ๆ–‡ๆกฃ็ดขๅผ•

๐Ÿ“ฆ Download Task Suites

Method 1: Using CLI Tool (Recommended)

After installation, you can use the following commands to view and download task suites:

# View installed tasks
uv run --project envs/base vla-arena.download-tasks installed

# List available task suites
uv run --project envs/base vla-arena.download-tasks list --repo vla-arena/tasks

# Install a single task suite
uv run --project envs/base vla-arena.download-tasks install distractor_dynamic_distractors --repo vla-arena/tasks

# Install multiple task suites at once
uv run --project envs/base vla-arena.download-tasks install safety_hazard_avoidance safety_state_preservation --repo vla-arena/tasks

# Install all task suites (recommended)
uv run --project envs/base vla-arena.download-tasks install-all --repo vla-arena/tasks

Method 2: Using Python Script

# View installed tasks
uv run --project envs/base python -m scripts.download_tasks installed

# Install all tasks
uv run --project envs/base python -m scripts.download_tasks install-all --repo vla-arena/tasks

๐Ÿ”ง Custom Task Repository

If you want to use your own task repository:

# Use custom HuggingFace repository
uv run --project envs/base vla-arena.download-tasks install-all --repo your-username/your-task-repo

๐Ÿ“ Create and Share Custom Tasks

You can create and share your own task suites:

# Package a single task
uv run --project envs/base vla-arena.manage-tasks pack path/to/task.bddl --output ./packages

# Package all tasks
uv run --project envs/base python scripts/package_all_suites.py --output ./packages

# Upload to HuggingFace Hub
uv run --project envs/base vla-arena.manage-tasks upload ./packages/my_task.vlap --repo your-username/your-repo

Leaderboard

Performance Evaluation of VLA Models on the VLA-Arena Benchmark

We compare VLA models across four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Performance trends over three difficulty levels (L0โ€“L2) are shown with a unified scale (0.0โ€“1.0) for cross-model comparison. You can access detailed results and comparisons in our leaderboard.


Sharing Research Results

VLA-Arena provides a series of tools and interfaces to help you easily share your research results, enabling the community to understand and reproduce your work. This guide will introduce how to use these tools.

๐Ÿค– Sharing Model Results

To share your model results with the community:

  1. Evaluate Your Model: Evaluate your model on VLA-Arena tasks
  2. Submit Results: Follow the submission guidelines in our leaderboard repository
  3. Create Pull Request: Submit a pull request containing your model results

๐ŸŽฏ Sharing Task Designs

Share your custom tasks through the following steps, enabling the community to reproduce your task configurations:

  1. Design Tasks: Use CBDDL to design your custom tasks
  2. Package Tasks: Follow our guide to package and submit your tasks to your custom HuggingFace repository
  3. Update Task Store: Open a Pull Request to update your tasks in the VLA-Arena task store

Contributing

  • Report Issues: Found a bug? Open an issue
  • Improve Documentation: Help us make the docs better
  • Feature Requests: Suggest new features or improvements

Citing VLA-Arena

If you find VLA-Arena useful, please cite it in your publications.

@misc{zhang2025vlaarena,
  title={VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models},
  author={Borong Zhang and Jiahao Li and Jiachen Shen and Yishuai Cai and Yuhao Zhang and Yuanpei Chen and Juntao Dai and Jiaming Ji and Yaodong Yang},
  year={2025},
  eprint={2512.22539},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2512.22539}
}

License

This project is licensed under the Apache 2.0 license - see LICENSE for details.

Acknowledgments

  • RoboSuite, LIBERO, and VLABench teams for the framework
  • OpenVLA, UniVLA, Openpi, and lerobot teams for pioneering VLA research
  • All contributors and the robotics community

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
Made with โค๏ธ by the VLA-Arena Team

About

VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages