[Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models]
Haiweng Xu, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Ziheng Xi, Zongqing Lu
Peking University, Tsinghua University, BeingBeyond
This is the official repository for the paper "Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models", which introduces the BeTTER benchmark.
Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, projecting an illusion of robust semantic grounding and sequential planning. BeTTER is a diagnostic benchmark designed to break this illusion. By applying targeted causal interventions while enforcing kinematic isolation, BeTTER explicitly decouples high-level reasoning failures from low-level execution limits, unmasking severe cognitive deficits such as behavioral inertia and semantic feature collapse in state-of-the-art VLAs.
We are actively working to clean up and open-source the codebase. To ensure high quality, we will release the components progressively. Watch 👀 and Star ⭐ this repository to stay updated!
- Paper Release: ArXiv preprint available.
- Phase 1: Asset Curation & Task Generation Pipeline
- VLM-guided task instantiation templates.
- Open-vocabulary 3D asset retrieval and integration (via Objaverse).
- Phase 2: The BeTTER Benchmark Suite & Evaluation
- The complete suite of 10 base manipulation tasks and 60 diagnostic variations.
- Standardized evaluation scripts and testing environments.
- Phase 3: Data Augmentation & Privileged Logging
- Teleoperation trajectory amplification pipeline (incorporating MimicGen).
- Deterministic privileged state logging and VQA generation scripts.
(Code and instructions are coming soon. Please stay tuned!)
If you find our benchmark, analysis, or data pipelines useful in your research, please consider citing our work:
@article{xu2026unmasking,
title={Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models},
author={Xu, Haiweng and Zheng, Sipeng and Luo, Hao and Zhang, Wanpeng and Xi, Ziheng and Lu, Zongqing},
journal={arXiv preprint arXiv:2604.18000},
year={2026}
}We would like to thank the open-source community, particularly the developers of Objaverse and MimicGen, whose foundational tools greatly facilitated the development of this benchmark.
