Evaluating VLM-augmented kinematic constraints for General Motion Tracking in humanoid loco-manipulation.
Can off-the-shelf Vision-Language Models generate useful kinematic constraints from egocentric observations? This project tests that by combining Kimodo (text+constraint → motion) with a pretrained GMT policy on the Unitree G1 via ProtoMotions.
| Condition | Input to Kimodo | Purpose |
|---|---|---|
| Baseline | Text only | Lower bound |
| VLM | Text + VLM-predicted kinematic constraints | Main experiment |
| GT | Text + ground-truth kinematic constraints | Upper bound |
git clone https://github.com/NVlabs/ProtoMotions.git
uv pip install -e ProtoMotions
uv pip install dm_control easydict matplotlib
git clone https://github.com/AlexandreBrown/VLM-GMT.gitIsaacLab is required locally (capture, playback, eval). Follow the IsaacLab installation guide.
Kimodo is required on the cluster for motion generation (>=17 GB VRAM). Follow the Kimodo docs.
VLM deps (VLM condition only):
uv pip install transformers>=4.50.0 accelerate qwen-vl-utils Pillow# Local
export VLMGMT=~/Documents/vlm_project/VLM-GMT
export PROTOMOTIONS=~/Documents/vlm_project/ProtoMotions
export CKPT=$PROTOMOTIONS/data/pretrained_models/motion_tracker/g1-bones-deploy/last.ckpt
# Cluster
export VLMGMT=~/kimodo_test/VLM-GMT
export PROTOMOTIONS=~/kimodo_test/ProtoMotions
export HF_HOME=$SCRATCH/huggingface_cacheCommands that need IsaacLab must be run from $PROTOMOTIONS (relative asset paths).
VLM-GMT/
├── prompts/
│ └── system.txt # Shared VLM system prompt
├── tasks/
│ ├── manip_reach_obj/ # Red cube on a table
│ ├── walk_to_obj/ # Green box on the ground
│ ├── navigate_maze/ # 2 staggering walls (maze-like)
│ ├── point_at_obj_with_right_hand/ # Blue object on pedestal (left), point with right hand
│ ├── point_at_obj_with_left_hand/ # Blue object on pedestal (right), point with left hand
│ ├── raise_right_hand/ # Text-only: raise right hand above head
│ ├── raise_left_hand/ # Text-only: raise left hand above head
│ ├── kneel_down_1_knee/ # Text-only: kneel on one knee (fullbody constraint)
│ ├── touch_left_leg_with_right_hand/ # Text-only: touch left knee with right hand
│ └── touch_right_leg_with_left_hand/ # Text-only: touch right knee with left hand
│ Each task contains: create_scene.py, metrics.py, kimodo_prompt.txt, vlm_prompt.txt
├── pipeline/
│ ├── generate_motion.py # Constraints → Kimodo → motion.pt (cluster)
│ ├── generate_constraints.py # GT / VLM → Kimodo constraint objects
│ ├── capture_egocentric.py # Capture G1 head camera image (local)
│ ├── egocentric_camera.py # Camera sensor utilities
│ └── vlm/
│ ├── __init__.py # Registry + load_vlm()
│ ├── base.py # Abstract VLMBase
│ └── qwen.py # Qwen2.5-VL (7B / 32B / 72B)
├── eval/
│ ├── run_eval.py # GMT inference + metrics (local)
│ ├── video_recorder.py # Optional video capture
│ └── metrics/
│ ├── distance_to_target.py # Supports 2D, 3D, fixed target pos
│ └── navigate_maze.py # Trajectory-based line compliance metric
├── scripts/ # Per-task command scripts
│ ├── manip_reach_obj.sh
│ ├── walk_to_obj.sh
│ ├── navigate_maze.sh
│ ├── point_at_obj_with_right_hand.sh
│ ├── point_at_obj_with_left_hand.sh
│ ├── raise_right_hand.sh
│ ├── raise_left_hand.sh
│ ├── kneel_down_1_knee.sh
│ ├── touch_left_leg_with_right_hand.sh
│ └── touch_right_leg_with_left_hand.sh
└── outputs/ # Generated data (gitignored)
Each task has a script in scripts/ with all commands (create scene, capture, generate motion, playback, eval). Edit the variables at the top of each script to match your paths.
| Task | Script | Success metric |
|---|---|---|
| manip_reach_obj | scripts/manip_reach_obj.sh |
dist(right_wrist, cube) < 0.15m at episode end |
| walk_to_obj | scripts/walk_to_obj.sh |
dist_2d(pelvis, box) < 0.5m at episode end |
| navigate_maze | scripts/navigate_maze.sh |
Avoids both walls laterally AND final x past wall 2 + 0.5m |
| point_at_obj_with_right_hand | scripts/point_at_obj_with_right_hand.sh |
dist(right_hand, obj) < 0.15m at episode end |
| point_at_obj_with_left_hand | scripts/point_at_obj_with_left_hand.sh |
dist(left_hand, obj) < 0.15m at episode end |
| raise_right_hand | scripts/raise_right_hand.sh |
right_hand_z > 1.3m at episode end (text-only) |
| raise_left_hand | scripts/raise_left_hand.sh |
left_hand_z > 1.3m at episode end (text-only) |
| kneel_down_1_knee | scripts/kneel_down_1_knee.sh |
pelvis_z < 0.5m at episode end (text-only, fullbody constraint) |
| touch_left_leg_with_right_hand | scripts/touch_left_leg_with_right_hand.sh |
dist(right_hand, left_knee) < 0.15m (text-only) |
| touch_right_leg_with_left_hand | scripts/touch_right_leg_with_left_hand.sh |
dist(left_hand, right_knee) < 0.15m (text-only) |
- Create scene (local)
- Generate baseline motion (cluster)
- Capture ego image (local, from
$PROTOMOTIONS) — requires baseline motion.pt - Generate GT and VLM motions (cluster) — VLM requires ego.png transferred from local
- Kinematic playback (local) — optional sanity check
- Eval (local, from
$PROTOMOTIONS)
- Create
tasks/<task_name>/create_scene.py - Create
tasks/<task_name>/metrics.pywithget_metrics() → list[Metric] - Create
tasks/<task_name>/kimodo_prompt.txtandvlm_prompt.txt - Add GT constraint logic to
pipeline/generate_constraints.py - Add GT arg handling to
pipeline/generate_motion.py - Create
scripts/<task_name>.sh
- Subclass
pipeline/vlm/base.py:VLMBase - Implement
load()andquery_constraints(image_rgb, task_description) → list[dict]image_rgbmay beNonefor text-only tasks (VLM reasons from body knowledge)
- Register in
pipeline/vlm/__init__.py:REGISTRYandHF_MODEL_IDS