This repository contains the code for Learning from Human Gaze: Human-like Robot Social Navigation in Dense Crowds (AAAI 2026). Gaze2Nav learns from egocentric RGB video, human gaze, pedestrian instance masks, and 2D trajectories to produce more human-like navigation in dense crowds.
Gaze2Nav follows a three-stage pipeline:
- Gaze Predictor predicts where a human would look in the current egocentric frame from recent RGB frames and previous gaze maps.
- Semantic Saliency Matching matches predicted gaze peaks with tracked pedestrian masks and keeps recently attended pedestrians in a sliding window.
- Motion Planner predicts future waypoints from RGB history plus gaze maps or salient pedestrian masks.
The code also includes ViNT/GNM baselines and gaze/person-aware variants.
configs/ Training and detector/tracker configs
gaze2nav/data/ Dataset loaders and preprocessing scripts
gaze2nav/models/ Gaze, saliency-matching, and action models
gaze2nav/training/ Losses, training loops, logging, visualization
detector/, deep_sort/ Mask R-CNN and Deep SORT tracking components
tracking_utils/ Shared detector/tracker drawing, IO, and logging helpers
run_deep_sort.py Batch tracking script that exports mask CSV files
train.py Main train/generate entrypoint
evaluate_salient_ids.py Evaluation for generated salient-person IDs
create_data_splits.py Utility for train/test trajectory splits
Each trajectory folder should contain frame-level data:
data/<trajectory_name>/
0.jpg, 1.jpg, ...
0.csv, 1.csv, ... # first row: tracked person IDs; following rows: flattened masks
traj_data.pkl # 2D trajectory dataframe
fixations.pkl # fixation dataframe, x/y per frame
person_ids.pkl # generated from mask CSVs
select_ids.pkl # ground-truth salient IDs
Split folders contain traj_names.txt:
data_splits/train/traj_names.txt
data_splits/test/traj_names.txt
Create splits with:
python create_data_splits.py --data_dir /path/to/data --data_splits_dir /path/to --dataset_name data_splitsParts of the navigation model structure and training workflow are inspired by robodhruv/visualnav-transformer. The tracking stack includes Deep SORT-derived components together with Mask R-CNN detection code.