This repository contains the refactored framework to implement a 4D (spatiotemporal) visual backbone for the AutoVLA model.
Instead of a static image encoder (e.g., DinoV2 or SigLIP), this repository implements MCG-NJU/videomae-base to ingest 16-frame Video Action features, seamlessly integrating with the open-source Qwen/Qwen2.5-VL-3B-Instruct Vision-Language Model.
-
Install requirements
pip install -r requirements.txt
-
Download the Waymo subset (see Dataset Setup below).
python download_waymo_subset.py
This writes
data/waymo_subset/with tar shards. -
Extract tar shards into the layout expected by the training dataset (
dataset_dir/subset/split/*.mp4and*.npy):python scripts/extract_waymo_subset.py --dataset_dir data/waymo_subset
Output:
data/waymo_subset/Unconventional Dynamic Obstacles/train/and.../val/with{id}.mp4and{id}.npyper sample. -
Generate the K-disk action vocabulary (required once before training). The tokenizer fits from either the tar files or the extracted
.npyfiles:python data/tokenizer.py --dataset_dir data/waymo_subset --output_file data/action_centers.pt
-
Train with SFT (cross-entropy over action tokens, QLoRA):
python scripts/train.py --dataset_dir data/waymo_subset --split train --batch_size 4 --lr 1e-5 --epochs 3
Optional:
--subset "Unconventional Dynamic Obstacles"(default) must match the name used in step 3. Checkpoints:models/projector_weights.pt,models/action_head_weights.pt. -
Evaluate (e.g. PDMS in
navsim):python scripts/eval.py
-
Visualize Model Predictions:
PYTHONPATH=$PWD python scripts/visualize_comparisons.pyGenerates Bird's-Eye-View (BEV) trajectory plots comparing Ground Truth, BaselineVLA, and AutoVLA4D model predictions. Results are saved in the
visualizations/directory.
The full Waymo Open Dataset and full Impromptu-VLA QA data are large. This project uses a small subset from HuggingFace (aaaaaap/unstructed) to stay under ~50–100 GB.
-
Download (step 2 above):
python download_waymo_subset.py
Creates
data/waymo_subset/withwaymo/waymo_train_shard_*.tarandwaymo/waymo_val_shard_*.tar. -
Extract (step 3 above) so the dataloader sees
.mp4and.npyper sample. Then use the same--dataset_dir data/waymo_subsetfor the tokenizer and for training.