AutoVLA 4D

This repository contains the refactored framework to implement a 4D (spatiotemporal) visual backbone for the AutoVLA model.

Instead of a static image encoder (e.g., DinoV2 or SigLIP), this repository implements MCG-NJU/videomae-base to ingest 16-frame Video Action features, seamlessly integrating with the open-source Qwen/Qwen2.5-VL-3B-Instruct Vision-Language Model.

Getting Started

Install requirements
```
pip install -r requirements.txt
```
Download the Waymo subset (see Dataset Setup below).
```
python download_waymo_subset.py
```
This writes data/waymo_subset/ with tar shards.
Extract tar shards into the layout expected by the training dataset (dataset_dir/subset/split/*.mp4 and *.npy):
```
python scripts/extract_waymo_subset.py --dataset_dir data/waymo_subset
```
Output: data/waymo_subset/Unconventional Dynamic Obstacles/train/ and .../val/ with {id}.mp4 and {id}.npy per sample.
Generate the K-disk action vocabulary (required once before training). The tokenizer fits from either the tar files or the extracted .npy files:
```
python data/tokenizer.py --dataset_dir data/waymo_subset --output_file data/action_centers.pt
```
Train with SFT (cross-entropy over action tokens, QLoRA):
```
python scripts/train.py --dataset_dir data/waymo_subset --split train --batch_size 4 --lr 1e-5 --epochs 3
```
Optional: --subset "Unconventional Dynamic Obstacles" (default) must match the name used in step 3. Checkpoints: models/projector_weights.pt, models/action_head_weights.pt.
Evaluate (e.g. PDMS in navsim):
```
python scripts/eval.py
```
Visualize Model Predictions:
```
PYTHONPATH=$PWD python scripts/visualize_comparisons.py
```
Generates Bird's-Eye-View (BEV) trajectory plots comparing Ground Truth, BaselineVLA, and AutoVLA4D model predictions. Results are saved in the visualizations/ directory.

Dataset Setup: Waymo Subset (< 300 GB)

The full Waymo Open Dataset and full Impromptu-VLA QA data are large. This project uses a small subset from HuggingFace (aaaaaap/unstructed) to stay under ~50–100 GB.

Download (step 2 above):
```
python download_waymo_subset.py
```
Creates data/waymo_subset/ with waymo/waymo_train_shard_*.tar and waymo/waymo_val_shard_*.tar.
Extract (step 3 above) so the dataloader sees .mp4 and .npy per sample. Then use the same --dataset_dir data/waymo_subset for the tokenizer and for training.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
models		models
scripts		scripts
visualizations		visualizations
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
augmentation_figure.png		augmentation_figure.png
download_waymo_subset.py		download_waymo_subset.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoVLA 4D

Getting Started

Dataset Setup: Waymo Subset (< 300 GB)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoVLA 4D

Getting Started

Dataset Setup: Waymo Subset (< 300 GB)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages