Official implementation for GeoWorld-VLM: Geometry from World Models for Vision-Language Models.
This repository contains the training and evaluation code for GeoWorld-VLM, a VLM-side distillation framework that transfers geometry-aware structure from frozen camera-conditioned world models into VLM visual pathways. The method aligns post-projector image features with intermediate world-model representations while keeping the language backbone frozen, so spatial reasoning improves without changing core linguistic behavior. The codebase is organized for reproducibility: data preparation, model paths, training scripts, checkpoint export, and evaluation are separated into configurable entry points.
Paper: arXiv:2605.16713
Given an input image and a spatial reasoning question, GeoWorld-VLM enhances the spatial understanding of standard VLMs by injecting world-model priors at the feature-map level. Compared with original VLM features, GeoWorld-VLM produces more geometry-aware representations that improve spatial grounding and answer accuracy across diverse benchmarks.
Figure 1. Overview. GeoWorld-VLM injects world-model priors into VLM visual features, producing more geometry-aware representations and stronger spatial reasoning performance than strong baselines (raw Gemma-4, task-only fine-tuning, and DINO-augmented fine-tuning) on benchmarks such as What’sUp and VSR.
During training, GeoWorld-VLM fine-tunes only the VLM vision stack, including the vision encoder and multimodal projector. It aligns latent features from the VLM encoder with intermediate world-model representations, where the world model takes the input image, text prompt, and randomly sampled camera poses as input. At inference time, the world model is no longer needed, and GeoWorld-VLM runs as standard VLM inference.
Figure 2. GeoWorld-VLM. Training aligns VLM visual features with world-model intermediates while updating only vision-related blocks; inference does not require the world model.
On the What’sUp + VSR suite, GeoWorld-VLM consistently improves spatial reasoning across sub-benchmarks, outperforming the original VLM, task-only fine-tuning, and static DINOv3-feature distillation.
Figure 4. Overall comparison on the What’sUp + VSR suite. GeoWorld-VLM shows consistent gains across diverse spatial reasoning sub-benchmarks.
.
├── README.md
├── requirements.txt
├── configs/
│ └── paths.example.env
├── figures/
│ └── README.md
├── scripts/
│ ├── export_conda_env.sh
│ ├── train_single_image_whatsup_vsr.sh
│ ├── train_single_image_embspatial.sh
│ ├── train_double_image_sat.sh
│ ├── export_hf_checkpoint.sh
│ ├── eval_raw_model.sh
│ └── eval_ours_model.sh
└── code/
├── training/
│ ├── train_gemma4_lingbot_spatial.py
│ ├── train_embspatial_gemma4_lingbot_spatial.py
│ ├── train_sat_gemma4_lingbot_double_image_qtype.py
│ ├── gemma4_lingbot_spatial_model.py
│ ├── export_hf_model.py
│ └── dataset_*.py
└── evaluation/
├── eval_gemma.py
├── eval_embspatial_gemma.py
└── eval_sat_gemma_qtype.py
We recommend using a fresh conda environment with CUDA-enabled PyTorch.
conda create -n adaptvis python=3.12 -y
conda activate adaptvis
pip install -r requirements.txtInstall any extra dependencies required by your local LingBot-World-Fast checkout following the LingBot-World instructions.
Note: some script and variable names still use legacy
adaptvisprefixes for compatibility with earlier experiments.
Download or prepare the following local model directories:
- Gemma-4 VLM: base VLM used as the student and for raw evaluation.
- LingBot-World-Fast: frozen world-model teacher used for feature alignment.
Create a local path config:
cp configs/paths.example.env configs/paths.local.envThen edit:
export GEMMA_MODEL="/path/to/gemma-4-E4B-it"
export LINGBOT_MODEL="/path/to/lingbot-world-fast"
export LINGBOT_CODE="/path/to/lingbot-world"
export OUTPUT_DIR="/path/to/outputs"
export RESULTS_DIR="/path/to/results"The code assumes offline/local loading by default (HF_DATASETS_OFFLINE=1, TRANSFORMERS_OFFLINE=1, HF_HUB_OFFLINE=1). If you want to download models through Hugging Face at runtime, disable those environment variables in your shell.
Set the following paths in configs/paths.local.env.
Download path:
Expected variables:
export ADAPTVIS_DATA_DIR="/path/to/data"
export ADAPTVIS_PROMPTS_DIR="/path/to/prompts"The training/evaluation code expects the data and prompt files used by dataset_adaptvis_mcq.py, including:
Controlled_Images_AControlled_Images_BCOCO_QA_one_objCOCO_QA_two_objVG_QA_one_objVG_QA_two_objVSR
The split file should be placed at:
splits/data_split.json
or passed through SPLIT_FILE=/path/to/data_split.json.
Download path:
Expected variables:
export EMBSPATIAL_JSON="/path/to/embspatial.json"The split file should be placed at:
splits/embspatial_split.json
or passed through SPLIT_FILE=/path/to/embspatial_split.json.
Download path:
Expected variables:
export SAT_ROOT="/path/to/sat_DATA"The SAT root should contain the val and test splits loadable by datasets.load_from_disk. The split file should be placed at:
splits/sat_split.json
or passed through SPLIT_FILE=/path/to/sat_split.json.
To create a SAT split:
PYTHONPATH=code:code/training python code/training/make_sat_split.py \
--sat-root "$SAT_ROOT" \
--output splits/sat_split.json \
--train-size 2000 \
--val-eval-size 1000 \
--seed 42All scripts read configs/paths.local.env. Most hyperparameters can be overridden through environment variables.
bash scripts/train_single_image_whatsup_vsr.shUseful overrides:
GPUS=0,1 EXP_NAME=gemma4_lingbot_whatsup_vsr EPOCHS=3 BATCH_SIZE=4 \
bash scripts/train_single_image_whatsup_vsr.shDefault key settings:
teacher_mode=i2v
i2v_num_frames=9
num_teacher_steps=2
wan_hook_block_index=24
lambda_align=0.1
lambda_preserve=0.05
bash scripts/train_single_image_embspatial.shOverride the split file if needed:
SPLIT_FILE=/path/to/embspatial_split.json bash scripts/train_single_image_embspatial.shbash scripts/train_double_image_sat.shThe default double-image method is:
student: image encoder + projector + MLP for two image features
teacher: first two frames are the two input images; remaining frames are blank/noise
alignment: mean-pool the two student features and align to one teacher feature
Useful overrides:
SAT_QTYPE_FILTER=all bash scripts/train_double_image_sat.sh
SAT_QTYPE_FILTER=action_sequence bash scripts/train_double_image_sat.sh
SAT_QTYPE_FILTER=non_action_sequence bash scripts/train_double_image_sat.shTraining saves trainable_state.pt files that contain only updated trainable weights and alignment modules. To evaluate with standard Hugging Face loading, merge the trainable weights into the base model:
CKPT=/path/to/outputs/gemma4_lingbot_whatsup_vsr/epoch_3/trainable_state.pt \
EXPORT_DIR=/path/to/outputs/gemma4_lingbot_whatsup_vsr_hf \
bash scripts/export_hf_checkpoint.shThe exported directory can be passed directly to the evaluation scripts as MODEL_PATH or OURS_MODEL.
We provide two public evaluation entry points:
scripts/eval_raw_model.sh: evaluates the original base model.scripts/eval_ours_model.sh: evaluates an exported GeoWorld-VLM model.
TASK=whatsup_vsr bash scripts/eval_raw_model.sh
TASK=embspatial bash scripts/eval_raw_model.sh
TASK=sat bash scripts/eval_raw_model.shOURS_MODEL=/path/to/exported_hf_model TASK=whatsup_vsr bash scripts/eval_ours_model.sh
OURS_MODEL=/path/to/exported_hf_model TASK=embspatial bash scripts/eval_ours_model.sh
OURS_MODEL=/path/to/exported_hf_model TASK=sat bash scripts/eval_ours_model.shSAT supports optional subset evaluation:
OURS_MODEL=/path/to/exported_hf_model TASK=sat SAT_QTYPE_FILTER=action_sequence bash scripts/eval_ours_model.sh
OURS_MODEL=/path/to/exported_hf_model TASK=sat SAT_QTYPE_FILTER=non_action_sequence bash scripts/eval_ours_model.sh- The default seed is
42unless overridden by script arguments. - We train for
3epochs with batch size4in the released scripts. - We use two GPUs for LingBot teacher alignment: one for the student VLM and one for the teacher.
- Use
export_hf_checkpoint.shbefore evaluation whenever you train with alignment modules. - The alignment MLPs are used only during training and are not exported into the final Hugging Face VLM.
@misc{gu2026geoworldvlmgeometryworldmodels,
title={GeoWorld-VLM: Geometry from World Models for Vision-Language Models},
author={Renjie Gu and Kaichen Zhou and Yan Luo and Mengyu Wang},
year={2026},
eprint={2605.16713},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.16713},
}This code is released for research use. Please check the licenses of Gemma4, LingBot-World-Fast, WhatsUp, VSR, EmbSpatial, and SAT before redistributing models or datasets.


