Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk
Imperial College London
UniScene3D learns transferable 3D scene representations from multi-view colored pointmaps, unifying RGB appearance and world-aligned geometry within a single ViT encoder. We evaluate its effectiveness across diverse 3D scene understanding tasks under zero-shot, few-shot, and task-specific fine-tuning settings.
- News
- Key Takeaways
- Expected Repository Structure
- Installation
- Data Preparation
- Pretraining
- Evaluation
- Acknowledgements
- Citation
- License
- 🚀
2026-04-02: Code, pretrained model, pretraining, and evaluation data are now available.
-
Core Question: Unlike 2D vision, 3D scene understanding still lacks a generalizable encoder like CLIP, largely due to the scarcity of large-scale 3D pretraining data. This raises the question: can a 2D vision encoder be extended into a general 3D scene encoder without extensive 3D pretraining?
-
Preliminary Finding: Pointmaps encode world-frame geometry like point clouds while preserving an image-like format compatible with 2D vision models. Our initial study shows that pretrained 2D vision weights are also beneficial for learning pointmap features.
-
Model Contribution: UniScene3D extends pretrained CLIP models to learn unified 3D scene representations from pixel-aligned, multi-view colored pointmaps by jointly encoding geometry and appearance.
-
Key Training Idea: We introduce cross-view geometric alignment and grounded view alignment to enforce geometric and semantic consistency across viewpoints.
-
Result: The learned representations effectively combine complementary information from images and pointmaps, generalize across diverse scenes, and transfer well to a broad range of downstream 3D tasks.
UniScene3D/
├── checkpoints/ # Downloaded pretrained checkpoints for evaluation/fine-tuning.
├── configs/
│ ├── all_pretrain.yaml
│ └── finetune/
├── dataset/ # Language data for pretraining and evaluation.
├── scripts/ # runnable shell entry points
├── src/
│ ├── data/
│ ├── evaluator/
│ ├── fg-clip/ # local FG-CLIP code/assets
│ ├── model/
│ ├── modules/
│ ├── optim/
│ └── trainer/
├── launch.py # launcher for python / accelerate / submitit
├── run.py # main training/evaluation entry point
└── requirements.txt
conda create -n uniscene3d python=3.10 -y
conda activate uniscene3dInstall a PyTorch build that matches your CUDA setup. The pinned versions used in this repo are:
torch==2.5.1torchvision==0.20.1
For example, if you use pip wheels from PyTorch:
pip install torch==2.5.1 torchvision==0.20.1pip install -r requirements.txtPlease download the dataset/ folder from Hugging Face at MatchLab/ScenePoint and place it at the repository root. This folder includes the language data required for pretraining and evaluation, including:
dataset/referdataset/retrievaldataset/classification- dataset metadata used by the training and evaluation scripts
The scene data are hosted on the same Hugging Face dataset. When you run the training/evaluation scripts, the required scene assets will be downloaded automatically and cached locally.
The processed scene data are derived from the original ScanNet, 3RScan, and ARKitScenes datasets. Please also refer to their official websites for the original data access terms and licenses.
The default pretraining recipe is defined in configs/all_pretrain.yaml.
Run:
bash scripts/pretraining/pretrain.shBy default, experiment outputs are written under results/, and the runtime config is saved into each experiment directory by run.py.
Please download the released model checkpoint and place it under checkpoints/ before running evaluation or fine-tuning.
bash scripts/view_retrieval/view_ret.shbash scripts/scene_retrieval/scene_ret.shZero-shot:
bash scripts/scene_classification/zero_shot_scene_cls.shFew-shot:
bash scripts/scene_classification/few_shot_scene_cls.shThe shared evaluation environment is configured in scripts/spatial_bench_common.sh. Important environment variables include:
UNISCENE3D_CKPT: path to the UniScene3D checkpointHF_REPO_ID: Hugging Face dataset repo id for scene assets, defaultMatchLab/ScenePointPM_KEY: default pointmap key,point_mapRGB_KEY: default RGB key,color_images
Run the provided launchers:
bash scripts/vqa3d/scanqa.sh
bash scripts/vqa3d/sqa3d.sh
bash scripts/vqa3d/hypo3d.shWe sincerely thank the authors and maintainers of SceneVerse, 3D-VisTA, and FG-CLIP for releasing their code, models, and research resources. UniScene3D builds on ideas and infrastructure from these prior projects, and their open-source contributions have been invaluable to this work.
If you find this repository useful, please cite the paper:
@inproceedings{mao2026uniscene3d,
title = {Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding},
author = {Mao, Ye and Luo, Weixun and Huang, Ranran and Jing, Junpeng and Mikolajczyk, Krystian},
booktitle = {arxiv},
year = {2026}
}This project is released under the license in LICENSE.

