Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Imperial College London

UniScene3D learns transferable 3D scene representations from multi-view colored pointmaps, unifying RGB appearance and world-aligned geometry within a single ViT encoder. We evaluate its effectiveness across diverse 3D scene understanding tasks under zero-shot, few-shot, and task-specific fine-tuning settings.

News

🚀 2026-04-02: Code, pretrained model, pretraining, and evaluation data are now available.

Key Takeaways

Core Question: Unlike 2D vision, 3D scene understanding still lacks a generalizable encoder like CLIP, largely due to the scarcity of large-scale 3D pretraining data. This raises the question: can a 2D vision encoder be extended into a general 3D scene encoder without extensive 3D pretraining?
Preliminary Finding: Pointmaps encode world-frame geometry like point clouds while preserving an image-like format compatible with 2D vision models. Our initial study shows that pretrained 2D vision weights are also beneficial for learning pointmap features.
Model Contribution: UniScene3D extends pretrained CLIP models to learn unified 3D scene representations from pixel-aligned, multi-view colored pointmaps by jointly encoding geometry and appearance.
Key Training Idea: We introduce cross-view geometric alignment and grounded view alignment to enforce geometric and semantic consistency across viewpoints.
Result: The learned representations effectively combine complementary information from images and pointmaps, generalize across diverse scenes, and transfer well to a broad range of downstream 3D tasks.

Expected Repository Structure

UniScene3D/
├── checkpoints/            # Downloaded pretrained checkpoints for evaluation/fine-tuning.
├── configs/
│   ├── all_pretrain.yaml
│   └── finetune/
├── dataset/                 # Language data for pretraining and evaluation.
├── scripts/                 # runnable shell entry points
├── src/
│   ├── data/
│   ├── evaluator/
│   ├── fg-clip/             # local FG-CLIP code/assets
│   ├── model/
│   ├── modules/
│   ├── optim/
│   └── trainer/
├── launch.py                # launcher for python / accelerate / submitit
├── run.py                   # main training/evaluation entry point
└── requirements.txt

Installation

1. Create an environment

conda create -n uniscene3d python=3.10 -y
conda activate uniscene3d

2. Install PyTorch

Install a PyTorch build that matches your CUDA setup. The pinned versions used in this repo are:

torch==2.5.1
torchvision==0.20.1

For example, if you use pip wheels from PyTorch:

pip install torch==2.5.1 torchvision==0.20.1

3. Install the remaining dependencies

pip install -r requirements.txt

Data Preparation

Please download the dataset/ folder from Hugging Face at MatchLab/ScenePoint and place it at the repository root. This folder includes the language data required for pretraining and evaluation, including:

dataset/refer
dataset/retrieval
dataset/classification
dataset metadata used by the training and evaluation scripts

The scene data are hosted on the same Hugging Face dataset. When you run the training/evaluation scripts, the required scene assets will be downloaded automatically and cached locally.

The processed scene data are derived from the original ScanNet, 3RScan, and ARKitScenes datasets. Please also refer to their official websites for the original data access terms and licenses.

Pretraining

The default pretraining recipe is defined in configs/all_pretrain.yaml.

Run:

bash scripts/pretraining/pretrain.sh

By default, experiment outputs are written under results/, and the runtime config is saved into each experiment directory by run.py.

Evaluation

Please download the released model checkpoint and place it under checkpoints/ before running evaluation or fine-tuning.

Viewpoint Grounding

bash scripts/view_retrieval/view_ret.sh

Scene Retrieval

bash scripts/scene_retrieval/scene_ret.sh

Scene Type Classification

Zero-shot:

bash scripts/scene_classification/zero_shot_scene_cls.sh

Few-shot:

bash scripts/scene_classification/few_shot_scene_cls.sh

The shared evaluation environment is configured in scripts/spatial_bench_common.sh. Important environment variables include:

UNISCENE3D_CKPT: path to the UniScene3D checkpoint
HF_REPO_ID: Hugging Face dataset repo id for scene assets, default MatchLab/ScenePoint
PM_KEY: default pointmap key, point_map
RGB_KEY: default RGB key, color_images

3D VQA Fine-Tuning

Run the provided launchers:

bash scripts/vqa3d/scanqa.sh
bash scripts/vqa3d/sqa3d.sh
bash scripts/vqa3d/hypo3d.sh

Acknowledgements

We sincerely thank the authors and maintainers of SceneVerse, 3D-VisTA, and FG-CLIP for releasing their code, models, and research resources. UniScene3D builds on ideas and infrastructure from these prior projects, and their open-source contributions have been invaluable to this work.

Citation

If you find this repository useful, please cite the paper:

@inproceedings{mao2026uniscene3d,
  title     = {Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding},
  author    = {Mao, Ye and Luo, Weixun and Huang, Ranran and Jing, Junpeng and Mikolajczyk, Krystian},
  booktitle = {arxiv},
  year      = {2026}
}

License

This project is released under the license in LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Contents

News

Key Takeaways

Expected Repository Structure

Installation

1. Create an environment

2. Install PyTorch

3. Install the remaining dependencies

Data Preparation

Pretraining

Evaluation

Viewpoint Grounding

Scene Retrieval

Scene Type Classification

3D VQA Fine-Tuning

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
configs		configs
docs		docs
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
launch.py		launch.py
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Contents

News

Key Takeaways

Expected Repository Structure

Installation

1. Create an environment

2. Install PyTorch

3. Install the remaining dependencies

Data Preparation

Pretraining

Evaluation

Viewpoint Grounding

Scene Retrieval

Scene Type Classification

3D VQA Fine-Tuning

Acknowledgements

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages