Group3D

MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Youbin Kim¹ · Jinho Park¹ · Hogun Park¹ · Eunbyung Park²

¹ Sungkyunkwan University ² Yonsei University

ECCV 2026

teaser.mp4

Installation

1. Clone the repository

git clone https://github.com/Ubin108/Group3D.git --recursive
cd Group3D

2. Install dependencies

conda create -n group3d python=3.12
conda activate group3d

pip install torch==2.7.0+cu118 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install -e third_party/Depth-Anything-3
pip install git+https://github.com/QitaoZhao/gsplat.git --no-build-isolation
pip install -e third_party/sam3

3. HuggingFace login (for SAM3 weights)

SAM3 model weights are gated on HuggingFace. Visit the SAM3 model page, agree to share your contact information with Meta, then log in:

hf auth login

4. Set up API keys

Create a .env file in the project root:

OPENAI_API_KEY=sk-...

Data Preparation

1. Download ScanNetv2

Download ScanNetv2 and place it as follows:

data/ScanNetv2/scans/
└── scene0011_00/
    ├── color/
    ├── depth/
    ├── intrinsic/
    └── pose/

2. Frame sampling

Sample 128 frames uniformly from each scene. This creates video_color_128/ under each scene directory.

python preprocess/sample_frames.py \
    --root data/ScanNetv2/scans \
    --target_n 128

3. Depth estimation & point cloud building

Run Depth-Anything-3 on the sampled frames. This creates point_cloud/ with depth maps, confidence maps, and estimated camera extrinsics.

python preprocess/build_pointcloud_da3.py \
    --root data/ScanNetv2/scans \
    --target_n 128

4. Point cloud alignment

Align the point cloud to the scene coordinate frame. This creates point_cloud_aligned_pose_free/ (or point_cloud_aligned_pose_known/).

# Pose-free (uses DA3 estimated poses)
python preprocess/align_pointcloud.py \
    --dataset_root data/ScanNetv2/scans \
    --pose_mode pose_free

# Pose-known (uses ground-truth poses)
python preprocess/align_pointcloud.py \
    --dataset_root data/ScanNetv2/scans \
    --pose_mode pose_known

Final dataset structure

data/ScanNetv2/scans/
└── scene0011_00/
    ├── color/
    ├── depth/
    ├── intrinsic/
    ├── pose/
    ├── video_color_128/
    ├── point_cloud/
    ├── point_cloud_aligned_pose_free/
    └── point_cloud_aligned_pose_known/

Running Group3D

Run the full pipeline

# All scenes
python run.py --config configs/pose_free.yaml

# Single scene
python run.py --config configs/pose_free.yaml --scene scene0011_00

Evaluation

ScanNet20

Before evaluating, generate GT bounding boxes — see eval/README.md for required files and instructions.

python eval/evaluation.py \
    --pred_dir results \
    --gt_dir data/ScanNetv2/scannet_20 \
    --split_txt data/ScanNetv2/scannet_val.txt \
    [--gt_label_mode ov3det] [--per_class] [--verbose]

--gt_dir: directory containing GT .npy files ({scene_id}_bbox_scannet20.npy or {scene_id}_aligned_bbox.npy)
--split_txt: val split scene list (312 scenes)
--gt_label_mode: ov3det (default, class IDs 0–19) or nyu40

ScanNet200

python eval/evaluation.py --benchmark scannet200 \
    --pred_dir results \
    --scan_root /path/to/scans_200/val \
    --gt_dir /path/to/scannet/scans \
    --split_txt data/ScanNetv2/scannet_val.txt \
    [--per_class] [--verbose]

--scan_root: directory containing scene PLY files (e.g. scans_200/val/)
--gt_dir: ScanNet scans directory containing {scene_id}/ subdirs with .aggregation.json and .segs.json
--split_txt: val split scene list (optional; if omitted, scenes are inferred from --pred_dir)

Visualization

Render detected instances as 3D bounding boxes:

python visualization.py --scene scene0011_00 --config configs/pose_free.yaml

Citation

If you find this work useful, please cite:

@article{kim2026group3d,
  title     = {Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection},
  author    = {Kim, Youbin and Park, Jinho and Park, Hogun and Park, Eunbyung},
  journal   = {arXiv preprint arXiv:2603.21944},
  year      = {2026}
}

Acknowledgements

We thank the authors of SAM3, Depth-Anything-3 for their excellent work and open-source contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Group3D

Installation

1. Clone the repository

2. Install dependencies

3. HuggingFace login (for SAM3 weights)

4. Set up API keys

Data Preparation

1. Download ScanNetv2

2. Frame sampling

3. Depth estimation & point cloud building

4. Point cloud alignment

Final dataset structure

Running Group3D

Run the full pipeline

Evaluation

ScanNet20

ScanNet200

Visualization

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data/ScanNetv2		data/ScanNetv2
eval		eval
group3d		group3d
preprocess		preprocess
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
visualization.py		visualization.py

Folders and files

Latest commit

History

Repository files navigation

Group3D

Installation

1. Clone the repository

2. Install dependencies

3. HuggingFace login (for SAM3 weights)

4. Set up API keys

Data Preparation

1. Download ScanNetv2

2. Frame sampling

3. Depth estimation & point cloud building

4. Point cloud alignment

Final dataset structure

Running Group3D

Run the full pipeline

Evaluation

ScanNet20

ScanNet200

Visualization

Citation

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages