MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Youbin Kim1 · Jinho Park1 · Hogun Park1 · Eunbyung Park2
1 Sungkyunkwan University 2 Yonsei University
ECCV 2026
teaser.mp4
git clone https://github.com/Ubin108/Group3D.git --recursive
cd Group3Dconda create -n group3d python=3.12
conda activate group3d
pip install torch==2.7.0+cu118 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install -e third_party/Depth-Anything-3
pip install git+https://github.com/QitaoZhao/gsplat.git --no-build-isolation
pip install -e third_party/sam3SAM3 model weights are gated on HuggingFace. Visit the SAM3 model page, agree to share your contact information with Meta, then log in:
hf auth loginCreate a .env file in the project root:
OPENAI_API_KEY=sk-...
Download ScanNetv2 and place it as follows:
data/ScanNetv2/scans/
└── scene0011_00/
├── color/
├── depth/
├── intrinsic/
└── pose/
Sample 128 frames uniformly from each scene. This creates video_color_128/ under each scene directory.
python preprocess/sample_frames.py \
--root data/ScanNetv2/scans \
--target_n 128Run Depth-Anything-3 on the sampled frames. This creates point_cloud/ with depth maps, confidence maps, and estimated camera extrinsics.
python preprocess/build_pointcloud_da3.py \
--root data/ScanNetv2/scans \
--target_n 128Align the point cloud to the scene coordinate frame. This creates point_cloud_aligned_pose_free/ (or point_cloud_aligned_pose_known/).
# Pose-free (uses DA3 estimated poses)
python preprocess/align_pointcloud.py \
--dataset_root data/ScanNetv2/scans \
--pose_mode pose_free
# Pose-known (uses ground-truth poses)
python preprocess/align_pointcloud.py \
--dataset_root data/ScanNetv2/scans \
--pose_mode pose_knowndata/ScanNetv2/scans/
└── scene0011_00/
├── color/
├── depth/
├── intrinsic/
├── pose/
├── video_color_128/
├── point_cloud/
├── point_cloud_aligned_pose_free/
└── point_cloud_aligned_pose_known/
# All scenes
python run.py --config configs/pose_free.yaml
# Single scene
python run.py --config configs/pose_free.yaml --scene scene0011_00Before evaluating, generate GT bounding boxes — see eval/README.md for required files and instructions.
python eval/evaluation.py \
--pred_dir results \
--gt_dir data/ScanNetv2/scannet_20 \
--split_txt data/ScanNetv2/scannet_val.txt \
[--gt_label_mode ov3det] [--per_class] [--verbose]--gt_dir: directory containing GT.npyfiles ({scene_id}_bbox_scannet20.npyor{scene_id}_aligned_bbox.npy)--split_txt: val split scene list (312 scenes)--gt_label_mode:ov3det(default, class IDs 0–19) ornyu40
python eval/evaluation.py --benchmark scannet200 \
--pred_dir results \
--scan_root /path/to/scans_200/val \
--gt_dir /path/to/scannet/scans \
--split_txt data/ScanNetv2/scannet_val.txt \
[--per_class] [--verbose]--scan_root: directory containing scene PLY files (e.g.scans_200/val/)--gt_dir: ScanNet scans directory containing{scene_id}/subdirs with.aggregation.jsonand.segs.json--split_txt: val split scene list (optional; if omitted, scenes are inferred from--pred_dir)
Render detected instances as 3D bounding boxes:
python visualization.py --scene scene0011_00 --config configs/pose_free.yamlIf you find this work useful, please cite:
@article{kim2026group3d,
title = {Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection},
author = {Kim, Youbin and Park, Jinho and Park, Hogun and Park, Eunbyung},
journal = {arXiv preprint arXiv:2603.21944},
year = {2026}
}We thank the authors of SAM3, Depth-Anything-3 for their excellent work and open-source contributions.