Skip to content

Ubin108/Group3D

Repository files navigation

Group3D

MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Youbin Kim1  ·  Jinho Park1  ·  Hogun Park1  ·  Eunbyung Park2

1 Sungkyunkwan University    2 Yonsei University

ECCV 2026

arXiv Project Page

teaser.mp4

Installation

1. Clone the repository

git clone https://github.com/Ubin108/Group3D.git --recursive
cd Group3D

2. Install dependencies

conda create -n group3d python=3.12
conda activate group3d

pip install torch==2.7.0+cu118 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install -e third_party/Depth-Anything-3
pip install git+https://github.com/QitaoZhao/gsplat.git --no-build-isolation
pip install -e third_party/sam3

3. HuggingFace login (for SAM3 weights)

SAM3 model weights are gated on HuggingFace. Visit the SAM3 model page, agree to share your contact information with Meta, then log in:

hf auth login

4. Set up API keys

Create a .env file in the project root:

OPENAI_API_KEY=sk-...

Data Preparation

1. Download ScanNetv2

Download ScanNetv2 and place it as follows:

data/ScanNetv2/scans/
└── scene0011_00/
    ├── color/
    ├── depth/
    ├── intrinsic/
    └── pose/

2. Frame sampling

Sample 128 frames uniformly from each scene. This creates video_color_128/ under each scene directory.

python preprocess/sample_frames.py \
    --root data/ScanNetv2/scans \
    --target_n 128

3. Depth estimation & point cloud building

Run Depth-Anything-3 on the sampled frames. This creates point_cloud/ with depth maps, confidence maps, and estimated camera extrinsics.

python preprocess/build_pointcloud_da3.py \
    --root data/ScanNetv2/scans \
    --target_n 128

4. Point cloud alignment

Align the point cloud to the scene coordinate frame. This creates point_cloud_aligned_pose_free/ (or point_cloud_aligned_pose_known/).

# Pose-free (uses DA3 estimated poses)
python preprocess/align_pointcloud.py \
    --dataset_root data/ScanNetv2/scans \
    --pose_mode pose_free

# Pose-known (uses ground-truth poses)
python preprocess/align_pointcloud.py \
    --dataset_root data/ScanNetv2/scans \
    --pose_mode pose_known

Final dataset structure

data/ScanNetv2/scans/
└── scene0011_00/
    ├── color/
    ├── depth/
    ├── intrinsic/
    ├── pose/
    ├── video_color_128/
    ├── point_cloud/
    ├── point_cloud_aligned_pose_free/
    └── point_cloud_aligned_pose_known/

Running Group3D

Run the full pipeline

# All scenes
python run.py --config configs/pose_free.yaml

# Single scene
python run.py --config configs/pose_free.yaml --scene scene0011_00

Evaluation

ScanNet20

Before evaluating, generate GT bounding boxes — see eval/README.md for required files and instructions.

python eval/evaluation.py \
    --pred_dir results \
    --gt_dir data/ScanNetv2/scannet_20 \
    --split_txt data/ScanNetv2/scannet_val.txt \
    [--gt_label_mode ov3det] [--per_class] [--verbose]
  • --gt_dir: directory containing GT .npy files ({scene_id}_bbox_scannet20.npy or {scene_id}_aligned_bbox.npy)
  • --split_txt: val split scene list (312 scenes)
  • --gt_label_mode: ov3det (default, class IDs 0–19) or nyu40

ScanNet200

python eval/evaluation.py --benchmark scannet200 \
    --pred_dir results \
    --scan_root /path/to/scans_200/val \
    --gt_dir /path/to/scannet/scans \
    --split_txt data/ScanNetv2/scannet_val.txt \
    [--per_class] [--verbose]
  • --scan_root: directory containing scene PLY files (e.g. scans_200/val/)
  • --gt_dir: ScanNet scans directory containing {scene_id}/ subdirs with .aggregation.json and .segs.json
  • --split_txt: val split scene list (optional; if omitted, scenes are inferred from --pred_dir)

Visualization

Render detected instances as 3D bounding boxes:

python visualization.py --scene scene0011_00 --config configs/pose_free.yaml

Citation

If you find this work useful, please cite:

@article{kim2026group3d,
  title     = {Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection},
  author    = {Kim, Youbin and Park, Jinho and Park, Hogun and Park, Eunbyung},
  journal   = {arXiv preprint arXiv:2603.21944},
  year      = {2026}
}

Acknowledgements

We thank the authors of SAM3, Depth-Anything-3 for their excellent work and open-source contributions.

About

[ECCV 2026] Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages