Built on top of SceneRF (ICCV 2023) by Cao & de Charette, Inria Paris
| Task | Improvement over Baseline |
|---|---|
| Novel Depth Synthesis | +16% |
| 3D Scene Reconstruction | +7% |
| Novel View Synthesis (RGB) | +2% |
We improve SceneRF's indoor 3D reconstruction pipeline by introducing Random Fourier Feature positional encoding and hierarchical volume sampling, achieving improved results on the BundleFusion dataset while extending the framework to support TUM RGB-D scenes.
SceneRF reconstructs 3D scenes from a single monocular image by synthesizing novel depth maps and views at predicted camera poses, then fusing them into a coherent 3D volume.
SceneRF uses a NeRF-based pipeline: a 2D feature extractor (Spherical U-Net with EfficientNet-B7 backbone) produces per-pixel features, which are decoded by an MLP into density and color along sampled rays. A Self-Organizing Map (SOM) on Gaussians guides where along each ray to sample. We targeted two bottlenecks in this pipeline:
The original model uses standard sinusoidal positional encoding. We replaced it with Random Fourier Features, random projections drawn from
Standard PE: [sin(2^0 x), cos(2^0 x), ..., sin(2^L x), cos(2^L x)]
RFF: sqrt(2) * cos(Wx + b), W ~ N(0, sigma^2), b ~ U[0, 2pi]
Impact: Improved depth synthesis metrics across the board, particularly RMSE and RMSE log. Combined with hierarchical sampling, the two changes complement each other for the best overall results.
For each ray cast from the camera, the model must decide where along the ray to place 3D sample points before querying the MLP for density and color. The original SceneRF uses two sampling strategies:
- Uniform sampling: Evenly spaces points along the ray from near to far. Provides broad coverage but wastes most samples in empty space.
- Gaussian sampling: A small secondary network predicts a set of Gaussian distributions (mean + std) along each ray, trained via SOM-KL loss to settle near surfaces. Points are drawn randomly from these Gaussians, concentrating samples where the network believes surfaces exist. However, this is a learned estimate that can be biased or over-concentrate on a single dominant surface.
We introduced a third strategy:
- Hierarchical sampling (ours): First, the uniform samples are rendered through the main MLP to produce density weights along the ray. These weights reveal where the network actually found density. New sample points are then drawn from this weight distribution via inverse CDF sampling, placing more points in high-density regions and almost none in empty space.
All three sample sets are then merged, sorted by depth, and rendered together in a single final pass.
Why they complement each other: Uniform provides baseline coverage everywhere. Gaussian brings a learned prior about likely surface locations. Hierarchical brings direct empirical evidence from the rendering network's own density predictions. Together they handle cases that any single strategy would miss, such as scenes with multiple surfaces at varying depths where Gaussians might fixate on only the closest one.
This coarse-to-fine hierarchical sampling strategy originates from the original NeRF paper (Mildenhall et al., 2020) but was not present in SceneRF. We adapted it to work alongside SceneRF's existing Gaussian sampling pipeline.
Impact: Drove the largest individual improvement on primary depth metrics (Abs Rel, Sq Rel) and was the only change that also improved novel view synthesis (LPIPS). Combined with RFF, the two changes together achieve the best results across all evaluation categories.
We briefly experimented with multi-head self-attention in the U-Net bottleneck to capture long-range spatial dependencies. In our runs, it did not improve the metrics shown in the results table, so we removed it from the final code path. We still document the experiment in our report as a negative result for transparency.
We extended the entire training and evaluation pipeline to support the TUM RGB-D dataset, including:
- A conversion tool (
convert_tum_to_bf/tum_to_bf.py) that transforms TUM RGB-D data into BundleFusion format (pose conversion, depth scaling, intrinsics mapping) --datasetflags across all training, evaluation, and reconstruction scripts
The figure below contains results on two indoor datasets: BundleFusion and TUM RGB-D.
For BundleFusion, we can compare directly against both the original SceneRF paper and our Scaled Down baseline, since those numbers are available in the same setting.
For TUM RGB-D, the results are additional experiments enabled by our dataset-extension work. There, the comparison is between our own variants only, since the original SceneRF paper and our reduced-compute baseline were not reported in the same way for that dataset.
BundleFusion takeaways (direct comparison to original SceneRF + our scaled-down baseline; bold = best):
| Configuration | Abs Rel ↓ (Depth) | LPIPS ↓ (View Synth.) | IoU ↑ (Recon.) |
|---|---|---|---|
| Original (paper) | 0.1766 | 0.323 | 20.16 |
| Our Baseline (Scaled Down) | 0.1961 | 0.337 | 17.72 |
| Ours (RFF + Hier. Samp.) | 0.1582 | 0.327 | 19.06 |
For TUM RGB-D (lower block in the figure), the most meaningful comparison is among our own ablations. There, Hierarchical Sampling v1 gives the strongest depth metrics, RFF + Hierarchical Sampling gives the best LPIPS / SSIM for novel view synthesis, and the Scaled Down variant retains the best IoU / recall in scene reconstruction.
Input Image
│
▼
┌─────────────────────────────┐
│ EfficientNet-B7 Encoder │
│ (Pretrained) │
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ Spherical U-Net Decoder │ Multi-scale features projected
│ (Spherical Mapping) │ onto spherical coordinate space
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ Ray Sampling │
│ ├─ Uniform sampling │
│ ├─ Gaussian (SOM-guided) │
│ └─ Hierarchical (NEW) ◄───┼── Coarse-to-fine refinement
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ RFF Positional Encoding ◄──┼── Random Fourier Features (NEW)
│ + View Direction │
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ ResNet-FC MLP │ Density + Color prediction
│ (3 blocks, 512 hidden) │ per sampled 3D point
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ Volume Rendering │ Alpha compositing along rays
│ → Depth + RGB + TSDF │ → 3D Scene Reconstruction
└─────────────────────────────┘
SceneRF/
├── scenerf/
│ ├── models/
│ │ ├── scenerf_bf.py # Main model (modified: RFF + hierarchical sampling)
│ │ ├── pe_rff.py # NEW: Random Fourier Features encoding
│ │ ├── pe.py # Original positional encoding (replaced)
│ │ ├── unet2d_sphere.py # Spherical U-Net backbone used in the final model
│ │ ├── resnetfc.py # ResNet-FC MLP backbone
│ │ ├── ray_som_kl.py # Self-Organizing Map for Gaussian ray sampling
│ │ └── spherical_mapping.py # Spherical coordinate projection
│ ├── scripts/
│ │ ├── train_bundlefusion.py # Training entry point (modified: --dataset, --n_pts_hier)
│ │ ├── train_kitti.py # KITTI training
│ │ ├── evaluation/ # All eval scripts (modified: --dataset for TUM support)
│ │ └── reconstruction/ # Depth-to-TSDF, novel depth generation (modified)
│ ├── data/
│ │ ├── bundlefusion/ # BundleFusion + TUM RGB-D dataloaders (modified)
│ │ └── semantic_kitti/ # KITTI dataloaders
│ └── loss/ # Depth metrics, self-supervised losses
├── convert_tum_to_bf/ # NEW: TUM RGB-D → BundleFusion converter
├── train_eval_bash_scripts/ # NEW: Ready-to-run training & evaluation scripts
├── assets/
│ ├── BetterSceneRF_Report.pdf # Technical report with full methodology & ablations
│ └── outputResults.png # Evaluation result figures
└── teaser/ # README figures and animations
Using Conda:
conda create -y -n scenerf python=3.7
conda activate scenerf
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
conda install -c bioconda tbb=2020.2
pip install torchmetrics==0.6.0
pip install -e ./Using Docker:
docker build -t scene-rf .
docker run -it --gpus all scene-rfexport BF_ROOT=/path/to/bundlefusion
export BF_LOG=/path/to/logs
python scenerf/scripts/train_bundlefusion.py \
--bs=1 --n_gpus=1 \
--n_rays=1024 --lr=2e-5 \
--enable_log=True \
--dataset=bf \
--root=$BF_ROOT --logdir=$BF_LOG \
--n_gaussians=2 --n_pts_per_gaussian=4 \
--n_pts_uni=8 --n_pts_hier=8 \
--max_epochs=30Set --n_pts_hier=0 to disable hierarchical sampling. Set --dataset=tum_rgbd to train on TUM RGB-D.
See train_eval_bash_scripts/ for ready-to-run training and evaluation configurations.
python convert_tum_to_bf/tum_to_bf.py \
--source_dir=/path/to/tum_scenes \
--dest_dir=/path/to/output| Dataset | Type | Scenes | Download |
|---|---|---|---|
| BundleFusion | Indoor RGB-D | 8 scenes | Stanford Graphics |
| TUM RGB-D | Indoor RGB-D | Multiple sequences | TUM CVG |
| KITTI Odometry | Outdoor LiDAR+RGB | 22 sequences | KITTI |
| SemanticKITTI | Outdoor voxel labels | 22 sequences | SemanticKITTI |
This work builds on SceneRF. If you use this code, please cite the original paper:
@InProceedings{cao2023scenerf,
author = {Cao, Anh-Quan and de Charette, Raoul},
title = {SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields},
booktitle = {ICCV},
year = {2023},
}All credit for the original SceneRF architecture goes to Anh-Quan Cao and Raoul de Charette at Inria, Paris. The original work was partly funded by the French project SIGHT (ANR-20-CE23-0016) and conducted in the SAMBA collaborative project.
Built at Technical University of Munich | Machine Learning for 3D Geometry (IN2392)
