NimbleD

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

European Conference on Computer Vision (ECCV) 2024 CV4Metaverse Workshop - Oral

Results on KITTI Eigen Split with Median Alignment

Method	Par	AbsRel	SqRel	RMSE	RMSElog	δ < 1.25¹	δ < 1.25²	δ < 1.25³
Monodepth2-R18	14.8	0.115	0.903	4.863	0.193	0.877	0.959	0.981
+ NimbleD (ours)	14.8	0.100	0.739	4.440	0.175	0.898	0.967	0.985
Monodepth2-R50	34.6	0.110	0.831	4.642	0.187	0.883	0.962	0.982
+ NimbleD (ours)	34.6	0.097	0.721	4.377	0.172	0.904	0.968	0.985
SwiftDepth-S	3.6	0.110	0.830	4.700	0.187	0.882	0.962	0.982
+ NimbleD (ours)	3.6	0.098	0.733	4.401	0.174	0.901	0.968	0.985
SwiftDepth	6.4	0.107	0.790	4.643	0.182	0.888	0.963	0.983
+ NimbleD (ours)	6.4	0.096	0.697	4.333	0.171	0.905	0.969	0.986
LiteMono-small	2.5	0.110	0.802	4.671	0.186	0.879	0.961	0.982
+ NimbleD (ours)	2.5	0.099	0.709	4.370	0.172	0.898	0.967	0.986
LiteMono	3.1	0.107	0.765	4.561	0.183	0.886	0.963	0.983
+ NimbleD (ours)	3.1	0.096	0.684	4.304	0.171	0.903	0.969	0.986
LiteMono-8M	8.8	0.101	0.729	4.454	0.178	0.897	0.965	0.983
+ NimbleD (ours)	8.8	0.092	0.646	4.194	0.165	0.910	0.970	0.986

Setup

Experiments were conducted on Windows 11, Python 3.9.19, CUDA 12.5, PyTorch 1.13.1.

Main dependencies are listed in the requirements.txt.

Datasets

KITTI

Refer to Monodepth2 for KITTI dataset preparation.

Image Format: PNG
Dataset Size: ~161 GB

Generate Pseudo-labels

python generate_kitti_pseudo_labels.py --data_dir KITTI_DATA_PATH

Pseudo-labels Size: ~13 GB

YouTube

The selected YouTube videos were accessed and obtained under a CC-BY license at a resolution of 854x480.

Proof of access under the CC-BY license can be found here: CC-BY.

Please verify the current license and comply with YouTube’s terms of service before using or downloading videos.

Each video must be saved in the following structured format matching the order of their respective URL files:

datasets/
└── youtube/
    └── videos/
        ├── driving/
        │   ├── D_0001.mp4
        │   ├── D_0002.mp4
        │   ├── ...
        │   └── D_0035.mp4
        ├── hiking/
        │   ├── H_0001.mp4
        │   ├── H_0002.mp4
        │   ├── ...
        │   └── H_0035.mp4
        └── city_walking/
            ├── CW_0001.mp4
            ├── CW_0002.mp4
            ├── ...
            └── CW_0035.mp4

Videos Size: ~31 GB

Extract Frames

python ./datasets/youtube/extract_frames.py

Frames Size: ~556 GB

Generate Pseudo-labels

python generate_youtube_pseudo_labels.py --data_dir YOUTUBE_DATA_PATH

Pseudo-labels Size: ~364 GB

Training

MODEL_NAME:

md2_r18
md2_r50
swiftdepth_s
swiftdepth
litemono_s
litemono
litemono_8m

Large-scale Video Pre-Training

python pretrain_youtube.py --project_name PROJECT_NAME --model_name MODEL_NAME --data_dir YOUTUBE_DATA_PATH --learn_k

Fine-tune on KITTI

python finetune_kitti.py --project_name PROJECT_NAME --model_name MODEL_NAME --data_dir KITTI_DATA_PATH --pretrained_weights PRETRAIN_WEIGHTS_PATH --learn_k

Weights

Evaluation

MODEL_NAME:

md2_r18
md2_r50
swiftdepth_s
swiftdepth
litemono_s
litemono
litemono_8m

Evaluate on KITTI Eigen split with median alignment

python eval_kitti.py --data_dir KITTI_DATA_PATH --weights_dir WEIGHTS_PATH --model_name MODEL_NAME --eval_split eigen --align median

Evaluate on KITTI Eigen-Benchmark split with lsqr alignment

python eval_kitti.py --data_dir KITTI_DATA_PATH --weights_dir WEIGHTS_PATH --model_name MODEL_NAME --eval_split eigen_benchmark --align lsqr

Evaluate on NYUv2

Note: I realized that the evaluation set I used contains a slightly different set of images compared to those used in other papers. Unfortunately, I no longer remember how I generated or where I downloaded these images.

To address this, I have uploaded the evaluation set I used in the paper here. Since this dataset is intended solely for a head-to-head comparison of improvements over baselines tested on the same data, the comparison remains fair. If you perform evaluations on the standard test set, you will observe only minor differences.

python eval_nyuv2.py --data_dir NYUv2_DATA_PATH --weights_dir WEIGHTS_PATH --model_name MODEL_NAME --align median

Evaluate on Make3D

python eval_make3d.py --data_dir MAKE3D_DATA_PATH --weights_dir WEIGHTS_PATH --model_name MODEL_NAME

Acknowledgement

The code is inspired by and builds upon the following works: Monodepth2, Lite-Mono, SwiftDepth, KBR, DepthAnything.

Thank you to the authors for their valuable contributions.

Attribution

The YouTube videos were accessed under a CC-BY license at the time of collection. The list of video URLs is available here.

The following creators are acknowledged for their content:

Kizzume
- YouTube Channel
- Channel ID: UCPJJsmyvEFizmsVKznk_pjw
Evan Explores
- YouTube Channel
- Channel ID: UCqsCOd3o-7vDXFmYwP00fjg
Travel | Relax | Listen
- YouTube Channel
- Channel ID: UCwR7sfacuPghWn-KvVwxOeg
POPtravel
- Daniel Sczepansky
- YouTube Channel
- Channel ID: UClODDXeUIz1-FaKyN8dsNrA
- Website: www.poptravel.org

Their works are highly appreciated.

Citation

@InProceedings{10.1007/978-3-031-92387-6_18,
author="Luginov, Albert
and Shahzad, Muhammad",
editor="Del Bue, Alessio
and Canton, Cristian
and Pont-Tuset, Jordi
and Tommasi, Tatiana",
title="NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-Scale Video Pre-training",
booktitle="Computer Vision -- ECCV 2024 Workshops",
year="2025",
publisher="Springer Nature Switzerland",
address="Cham",
pages="235--251",
abstract="We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and attributions are available at https://github.com/xapaxca/nimbled.",
isbn="978-3-031-92387-6"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NimbleD

Results on KITTI Eigen Split with Median Alignment

Setup

Datasets

KITTI

YouTube

Training

Weights

Evaluation

Acknowledgement

Attribution

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
imagenet_pretrained_backbones		imagenet_pretrained_backbones
networks		networks
weights		weights
.gitignore		.gitignore
README.md		README.md
eval_kitti.py		eval_kitti.py
eval_make3d.py		eval_make3d.py
eval_nyuv2.py		eval_nyuv2.py
finetune_kitti.py		finetune_kitti.py
generate_kitti_pseudo_labels.py		generate_kitti_pseudo_labels.py
generate_youtube_pseudo_labels.py		generate_youtube_pseudo_labels.py
loss.py		loss.py
pretrain_youtube.py		pretrain_youtube.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

NimbleD

Results on KITTI Eigen Split with Median Alignment

Setup

Datasets

KITTI

YouTube

Training

Weights

Evaluation

Acknowledgement

Attribution

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages