This repository contains the code and data for our series of work on zero-shot Vision-and-Language Navigation (VLN) using global spatial scene priors. We are the first to close-loop the pre-exploration to physically grounded 3D scene reconstructions (i.e. point clouds) for VLN agents and investigate how pre-explored 3D scene representations can provide a robust reasoning basis in multiple ways.
SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [arXiv]
We propose a zero-shot VLN setting that allows agents to pre-explore the environment, and construct the Spatial Scene Graph (SSG) to capture global spatial structure and semantics. Based on SSG, SpatialNav integrates an agent-centric spatial map, compass-aligned visual representation, and remote object localization for efficient navigation. SpatialNav significantly outperforms existing zero-shot agents and narrows the gap with state-of-the-art learning-based methods.
SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation [arXiv]
Building on SpatialNav, SpatialAnt addresses the reality gap when deploying pre-exploration-based agents on real robots. We introduce a physical grounding strategy to recover metric scale from monocular RGB-based reconstructed scene point clouds. We further design a visual anticipation mechanism that renders future observations from noisy point clouds for counterfactual reasoning. SpatialAnt achieves state-of-the-art zero-shot performance in both simulation and real-world deployment on the Hello Robot.
- The best and the second best results within each group are denoted by bold and underline.
| Methods | Pre-Exp | R2R | REVERIE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| TL(↓) | NE(↓) | OSR(↑) | SR(↑) | SPL(↑) | OSR(↑) | SR(↑) | SPL(↑) | ||
| Supervised Learning: | |||||||||
| NavCoT | -- | 9.95 | 6.36 | 48 | 40 | 37 | 14.2 | 9.2 | 7.2 |
| PREVALENT | -- | 10.19 | 4.71 | - | 58 | 53 | -- | -- | -- |
| VLN-BERT | -- | 12.01 | 3.93 | 69 | 63 | 57 | 27.7 | 25.5 | 21.1 |
| HAMT | -- | 11.46 | 2.29 | 73 | 66 | 61 | 36.8 | 33.0 | 30.2 |
| DUET | -- | 13.94 | 3.31 | 81 | 72 | 60 | 51.1 | 47.0 | 33.7 |
| DUET+ScaleVLN | -- | 14.09 | 2.09 | 88 | 81 | 70 | 63.9 | 57.0 | 41.8 |
| Zero-Shot: | |||||||||
| NavGPT | ✕ | 11.45 | 6.46 | 42 | 34 | 29 | 28.3 | 19.2 | 14.6 |
| MapGPT | ✕ | -- | 5.63 | 57.6 | 43.7 | 34.8 | 36.8 | 31.6 | 20.3 |
| MC-GPT | ✕ | -- | 5.42 | 68.8 | 32.1 | -- | 30.3 | 19.4 | 9.7 |
| SpatialGPT | ✕ | -- | 5.56 | 70.8 | 48.4 | 36.1 | -- | -- | -- |
| SpatialNav (Ours) | ✓ | 13.8 | 4.54 | 68.2 | 57.7 | 47.8 | 58.1 | 49.6 | 34.6 |
- The best supervised results are highlighted in bold, while the best zero-shot results are underlined.
- "Pre-Exp" denotes whether the zero-shot agent adopts the pre-exploration based navigation settings.
| # | Methods | Pre-Exp | R2R-CE | RxR-CE | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| NE(↓) | OSR(↑) | SR(↑) | SPL(↑) | nDTW(↑) | NE(↓) | SR(↑) | SPL(↑) | nDTW(↑) | |||
| Supervised Learning: | |||||||||||
| 1 | NavFoM | -- | 4.61 | 72.1 | 61.7 | 55.3 | -- | 4.74 | 64.4 | 56.2 | 65.8 |
| 2 | Efficient-VLN | -- | 4.18 | 73.7 | 64.2 | 55.9 | -- | 3.88 | 67.0 | 54.3 | 68.4 |
| Zero-Shot: | |||||||||||
| 3 | Open-Nav | ✕ | 6.70 | 23.0 | 19.0 | 16.1 | 45.8 | -- | -- | -- | -- |
| 4 | Smartway | ✕ | 7.01 | 51.0 | 29.0 | 22.5 | -- | -- | -- | -- | -- |
| 5 | STRIDER | ✕ | 6.91 | 39.0 | 35.0 | 30.3 | 51.8 | 11.19 | 21.2 | 9.6 | 30.1 |
| 6 | VLN-Zero | ✓ | 5.97 | 51.6 | 42.4 | 26.3 | -- | 9.13 | 30.8 | 19.0 | -- |
| 7 | SpatialNav (Ours) | ✓ | 5.15 | 66.0 | 64.0 | 51.1 | 65.4 | 7.64 | 32.4 | 24.6 | 55.0 |
| 8 | SpatialAnt (Ours) | ✓ | 4.42 | 76.0 | 66.0 | 54.4 | 69.5 | 5.28 | 50.8 | 35.6 | 65.4 |
If you find our work useful, please consider citing:
@article{zhang2026spatialnav,
title={SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation},
author={Zhang, Jiwen and Li, Zejun and Wang, Siyuan and Shi, Xiangyu and Wei, Zhongyu and Wu, Qi},
journal={arXiv preprint arXiv:2601.06806},
year={2026}
}
@article{zhang2026spatialant,
title={SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation},
author={Zhang, Jiwen and Shi, Xiangyu and Wang, Siyuan and Li, Zerui and Wei, Zhongyu and Wu, Qi},
year={2026}
}
The website code is borrowed from the Nerfies website, and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.