Skip to content

IMNearth/Spatial-X

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spatial-X: Zero-Shot Vision-and-Language Navigation with Spatial Scene Priors

This repository contains the code and data for our series of work on zero-shot Vision-and-Language Navigation (VLN) using global spatial scene priors. We are the first to close-loop the pre-exploration to physically grounded 3D scene reconstructions (i.e. point clouds) for VLN agents and investigate how pre-explored 3D scene representations can provide a robust reasoning basis in multiple ways.

Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [arXiv]

We propose a zero-shot VLN setting that allows agents to pre-explore the environment, and construct the Spatial Scene Graph (SSG) to capture global spatial structure and semantics. Based on SSG, SpatialNav integrates an agent-centric spatial map, compass-aligned visual representation, and remote object localization for efficient navigation. SpatialNav significantly outperforms existing zero-shot agents and narrows the gap with state-of-the-art learning-based methods.

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation [arXiv]

Building on SpatialNav, SpatialAnt addresses the reality gap when deploying pre-exploration-based agents on real robots. We introduce a physical grounding strategy to recover metric scale from monocular RGB-based reconstructed scene point clouds. We further design a visual anticipation mechanism that renders future observations from noisy point clouds for counterfactual reasoning. SpatialAnt achieves state-of-the-art zero-shot performance in both simulation and real-world deployment on the Hello Robot.

Performance

Results in Discrete Environments

  • The best and the second best results within each group are denoted by bold and underline.
Methods Pre-Exp R2R REVERIE
TL(↓) NE(↓) OSR(↑) SR(↑) SPL(↑) OSR(↑) SR(↑) SPL(↑)
Supervised Learning:
NavCoT -- 9.95 6.36 48 40 37 14.2 9.2 7.2
PREVALENT -- 10.19 4.71 - 58 53 -- -- --
VLN-BERT -- 12.01 3.93 69 63 57 27.7 25.5 21.1
HAMT -- 11.46 2.29 73 66 61 36.8 33.0 30.2
DUET -- 13.94 3.31 81 72 60 51.1 47.0 33.7
DUET+ScaleVLN -- 14.09 2.09 88 81 70 63.9 57.0 41.8
Zero-Shot:
NavGPT 11.45 6.46 42 34 29 28.3 19.2 14.6
MapGPT -- 5.63 57.6 43.7 34.8 36.8 31.6 20.3
MC-GPT -- 5.42 68.8 32.1 -- 30.3 19.4 9.7
SpatialGPT -- 5.56 70.8 48.4 36.1 -- -- --
SpatialNav (Ours) 13.8 4.54 68.2 57.7 47.8 58.1 49.6 34.6

Results in Continuous Environments

  • The best supervised results are highlighted in bold, while the best zero-shot results are underlined.
  • "Pre-Exp" denotes whether the zero-shot agent adopts the pre-exploration based navigation settings.
# Methods Pre-Exp R2R-CE RxR-CE
NE(↓) OSR(↑) SR(↑) SPL(↑) nDTW(↑) NE(↓) SR(↑) SPL(↑) nDTW(↑)
Supervised Learning:
1 NavFoM -- 4.61 72.1 61.7 55.3 -- 4.74 64.4 56.2 65.8
2 Efficient-VLN -- 4.18 73.7 64.2 55.9 -- 3.88 67.0 54.3 68.4
Zero-Shot:
3 Open-Nav 6.70 23.0 19.0 16.1 45.8 -- -- -- --
4 Smartway 7.01 51.0 29.0 22.5 -- -- -- -- --
5 STRIDER 6.91 39.0 35.0 30.3 51.8 11.19 21.2 9.6 30.1
6 VLN-Zero 5.97 51.6 42.4 26.3 -- 9.13 30.8 19.0 --
7 SpatialNav (Ours) 5.15 66.0 64.0 51.1 65.4 7.64 32.4 24.6 55.0
8 SpatialAnt (Ours) 4.42 76.0 66.0 54.4 69.5 5.28 50.8 35.6 65.4

Citation

If you find our work useful, please consider citing:

@article{zhang2026spatialnav,
  title={SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation},
  author={Zhang, Jiwen and Li, Zejun and Wang, Siyuan and Shi, Xiangyu and Wei, Zhongyu and Wu, Qi},
  journal={arXiv preprint arXiv:2601.06806},
  year={2026}
}

@article{zhang2026spatialant,
  title={SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation},
  author={Zhang, Jiwen and Shi, Xiangyu and Wang, Siyuan and Li, Zerui and Wei, Zhongyu and Wu, Qi},
  year={2026}
}

Website License

Creative Commons License

The website code is borrowed from the Nerfies website, and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

Spatial-X: Zero-Shot Vision-and-Language Navigation with Spatial Scene Priors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors