Spatial-X: Zero-Shot Vision-and-Language Navigation with Spatial Scene Priors

This repository contains the code and data for our series of work on zero-shot Vision-and-Language Navigation (VLN) using global spatial scene priors. We are the first to close-loop the pre-exploration to physically grounded 3D scene reconstructions (i.e. point clouds) for VLN agents and investigate how pre-explored 3D scene representations can provide a robust reasoning basis in multiple ways.

Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [arXiv]

We propose a zero-shot VLN setting that allows agents to pre-explore the environment, and construct the Spatial Scene Graph (SSG) to capture global spatial structure and semantics. Based on SSG, SpatialNav integrates an agent-centric spatial map, compass-aligned visual representation, and remote object localization for efficient navigation. SpatialNav significantly outperforms existing zero-shot agents and narrows the gap with state-of-the-art learning-based methods.

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation [arXiv]

Building on SpatialNav, SpatialAnt addresses the reality gap when deploying pre-exploration-based agents on real robots. We introduce a physical grounding strategy to recover metric scale from monocular RGB-based reconstructed scene point clouds. We further design a visual anticipation mechanism that renders future observations from noisy point clouds for counterfactual reasoning. SpatialAnt achieves state-of-the-art zero-shot performance in both simulation and real-world deployment on the Hello Robot.

Performance

Results in Discrete Environments

The best and the second best results within each group are denoted by bold and underline.

Methods	Pre-Exp	R2R					REVERIE
Methods	Pre-Exp	TL(↓)	NE(↓)	OSR(↑)	SR(↑)	SPL(↑)	OSR(↑)	SR(↑)	SPL(↑)
*Supervised Learning:*
NavCoT	--	9.95	6.36	48	40	37	14.2	9.2	7.2
PREVALENT	--	10.19	4.71	-	58	53	--	--	--
VLN-BERT	--	12.01	3.93	69	63	57	27.7	25.5	21.1
HAMT	--	11.46	2.29	73	66	61	36.8	33.0	30.2
DUET	--	13.94	3.31	81	72	60	51.1	47.0	33.7
DUET+ScaleVLN	--	14.09	2.09	88	81	70	63.9	57.0	41.8
*Zero-Shot:*
NavGPT	✕	11.45	6.46	42	34	29	28.3	19.2	14.6
MapGPT	✕	--	5.63	57.6	43.7	34.8	36.8	31.6	20.3
MC-GPT	✕	--	5.42	68.8	32.1	--	30.3	19.4	9.7
SpatialGPT	✕	--	5.56	70.8	48.4	36.1	--	--	--
SpatialNav (Ours)	✓	13.8	4.54	68.2	57.7	47.8	58.1	49.6	34.6

Results in Continuous Environments

The best supervised results are highlighted in bold, while the best zero-shot results are underlined.
"Pre-Exp" denotes whether the zero-shot agent adopts the pre-exploration based navigation settings.

#	Methods	Pre-Exp	R2R-CE					RxR-CE
#	Methods	Pre-Exp	NE(↓)	OSR(↑)	SR(↑)	SPL(↑)	nDTW(↑)	NE(↓)	SR(↑)	SPL(↑)	nDTW(↑)
*Supervised Learning:*
1	NavFoM	--	4.61	72.1	61.7	55.3	--	4.74	64.4	56.2	65.8
2	Efficient-VLN	--	4.18	73.7	64.2	55.9	--	3.88	67.0	54.3	68.4
*Zero-Shot:*
3	Open-Nav	✕	6.70	23.0	19.0	16.1	45.8	--	--	--	--
4	Smartway	✕	7.01	51.0	29.0	22.5	--	--	--	--	--
5	STRIDER	✕	6.91	39.0	35.0	30.3	51.8	11.19	21.2	9.6	30.1
6	VLN-Zero	✓	5.97	51.6	42.4	26.3	--	9.13	30.8	19.0	--
7	SpatialNav (Ours)	✓	5.15	66.0	64.0	51.1	65.4	7.64	32.4	24.6	55.0
8	SpatialAnt (Ours)	✓	4.42	76.0	66.0	54.4	69.5	5.28	50.8	35.6	65.4

Citation

If you find our work useful, please consider citing:

@article{zhang2026spatialnav,
  title={SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation},
  author={Zhang, Jiwen and Li, Zejun and Wang, Siyuan and Shi, Xiangyu and Wei, Zhongyu and Wu, Qi},
  journal={arXiv preprint arXiv:2601.06806},
  year={2026}
}

@article{zhang2026spatialant,
  title={SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation},
  author={Zhang, Jiwen and Shi, Xiangyu and Wang, Siyuan and Li, Zerui and Wei, Zhongyu and Wu, Qi},
  year={2026}
}

Website License

The website code is borrowed from the Nerfies website, and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
static		static
.gitignore		.gitignore
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial-X: Zero-Shot Vision-and-Language Navigation with Spatial Scene Priors

Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [arXiv]

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation [arXiv]

Performance

Results in Discrete Environments

Results in Continuous Environments

Citation

Website License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spatial-X: Zero-Shot Vision-and-Language Navigation with Spatial Scene Priors

Our Series of Works

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation [arXiv]

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation [arXiv]

Performance

Results in Discrete Environments

Results in Continuous Environments

Citation

Website License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages