Skip to content

[WIP] VLN Benchmark: H1 Navigation in Matterport 3D with NaVILA VLM#446

Open
kanghui0204 wants to merge 1 commit intomainfrom
feature/vln-benchmark
Open

[WIP] VLN Benchmark: H1 Navigation in Matterport 3D with NaVILA VLM#446
kanghui0204 wants to merge 1 commit intomainfrom
feature/vln-benchmark

Conversation

@kanghui0204
Copy link
Collaborator

[WIP] VLN Benchmark: H1 Navigation in Matterport 3D with NaVILA VLM

Summary

Add Vision-Language Navigation (VLN) benchmark support to IsaacLab Arena, enabling H1 humanoid navigation in Matterport 3D indoor scenes using the NaVILA VLM.

This is a draft PR for early review and feedback. The core pipeline is functional and tested, but some areas may need refinement before merging.

Architecture

Two-level hierarchical policy:

  • High-level: NaVILA VLM generates velocity commands from RGB image history
  • Low-level: RSL-RL locomotion policy converts velocity commands to joint actions

Communication between Isaac Sim (client) and NaVILA (server) uses Arena's ZeroMQ remote-policy framework (merged in #394).

What's Included

Component Location Description
H1 Embodiment isaaclab_arena/embodiments/h1/ Standard H1 + VLN extension (cameras, observations)
VLN Task isaaclab_arena/tasks/vln_r2r_matterport_task.py R2R episode management, scene filtering, termination
VLN Metrics isaaclab_arena/metrics/vln_metrics.py SPL, Success, PathLength, DistanceToGoal
Matterport Background isaaclab_arena/assets/matterport_background.py Scene loading + lighting + ground plane
Client Policy isaaclab_arena/policy/vln/ VlnVlmLocomotionPolicy (VLM + RSL-RL composite)
NaVILA Server isaaclab_arena_navila/ NaVilaServerPolicy (LLaVA-based VLM inference)
Environment isaaclab_arena_environments/vln_environment.py h1_vln_matterport environment registration
Docker docker/Dockerfile.vln_server, docker/run_vln_server.sh VLM server container
Pretrained LL model isaaclab_arena/policy/vln/pretrained/ H1 locomotion checkpoint (4.7MB)

Key Design Decisions

  • Full image history + uniform sampling: The VLM receives 8 uniformly sampled frames from the entire episode history (not just the last 8). This enables the VLM to determine task completion and output "stop".
  • Scene filtering: Episodes are automatically filtered by the loaded USD scene. --episode_start/end refers to indices within the filtered set.
  • XY metrics: Distance calculations use horizontal (XY) plane only, because robot pelvis height (~0.9m) differs from dataset waypoint height (~0.17m floor level).
  • Modular server: NaVILA server is in a separate package (isaaclab_arena_navila/), following the isaaclab_arena_gr00t pattern. Other VLMs can be added without changing client code.

Test Results

Episodes Success SPL Avg Distance-to-Goal
10 (zsNo4HB9uLZ) 0.40 0.36 6.17m
3 (zsNo4HB9uLZ) 0.67 0.59 5.41m
1 (best case) 1.00 0.77 0.77m

VLM correctly outputs "stop" when task is complete (e.g., "I think I should stop because I have finished the instruction.").

Known Limitations

  • num_envs must be 1 (multi-env VLM instruction tracking not yet implemented)
  • Scene switching requires process restart
  • Uses invisible ground plane instead of Matterport mesh collision (GPU physics limitation)
  • Pretrained checkpoint included for early testing; will be removed before final merge

How to Test

See isaaclab_arena_navila/README.md for full setup instructions (English + Chinese).

Server
bash docker/run_vln_server.sh -m /path/to/navila-model --port 5555

Client (inside Isaac Sim container)

/isaac-sim/python.sh -u -m isaaclab_arena.evaluation.policy_runner \
--enable_cameras --num_envs 1 \
--policy_type isaaclab_arena.policy.vln.vln_vlm_locomotion_policy.VlnVlmLocomotionPolicy \
--remote_host localhost --remote_port 5555 \
--ll_checkpoint_path isaaclab_arena/policy/vln/pretrained/h1_navila_locomotion.pt \
--ll_agent_cfg isaaclab_arena/policy/vln/pretrained/h1_navila_agent.yaml \
--num_episodes 5 \
h1_vln_matterport \
--usd_path /datasets/VLN-CE-Isaac/matterport_usd/zsNo4HB9uLZ/zsNo4HB9uLZ.usd \
--r2r_dataset_path /datasets/VLN-CE-Isaac/vln_ce_isaac_v1.json.gz

Checklist

  • End-to-end pipeline verified (VLM inference → velocity → locomotion → navigation)
  • VLM correctly outputs "stop" for task completion
  • Standard VLN metrics (SPL, Success, PathLength, DTG)
  • Docker server build and launch scripts
  • Documentation (README with English + Chinese)
  • Pretrained H1 locomotion checkpoint included
  • Multi-env support
  • Matterport mesh collision
  • CI integration
  • Performance benchmarking across all 11 scenes

Two-level hierarchical policy for Vision-Language Navigation:
  - High-level: NaVILA VLM generates velocity commands from RGB images
  - Low-level: RSL-RL locomotion policy converts to joint actions

Code organization follows Arena patterns:
  - isaaclab_arena/embodiments/h1/       Standard H1 + VLN extension
  - isaaclab_arena/tasks/                VlnR2rMatterportTask
  - isaaclab_arena/metrics/              SPL, Success, PathLength, DTG (XY)
  - isaaclab_arena/assets/               MatterportBackground with lighting
  - isaaclab_arena/policy/vln/           VlnVlmLocomotionPolicy (client)
  - isaaclab_arena_navila/               NaVilaServerPolicy (server)
  - isaaclab_arena_environments/         h1_vln_matterport environment

Key features:
  - Auto scene-episode matching via scene_filter
  - Full image history + uniform sampling for VLM stop detection
  - Configurable head + follow cameras
  - Docker VLM server (docker/Dockerfile.vln_server)
  - VLN-CE R2R dataset support (11 Matterport scenes, 1077 episodes)

Verified: success=1.0, SPL=0.77 on zsNo4HB9uLZ scene
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants