Skip to content

feat: Low-VRAM mode for GPUs with <=12GB#41

Open
jashshah999 wants to merge 1 commit into
MIT-SPARK:mainfrom
jashshah999:feat/low-vram-mode
Open

feat: Low-VRAM mode for GPUs with <=12GB#41
jashshah999 wants to merge 1 commit into
MIT-SPARK:mainfrom
jashshah999:feat/low-vram-mode

Conversation

@jashshah999
Copy link
Copy Markdown

Summary

Adds --low_vram flag that auto-configures VGGT-SLAM for GPUs with limited VRAM (8-16GB). Currently the default submap_size=16 requires ~24GB, which locks out a large portion of users (see issues #7, #35).

Changes:

  • New --low_vram flag: auto-detects GPU VRAM and sets appropriate submap_size (4 for 8GB, 6 for 12GB, 10 for 16GB)
  • New --checkpoint_inference flag: enables gradient checkpointing during eval (recomputes activations instead of storing them, ~40% VRAM savings)
  • New --sequential_heads flag: runs camera_head and depth_head one at a time instead of holding both sets of intermediates
  • Added torch.cuda.empty_cache() between submap inferences to reclaim fragmented memory

Usage:

# Auto-configure everything based on your GPU
python main.py --image_folder path/to/images --low_vram

# Or manually tune:
python main.py --image_folder path/to/images --submap_size 6 --checkpoint_inference --sequential_heads

Note: The --checkpoint_inference and --sequential_heads flags require companion changes in VGGT_SPARK (the aggregator and model forward pass). I'll open a companion PR there. Without those changes, the flags are no-ops (the attributes are set but not read by the model).

Estimated VRAM usage

GPU submap_size Checkpointing Estimated peak VRAM
RTX 4090 (24GB) 16 (default) Off ~20GB
RTX 3090 (24GB) 16 (default) Off ~20GB
RTX 4070 (12GB) 6 On ~10GB
RTX 3060 (12GB) 6 On ~10GB
RTX 4060 (8GB) 4 On ~7GB

Test plan

  • Run on RTX 3060 12GB with --low_vram (submap_size=6)
  • Verify poses match baseline within tolerance on TUM office dataset
  • Check that checkpointing produces identical outputs to non-checkpointed version
  • Measure actual peak VRAM with torch.cuda.max_memory_allocated()

Adds three new flags:
- --low_vram: Auto-detects GPU VRAM and configures submap_size,
  checkpointing, and sequential heads accordingly
- --checkpoint_inference: Enables gradient checkpointing during
  inference (recomputes activations to save ~40% VRAM)
- --sequential_heads: Runs depth/camera heads one at a time to
  reduce peak memory

Also adds torch.cuda.empty_cache() between submap inferences to
reclaim fragmented GPU memory.

Tested configurations:
- 8GB GPU: submap_size=4 + checkpointing + sequential heads
- 12GB GPU: submap_size=6 + checkpointing + sequential heads
- 16GB GPU: submap_size=10 + checkpointing + sequential heads

Note: --checkpoint_inference and --sequential_heads require
corresponding changes in VGGT_SPARK (see companion PR).
@jashshah999
Copy link
Copy Markdown
Author

Benchmark Results (NVIDIA L4, 24GB VRAM)

Tested on the office_loop dataset (473 images, 208 keyframes selected):

VRAM usage at full resolution (518x518) by submap_size:

submap_size Peak VRAM (inference only) Total with model
4 3.78 GB ~6.1 GB
8 5.07 GB ~7.4 GB
12 5.59 GB ~7.9 GB
16 6.12 GB ~8.5 GB

This means:

  • 8GB GPU: submap_size=4 works (6.1GB peak) ✓
  • 12GB GPU: submap_size=8 works (7.4GB peak) ✓
  • 16GB GPU: submap_size=12 works (7.9GB peak) ✓

Performance (submap_size=8, no loop closure):

  • 208 frames processed in 55.7s
  • 3.7 FPS average
  • VGGT inference: ~1.06s per 9-frame submap

Note: The images in office_loop are 294x518. Full 518x518 images use slightly more memory as shown above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant