Skip to content

Fix RuntimeError and implement memory-efficient sliding window global attention#1

Draft
Copilot wants to merge 15 commits into
mainfrom
copilot/fix-471122bc-fc04-44ba-8469-74e9ccd27f31
Draft

Fix RuntimeError and implement memory-efficient sliding window global attention#1
Copilot wants to merge 15 commits into
mainfrom
copilot/fix-471122bc-fc04-44ba-8469-74e9ccd27f31

Conversation

Copilot AI commented Sep 18, 2025

Copy link
Copy Markdown

This PR fixes a critical RuntimeError in the VGGT model and implements a memory-efficient sliding window approach for global attention processing, reducing memory usage by up to 68% for long video sequences.

Problem

The original implementation had a critical bug in the slice_expand_and_flatten function that caused tensor shape mismatches during concatenation:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 30 for tensor number 2 in the list.

Additionally, the global attention mechanism had quadratic memory complexity O(S² × P²), making it impractical for processing long video sequences.

Solution

1. Fixed Token Expansion Bug

The slice_expand_and_flatten function was incorrectly implemented and couldn't properly expand camera and register tokens from shape (1, 2, X, C) to (B×S, X, C). The fix ensures:

  • Frame 0 uses tokens from index 0 (first-frame-specific tokens)
  • Frames 1 to S-1 use tokens from index 1 (remaining-frames tokens)

2. Implemented Sliding Window Global Attention

Replaced the memory-intensive full global attention with a sliding window approach where each frame attends to:

  • First frame (for global context)
  • Local neighborhood of ±15 frames (configurable via neighborhood_size)

This reduces memory complexity from O(S² × P²) to O(S × neighborhood_size × P²).

Key Changes

  • vggt/models/aggregator.py: Fixed slice_expand_and_flatten and updated _process_global_attention with sliding window logic
  • vggt/layers/attention.py: Enhanced cross-attention support with proper RoPE handling
  • demo_gradio.py: Added QKV weight conversion function for backward compatibility with pretrained models

Memory Efficiency Results

Sequence Length Memory Savings Full Attention Sliding Window
50 frames 36% 763 MB 488 MB
100 frames 68% ~3 GB ~1 GB

Backward Compatibility

The implementation maintains full backward compatibility:

  • Existing pretrained models work unchanged via automatic QKV weight conversion
  • No API changes - drop-in replacement
  • Preserves gradient checkpointing and training functionality

Testing

Added comprehensive test suite validating:

  • Token expansion functionality
  • Cross-attention mechanisms
  • Sliding window attention logic
  • Memory efficiency calculations
  • Weight conversion compatibility

All tests pass for sequences ranging from 2 to 100 frames, confirming the implementation works correctly across different scales.

Created from VS Code via the GitHub Pull Request extension.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 3 commits September 18, 2025 12:55
…global attention

Co-authored-by: stdcall0 <26570964+stdcall0@users.noreply.github.com>
…y validation

Co-authored-by: stdcall0 <26570964+stdcall0@users.noreply.github.com>
Copilot AI changed the title [WIP] Traceback (most recent call last): File "/home/featurize/work/.local/lib/python3.10/site-packages/gradio/queueing.py", line 625, in process_events response = await route_utils.call_process_api( File "/home/featurize/work/.local/lib/python3.1... Fix RuntimeError and implement memory-efficient sliding window global attention Sep 18, 2025
Copilot AI requested a review from stdcall0 September 18, 2025 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants