I am encountering a RuntimeError: CUDA driver error: out of memory when processing longer videos with lingbot-map, even when using --mode windowed.
After investigating demo.py, it appears that the current data loading implementation is not truly streaming. Even though the model supports windowed inference for long sequences, the load_images function (line 199+) reads the entire video file into memory and then immediately attempts to move the full tensor to the GPU (images = images.to(device) in line 374).
This causes an OOM error on my GPU (RTX 4080, 16GB VRAM) before the actual windowed inference even starts.
Expected Behavior:
The demo should support lazy loading of video frames. Instead of loading the entire video into the VRAM/RAM at startup, it should load only the necessary frames/batches required by the current windowed inference step.
Actual Behavior:
The script attempts to allocate memory for the entire video at once:
Python
images = images.to(device) # OOM here
Environment:
GPU: RTX 4080 (16GB VRAM)
OS: Ubuntu via WSL2
Mode: windowed
Error: RuntimeError: CUDA driver error: out of memory
Suggested Solution:
It would be highly beneficial to refactor the data loader to support lazy loading. Instead of creating one large images tensor, the loader could return a generator or a custom Dataset object that yields frames or small batches of frames as needed by the windowed inference logic.
This would enable lingbot-map to process videos of arbitrary length without hitting VRAM limits.
issue creation helped by gemini :)
I am encountering a RuntimeError: CUDA driver error: out of memory when processing longer videos with lingbot-map, even when using --mode windowed.
After investigating demo.py, it appears that the current data loading implementation is not truly streaming. Even though the model supports windowed inference for long sequences, the load_images function (line 199+) reads the entire video file into memory and then immediately attempts to move the full tensor to the GPU (images = images.to(device) in line 374).
This causes an OOM error on my GPU (RTX 4080, 16GB VRAM) before the actual windowed inference even starts.
Expected Behavior:
The demo should support lazy loading of video frames. Instead of loading the entire video into the VRAM/RAM at startup, it should load only the necessary frames/batches required by the current windowed inference step.
Actual Behavior:
The script attempts to allocate memory for the entire video at once:
Python
images = images.to(device) # OOM here
Environment:
GPU: RTX 4080 (16GB VRAM)
OS: Ubuntu via WSL2
Mode: windowed
Error: RuntimeError: CUDA driver error: out of memory
Suggested Solution:
It would be highly beneficial to refactor the data loader to support lazy loading. Instead of creating one large images tensor, the loader could return a generator or a custom Dataset object that yields frames or small batches of frames as needed by the windowed inference logic.
This would enable lingbot-map to process videos of arbitrary length without hitting VRAM limits.
issue creation helped by gemini :)