Skip to content

[Bug/Feature Request] OOM Error with long videos: 'load_images' preloads entire video into VRAM #44

@strhwste

Description

@strhwste

I am encountering a RuntimeError: CUDA driver error: out of memory when processing longer videos with lingbot-map, even when using --mode windowed.

After investigating demo.py, it appears that the current data loading implementation is not truly streaming. Even though the model supports windowed inference for long sequences, the load_images function (line 199+) reads the entire video file into memory and then immediately attempts to move the full tensor to the GPU (images = images.to(device) in line 374).

This causes an OOM error on my GPU (RTX 4080, 16GB VRAM) before the actual windowed inference even starts.

Expected Behavior:
The demo should support lazy loading of video frames. Instead of loading the entire video into the VRAM/RAM at startup, it should load only the necessary frames/batches required by the current windowed inference step.

Actual Behavior:
The script attempts to allocate memory for the entire video at once:

Python
images = images.to(device) # OOM here
Environment:
GPU: RTX 4080 (16GB VRAM)

OS: Ubuntu via WSL2

Mode: windowed

Error: RuntimeError: CUDA driver error: out of memory

Suggested Solution:
It would be highly beneficial to refactor the data loader to support lazy loading. Instead of creating one large images tensor, the loader could return a generator or a custom Dataset object that yields frames or small batches of frames as needed by the windowed inference logic.

This would enable lingbot-map to process videos of arbitrary length without hitting VRAM limits.

issue creation helped by gemini :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions