Skip to content

Fix CUDA OOM on pipeline switching#403

Open
emranemran wants to merge 6 commits intobackend-fal-v6from
cuda-oom
Open

Fix CUDA OOM on pipeline switching#403
emranemran wants to merge 6 commits intobackend-fal-v6from
cuda-oom

Conversation

@emranemran
Copy link

@emranemran emranemran commented Feb 5, 2026

Summary

Fixes CUDA out-of-memory errors when switching between pipelines (e.g. longlive → krea).

Root Cause

When _unload_pipeline_by_id_unsafe() removes a pipeline from the manager's _pipelines dict, active WebRTC sessions still hold references to the old pipeline through:

WebRTCManager.sessions → Session → VideoProcessingTrack → FrameProcessor
  → PipelineProcessor.pipeline → old pipeline object (still on GPU)

Because PipelineProcessor stores a direct reference (self.pipeline = pipeline), the old pipeline and all its GPU tensors survive gc.collect(). Additionally, pinned memory buffers in FrameProcessor were never released during pipeline switching.

Changes

  • Add cleanup() to Pipeline base class — calls gc.collect() and torch.cuda.empty_cache() to free GPU memory. Does NOT clear component/state dicts to avoid race conditions with in-flight frame processing.
  • Call cleanup() during pipeline unload — in _unload_pipeline_by_id_unsafe(), invoke cleanup before removing references
  • Add GPU memory logging — log free GPU memory before/after unload and after load, wrapped in try/except to prevent logging failures from disrupting pipeline switching
  • Clear pinned buffer cache on frame processor stop — release DMA transfer buffers that were never freed
  • Release pipeline reference on processor stop — set self.pipeline = None to break the reference chain and allow GC to reclaim GPU memory

Test plan

  • Deploy to fal.ai staging and verify GPU memory logs appear
  • Switch pipelines (longlive → krea) without crashes or KeyError: 'vae'
  • Confirm GPU memory is reclaimed between pipeline switches
  • uv run pytest passes
  • uv run daydream-scope starts without errors

🤖 Generated with Claude Code

@emranemran emranemran force-pushed the cuda-oom branch 2 times, most recently from 5e0c0b2 to 8449455 Compare February 5, 2026 20:25
- Add cleanup() to Pipeline base class (gc.collect + empty_cache)
- Call cleanup on pipeline unload to free GPU memory
- Log GPU memory before/after unload and after load
- Clear pinned buffer cache on frame processor stop
- Release pipeline reference on pipeline processor stop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: emranemran <emran.mah@gmail.com>
@emranemran
Copy link
Author

emranemran commented Feb 6, 2026

This seems to be working. Tried switching from longlive (default settings) to krea (+ vace + v2v + RIFE postprocessor):


Feb 6, 08:15:00 | INFO | 2026-02-06 08:15:00,865 - scope.server.pipeline_manager - INFO - GPU memory free after load: 23.73 GiB
-- | -- | --
  |   | Feb 6, 08:14:58 | INFO | 2026-02-06 08:14:58,524 - scope.server.pipeline_manager - INFO - GPU memory free after load: 23.79 GiB
  |   | Feb 6, 08:12:14 | INFO | 2026-02-06 08:12:14,335 - scope.server.pipeline_manager - INFO - GPU memory free after unload: 64.74 GiB
  |   | Feb 6, 08:12:13 | INFO | 2026-02-06 08:12:13,528 - scope.server.pipeline_manager - INFO - GPU memory free before unload: 64.76 GiB
  |   | Feb 6, 08:11:19 | INFO | 2026-02-06 08:11:19,612 - scope.server.pipeline_manager - INFO - GPU memory free after load: 67.41 GiB
  |   | Feb 6, 08:11:16 | INFO | 2026-02-06 08:11:16,302 - scope.server.pipeline_manager - INFO - GPU memory free after load: 67.41 GiB

EDIT: oops hang on, i read these logs incorrectly. Seems like it's not freeing much mem at all:

  - Longlive used ~12.6 GiB on load (80 - 67.41)
  - Before → after unload: 64.76 → 64.74 GiB — only 0.02 GiB freed
  - Krea then loaded on top, using ~41 GiB (64.74 → 23.79)

UPDATE: if i switch from longlive -> krea (+ vace + t2v) w/o RIFE, then I see the following so the issue could be the RIFE post-processer doesn't unload properly.

Feb 6, 09:13:04 | INFO | 2026-02-06 09:13:04,519 - scope.server.pipeline_manager - INFO - GPU memory free after unload: 78.37 GiB
-- | -- | --
  |   | Feb 6, 09:13:03 | INFO | 2026-02-06 09:13:03,747 - scope.server.pipeline_manager - INFO - GPU memory free before unload: 64.73 GiB
  |   | Feb 6, 09:12:20 | INFO | 2026-02-06 09:12:20,520 - scope.server.pipeline_manager - INFO - GPU memory free after load: 67.20 GiB
  |   | Feb 6, 09:12:18 | INFO | 2026-02-06 09:12:18,902 - scope.server.pipeline_manager - INFO - GPU memory free after load: 67.20 GiB
  |   | Feb 6, 09:12:18 | INFO | 2026-02-06 09:12:18,901 - scope.server.pipeline_manager - INFO - GPU memory free after unload: 67.20 GiB
  |   | Feb 6, 09:12:16 | INFO | 2026-02-06 09:12:16,948 - scope.server.pipeline_manager - INFO - GPU memory free before unload: 26.34 GiB
  |   | Feb 6, 09:08:32 | INFO | 2026-02-06 09:08:32,936 - scope.server.pipeline_manager - INFO - GPU memory free after load: 26.34 GiB
  |   | Feb 6, 09:08:22 | INFO | 2026-02-06 09:08:22,872 - scope.server.pipeline_manager - INFO - GPU memory free after load: 26.65 GiB
  |   | Feb 6, 09:02:04 | INFO | 2026-02-06 09:02:04,712 - scope.server.pipeline_manager - INFO - GPU memory free after unload: 64.66 GiB
  |   | Feb 6, 09:02:04 | INFO | 2026-02-06 09:02:04,291 - scope.server.pipeline_manager - INFO - GPU memory free before unload: 64.74 GiB
  |   | Feb 6, 09:01:22 | INFO | 2026-02-06 09:01:22,428 - scope.server.pipeline_manager - INFO - GPU memory free after load: 67.41 GiB
  |   | Feb 6, 09:01:20 | INFO | 2026-02-06 09:01:20,619 - scope.server.pipeline_manager - INFO - GPU memory free after load: 67.41 GiB

emranemran and others added 2 commits February 6, 2026 00:47
Add unload callback mechanism so PipelineManager can notify
FrameProcessors to release pipeline references before calling
cleanup(). This allows gc.collect() + empty_cache() to actually
free GPU memory during pipeline switches, not just on session end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: emranemran <emran.mah@gmail.com>
Signed-off-by: Varshith B <varshith15@gmail.com>
@varshith15
Copy link
Contributor

varshith15 commented Feb 6, 2026

just deleting the pipeline and calling gc.collect is not enough the reference to objects which have gpu data need to deleted too, for the internal pipelines (ive added components and state to be deleted just to be extra careful) this should work fine but for plugins its an issue

so plugins should also cleanup method delete their gpu objects references

from my testing locally : longlive (18 GB) -> flashvsr (14GB) -> streamdiffusion v2 (16B -- where its supposed to be taking 14 GB)

cc: @yondonfu

@yondonfu
Copy link
Contributor

yondonfu commented Feb 6, 2026

@varshith15 I think from a DX POV it would a pain to require a plugin pipeline to implement a cleanup function so curious if that can be avoided...

just deleting the pipeline and calling gc.collect is not enough the reference to objects which have gpu data need to deleted too

Why? If the pipeline is the only one that contains references to the underlying data structures that actually are consuming GPU mem shouldn't it be the case that as long as the there is no remaining ref to the pipeline then a gc.collect() and a CUDA cache clear would wipe any GPU mem that was consumed by the pipeline and its underlying data structures?

EDIT: Ok to answer my own question I think the issue is that a) if CUDA tensors "escape" the pipeline and are not freed that would keep refs around b) hooks/closures that hold on to refs of vars c) a variety of other ways it seems...

@yondonfu
Copy link
Contributor

yondonfu commented Feb 6, 2026

It seems like the only guaranteed clean way to make sure all GPU mem is freed when unloading a pipeline, regardless of how it is implemented, is to isolate it in a subprocess. @leszko looked into this previously, but we tabled because didn't want to take on the complexity (particularly inter-process comms for an already latency sensitive code path). I would treat that path as a separate thing to consider and discuss.

This leaves the question - what are the practical steps that can be taken to minimize change of this type of OOM during pipeline switching in the short term?

A few ideas from a chat with Claude that is a superset of what is already done in this PR and @varshith15's suggestions:

Step 1: Stop active processors and break ref chains BEFORE gc.collect

 The single highest-impact change. Currently _unload_pipeline_by_id_unsafe does:
 1. del self._pipelines[pipeline_id] — drops manager's ref
 2. gc.collect() — but processors still hold refs, so nothing is freed

 Reorder to:
 1. Notify all active FrameProcessors that use this pipeline_id to stop their PipelineProcessors
 2. Each PipelineProcessor.stop() should set self.pipeline = None and drain queues (freeing GPU tensors)
 3. FrameProcessor should clear _pinned_buffer_cache
 4. THEN del self._pipelines[pipeline_id]
 5. THEN gc.collect() + torch.cuda.empty_cache()

 This is similar to PR #403's callback approach, but must be synchronous/blocking — the unload must wait until all processors have actually released their refs.

 Files: pipeline_manager.py, frame_processor.py, pipeline_processor.py

 Step 2: Explicit teardown on Pipeline base class (generic, no plugin work needed)

 Add a teardown() method to Pipeline base class in interface.py that:
 - Iterates self.__dict__ and deletes any torch.nn.Module instances found (catches RIFE's self.rife_interpolator, standard self.components, etc.)
 - Clears self.components, self.state, self.blocks if they exist
 - This is generic — handles any pipeline without requiring custom cleanup
 - Called by pipeline_manager after all external refs are broken (Step 1) but before gc.collect

 This addresses the RIFE-specific problem without requiring each pipeline to implement its own cleanup. The base class just introspects __dict__ for GPU-holding objects.

 Files: interface.py, pipeline_manager.py

 Step 3: Properly drain GPU tensors from queues

 PipelineProcessor.stop() currently drains queues with get_nowait() but the tensors go out of scope one by one. For deterministic cleanup:
 - Drain all tensors into a list, then del the list explicitly
 - Alternatively, just ensure self.pipeline = None is set in stop() (as PR #403 proposes)

 The queue draining in stop() already exists (lines 171-185 of pipeline_processor.py) — just ensure that after draining, there's no lingering ref.

 Files: pipeline_processor.py

 Step 4: Clear pinned buffer cache on stop

 FrameProcessor._pinned_buffer_cache holds pinned CUDA memory buffers that are never released during pipeline switching. Add to FrameProcessor.stop():

 with self._pinned_buffer_lock:
     self._pinned_buffer_cache.clear()

 This is small but pinned memory can be non-trivial.

I think what feels reasonable is for there to be a generic cleanup fn that does a best effort cleanup for all pipelines by introspecting __dict__ and it just leaves possibility of edge cases right now.

Signed-off-by: Varshith B <varshith15@gmail.com>
@varshith15
Copy link
Contributor

varshith15 commented Feb 6, 2026

I think what feels reasonable is for there to be a generic cleanup fn that does a best effort cleanup for all pipelines by introspecting __dict__ and it just leaves possibility of edge cases right now.

i think we definitely need to go a couple of levels deep, 2 atleast, for instance rife, the RIFEInterpolator isnt torch module, but the model inside it is

added a BFS kinda crawler

@yondonfu
Copy link
Contributor

yondonfu commented Feb 6, 2026

i think we definitely need to go a couple of levels deep, 2 atleast, for instance rife, the RIFEInterpolator isnt torch module, but the model inside it is

But, if you remove all refs in __dict__ even if RIFEInterpolator is nested in some field called self.my_field shouldn't a subsequent cache clear + collect handle that because if self.my_field is gone then RIFEInterpolator has no refs?

Signed-off-by: Varshith B <varshith15@gmail.com>
@varshith15
Copy link
Contributor

But, if you remove all refs in __dict__ even if RIFEInterpolator is nested in some field called self.my_field shouldn't a subsequent cache clear + collect handle that because if self.my_field is gone then RIFEInterpolator has no refs?

yeah this makes sense, is also a lot cleaner, updated it -- i was thinking of only deleting tensors and tensor modules but yeah gc should be able to collect if the ref count goes to 0 -- this should get most of the refs

@yondonfu
Copy link
Contributor

yondonfu commented Feb 6, 2026

On the backend-fal-v6 branch I can't seem to repro the issue of memory not being freed with the following combos:

video-depth-anything + longlive + rife -> passthrough
reward-forcing -> passthrough
memflow -> passthrough

In all cases, when the switch to passthrough is completed I see the VRAM drop down to baseline levels.

I did not get a chance to try with Krea though. So, I wonder if it is Krea specific.

If testing on a H100 one thing to try that comes to mind is changing this line

to compile=False as this is one of the bigger differences relative to other pipelines that is specific to H100s. Perhaps the torch.compile happening under the hood is causing some caching that prevents GPU mem from being freed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants