Skip to content

Single-GPU CPU offloading via reasoner/generator split#11

Open
rahul-steiger-nv wants to merge 4 commits into
NVIDIA:mainfrom
rahul-steiger-nv:main
Open

Single-GPU CPU offloading via reasoner/generator split#11
rahul-steiger-nv wants to merge 4 commits into
NVIDIA:mainfrom
rahul-steiger-nv:main

Conversation

@rahul-steiger-nv
Copy link
Copy Markdown

Summary

Opt-in, single-GPU CPU offloading for in-tree inference so Cosmos3-Nano fits on
smaller GPUs. The reasoner tower runs once as a prefill that caches per-layer K/V;
the denoise loop then runs generator-only with the reasoner offloaded to CPU.
Default-off — the joint path is unchanged.

Enable with --offload-stages reasoner generator vae.

How it works

  • Split: MoTDecoderLayer.forward gains und_only (prefill) / gen_only
    (denoise) branches; ReasonerMemoryState caches understanding K/V in prefill and
    replays it during generator-only denoise. Empty sub-sequences true-skip their
    projections, so offloaded weights are never invoked.
  • Staging: OffloadPipeline packs each group (reasoner/generator/vae) into
    pinned CPU memory and stages one at a time into a single reusable GPU arena —
    reasoner around prefill, generator around denoise, VAE around decode.
  • Two-phase load: offloaded modules stay on meta through init and materialize
    directly on CPU (never touch the GPU); freed device memory is returned before the
    arena is allocated.

Constraints

Single-GPU only, incompatible with CUDA graphs (both guarded); two_way attention
with video_temporal_causal=False.

Testing

Validated on Cosmos3-Nano, RTX 5090 (32 GB): 720p up to 33 frames and
480p up to 389 frames.

Future work

Layerwise offloading could push resolution/frames higher, at the cost of more
staging overhead and runtime — left as a follow-up.

@rahul-steiger-nv rahul-steiger-nv marked this pull request as ready for review June 2, 2026 18:54

self.language_model.init_weights(buffer_device=buffer_device)

def offload_module_groups(self) -> dict[str, list[torch.nn.Module]]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes for model/vfm already in internal version?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not yet afaik.

@lfengad
Copy link
Copy Markdown
Collaborator

lfengad commented Jun 3, 2026

Some conflicts might be solved. THX!

Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>
Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>
Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>
@rahul-steiger-nv
Copy link
Copy Markdown
Author

Some conflicts might be solved. THX!

I fixed the conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants