Single-GPU CPU offloading via reasoner/generator split by rahul-steiger-nv · Pull Request #11 · NVIDIA/cosmos-framework

rahul-steiger-nv · 2026-06-02T18:38:24Z

Summary

Opt-in, single-GPU CPU offloading for in-tree inference so Cosmos3-Nano fits on
smaller GPUs. The reasoner tower runs once as a prefill that caches per-layer K/V;
the denoise loop then runs generator-only with the reasoner offloaded to CPU.
Default-off — the joint path is unchanged.

Enable with --offload-stages reasoner generator vae.

How it works

Split: MoTDecoderLayer.forward gains und_only (prefill) / gen_only
(denoise) branches; ReasonerMemoryState caches understanding K/V in prefill and
replays it during generator-only denoise. Empty sub-sequences true-skip their
projections, so offloaded weights are never invoked.
Staging: OffloadPipeline packs each group (reasoner/generator/vae) into
pinned CPU memory and stages one at a time into a single reusable GPU arena —
reasoner around prefill, generator around denoise, VAE around decode.
Two-phase load: offloaded modules stay on meta through init and materialize
directly on CPU (never touch the GPU); freed device memory is returned before the
arena is allocated.

Constraints

Single-GPU only, incompatible with CUDA graphs (both guarded); two_way attention
with video_temporal_causal=False.

Testing

Validated on Cosmos3-Nano, RTX 5090 (32 GB): 720p up to 33 frames and
480p up to 389 frames.

Future work

Layerwise offloading could push resolution/frames higher, at the cost of more
staging overhead and runtime — left as a follow-up.

lfengad · 2026-06-03T06:30:10Z


        self.language_model.init_weights(buffer_device=buffer_device)

+    def offload_module_groups(self) -> dict[str, list[torch.nn.Module]]:


Are these changes for model/vfm already in internal version?

No, not yet afaik.

lfengad · 2026-06-03T15:09:11Z

Some conflicts might be solved. THX!

Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>

rahul-steiger-nv · 2026-06-03T15:46:46Z

Some conflicts might be solved. THX!

I fixed the conflicts

rahul-steiger-nv closed this Jun 2, 2026

rahul-steiger-nv reopened this Jun 2, 2026

rahul-steiger-nv marked this pull request as ready for review June 2, 2026 18:54

lfengad reviewed Jun 3, 2026

View reviewed changes

rahul-steiger-nv added 3 commits June 3, 2026 08:18

added initial offloading implementation

194e69c

Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>

materialize reasoner and generator on cpu instead of gpu

2bfe944

Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>

fixed pre-commit

c92f423

Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>

rahul-steiger-nv force-pushed the main branch from 7279cf1 to c92f423 Compare June 3, 2026 15:45

Merge branch 'main' into main

dd71a63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-GPU CPU offloading via reasoner/generator split#11

Single-GPU CPU offloading via reasoner/generator split#11
rahul-steiger-nv wants to merge 4 commits into
NVIDIA:mainfrom
rahul-steiger-nv:main

rahul-steiger-nv commented Jun 2, 2026

Uh oh!

lfengad Jun 3, 2026

Uh oh!

rahul-steiger-nv Jun 3, 2026

Uh oh!

lfengad commented Jun 3, 2026 •

edited

Loading

Uh oh!

rahul-steiger-nv commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		self.language_model.init_weights(buffer_device=buffer_device)

		def offload_module_groups(self) -> dict[str, list[torch.nn.Module]]:

Conversation

rahul-steiger-nv commented Jun 2, 2026

Summary

How it works

Constraints

Testing

Future work

Uh oh!

lfengad Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

rahul-steiger-nv Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

lfengad commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul-steiger-nv commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfengad commented Jun 3, 2026 •

edited

Loading