Single-GPU CPU offloading via reasoner/generator split#11
Open
rahul-steiger-nv wants to merge 4 commits into
Open
Single-GPU CPU offloading via reasoner/generator split#11rahul-steiger-nv wants to merge 4 commits into
rahul-steiger-nv wants to merge 4 commits into
Conversation
lfengad
reviewed
Jun 3, 2026
|
|
||
| self.language_model.init_weights(buffer_device=buffer_device) | ||
|
|
||
| def offload_module_groups(self) -> dict[str, list[torch.nn.Module]]: |
Collaborator
There was a problem hiding this comment.
Are these changes for model/vfm already in internal version?
Collaborator
|
Some conflicts might be solved. THX! |
Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>
Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>
Signed-off-by: Rahul Steiger <rsteiger@nvidia.com>
Author
I fixed the conflicts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Opt-in, single-GPU CPU offloading for in-tree inference so Cosmos3-Nano fits on
smaller GPUs. The reasoner tower runs once as a prefill that caches per-layer K/V;
the denoise loop then runs generator-only with the reasoner offloaded to CPU.
Default-off — the joint path is unchanged.
Enable with
--offload-stages reasoner generator vae.How it works
MoTDecoderLayer.forwardgainsund_only(prefill) /gen_only(denoise) branches;
ReasonerMemoryStatecaches understanding K/V in prefill andreplays it during generator-only denoise. Empty sub-sequences true-skip their
projections, so offloaded weights are never invoked.
OffloadPipelinepacks each group (reasoner/generator/vae) intopinned CPU memory and stages one at a time into a single reusable GPU arena —
reasoner around prefill, generator around denoise, VAE around decode.
metathrough init and materializedirectly on CPU (never touch the GPU); freed device memory is returned before the
arena is allocated.
Constraints
Single-GPU only, incompatible with CUDA graphs (both guarded);
two_wayattentionwith
video_temporal_causal=False.Testing
Validated on Cosmos3-Nano, RTX 5090 (32 GB): 720p up to 33 frames and
480p up to 389 frames.
Future work
Layerwise offloading could push resolution/frames higher, at the cost of more
staging overhead and runtime — left as a follow-up.