Table of Contents
- Repository Layout
- The
cosmos_framework/Packagecosmos_framework/algorithm/cosmos_framework/callbacks/cosmos_framework/checkpoint/cosmos_framework/communicator/cosmos_framework/configs/cosmos_framework/controller/cosmos_framework/data/cosmos_framework/evaluation/cosmos_framework/inference/cosmos_framework/launcher/cosmos_framework/model/cosmos_framework/tools/cosmos_framework/trainer/cosmos_framework/utils/cosmos_framework/workers/
- Supporting Directories
- Where to Add New Code
Cosmos/
├── cosmos_framework/ # Main package (training infra + inference subpackage)
│ ├── inference/ # Inference subpackage (model, args, defaults, Ray serving, common helpers, SFT experiment configs)
│ └── ... # Training-infra subpackages: data, model, trainer, callbacks, checkpoint, …
│ └── scripts/ # CLI entry-point scripts: train.py, _train.py, inference.py, eval.py, export_model.py, …
├── packages/ # Backend shim packages: diffusers-cosmos3, transformers-cosmos3, vllm-cosmos3
├── docs/ # User documentation (you are here)
├── docker/ # Dockerfiles for reproducible environments
├── examples/ # Runnable training / fine-tuning / inference examples
├── tests/ # Unit and integration tests
├── tools/ # Standalone CLI utilities (e.g. checkpoint conversion)
├── pyproject.toml # uv-managed dependency manifest
├── uv.lock # Pinned dependency graph (do not edit by hand)
└── .python-version # Python version pin (used by uv)
cosmos_framework/ is the single Python package. Training infrastructure (data, model, trainer, callbacks, checkpoint, utils, …) lives in top-level subpackages; inference (Diffusers / Transformers / vLLM-friendly inference core, Ray serving, per-modality defaults, training-side experiment YAMLs) lives under cosmos_framework/inference/. The library-style backend shims that load Cosmos3 checkpoints into upstream ecosystems live under packages/{diffusers,transformers,vllm}-cosmos3/.
The cosmos_framework/ package is organized around the workflow of a large-scale, distributed training run — particularly post-training and reinforcement-learning regimes — with each subpackage owning one concern.
cosmos_framework/
├── algorithm/ # Loss functions, reward models, RL algorithms [planned]
│ ├── loss/
│ ├── reward/
│ └── rl/
├── callbacks/ # Lifecycle hooks (logging, profiling, eval triggers, checkpoint cadence)
├── checkpoint/ # Saving, loading, conversion (DCP ↔ HF safetensors)
├── communicator/ # Inter-process / inter-worker communication primitives [planned]
├── configs/ # Pydantic-validated TOML schema + LazyConfig experiment SKUs
│ ├── base/ # config.py, experiment/{action,sft,posttrain_video}/, vlm/, defaults/
│ └── toml_config/# sft_config.py (pydantic schema), toml_config_helper.py
├── controller/ # Top-level orchestration of multi-worker training jobs [planned]
├── data/ # Dataset loading, batching, augmentation, sharding
├── evaluation/ # Eval harness for trained checkpoints [planned]
├── inference/ # Inference engine + modality defaults + Ray serving + shared helpers
├── launcher/ # Job launching (Slurm, torchrun, k8s) [planned]
├── model/ # Model definitions and parallelism wrappers
├── tools/ # In-package CLI utilities (flops/, visualize/)
├── trainer/ # Training loop, optimizer step, gradient accumulation
├── utils/ # Shared low-level utilities (logging, config, distributed helpers, vfm/, vlm/)
└── workers/ # Specialized roles in a distributed RL job [planned]
├── reference/ # Reference / frozen-policy worker (KL anchor)
├── reward/ # Reward-model worker
├── rollout/ # On-policy rollout generation worker
└── simulations/# Simulator-driven environment worker
Note — some subpackages above don't exist yet. Entries tagged
[planned](algorithm/,communicator/,controller/,evaluation/,launcher/,workers/) describe the intended home for functionality that has not yet landed in this release. The directories are intentionally absent on disk — they will be created when their respective feature areas (RL training, multi-worker orchestration, the distributed evaluation harness, job launchers, RL worker roles) come online. The "Where to Add New Code" table at the bottom of this document still applies: when you build one of these features, create the matching subpackage and put the code there. Subpackages without the tag (callbacks/,checkpoint/,configs/,data/,inference/,model/,tools/,trainer/,utils/, plusscripts/covered separately) exist today.
Planned — not yet present in this release; the subpackage will be created when RL / loss work lands.
Algorithmic primitives that are independent of the model and trainer.
loss/— supervised and distillation losses (cross-entropy, flow-matching, KL, etc.).reward/— reward functions and learned reward heads.rl/— RL update rules (PPO, GRPO, DPO-family) that consume losses and rewards.
Add new objectives here, not inside the trainer.
Pluggable lifecycle hooks invoked by the trainer at well-defined points (step begin/end, epoch boundary, eval, save, exception). Use callbacks for cross-cutting concerns such as wandb/W&B logging, gradient clipping, MoE stability monitoring, dataloader-state checkpointing, and learning-rate logging.
All checkpoint I/O lives here:
- DCP (PyTorch Distributed Checkpoint) save/load
- HuggingFace
safetensorsimport/export - Schema migration and resume-from-step logic
Planned — not yet present in this release; the subpackage will be created when multi-worker comms land.
Communication primitives between processes — point-to-point send/recv, broadcast helpers, and any RPC-style channels used between the controller and workers. Keep raw torch.distributed / NCCL calls out of business logic; route them through this layer.
Configuration system for training runs:
configs/toml_config/— the user-facing pydantic schema for the structured TOML interface consumed bycosmos_framework.scripts.train --sft-toml=….sft_config.pydefinesSFTExperimentConfig(withextra="forbid"on every sub-model).toml_config_helper.pyhandles VFM↔VLM path remaps and OmegaConf env interpolation.configs/base/— internal LazyConfig-based experiment SKUs.base/experiment/sft/*.pyregisters Cosmos3 SFT experiments (e.g.vision_sft_nano.py,vision_sft_super.py).base/vlm/is the VLM-side analogue with its ownconfig.py,experiment/,defaults/, andfreeze_config.py.
See docs/sft_config.md for the full field-by-field TOML reference.
Planned — not yet present in this release; the subpackage will be created when multi-worker orchestration lands.
The orchestrator for a multi-worker job. The controller drives the training loop, hands batches to rollout/reward workers, collects gradients, and decides when to checkpoint or evaluate. Think "head node logic" — there is one controller per job.
Datasets, samplers, collators, augmentations, and data-side parallelism (e.g. sequence packing, multi-aspect batching). New dataset formats and new augmentations both live here.
Planned — not yet present in this release; the subpackage will be created when the offline eval harness lands.
Evaluation harnesses run against trained checkpoints — metrics, dataset-driven eval loops, and reporting. Distinct from inference/: evaluation is offline and metric-oriented.
The full inference subpackage:
args.py— sampling/setup args (SamplingArgs,SamplingOverrides,OmniSetupArgs,OmniSetupOverrides,OmniSampleOverrides), plus the modality-defaults loader and the_RESOLUTION_SHIFT_DEFAULTStable.model.py,inference.py— model + inference engine entry points used bycosmos_framework/scripts/inference.py.common/— shared helpers for args, init, config, checkpoints (used by both training and inference scripts).defaults/<mode>/sample_args.json— per-modality default sample arguments (text2image, text2video, image2video, image2image, video2video, forward_dynamics, inverse_dynamics, policy, reasoner) plusprompt_upsampler.txtandvideo_captioner.txtsystem prompts.ray/— Ray Serve / Submit / Gradio entry points (cosmos_framework.inference.ray.serve,cosmos_framework.inference.ray.submit,cosmos_framework.inference.ray.gradio) and their YAML configs underray/configs/.configs/{checkpoint,model}/— per-checkpoint and per-model inference configs.- Modality entry points:
vision.py,action.py,sound.py,transfer.py,interactive.py,prompt_upsampling.py,dataset.py.
Training-side experiment SKUs live separately at cosmos_framework/configs/base/experiment/sft/*.py (see cosmos_framework/configs/) — not under inference/.
Library-style backend shims that adapt Cosmos3 checkpoints to the Diffusers / Transformers / vLLM ecosystems live separately under packages/{diffusers,transformers,vllm}-cosmos3/.
Planned — not yet present in this release; the subpackage will be created when launcher back-ends land.
Job launching back-ends: Slurm, torchrun, and Kubernetes adapters. Selects the launch path based on the environment and forwards process rank/world-size to the controller.
Model architectures and the parallelism wrappers around them (FSDP, tensor parallel, context parallel, pipeline parallel). The trainer is model-agnostic; everything the trainer touches goes through this layer.
CLI entry points surfaced from the package (as opposed to standalone scripts in the top-level tools/). Use this for utilities that need to import cosmos_framework.* internals.
The training loop itself — gradient accumulation, optimizer step, scheduler step, mixed-precision policy, and the dispatcher that fires callbacks. Stays narrow on purpose: model details live in model/, algorithm details in algorithm/.
Shared low-level helpers (logging, config loading, distributed setup, profiling). Keep this folder thin — anything substantial should grow into its own subpackage.
Planned — not yet present in this release; the subpackage will be created when RL worker roles land.
Specialized worker roles for distributed RL jobs. Each worker is a long-running process the controller talks to:
reference/— frozen reference policy (for KL anchoring in PPO/GRPO/DPO).reward/— reward-model worker; computes scalar rewards for rollouts.rollout/— on-policy generation worker; samples trajectories from the current policy.simulations/— simulator-backed environment worker (used when reward comes from a sim rather than a learned model).
Add new worker types as sibling subpackages — each owns its own startup, message loop, and shutdown.
tests/— pytest tests, mirroring thecosmos_framework/package layout.examples/— runnable end-to-end examples; seeexamples/README.md.docker/— Dockerfiles and image build helpers; seedocker/README.md.cosmos_framework/scripts/— CLI entry-point scripts (train.py,inference.py,eval.py,export_model.py, …); invoke aspython -m cosmos_framework.scripts.<name>. Primary training entry point:cosmos_framework.scripts.traindriven by a structured, pydantic-validated TOML interface (--sft-toml=<recipe-toml>); recipe TOMLs live underexamples/toml/sft_config/and the schema is defined incosmos_framework/configs/toml_config/sft_config.py— seeexamples/README.mdanddocs/training.md.packages/— library-style backend shims:packages/{diffusers,transformers,vllm}-cosmos3/, each installable independently.
| You want to add… | Put it in… |
|---|---|
| A new loss function | cosmos_framework/algorithm/loss/ |
| A new RL update rule | cosmos_framework/algorithm/rl/ |
| A new reward function or head | cosmos_framework/algorithm/reward/ |
| A new model architecture | cosmos_framework/model/ |
| A new dataset format / augmentation | cosmos_framework/data/ |
| A new training callback | cosmos_framework/callbacks/ |
| A new checkpoint format or converter | cosmos_framework/checkpoint/ |
| A new launcher back-end (Slurm flavor, k8s) | cosmos_framework/launcher/ |
| A new RL worker role | cosmos_framework/workers/<new_role>/ |
| A new evaluation suite | cosmos_framework/evaluation/ |
| A new runnable example | examples/ |
| A new standalone CLI tool | tools/ (repo root) for non-cosmos imports, otherwise cosmos_framework/tools/ |
| A new test | tests/ mirroring the package path |