Skill:
.agents/skills/cosmos3-setup/SKILL.md
Table of Contents
- NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer — Hopper (H100) or Blackwell (B200) recommended for full training throughput
- NVIDIA driver compatible with CUDA version
- NVIDIA CUDA >=12.8
- Linux x86-64/aarch64
- glibc >=2.35 (e.g. Ubuntu >=22.04)
- Python >=3.10
- Multi-node training additionally requires a working NCCL setup (IB/RoCE recommended) and a shared filesystem visible to all ranks for checkpoint I/O
- Free disk: ~150 GiB recommended for a first-run inference or training workflow (Hugging Face cache ~90 GiB, uv cache ~20 GiB, run outputs ~30 GiB). See FAQ → Expected disk footprint for the breakdown and how to relocate caches.
Recommended Base Image
For CUDA 13 builds, the NVIDIA NGC PyTorch container is the recommended starting point — it bundles PyTorch + CUDA 13 + cuDNN + NCCL tuned for NVIDIA hardware, plus Apex, TransformerEngine, and Megatron utilities that training infra users commonly need.
FROM nvcr.io/nvidia/pytorch:25.09-py3For CUDA 12.8 builds, pin to an earlier NGC tag (e.g. nvcr.io/nvidia/pytorch:25.06-py3) that still ships CUDA 12.
If you encounter issues, see Troubleshooting.
Clone the repository:
git clone git@github.com:NVIDIA/cosmos-framework.git
cd cosmos-frameworkThe two supported install paths are the recommended base image and the Docker container. For other paths (standalone venv, custom torch/cuda) see Advanced.
Quickstart: From the Recommended Base Image
If you started from the recommended base image (nvcr.io/nvidia/pytorch:25.09-py3), the following commands set up the full environment in one go. Run them from the root of this repository (i.e. inside the Cosmos/ directory you just cloned):
apt-get update
apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# CUDA 13.0 (recommended); for CUDA 12.8 use `--group=cu128-train`
uv sync --all-extras --group=cu130-train
source .venv/bin/activate && export LD_LIBRARY_PATH=Docker Container
Please make sure you have access to Docker on your machine and the NVIDIA Container Toolkit is installed.
Build the container:
image_tag=$(docker build -q .)Run the container:
docker run -it --runtime=nvidia --ipc=host --rm \
-v .:/workspace -v /workspace/.venv \
-v /root/.cache:/root/.cache \
-e HF_TOKEN="$HF_TOKEN" \
$image_tagFor multi-node training, also bind-mount your shared dataset and checkpoint directories so all ranks see the same filesystem.
Optional arguments:
--ipc=host: Use host system's shared memory, since parallel torchrun consumes a large amount of shared memory. If not allowed by security policy, increase--shm-size(documentation).-v /root/.cache:/root/.cache: Mount host cache to avoid re-downloading cache entries.
If you get docker: Error response from daemon: unknown or invalid runtime name: nvidia, you need to configure docker:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerSee docker/README.md for additional images and build options.
Advanced
Use these paths only when the recommended base image or Docker container are not viable for your environment.
Virtual Environment
Install system dependencies:
sudo apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wgetInstall uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/envInstall the package using one of the following methods:
UV Sync: fully reproducible environment
Choose the dependency group that matches your CUDA toolkit (see CUDA Variants):
# CUDA 13.0 (recommended)
uv sync --all-extras --group=cu130-train
# Or, for CUDA 12.8:
# uv sync --all-extras --group=cu128-train
source .venv/bin/activate && export LD_LIBRARY_PATH=UV Pip: virtual environment
# Create virtual environment (skip if using an existing environment)
uv venv --clear && source .venv/bin/activate && export LD_LIBRARY_PATH=
uv pip install -r pyproject.toml --all-extras --group=cu130-train
uv pip install -e .UV Pip: system environment
uv pip install --system --break-system-packages -r pyproject.toml --all-extras --group=cu130-trainCustom torch/cuda versions
cuda_name=cu130
torch_name=torch210
# 1. Create and activate the virtual environment
uv venv --clear && source .venv/bin/activate
# 2. Install the desired torch/cuda versions
uv pip install "torch==2.10.0" "torchvision" --torch-backend=$cuda_name
# 3. Install the package with desired extras
uv pip install -r pyproject.toml --all-extras --group=cu130-train
# 4. Install one of the following attention backends:
# * Blackwell
uv pip install "natten==0.21.6.dev6+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/natten
# * Hopper
uv pip install "flash-attn-3-nv==1.0.3+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn-3-nv
# * Ada/Ampere
uv pip install "flash-attn==2.7.4.post1+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attnIf there is no attention backend wheel for your torch/cuda versions, you can build one using cosmos-dependencies.
Optional package extras:
train: Training infrastructure (FSDP, parallelism, checkpointing, datasets)eval: Evaluation harnesses for trained checkpoints
This repository is training-focused, so the *-train dependency groups are the supported install path. Inference-only groups exist for evaluating trained checkpoints in-tree but are not required for training.
| CUDA Version | Training (recommended) | Notes |
|---|---|---|
| CUDA 13.0 (recommended) | --group=cu130-train |
NVIDIA Driver |
| CUDA 12.8 | --group=cu128-train |
NVIDIA Driver |
Export the following before downloading checkpoints or launching training. See environment_variables.md for the full reference.
| Variable | Purpose |
|---|---|
HF_TOKEN |
Hugging Face access token for gated model/dataset downloads. Alternative to uvx hf auth login. |
HF_HOME |
Cache directory for Hugging Face models and datasets. Recommend ≥ 1 TB free. |
IMAGINAIRE_OUTPUT_ROOT |
Output root for training DCP checkpoints and logs. Recommend ≥ 1 TB free. |
UV_CACHE_DIR |
Cache directory for uv-managed dependencies. |
LD_LIBRARY_PATH= |
Clear (set to empty) after sourcing the venv to avoid host library bleed-through into PyTorch imports. |
Training in this repo typically starts from a pretrained base checkpoint that you fine-tune or post-train. The recommended source is the Hugging Face Hub.
-
Get a Hugging Face Access Token with
Readpermission. -
Authenticate using either mechanism (they are equivalent — pick one, do not set both with different tokens):
HF_TOKENenvironment variable — preferred for Docker and non-interactive shells. Export it once and anyhuggingface_hubcall (CLI or library) picks it up.uvx hf auth login— preferred for local interactive use. Writes the token to~/.cache/huggingface/token, persisted across sessions (and across Docker runs if you bind-mount/root/.cache).
-
Accept the license for any gated model you intend to use (e.g. the NVIDIA Open Model License Agreement where applicable).
-
Test access:
uvx hf@latest download --repo-type model nvidia/Cosmos-Guardrail1 \ --revision d6d4bfa899a71454a700907664f3e88f503950cf --include "README.md"
If you encounter issues:
- Check that you don't have conflicting environment variables — e.g. an
HF_TOKENset to a different token than the one cached byhf auth login:printenv | grep HF_. - Check that your token has sufficient permissions.
Checkpoints are downloaded on demand during training and evaluation. To change the cache location, set HF_HOME. See training.md for DCP conversion and Hugging Face safetensors export.
Errors:
ImportError: cannot import name '_functionalization' from 'torch._C'
Clear the library path in your current shell:
export LD_LIBRARY_PATH=This applies to the current session only. To persist, add the line to your Dockerfile or ~/.bashrc.
If this doesn't fix the issue, try reinstalling venv.
Errors:
ModuleNotFoundError: No module named <module_name>
Reinstall venv:
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate && export LD_LIBRARY_PATH=If this doesn't fix the issue, try reinstalling uv.
Errors:
fatal error: Python.h: No such file or directory
Reinstall uv and venv:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install --reinstall
rm -rf .venv
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate && export LD_LIBRARY_PATH=OSError: <lib_name>: cannot open shared object file: No such file or directory
Ensure you have CUDA installed. The major version must match between the system and virtual environment CUDA versions.
sudo apt-get install -y --no-install-recommends cuda-toolkit-<cuda_major_version>Alternatively, use the Docker container.