Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
- [Quickstart](#quickstart)
- [Generator with Diffusers](#generator-with-diffusers)
- [Generator with vLLM-Omni](#generator-with-vllm-omni)
- [Generator with SGLang](#generator-with-sglang)
- [Reasoner with Transformers](#reasoner-with-transformers)
- [Reasoner with vLLM](#reasoner-with-vllm)
- [Troubleshooting](#troubleshooting)
Expand Down Expand Up @@ -413,6 +414,113 @@ References:

</details>

#### Generator with SGLang

<details>
<summary>Expand SGLang generator setup, endpoints, and request reference</summary>

Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. Cosmos 3 also includes video-with-sound and action/policy models; this SGLang section focuses on the currently supported text-to-image, text-to-video, and image-to-video generator serving paths.

Supported checkpoints:

| Model | Status | Notes |
| --- | --- | --- |
| `nvidia/Cosmos3-Nano` | Supported | Text-to-image, text-to-video, image-to-video |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to specify we support other modalities such as sound and action.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the wording to mention that Cosmos 3 includes video-with-sound and action/policy models, while keeping this SGLang section scoped to the currently supported T2I/T2V/I2V generator serving paths.

| `nvidia/Cosmos3-Super` | Supported | Use multiple GPUs for the 64B checkpoint |
| `nvidia/Cosmos3-Super-Text2Image` | Supported | Text-to-image specialized checkpoint |
| `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint |
| `nvidia/Cosmos3-Nano-Policy-DROID` | Not supported yet | Action/policy checkpoint |

Install SGLang from the main branch with diffusion extras:

```shell
git clone --branch main https://github.com/sgl-project/sglang.git
cd sglang
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e "python[diffusion]"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make tag/stable release of the sglang repo and pin it here?
This command will always download top of tree sglang, which is not what we want as part of the README.

Copy link
Copy Markdown
Author

@mickqian mickqian Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I added an optional checkout step plus a version note. the default keeps tracking upstream SGLang to pick up ongoing Cosmos 3 fixes/performance improvements, while production or reproducible deployments should pin a release tag or known-good commit before install.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to support uv or venv

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a venv setup before the editable SGLang install.

Comment thread
atharvajoshi10 marked this conversation as resolved.
pip install "cosmos-guardrail==0.3.1"
```

> **Version note:** Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.

Start a Nano server:

```shell
sglang serve --model-path nvidia/Cosmos3-Nano
```

For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs:

```shell
sglang serve \
--model-path nvidia/Cosmos3-Super-Image2Video \
--num-gpus 4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken

sglang serve \
  --model-path nvidia/Cosmos3-Super-Image2Video \
  --num-gpus 4

is equivalent to CFG + ulysses-deg 2 i.e.

sglang serve \
  --model-path nvidia/Cosmos3-Super-Image2Video \
  --num-gpus 4 --enable-cfg-parallel --ulysses-degree 2

which is indeed preferred way to serve multi-gpu inference, but only if the model fits into single GPU (>80GB). This it only best setup for performance, but it doesn't reduce memory requirements.

Safer option would be to use fsdp as an example for Cosmos3-Super checkpoint, as this setup actually does reduce memory requirement by sharding the weights across gpus, i.e.:

sglang serve \
  --model-path nvidia/Cosmos3-Super-Image2Video \
  --num-gpus 4 --use-fsdp-inference

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are looking for memory-friendly setups, yes we could do better, whether fsdp or offloading would do

```

Vision endpoints:

| Mode | Endpoint | Notes |
| --- | --- | --- |
| Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 |
| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |

Text-to-video example:

```shell
# Submit an async video generation job and capture its ID.
job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
--form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
--form-string "negative_prompt=blurry, distorted, low quality" \
--form-string "size=1280x720" \
--form-string "num_frames=81" \
--form-string "fps=24" \
--form-string "num_inference_steps=35" \
--form-string "guidance_scale=4.0" \
--form-string "flow_shift=10.0" \
--form-string "seed=42" \
--form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
| jq -r .id)

# Poll until the job completes. Cosmos 3 video generation can take several minutes.
status=""
until [ "$status" = "completed" ]; do
status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status)
[ "$status" = "failed" ] && exit 1
sleep 5
done

# Download the completed MP4.
curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
-o cosmos3_t2v_output.mp4
Comment on lines +474 to +497
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments here to improve readability?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments for the submit, poll, and download steps in the video example.

```

Text-to-image example:

```shell
curl -sS -X POST http://localhost:30000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "A warehouse robot folds a blue cloth on a clean workbench.",
"size": "1280x720",
"n": 1,
"num_inference_steps": 35,
"guidance_scale": 6.0,
"flow_shift": 10.0,
"seed": 0,
"extra_args": {
"use_resolution_template": false,
"guardrails": true
}
}'
```

SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models.

</details>

#### Reasoner with Transformers
Coming soon!

Expand Down