From df000367811fd5d6202fc182dc6f0818e5f070b2 Mon Sep 17 00:00:00 2001 From: Mick Date: Mon, 1 Jun 2026 17:54:14 +0800 Subject: [PATCH 1/8] Add SGLang Cosmos3 serving docs --- README.md | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 99 insertions(+) diff --git a/README.md b/README.md index ed3571b..7cc0083 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ - [Quickstart](#quickstart) - [Generator with Diffusers](#generator-with-diffusers) - [Generator with vLLM-Omni](#generator-with-vllm-omni) + - [Generator with SGLang](#generator-with-sglang) - [Reasoner with Transformers](#reasoner-with-transformers) - [Reasoner with vLLM](#reasoner-with-vllm) - [Troubleshooting](#troubleshooting) @@ -413,6 +414,104 @@ References: +#### Generator with SGLang + +
+Expand SGLang generator setup, endpoints, and request reference + +Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. SGLang currently supports text-to-image, text-to-video, and image-to-video for the generator checkpoints; video-to-video, video-with-sound, and action modes are planned separately. + +Supported checkpoints: + +| Model | Status | Notes | +| --- | --- | --- | +| `nvidia/Cosmos3-Nano` | Supported | Text-to-image, text-to-video, image-to-video | +| `nvidia/Cosmos3-Super` | Supported | Use multiple GPUs for the 64B checkpoint | +| `nvidia/Cosmos3-Super-Text2Image` | Supported | Text-to-image specialized checkpoint | +| `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint | +| `nvidia/Cosmos3-Nano-Policy-DROID` | Not supported yet | Action/policy checkpoint | + +Install SGLang with diffusion extras: + +```shell +pip install -e "python[diffusion]" +pip install "cosmos-guardrail==0.3.1" +``` + +Start a Nano server: + +```shell +sglang serve \ + --model-type diffusion \ + --model-path nvidia/Cosmos3-Nano \ + --num-gpus 1 \ + --host 0.0.0.0 \ + --port 8000 \ + --output-path /tmp/cosmos3-sglang +``` + +For `Cosmos3-Super`, use multiple GPUs: + +```shell +sglang serve \ + --model-type diffusion \ + --model-path nvidia/Cosmos3-Super \ + --num-gpus 4 \ + --host 0.0.0.0 \ + --port 8000 \ + --output-path /tmp/cosmos3-sglang +``` + +Vision endpoints: + +| Mode | Endpoint | Notes | +| --- | --- | --- | +| Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 | +| Text to video | `POST /v1/videos/sync` or `POST /v1/videos` | `/sync` blocks and returns MP4 bytes with `Accept: video/mp4` | +| Image to video | `POST /v1/videos/sync` or `POST /v1/videos` | Upload the conditioning image with `input_reference` | + +Text-to-video example: + +```shell +curl -sS -X POST http://localhost:8000/v1/videos/sync \ + -H "Accept: video/mp4" \ + --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \ + --form-string "negative_prompt=blurry, distorted, low quality" \ + --form-string "size=1280x720" \ + --form-string "num_frames=81" \ + --form-string "fps=24" \ + --form-string "num_inference_steps=35" \ + --form-string "guidance_scale=4.0" \ + --form-string "flow_shift=10.0" \ + --form-string "seed=42" \ + --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \ + -o cosmos3_t2v_output.mp4 +``` + +Text-to-image example: + +```shell +curl -sS -X POST http://localhost:8000/v1/images/generations \ + -H "Content-Type: application/json" \ + -d '{ + "prompt": "A warehouse robot folds a blue cloth on a clean workbench.", + "size": "1280x720", + "n": 1, + "num_inference_steps": 35, + "guidance_scale": 6.0, + "flow_shift": 10.0, + "seed": 0, + "extra_args": { + "use_resolution_template": false, + "guardrails": true + } + }' +``` + +SGLang accepts the same Cosmos 3 request knobs used by the vLLM-Omni examples: `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models. + +
+ #### Reasoner with Transformers Coming soon! From 52ff056b5b236b86b4c17cb7383f21f64ede54c6 Mon Sep 17 00:00:00 2001 From: Mick Date: Mon, 1 Jun 2026 20:04:00 +0800 Subject: [PATCH 2/8] Use async SGLang video API in Cosmos3 docs --- README.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 7cc0083..c9fd0bf 100644 --- a/README.md +++ b/README.md @@ -467,14 +467,13 @@ Vision endpoints: | Mode | Endpoint | Notes | | --- | --- | --- | | Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 | -| Text to video | `POST /v1/videos/sync` or `POST /v1/videos` | `/sync` blocks and returns MP4 bytes with `Accept: video/mp4` | -| Image to video | `POST /v1/videos/sync` or `POST /v1/videos` | Upload the conditioning image with `input_reference` | +| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` | +| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` | Text-to-video example: ```shell -curl -sS -X POST http://localhost:8000/v1/videos/sync \ - -H "Accept: video/mp4" \ +job_id=$(curl -sS -X POST http://localhost:8000/v1/videos \ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \ --form-string "negative_prompt=blurry, distorted, low quality" \ --form-string "size=1280x720" \ @@ -485,6 +484,17 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \ --form-string "flow_shift=10.0" \ --form-string "seed=42" \ --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \ + | python -c 'import json, sys; print(json.load(sys.stdin)["id"])') + +while true; do + status=$(curl -sS "http://localhost:8000/v1/videos/${job_id}" \ + | python -c 'import json, sys; print(json.load(sys.stdin)["status"])') + [ "$status" = "completed" ] && break + [ "$status" = "failed" ] && exit 1 + sleep 1 +done + +curl -sS -L "http://localhost:8000/v1/videos/${job_id}/content" \ -o cosmos3_t2v_output.mp4 ``` From 44e38e7128a53b760a36fe7da1cd6d72bf2a096a Mon Sep 17 00:00:00 2001 From: Mick Date: Tue, 2 Jun 2026 00:58:52 +0800 Subject: [PATCH 3/8] Update SGLang install snippet --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index c9fd0bf..7a722fa 100644 --- a/README.md +++ b/README.md @@ -434,6 +434,8 @@ Supported checkpoints: Install SGLang with diffusion extras: ```shell +git clone https://github.com/sgl-project/sglang.git +cd sglang pip install -e "python[diffusion]" pip install "cosmos-guardrail==0.3.1" ``` From 76c41280c998d60dac4bf0d18255b264b9f85529 Mon Sep 17 00:00:00 2001 From: Mick Date: Tue, 2 Jun 2026 01:04:27 +0800 Subject: [PATCH 4/8] Remove vLLM reference from SGLang docs --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7a722fa..bd17041 100644 --- a/README.md +++ b/README.md @@ -520,7 +520,7 @@ curl -sS -X POST http://localhost:8000/v1/images/generations \ }' ``` -SGLang accepts the same Cosmos 3 request knobs used by the vLLM-Omni examples: `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models. +SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models. From 54bf1407363f23798d06ae1124987b88dd8b00ff Mon Sep 17 00:00:00 2001 From: Mick Date: Tue, 2 Jun 2026 01:10:56 +0800 Subject: [PATCH 5/8] Simplify SGLang Cosmos3 serve examples --- README.md | 26 ++++++++------------------ 1 file changed, 8 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index bd17041..5fef260 100644 --- a/README.md +++ b/README.md @@ -443,25 +443,15 @@ pip install "cosmos-guardrail==0.3.1" Start a Nano server: ```shell -sglang serve \ - --model-type diffusion \ - --model-path nvidia/Cosmos3-Nano \ - --num-gpus 1 \ - --host 0.0.0.0 \ - --port 8000 \ - --output-path /tmp/cosmos3-sglang +sglang serve --model-path nvidia/Cosmos3-Nano ``` -For `Cosmos3-Super`, use multiple GPUs: +For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs: ```shell sglang serve \ - --model-type diffusion \ - --model-path nvidia/Cosmos3-Super \ - --num-gpus 4 \ - --host 0.0.0.0 \ - --port 8000 \ - --output-path /tmp/cosmos3-sglang + --model-path nvidia/Cosmos3-Super-Image2Video \ + --num-gpus 4 ``` Vision endpoints: @@ -475,7 +465,7 @@ Vision endpoints: Text-to-video example: ```shell -job_id=$(curl -sS -X POST http://localhost:8000/v1/videos \ +job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \ --form-string "negative_prompt=blurry, distorted, low quality" \ --form-string "size=1280x720" \ @@ -489,21 +479,21 @@ job_id=$(curl -sS -X POST http://localhost:8000/v1/videos \ | python -c 'import json, sys; print(json.load(sys.stdin)["id"])') while true; do - status=$(curl -sS "http://localhost:8000/v1/videos/${job_id}" \ + status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \ | python -c 'import json, sys; print(json.load(sys.stdin)["status"])') [ "$status" = "completed" ] && break [ "$status" = "failed" ] && exit 1 sleep 1 done -curl -sS -L "http://localhost:8000/v1/videos/${job_id}/content" \ +curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \ -o cosmos3_t2v_output.mp4 ``` Text-to-image example: ```shell -curl -sS -X POST http://localhost:8000/v1/images/generations \ +curl -sS -X POST http://localhost:30000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "prompt": "A warehouse robot folds a blue cloth on a clean workbench.", From 499f6295d35975a13c7171a8643332500d758110 Mon Sep 17 00:00:00 2001 From: Mick Date: Tue, 2 Jun 2026 20:47:41 +0800 Subject: [PATCH 6/8] Clarify SGLang install pinning guidance --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 5fef260..a23e1de 100644 --- a/README.md +++ b/README.md @@ -436,10 +436,14 @@ Install SGLang with diffusion extras: ```shell git clone https://github.com/sgl-project/sglang.git cd sglang +# Optional: pin a release tag or known-good commit for reproducible deployments. +# git checkout pip install -e "python[diffusion]" pip install "cosmos-guardrail==0.3.1" ``` +> **Version note:** Cosmos 3 support in SGLang Diffusion is actively improving. The command above installs the latest upstream SGLang so users get current Cosmos 3 fixes and performance work. For production deployments or reproducible benchmarks, pin an SGLang release tag or a known-good commit before running `pip install`. + Start a Nano server: ```shell From 03e80fe464b0e0b8d930a88ac1457a019e29443b Mon Sep 17 00:00:00 2001 From: Mick Date: Wed, 3 Jun 2026 08:52:47 +0800 Subject: [PATCH 7/8] Update SGLang Cosmos3 README comments --- README.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index a23e1de..e1723e5 100644 --- a/README.md +++ b/README.md @@ -419,7 +419,7 @@ References:
Expand SGLang generator setup, endpoints, and request reference -Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. SGLang currently supports text-to-image, text-to-video, and image-to-video for the generator checkpoints; video-to-video, video-with-sound, and action modes are planned separately. +Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. Cosmos 3 also includes video-with-sound and action/policy models; this SGLang section focuses on the currently supported text-to-image, text-to-video, and image-to-video generator serving paths. Supported checkpoints: @@ -438,6 +438,9 @@ git clone https://github.com/sgl-project/sglang.git cd sglang # Optional: pin a release tag or known-good commit for reproducible deployments. # git checkout +python -m venv .venv +source .venv/bin/activate +python -m pip install --upgrade pip pip install -e "python[diffusion]" pip install "cosmos-guardrail==0.3.1" ``` @@ -469,6 +472,7 @@ Vision endpoints: Text-to-video example: ```shell +# Submit an async video generation job and capture its ID. job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \ --form-string "negative_prompt=blurry, distorted, low quality" \ @@ -480,16 +484,17 @@ job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \ --form-string "flow_shift=10.0" \ --form-string "seed=42" \ --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \ - | python -c 'import json, sys; print(json.load(sys.stdin)["id"])') + | jq -r .id) -while true; do - status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \ - | python -c 'import json, sys; print(json.load(sys.stdin)["status"])') - [ "$status" = "completed" ] && break +# Poll until the job completes. Cosmos 3 video generation can take several minutes. +status="" +until [ "$status" = "completed" ]; do + status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status) [ "$status" = "failed" ] && exit 1 - sleep 1 + sleep 5 done +# Download the completed MP4. curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \ -o cosmos3_t2v_output.mp4 ``` From 53049aeae8830171c0ba37b3dc3ebbd9f157a28c Mon Sep 17 00:00:00 2001 From: Mick Date: Thu, 4 Jun 2026 08:19:54 +0800 Subject: [PATCH 8/8] Use SGLang main branch for Cosmos3 install --- README.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index e1723e5..1f89dc8 100644 --- a/README.md +++ b/README.md @@ -431,13 +431,11 @@ Supported checkpoints: | `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint | | `nvidia/Cosmos3-Nano-Policy-DROID` | Not supported yet | Action/policy checkpoint | -Install SGLang with diffusion extras: +Install SGLang from the main branch with diffusion extras: ```shell -git clone https://github.com/sgl-project/sglang.git +git clone --branch main https://github.com/sgl-project/sglang.git cd sglang -# Optional: pin a release tag or known-good commit for reproducible deployments. -# git checkout python -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip @@ -445,7 +443,7 @@ pip install -e "python[diffusion]" pip install "cosmos-guardrail==0.3.1" ``` -> **Version note:** Cosmos 3 support in SGLang Diffusion is actively improving. The command above installs the latest upstream SGLang so users get current Cosmos 3 fixes and performance work. For production deployments or reproducible benchmarks, pin an SGLang release tag or a known-good commit before running `pip install`. +> **Version note:** Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there. Start a Nano server: