-
Notifications
You must be signed in to change notification settings - Fork 608
Add SGLang Cosmos3 serving docs #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
df00036
52ff056
44e38e7
76c4128
54bf140
499f629
03e80fe
53049ae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,7 @@ | |
| - [Quickstart](#quickstart) | ||
| - [Generator with Diffusers](#generator-with-diffusers) | ||
| - [Generator with vLLM-Omni](#generator-with-vllm-omni) | ||
| - [Generator with SGLang](#generator-with-sglang) | ||
| - [Reasoner with Transformers](#reasoner-with-transformers) | ||
| - [Reasoner with vLLM](#reasoner-with-vllm) | ||
| - [Troubleshooting](#troubleshooting) | ||
|
|
@@ -413,6 +414,113 @@ References: | |
|
|
||
| </details> | ||
|
|
||
| #### Generator with SGLang | ||
|
|
||
| <details> | ||
| <summary>Expand SGLang generator setup, endpoints, and request reference</summary> | ||
|
|
||
| Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. Cosmos 3 also includes video-with-sound and action/policy models; this SGLang section focuses on the currently supported text-to-image, text-to-video, and image-to-video generator serving paths. | ||
|
|
||
| Supported checkpoints: | ||
|
|
||
| | Model | Status | Notes | | ||
| | --- | --- | --- | | ||
| | `nvidia/Cosmos3-Nano` | Supported | Text-to-image, text-to-video, image-to-video | | ||
| | `nvidia/Cosmos3-Super` | Supported | Use multiple GPUs for the 64B checkpoint | | ||
| | `nvidia/Cosmos3-Super-Text2Image` | Supported | Text-to-image specialized checkpoint | | ||
| | `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint | | ||
| | `nvidia/Cosmos3-Nano-Policy-DROID` | Not supported yet | Action/policy checkpoint | | ||
|
|
||
| Install SGLang from the main branch with diffusion extras: | ||
|
|
||
| ```shell | ||
| git clone --branch main https://github.com/sgl-project/sglang.git | ||
| cd sglang | ||
| python -m venv .venv | ||
| source .venv/bin/activate | ||
| python -m pip install --upgrade pip | ||
| pip install -e "python[diffusion]" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we make tag/stable release of the sglang repo and pin it here?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point. I added an optional checkout step plus a version note. the default keeps tracking upstream SGLang to pick up ongoing Cosmos 3 fixes/performance improvements, while production or reproducible deployments should pin a release tag or known-good commit before install.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably best to support uv or venv
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a venv setup before the editable SGLang install.
atharvajoshi10 marked this conversation as resolved.
|
||
| pip install "cosmos-guardrail==0.3.1" | ||
| ``` | ||
|
|
||
| > **Version note:** Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there. | ||
|
|
||
| Start a Nano server: | ||
|
|
||
| ```shell | ||
| sglang serve --model-path nvidia/Cosmos3-Nano | ||
| ``` | ||
|
|
||
| For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs: | ||
|
|
||
| ```shell | ||
| sglang serve \ | ||
| --model-path nvidia/Cosmos3-Super-Image2Video \ | ||
| --num-gpus 4 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I'm not mistaken sglang serve \
--model-path nvidia/Cosmos3-Super-Image2Video \
--num-gpus 4is equivalent to CFG + ulysses-deg 2 i.e. sglang serve \
--model-path nvidia/Cosmos3-Super-Image2Video \
--num-gpus 4 --enable-cfg-parallel --ulysses-degree 2which is indeed preferred way to serve multi-gpu inference, but only if the model fits into single GPU (>80GB). This it only best setup for performance, but it doesn't reduce memory requirements. Safer option would be to use fsdp as an example for Cosmos3-Super checkpoint, as this setup actually does reduce memory requirement by sharding the weights across gpus, i.e.: sglang serve \
--model-path nvidia/Cosmos3-Super-Image2Video \
--num-gpus 4 --use-fsdp-inference
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we are looking for memory-friendly setups, yes we could do better, whether fsdp or offloading would do |
||
| ``` | ||
|
|
||
| Vision endpoints: | ||
|
|
||
| | Mode | Endpoint | Notes | | ||
| | --- | --- | --- | | ||
| | Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 | | ||
| | Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` | | ||
| | Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` | | ||
|
|
||
| Text-to-video example: | ||
|
|
||
| ```shell | ||
| # Submit an async video generation job and capture its ID. | ||
| job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \ | ||
| --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \ | ||
| --form-string "negative_prompt=blurry, distorted, low quality" \ | ||
| --form-string "size=1280x720" \ | ||
| --form-string "num_frames=81" \ | ||
| --form-string "fps=24" \ | ||
| --form-string "num_inference_steps=35" \ | ||
| --form-string "guidance_scale=4.0" \ | ||
| --form-string "flow_shift=10.0" \ | ||
| --form-string "seed=42" \ | ||
| --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \ | ||
| | jq -r .id) | ||
|
|
||
| # Poll until the job completes. Cosmos 3 video generation can take several minutes. | ||
| status="" | ||
| until [ "$status" = "completed" ]; do | ||
| status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status) | ||
| [ "$status" = "failed" ] && exit 1 | ||
| sleep 5 | ||
| done | ||
|
|
||
| # Download the completed MP4. | ||
| curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \ | ||
| -o cosmos3_t2v_output.mp4 | ||
|
Comment on lines
+474
to
+497
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add comments here to improve readability?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added comments for the submit, poll, and download steps in the video example. |
||
| ``` | ||
|
|
||
| Text-to-image example: | ||
|
|
||
| ```shell | ||
| curl -sS -X POST http://localhost:30000/v1/images/generations \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "prompt": "A warehouse robot folds a blue cloth on a clean workbench.", | ||
| "size": "1280x720", | ||
| "n": 1, | ||
| "num_inference_steps": 35, | ||
| "guidance_scale": 6.0, | ||
| "flow_shift": 10.0, | ||
| "seed": 0, | ||
| "extra_args": { | ||
| "use_resolution_template": false, | ||
| "guardrails": true | ||
| } | ||
| }' | ||
| ``` | ||
|
|
||
| SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models. | ||
|
|
||
| </details> | ||
|
|
||
| #### Reasoner with Transformers | ||
| Coming soon! | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to specify we support other modalities such as sound and action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the wording to mention that Cosmos 3 includes video-with-sound and action/policy models, while keeping this SGLang section scoped to the currently supported T2I/T2V/I2V generator serving paths.