Skip to content

[NV] Add DSV4-pro GB300 vLLM recipes#1238

Merged
cquil11 merged 15 commits intomainfrom
dsv4-gb300-vllm
May 1, 2026
Merged

[NV] Add DSV4-pro GB300 vLLM recipes#1238
cquil11 merged 15 commits intomainfrom
dsv4-gb300-vllm

Conversation

@hjjq
Copy link
Copy Markdown
Collaborator

@hjjq hjjq commented Apr 30, 2026

Summary

Adds DeepSeek-V4-Pro FP4 disaggregated Dynamo vLLM benchmark recipes for GB300 at the 8k/1k sequence-length sweep, and updates both GB300 launchers (NV and CW) to support the new dynamo-vllm framework for DSV4.

What Changed

New config: dsv4-fp4-gb300-dynamo-vllm (nvidia-master.yaml)

Six pareto points covering the 8k/1k ISL/OSL sweep:

Recipe Topology Concurrency Prefill Decode
1p6d-dep4-tp4 1P + 6D (28 GPUs) 192 DEP=4 TP=4
1p17d-tep4-tp4 1P + 17D (72 GPUs) 18 TEP=4 TP=4
4p1d-dep4-dep8-24-c4096 4P + 1D (24 GPUs) 4096 DEP=4 DEP=8
5p1d-dep4-dep8-28-c4096 5P + 1D (28 GPUs) 4096 DEP=4 DEP=8
6p1d-dep4-dep8-32-c4096 6P + 1D (32 GPUs) 4096 DEP=4 DEP=8
7p2d-dep4-dep16 7P + 2D (60 GPUs) 3072 DEP=4 DEP=16

All recipes use vllm/vllm-openai:v0.20.0-ubuntu2404, deep_gemm_mega_moe MoE backend, and NATS/etcd disaggregated orchestration.

Launcher updates

runners/launch_gb300-nv.sh

  • Added dsv4/fp4 model gate → model path /scratch/models/DeepSeek-V4-Pro.
  • Added dynamo-vllm + dsv4 branch that clones NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes and overlays the vLLM DSV4 recipes.
  • Isolated srt-slurm clone and venv dirs per run (keyed by RUN_KEY) to avoid collisions when multiple jobs share the same runner.
  • Set set -exo pipefail for stricter error handling.

runners/launch_gb300-cw.sh

  • Refactored the top-level gate from FRAMEWORK == dynamo-sglang to a MODEL_PREFIX + PRECISION outer gate with a FRAMEWORK inner gate, so CW now accepts both dynamo-sglang and dynamo-vllm for dsv4/fp4.
  • dynamo-sglang: keeps the existing fzyzcjy/srt-slurm fork pin and SGLang recipe overlay.
  • dynamo-vllm: checks out NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes and overlays the vLLM DSV4 recipes.
  • Moved repo/ref/recipe-path variables into the framework branches and generalized the clone/overlay block.
  • Added dynamo-vllm container entry in srtslurm.yaml.
  • CW still uses its cluster-local model path /scratch/models/dsv4/.

Sweep config tooling (generate_sweep_configs.py)

  • Added _runner_values_for_filter() helper for --runner-node-filter support in test-config sweeps.
  • generate_test_config_sweep() now accepts runner_data and expands runner entries per the filter, enabling targeted single-node dispatch.
  • New tests for runner node filter expansion and no-match skipping.

Other

  • perf-changelog.yaml: added entry for dsv4-fp4-gb300-dynamo-vllm.
  • Runner tag is runner: gb300 (broad) so GitHub can schedule onto either NV or CW GB300 runners.

Validation

Local:

  • bash -n runners/launch_gb300-cw.sh
  • bash -n runners/launch_gb300-nv.sh
  • python3 -m pytest utils/matrix_logic -q → 151 passed

Workflow runs:

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread .github/configs/nvidia-master.yaml
@hjjq hjjq changed the title [NV] Add DSV4-pro GB300 vLLM recipes [NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes Apr 30, 2026
@alec-flowers alec-flowers changed the title [NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes [NV] Add DSV4-pro GB300 vLLM recipes May 1, 2026
@alec-flowers
Copy link
Copy Markdown
Collaborator

@Klaud-Cold Can you update the PR description.

@functionstackx
Copy link
Copy Markdown
Contributor

@claude Can you update the PR description.

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 1, 2026

Claude finished @functionstackx's task in 1m 15s —— View job


Updated PR Description

  • Gather context on current PR changes
  • Draft updated PR description
  • Update the PR description

Updated the PR description with:

  • Summary of the new dsv4-fp4-gb300-dynamo-vllm config and six srt-slurm recipe files
  • Pareto point table covering all six topologies (1p6d through 7p2d) with concurrency, prefill, and decode details
  • Launcher updates for both launch_gb300-nv.sh (new dsv4/fp4 model gate, per-run isolated srt-slurm/venv dirs, stricter error handling) and launch_gb300-cw.sh (refactored gate to support both dynamo-sglang and dynamo-vllm)
  • Sweep config tooling changes (_runner_values_for_filter helper, --runner-node-filter support in test-config)
  • Validation section with local checks and workflow run links

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented May 1, 2026

@claude can you cross check the master yaml diff to the recipe diff to make sure all params such as tp, ep, image, etc match

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 1, 2026

Claude finished @cquil11's task in 2m 52s —— View job


Cross-checking master YAML vs recipe YAML parameters

  • Read the nvidia-master.yaml diff for the new dsv4-fp4-gb300-dynamo-vllm config
  • Read all 6 recipe YAML files
  • Cross-check each recipe's parameters against master YAML
  • Report findings

Mapping Convention

The master YAML uses abstract fields that map to vLLM config as follows:

  • tp = total GPUs per worker
  • ep = expert parallelism size
  • dp-attn: trueDEP mode: tensor-parallel-size = tp/ep, data-parallel-size = ep, enable-expert-parallel: true
  • dp-attn: false + ep > 1TEP mode: tensor-parallel-size = tp, enable-expert-parallel: true (no data-parallel-size)
  • ep: 1Plain TP: tensor-parallel-size = tp, no EP flags

Global Fields (all 6 recipes)

Field Master YAML All Recipes Match?
Image vllm/vllm-openai:v0.20.0-ubuntu2404 vllm/vllm-openai:v0.20.0-ubuntu2404
Model deepseek-ai/DeepSeek-V4-Pro path: deepseek-v4-pro / served-model-name: deepseek-ai/DeepSeek-V4-Pro
Precision fp4 fp4
ISL/OSL isl: 8192, osl: 1024 isl: 8192, osl: 1024

Recipe 1: disagg-gb300-1p6d-dep4-tp4.yaml (conc=192)

Field Master YAML Recipe Match?
Concurrency 192 concurrencies: "192"
Prefill workers num-worker: 1 prefill_workers: 1
Decode workers num-worker: 6 decode_workers: 6
Prefill tp/ep/dp-attn tp:4, ep:4, dp-attn:true → DEP=4 TP=1, DP=4, EP=true (1×4=4 GPUs)
Decode tp/ep/dp-attn tp:4, ep:1, dp-attn:false → plain TP=4 TP=4, no DP, no EP
GPUs/prefill 4 gpus_per_prefill: 4
GPUs/decode 4 gpus_per_decode: 4

⚠️ Issue: Decode block (lines 101-102) sets enable-ep-weight-filter: true and all2all-backend: "flashinfer_nvlink_one_sided" despite master having ep: 1 (no EP). These EP-only flags are inconsistent with a plain TP=4 decode. (Also flagged in prior review)


Recipe 2: disagg-gb300-1p17d-tep4-tp4.yaml (conc=18)

Field Master YAML Recipe Match?
Concurrency 18 concurrencies: "18"
Prefill workers num-worker: 1 prefill_workers: 1
Decode workers num-worker: 17 decode_workers: 17
Prefill tp/ep/dp-attn tp:4, ep:4, dp-attn:false → TEP=4 TP=4, EP=true, no DP
Decode tp/ep/dp-attn tp:4, ep:1, dp-attn:false → plain TP=4 TP=4, no DP, no EP
GPUs/prefill 4 gpus_per_prefill: 4
GPUs/decode 4 gpus_per_decode: 4

⚠️ Issue: Same as 1p6d — decode block (lines 93-94) has enable-ep-weight-filter: true and all2all-backend: "flashinfer_nvlink_one_sided" without EP enabled.


Recipe 3: disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml (conc=4096)

Field Master YAML Recipe Match?
Concurrency 4096 concurrencies: "4096"
Prefill workers num-worker: 4 prefill_workers: 4
Decode workers num-worker: 1 decode_workers: 1
Prefill tp/ep/dp-attn tp:4, ep:4, dp-attn:true → DEP=4 TP=1, DP=4, EP=true (1×4=4 GPUs)
Decode tp/ep/dp-attn tp:8, ep:8, dp-attn:true → DEP=8 TP=1, DP=8, EP=true (1×8=8 GPUs)
GPUs/prefill 4 gpus_per_prefill: 4
GPUs/decode 8 gpus_per_decode: 8
Total GPUs 4×4 + 2×4 = 24 (matches name)

⚠️ Issue: Decode block missing max-num-batched-tokens: 512 (all sibling DEP-decode recipes set it). (Also flagged in prior review)


Recipe 4: disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml (conc=4096)

Field Master YAML Recipe Match?
Concurrency 4096 concurrencies: "4096"
Prefill workers num-worker: 5 prefill_workers: 5
Decode workers num-worker: 1 decode_workers: 1
Prefill tp/ep/dp-attn tp:4, ep:4, dp-attn:true → DEP=4 TP=1, DP=4, EP=true
Decode tp/ep/dp-attn tp:8, ep:8, dp-attn:true → DEP=8 TP=1, DP=8, EP=true
GPUs/prefill 4 gpus_per_prefill: 4
GPUs/decode 8 gpus_per_decode: 8
Total GPUs 5×4 + 2×4 = 28 (matches name)

⚠️ Issue: Same as 4p1d — decode block missing max-num-batched-tokens: 512.


Recipe 5: disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml (conc=4096)

Field Master YAML Recipe Match?
Concurrency 4096 concurrencies: "4096"
Prefill workers num-worker: 6 prefill_workers: 6
Decode workers num-worker: 1 decode_workers: 1
Prefill tp/ep/dp-attn tp:4, ep:4, dp-attn:true → DEP=4 TP=1, DP=4, EP=true
Decode tp/ep/dp-attn tp:8, ep:8, dp-attn:true → DEP=8 TP=1, DP=8, EP=true
GPUs/prefill 4 gpus_per_prefill: 4
GPUs/decode 8 gpus_per_decode: 8
Total GPUs 6×4 + 2×4 = 32 (matches name)

⚠️ Issue: Same as 4p1d/5p1d — decode block missing max-num-batched-tokens: 512.


Recipe 6: disagg-gb300-7p2d-dep4-dep16.yaml (conc=3072)

Field Master YAML Recipe Match?
Concurrency 3072 concurrencies: "3072"
Prefill workers num-worker: 7 prefill_workers: 7
Decode workers num-worker: 2 decode_workers: 2
Prefill tp/ep/dp-attn tp:4, ep:4, dp-attn:true → DEP=4 TP=1, DP=4, EP=true
Decode tp/ep/dp-attn tp:16, ep:16, dp-attn:true → DEP=16 TP=1, DP=16, EP=true
GPUs/prefill 4 gpus_per_prefill: 4
GPUs/decode 16 gpus_per_decode: 16
Total GPUs 7×4 + 8×4 = 60

✅ No issues — this recipe is fully consistent.


Summary

All core params (image, model, precision, tp, ep, dp-attn, worker counts, GPU counts, concurrencies, ISL/OSL) match between the master YAML and all 6 recipes. No mismatches found on the primary fields.

Two consistency issues (also flagged in the prior automated review) remain worth noting:

Issue Affected Recipes Severity
Decode has EP-only flags (enable-ep-weight-filter, all2all-backend) but master says ep: 1 (no EP) 1p6d, 1p17d 🟡 Medium — likely no-op but inconsistent with all sibling recipes
Decode missing max-num-batched-tokens: 512 (all siblings set it) 4p1d, 5p1d, 6p1d 🟡 Low — max-num-seqs: 512 caps decode batches in practice, but deviates from pattern

cquil11 and others added 2 commits April 30, 2026 23:56
Mirror the launch_gb300-nv.sh pattern: compute nodes are aarch64, the
runner pod is x86_64, so dispatch `enroot import` to a compute node
via srun rather than relying on a manually-staged sqsh. flock on the
shared VAST lock file serializes parallel cw_0/1/2/3 jobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25202898401

The cw runners hit a srtctl version mismatch on the dynamo-vllm
srt-slurm pin (aflowers/gb200-dsv4-recipes rejects the
default_bash_preamble field, dropping the model_paths block). Route
this config to the nv runners until the cw srtctl pin is bumped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 merged commit 5959abc into main May 1, 2026
3 checks passed
@cquil11 cquil11 deleted the dsv4-gb300-vllm branch May 1, 2026 05:14
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25202974835

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25203293262

xiaohuguo2023 pushed a commit to xiaohuguo2023/InferenceX that referenced this pull request May 6, 2026
* Add recipes

* fix benchmark, fix srt-slurm branch

* update runner

* chore: resolve dsv4 gb300 changelog merge markers

* fix: use gb300 local dsv4 model path

* ci: support runner filtering for test configs

* fix: isolate gb300 srt setup state

* fix: remove unsupported gb300 recipe metadata

* clean up

* fix: support gb300 cw vllm launcher

* gb300-cw: import squash files via srun under flock

Mirror the launch_gb300-nv.sh pattern: compute nodes are aarch64, the
runner pod is x86_64, so dispatch `enroot import` to a compute node
via srun rather than relying on a manually-staged sqsh. flock on the
shared VAST lock file serializes parallel cw_0/1/2/3 jobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Pin dsv4-fp4-gb300-dynamo-vllm to gb300-nv runners

The cw runners hit a srtctl version mismatch on the dynamo-vllm
srt-slurm pin (aflowers/gb200-dsv4-recipes rejects the
default_bash_preamble field, dropping the model_paths block). Route
this config to the nv runners until the cw srtctl pin is bumped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Cameron Quilici <cjquilici@gmail.com>
Co-authored-by: Alec Flowers <aflowers@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants