[NV] Add DSV4-pro GB300 vLLM recipes by hjjq · Pull Request #1238 · SemiAnalysisAI/InferenceX

hjjq · 2026-04-30T18:40:55Z

Summary

Adds DeepSeek-V4-Pro FP4 disaggregated Dynamo vLLM benchmark recipes for GB300 at the 8k/1k sequence-length sweep, and updates both GB300 launchers (NV and CW) to support the new dynamo-vllm framework for DSV4.

What Changed

New config: `dsv4-fp4-gb300-dynamo-vllm` (`nvidia-master.yaml`)

Six pareto points covering the 8k/1k ISL/OSL sweep:

Recipe	Topology	Concurrency	Prefill	Decode
`1p6d-dep4-tp4`	1P + 6D (28 GPUs)	192	DEP=4	TP=4
`1p17d-tep4-tp4`	1P + 17D (72 GPUs)	18	TEP=4	TP=4
`4p1d-dep4-dep8-24-c4096`	4P + 1D (24 GPUs)	4096	DEP=4	DEP=8
`5p1d-dep4-dep8-28-c4096`	5P + 1D (28 GPUs)	4096	DEP=4	DEP=8
`6p1d-dep4-dep8-32-c4096`	6P + 1D (32 GPUs)	4096	DEP=4	DEP=8
`7p2d-dep4-dep16`	7P + 2D (60 GPUs)	3072	DEP=4	DEP=16

All recipes use vllm/vllm-openai:v0.20.0-ubuntu2404, deep_gemm_mega_moe MoE backend, and NATS/etcd disaggregated orchestration.

Launcher updates

runners/launch_gb300-nv.sh

Added dsv4/fp4 model gate → model path /scratch/models/DeepSeek-V4-Pro.
Added dynamo-vllm + dsv4 branch that clones NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes and overlays the vLLM DSV4 recipes.
Isolated srt-slurm clone and venv dirs per run (keyed by RUN_KEY) to avoid collisions when multiple jobs share the same runner.
Set set -exo pipefail for stricter error handling.

runners/launch_gb300-cw.sh

Refactored the top-level gate from FRAMEWORK == dynamo-sglang to a MODEL_PREFIX + PRECISION outer gate with a FRAMEWORK inner gate, so CW now accepts both dynamo-sglang and dynamo-vllm for dsv4/fp4.
dynamo-sglang: keeps the existing fzyzcjy/srt-slurm fork pin and SGLang recipe overlay.
dynamo-vllm: checks out NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes and overlays the vLLM DSV4 recipes.
Moved repo/ref/recipe-path variables into the framework branches and generalized the clone/overlay block.
Added dynamo-vllm container entry in srtslurm.yaml.
CW still uses its cluster-local model path /scratch/models/dsv4/.

Sweep config tooling (`generate_sweep_configs.py`)

Added _runner_values_for_filter() helper for --runner-node-filter support in test-config sweeps.
generate_test_config_sweep() now accepts runner_data and expands runner entries per the filter, enabling targeted single-node dispatch.
New tests for runner node filter expansion and no-match skipping.

Other

perf-changelog.yaml: added entry for dsv4-fp4-gb300-dynamo-vllm.
Runner tag is runner: gb300 (broad) so GitHub can schedule onto either NV or CW GB300 runners.

Validation

Local:

bash -n runners/launch_gb300-cw.sh ✓
bash -n runners/launch_gb300-nv.sh ✓
python3 -m pytest utils/matrix_logic -q → 151 passed

Workflow runs:

Focused manual NV run (3 points): https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25196217232
Full PR sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25202071995

github-actions · 2026-04-30T18:41:05Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

alec-flowers · 2026-05-01T04:30:42Z

@Klaud-Cold Can you update the PR description.

functionstackx · 2026-05-01T04:32:45Z

@claude Can you update the PR description.

Klaud-Cold · 2026-05-01T04:33:09Z

Claude finished @functionstackx's task in 1m 15s —— View job

Updated PR Description

Gather context on current PR changes
Draft updated PR description
Update the PR description

Updated the PR description with:

Summary of the new dsv4-fp4-gb300-dynamo-vllm config and six srt-slurm recipe files
Pareto point table covering all six topologies (1p6d through 7p2d) with concurrency, prefill, and decode details
Launcher updates for both launch_gb300-nv.sh (new dsv4/fp4 model gate, per-run isolated srt-slurm/venv dirs, stricter error handling) and launch_gb300-cw.sh (refactored gate to support both dynamo-sglang and dynamo-vllm)
Sweep config tooling changes (_runner_values_for_filter helper, --runner-node-filter support in test-config)
Validation section with local checks and workflow run links

cquil11 · 2026-05-01T04:34:28Z

@claude can you cross check the master yaml diff to the recipe diff to make sure all params such as tp, ep, image, etc match

Klaud-Cold · 2026-05-01T04:34:51Z

Claude finished @cquil11's task in 2m 52s —— View job

Cross-checking master YAML vs recipe YAML parameters

Read the nvidia-master.yaml diff for the new dsv4-fp4-gb300-dynamo-vllm config
Read all 6 recipe YAML files
Cross-check each recipe's parameters against master YAML
Report findings

Mapping Convention

The master YAML uses abstract fields that map to vLLM config as follows:

tp = total GPUs per worker
ep = expert parallelism size
dp-attn: true → DEP mode: tensor-parallel-size = tp/ep, data-parallel-size = ep, enable-expert-parallel: true
dp-attn: false + ep > 1 → TEP mode: tensor-parallel-size = tp, enable-expert-parallel: true (no data-parallel-size)
ep: 1 → Plain TP: tensor-parallel-size = tp, no EP flags

Global Fields (all 6 recipes)

Field	Master YAML	All Recipes	Match?
Image	`vllm/vllm-openai:v0.20.0-ubuntu2404`	`vllm/vllm-openai:v0.20.0-ubuntu2404`	✅
Model	`deepseek-ai/DeepSeek-V4-Pro`	`path: deepseek-v4-pro` / `served-model-name: deepseek-ai/DeepSeek-V4-Pro`	✅
Precision	`fp4`	`fp4`	✅
ISL/OSL	`isl: 8192, osl: 1024`	`isl: 8192, osl: 1024`	✅

Recipe 1: `disagg-gb300-1p6d-dep4-tp4.yaml` (conc=192)

Field	Master YAML	Recipe	Match?
Concurrency	`192`	`concurrencies: "192"`	✅
Prefill workers	`num-worker: 1`	`prefill_workers: 1`	✅
Decode workers	`num-worker: 6`	`decode_workers: 6`	✅
Prefill tp/ep/dp-attn	`tp:4, ep:4, dp-attn:true` → DEP=4	`TP=1, DP=4, EP=true` (1×4=4 GPUs)	✅
Decode tp/ep/dp-attn	`tp:4, ep:1, dp-attn:false` → plain TP=4	`TP=4`, no DP, no EP	✅
GPUs/prefill	4	`gpus_per_prefill: 4`	✅
GPUs/decode	4	`gpus_per_decode: 4`	✅

⚠️ Issue: Decode block (lines 101-102) sets enable-ep-weight-filter: true and all2all-backend: "flashinfer_nvlink_one_sided" despite master having ep: 1 (no EP). These EP-only flags are inconsistent with a plain TP=4 decode. (Also flagged in prior review)

Recipe 2: `disagg-gb300-1p17d-tep4-tp4.yaml` (conc=18)

Field	Master YAML	Recipe	Match?
Concurrency	`18`	`concurrencies: "18"`	✅
Prefill workers	`num-worker: 1`	`prefill_workers: 1`	✅
Decode workers	`num-worker: 17`	`decode_workers: 17`	✅
Prefill tp/ep/dp-attn	`tp:4, ep:4, dp-attn:false` → TEP=4	`TP=4, EP=true`, no DP	✅
Decode tp/ep/dp-attn	`tp:4, ep:1, dp-attn:false` → plain TP=4	`TP=4`, no DP, no EP	✅
GPUs/prefill	4	`gpus_per_prefill: 4`	✅
GPUs/decode	4	`gpus_per_decode: 4`	✅

⚠️ Issue: Same as 1p6d — decode block (lines 93-94) has enable-ep-weight-filter: true and all2all-backend: "flashinfer_nvlink_one_sided" without EP enabled.

Recipe 3: `disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml` (conc=4096)

Field	Master YAML	Recipe	Match?
Concurrency	`4096`	`concurrencies: "4096"`	✅
Prefill workers	`num-worker: 4`	`prefill_workers: 4`	✅
Decode workers	`num-worker: 1`	`decode_workers: 1`	✅
Prefill tp/ep/dp-attn	`tp:4, ep:4, dp-attn:true` → DEP=4	`TP=1, DP=4, EP=true` (1×4=4 GPUs)	✅
Decode tp/ep/dp-attn	`tp:8, ep:8, dp-attn:true` → DEP=8	`TP=1, DP=8, EP=true` (1×8=8 GPUs)	✅
GPUs/prefill	4	`gpus_per_prefill: 4`	✅
GPUs/decode	8	`gpus_per_decode: 8`	✅
Total GPUs	—	4×4 + 2×4 = 24 (matches name)	✅

⚠️ Issue: Decode block missing max-num-batched-tokens: 512 (all sibling DEP-decode recipes set it). (Also flagged in prior review)

Recipe 4: `disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml` (conc=4096)

Field	Master YAML	Recipe	Match?
Concurrency	`4096`	`concurrencies: "4096"`	✅
Prefill workers	`num-worker: 5`	`prefill_workers: 5`	✅
Decode workers	`num-worker: 1`	`decode_workers: 1`	✅
Prefill tp/ep/dp-attn	`tp:4, ep:4, dp-attn:true` → DEP=4	`TP=1, DP=4, EP=true`	✅
Decode tp/ep/dp-attn	`tp:8, ep:8, dp-attn:true` → DEP=8	`TP=1, DP=8, EP=true`	✅
GPUs/prefill	4	`gpus_per_prefill: 4`	✅
GPUs/decode	8	`gpus_per_decode: 8`	✅
Total GPUs	—	5×4 + 2×4 = 28 (matches name)	✅

⚠️ Issue: Same as 4p1d — decode block missing max-num-batched-tokens: 512.

Recipe 5: `disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml` (conc=4096)

Field	Master YAML	Recipe	Match?
Concurrency	`4096`	`concurrencies: "4096"`	✅
Prefill workers	`num-worker: 6`	`prefill_workers: 6`	✅
Decode workers	`num-worker: 1`	`decode_workers: 1`	✅
Prefill tp/ep/dp-attn	`tp:4, ep:4, dp-attn:true` → DEP=4	`TP=1, DP=4, EP=true`	✅
Decode tp/ep/dp-attn	`tp:8, ep:8, dp-attn:true` → DEP=8	`TP=1, DP=8, EP=true`	✅
GPUs/prefill	4	`gpus_per_prefill: 4`	✅
GPUs/decode	8	`gpus_per_decode: 8`	✅
Total GPUs	—	6×4 + 2×4 = 32 (matches name)	✅

⚠️ Issue: Same as 4p1d/5p1d — decode block missing max-num-batched-tokens: 512.

Recipe 6: `disagg-gb300-7p2d-dep4-dep16.yaml` (conc=3072)

Field	Master YAML	Recipe	Match?
Concurrency	`3072`	`concurrencies: "3072"`	✅
Prefill workers	`num-worker: 7`	`prefill_workers: 7`	✅
Decode workers	`num-worker: 2`	`decode_workers: 2`	✅
Prefill tp/ep/dp-attn	`tp:4, ep:4, dp-attn:true` → DEP=4	`TP=1, DP=4, EP=true`	✅
Decode tp/ep/dp-attn	`tp:16, ep:16, dp-attn:true` → DEP=16	`TP=1, DP=16, EP=true`	✅
GPUs/prefill	4	`gpus_per_prefill: 4`	✅
GPUs/decode	16	`gpus_per_decode: 16`	✅
Total GPUs	—	7×4 + 8×4 = 60	✅

✅ No issues — this recipe is fully consistent.

Summary

All core params (image, model, precision, tp, ep, dp-attn, worker counts, GPU counts, concurrencies, ISL/OSL) match between the master YAML and all 6 recipes. No mismatches found on the primary fields.

Two consistency issues (also flagged in the prior automated review) remain worth noting:

Issue	Affected Recipes	Severity
Decode has EP-only flags (`enable-ep-weight-filter`, `all2all-backend`) but master says `ep: 1` (no EP)	`1p6d`, `1p17d`	🟡 Medium — likely no-op but inconsistent with all sibling recipes
Decode missing `max-num-batched-tokens: 512` (all siblings set it)	`4p1d`, `5p1d`, `6p1d`	🟡 Low — `max-num-seqs: 512` caps decode batches in practice, but deviates from pattern

Mirror the launch_gb300-nv.sh pattern: compute nodes are aarch64, the runner pod is x86_64, so dispatch `enroot import` to a compute node via srun rather than relying on a manually-staged sqsh. flock on the shared VAST lock file serializes parallel cw_0/1/2/3 jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-01T05:02:55Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25202898401

The cw runners hit a srtctl version mismatch on the dynamo-vllm srt-slurm pin (aflowers/gb200-dsv4-recipes rejects the default_bash_preamble field, dropping the model_paths block). Route this config to the nv runners until the cw srtctl pin is bumped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-01T05:16:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25202974835

github-actions · 2026-05-01T05:17:04Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25203293262

* Add recipes * fix benchmark, fix srt-slurm branch * update runner * chore: resolve dsv4 gb300 changelog merge markers * fix: use gb300 local dsv4 model path * ci: support runner filtering for test configs * fix: isolate gb300 srt setup state * fix: remove unsupported gb300 recipe metadata * clean up * fix: support gb300 cw vllm launcher * gb300-cw: import squash files via srun under flock Mirror the launch_gb300-nv.sh pattern: compute nodes are aarch64, the runner pod is x86_64, so dispatch `enroot import` to a compute node via srun rather than relying on a manually-staged sqsh. flock on the shared VAST lock file serializes parallel cw_0/1/2/3 jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin dsv4-fp4-gb300-dynamo-vllm to gb300-nv runners The cw runners hit a srtctl version mismatch on the dynamo-vllm srt-slurm pin (aflowers/gb200-dsv4-recipes rejects the default_bash_preamble field, dropping the model_paths block). Route this config to the nv runners until the cw srtctl pin is bumped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Cameron Quilici <cjquilici@gmail.com> Co-authored-by: Alec Flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add recipes

58229f0

hjjq requested a review from a team April 30, 2026 18:40

hjjq requested review from jgangani and kedarpotdar-nv as code owners April 30, 2026 18:40

github-project-automation Bot added this to InferenceMAX Board Apr 30, 2026

fix benchmark, fix srt-slurm branch

3d48dd1

claude Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml

Comment thread benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml

Comment thread ...multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml

hjjq changed the title ~~[NV] Add DSV4-pro GB300 vLLM recipes~~ [NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes Apr 30, 2026

hjjq and others added 3 commits April 30, 2026 13:52

update runner

8359a19

Merge branch 'main' into dsv4-gb300-vllm

f429c24

chore: resolve dsv4 gb300 changelog merge markers

fa93b4d

alec-flowers added the full-sweep-enabled label Apr 30, 2026

alec-flowers and others added 6 commits April 30, 2026 16:34

fix: use gb300 local dsv4 model path

2107b42

ci: support runner filtering for test configs

9ffe98b

fix: isolate gb300 srt setup state

dd67fb5

fix: remove unsupported gb300 recipe metadata

5c03684

clean up

aea8e06

Merge branch 'main' into dsv4-gb300-vllm

6f39a32

alec-flowers changed the title ~~[NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes~~ [NV] Add DSV4-pro GB300 vLLM recipes May 1, 2026

fix: support gb300 cw vllm launcher

f9a1cc4

cquil11 and others added 2 commits April 30, 2026 23:56

Merge branch 'main' into dsv4-gb300-vllm

c74002e

cquil11 merged commit 5959abc into main May 1, 2026
3 checks passed

cquil11 deleted the dsv4-gb300-vllm branch May 1, 2026 05:14

github-project-automation Bot moved this to Done in InferenceMAX Board May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Add DSV4-pro GB300 vLLM recipes#1238

[NV] Add DSV4-pro GB300 vLLM recipes#1238
cquil11 merged 15 commits intomainfrom
dsv4-gb300-vllm

hjjq commented Apr 30, 2026 •

edited by Klaud-Cold

Loading

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented May 1, 2026

Uh oh!

functionstackx commented May 1, 2026

Uh oh!

Klaud-Cold commented May 1, 2026 •

edited

Loading

Uh oh!

cquil11 commented May 1, 2026

Uh oh!

Klaud-Cold commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hjjq commented Apr 30, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

New config: dsv4-fp4-gb300-dynamo-vllm (nvidia-master.yaml)

Launcher updates

Sweep config tooling (generate_sweep_configs.py)

Other

Validation

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented May 1, 2026

Uh oh!

functionstackx commented May 1, 2026

Uh oh!

Klaud-Cold commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated PR Description

Uh oh!

cquil11 commented May 1, 2026

Uh oh!

Klaud-Cold commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cross-checking master YAML vs recipe YAML parameters

Mapping Convention

Global Fields (all 6 recipes)

Recipe 1: disagg-gb300-1p6d-dep4-tp4.yaml (conc=192)

Recipe 2: disagg-gb300-1p17d-tep4-tp4.yaml (conc=18)

Recipe 3: disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml (conc=4096)

Recipe 4: disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml (conc=4096)

Recipe 5: disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml (conc=4096)

Recipe 6: disagg-gb300-7p2d-dep4-dep16.yaml (conc=3072)

Summary

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hjjq commented Apr 30, 2026 •

edited by Klaud-Cold

Loading

New config: `dsv4-fp4-gb300-dynamo-vllm` (`nvidia-master.yaml`)

Sweep config tooling (`generate_sweep_configs.py`)

Klaud-Cold commented May 1, 2026 •

edited

Loading

Klaud-Cold commented May 1, 2026 •

edited

Loading

Recipe 1: `disagg-gb300-1p6d-dep4-tp4.yaml` (conc=192)

Recipe 2: `disagg-gb300-1p17d-tep4-tp4.yaml` (conc=18)

Recipe 3: `disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml` (conc=4096)

Recipe 4: `disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml` (conc=4096)

Recipe 5: `disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml` (conc=4096)

Recipe 6: `disagg-gb300-7p2d-dep4-dep16.yaml` (conc=3072)