Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
@claude add PR Description |
|
Claude finished @ankursingh-nv's task in 39s —— View job Adding PR Description
PR description has been added with a summary of all changes, including the new GLM-5 FP8 B200 SGLang benchmark config, script details (TP8 and DEP8 modes), and a key configuration table. |
There was a problem hiding this comment.
LGTM — straightforward benchmark config addition following established patterns; inline nits are worth addressing but non-blocking.
Extended reasoning...
Overview
This PR adds a GLM-5 FP8 SGLang benchmark configuration for B200 GPUs across three files: a new config block in nvidia-master.yaml, a new benchmark shell script glm5_fp8_b200.sh, and a changelog entry in perf-changelog.yaml. The script supports both TP8 (low latency) and DEP8 (high throughput via dp-attention) modes with NSA attention backend.
Security risks
None. This is a benchmark configuration and launch script with no auth, user input handling, or external API interaction beyond downloading a model from HuggingFace.
Level of scrutiny
Low scrutiny appropriate. This is a mechanical addition of a new benchmark config following well-established patterns from dozens of similar scripts in the repo (e.g., glm5_fp8_mi355x.sh, qwen3.5_bf16_b200.sh, dsr1_fp8_b200.sh). The script structure, server flags, and benchmark invocation all match existing conventions.
Other factors
All three reported bugs are nits: (1) DP_ATTENTION missing from check_env_vars is a consistency issue — the CI framework always provides this variable, so it has no runtime impact; (2) the <PR_LINK> placeholder is a common pre-merge oversight seen in other entries (several existing entries have /pull/XXX placeholders); (3) missing --tool-call-parser/--reasoning-parser flags only matter if RUN_EVAL=true is used, which is not the primary benchmark path. The PR is small, self-contained, and follows the repo's established patterns closely.
| description: | ||
| - "Add GLM-5 FP8 SGLang benchmark for B200" | ||
| - "Supports TP8 (low latency) and DEP8 (high throughput) modes with NSA attention backend" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/915 |
There was a problem hiding this comment.
🟡 The pr-link field at line 970 has the raw placeholder <PR_LINK> instead of the actual PR URL https://github.com/SemiAnalysisAI/InferenceX/pull/915. Please replace it before merging.
Extended reasoning...
What the bug is
The new perf-changelog entry for glm5-fp8-b200-sglang has pr-link: <PR_LINK> which is a literal placeholder string, not a valid URL. Every changelog entry is expected to have a proper GitHub PR link so that changes can be traced back to their source.
How it manifests
Any tooling or scripts that parse perf-changelog.yaml and expect the pr-link field to be a valid URL will break or produce incorrect results when encountering <PR_LINK>. It also looks like an HTML/XML tag, which could cause issues with certain YAML consumers or markdown renderers.
Why existing patterns don't prevent it
The file does have a few other entries with placeholder pr-links using the /pull/XXX pattern (e.g., the glm5-fp8-mi355x-sglang and minimaxm2.5-fp8-h200-vllm entries), but those at least follow the URL format https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. The <PR_LINK> placeholder is a different and more obviously incomplete format that isn't even a URL.
Step-by-step proof
- Look at the diff for
perf-changelog.yaml— the last new entry addspr-link: <PR_LINK> - The PR number is 915 (visible from the PR metadata)
- Therefore the correct value should be
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/915 <PR_LINK>is not a valid URL and breaks the consistency of the changelog
Fix
Replace <PR_LINK> with https://github.com/SemiAnalysisAI/InferenceX/pull/915 on line 970.
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP \ | ||
| $DP_FLAGS \ | ||
| --kv-cache-dtype fp8_e4m3 --quantization fp8 \ | ||
| --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \ | ||
| --moe-runner-backend flashinfer_trtllm \ |
There was a problem hiding this comment.
can u add to sglang cookbook?
There was a problem hiding this comment.
ofc, I'll let the team know once we finalize the config.
Summary
Add GLM-5 (FP8) benchmark configuration and script for B200 GPUs using SGLang.
Changes
.github/configs/nvidia-master.yaml: Addglm5-fp8-b200-sglangconfig entry using thelmsysorg/sglang:glm5-blackwellimage andzai-org/GLM-5-FP8model. Sweeps across three sequence length configs (1k/1k, 1k/8k, 8k/1k) with TP8 (conc 4–32) and DEP8 with dp-attention (conc 32–128).benchmarks/single_node/glm5_fp8_b200.sh: New benchmark script supporting two modes:perf-changelog.yaml: Add changelog entry for the new config.Key Configuration Details
zai-org/GLM-5-FP8lmsysorg/sglang:glm5-blackwell