[NV] GLM5 fp8 b200 SGL by ankursingh-nv · Pull Request #915 · SemiAnalysisAI/InferenceX

ankursingh-nv · 2026-03-17T03:46:12Z

Summary

Add GLM-5 (FP8) benchmark configuration and script for B200 GPUs using SGLang.

Changes

.github/configs/nvidia-master.yaml: Add glm5-fp8-b200-sglang config entry using the lmsysorg/sglang:glm5-blackwell image and zai-org/GLM-5-FP8 model. Sweeps across three sequence length configs (1k/1k, 1k/8k, 8k/1k) with TP8 (conc 4–32) and DEP8 with dp-attention (conc 32–128).
benchmarks/single_node/glm5_fp8_b200.sh: New benchmark script supporting two modes:
- TP8 (low latency): Standard tensor-parallel serving
- DEP8 (high throughput): Data-parallel attention with expert parallelism
- Uses NSA attention backend with TRT-LLM decode, FlashInfer+TRT-LLM MoE runner, BF16 KV cache, FP8 quantization, and FlashInfer allreduce fusion.
perf-changelog.yaml: Add changelog entry for the new config.

Key Configuration Details

Parameter	Value
Model	`zai-org/GLM-5-FP8`
Image	`lmsysorg/sglang:glm5-blackwell`
Framework	SGLang
GPU	B200
Precision	FP8
Attention Backend	NSA (TRT-LLM decode)
MoE Runner	FlashInfer + TRT-LLM
KV Cache	BF16

github-actions · 2026-03-17T03:46:20Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-03-17T03:46:20Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

ankursingh-nv · 2026-03-17T03:46:46Z

@claude add PR Description

Klaud-Cold · 2026-03-17T03:47:14Z

Claude finished @ankursingh-nv's task in 39s —— View job

Adding PR Description

Gather context from changed files
Write and update PR description

PR description has been added with a summary of all changes, including the new GLM-5 FP8 B200 SGLang benchmark config, script details (TP8 and DEP8 modes), and a key configuration table.

claude

LGTM — straightforward benchmark config addition following established patterns; inline nits are worth addressing but non-blocking.

Extended reasoning...

Overview

This PR adds a GLM-5 FP8 SGLang benchmark configuration for B200 GPUs across three files: a new config block in nvidia-master.yaml, a new benchmark shell script glm5_fp8_b200.sh, and a changelog entry in perf-changelog.yaml. The script supports both TP8 (low latency) and DEP8 (high throughput via dp-attention) modes with NSA attention backend.

Security risks

None. This is a benchmark configuration and launch script with no auth, user input handling, or external API interaction beyond downloading a model from HuggingFace.

Level of scrutiny

Low scrutiny appropriate. This is a mechanical addition of a new benchmark config following well-established patterns from dozens of similar scripts in the repo (e.g., glm5_fp8_mi355x.sh, qwen3.5_bf16_b200.sh, dsr1_fp8_b200.sh). The script structure, server flags, and benchmark invocation all match existing conventions.

Other factors

All three reported bugs are nits: (1) DP_ATTENTION missing from check_env_vars is a consistency issue — the CI framework always provides this variable, so it has no runtime impact; (2) the <PR_LINK> placeholder is a common pre-merge oversight seen in other entries (several existing entries have /pull/XXX placeholders); (3) missing --tool-call-parser/--reasoning-parser flags only matter if RUN_EVAL=true is used, which is not the primary benchmark path. The PR is small, self-contained, and follows the repo's established patterns closely.

benchmarks/single_node/glm5_fp8_b200.sh

claude · 2026-03-17T04:23:25Z

perf-changelog.yaml

+  description:
+    - "Add GLM-5 FP8 SGLang benchmark for B200"
+    - "Supports TP8 (low latency) and DEP8 (high throughput) modes with NSA attention backend"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/915


🟡 The pr-link field at line 970 has the raw placeholder <PR_LINK> instead of the actual PR URL https://github.com/SemiAnalysisAI/InferenceX/pull/915. Please replace it before merging.

Extended reasoning...

What the bug is

The new perf-changelog entry for glm5-fp8-b200-sglang has pr-link: <PR_LINK> which is a literal placeholder string, not a valid URL. Every changelog entry is expected to have a proper GitHub PR link so that changes can be traced back to their source.

How it manifests

Any tooling or scripts that parse perf-changelog.yaml and expect the pr-link field to be a valid URL will break or produce incorrect results when encountering <PR_LINK>. It also looks like an HTML/XML tag, which could cause issues with certain YAML consumers or markdown renderers.

Why existing patterns don't prevent it

The file does have a few other entries with placeholder pr-links using the /pull/XXX pattern (e.g., the glm5-fp8-mi355x-sglang and minimaxm2.5-fp8-h200-vllm entries), but those at least follow the URL format https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. The <PR_LINK> placeholder is a different and more obviously incomplete format that isn't even a URL.

Step-by-step proof

Look at the diff for perf-changelog.yaml — the last new entry adds pr-link: <PR_LINK>

The PR number is 915 (visible from the PR metadata)

Therefore the correct value should be pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/915

<PR_LINK> is not a valid URL and breaks the consistency of the changelog

Fix

Replace <PR_LINK> with https://github.com/SemiAnalysisAI/InferenceX/pull/915 on line 970.

benchmarks/single_node/glm5_fp8_b200.sh

functionstackx · 2026-03-17T20:04:07Z

benchmarks/single_node/glm5_fp8_b200.sh

+PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
+--trust-remote-code \
+--tensor-parallel-size=$TP \
+$DP_FLAGS \
+--kv-cache-dtype fp8_e4m3 --quantization fp8 \
+--nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
+--moe-runner-backend flashinfer_trtllm \


can u add to sglang cookbook?

ofc, I'll let the team know once we finalize the config.

Ankur-singh added 2 commits March 14, 2026 00:01

glm5 b200 config

c557458

extend conc to 128

17feee0

ankursingh-nv requested a review from a team March 17, 2026 03:46

ankursingh-nv requested review from jgangani and kedarpotdar-nv as code owners March 17, 2026 03:46

github-project-automation bot added this to InferenceMAX Board Mar 17, 2026

fix PR number

d1a8039

ankursingh-nv added the sweep-enabled label Mar 17, 2026

claude bot reviewed Mar 17, 2026

View reviewed changes

fix container image

4eca3fe

functionstackx reviewed Mar 17, 2026

View reviewed changes

update config

758f9dd

kedarpotdar-nv approved these changes Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] GLM5 fp8 b200 SGL#915

[NV] GLM5 fp8 b200 SGL#915
ankursingh-nv wants to merge 5 commits intomainfrom
nv/glm5-fp8-b200-sgl

ankursingh-nv commented Mar 17, 2026 •

edited by Klaud-Cold

Loading

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

ankursingh-nv commented Mar 17, 2026

Uh oh!

Klaud-Cold commented Mar 17, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

claude bot Mar 17, 2026

Uh oh!

Uh oh!

functionstackx Mar 17, 2026

Uh oh!

ankursingh-nv Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ankursingh-nv commented Mar 17, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Key Configuration Details

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

ankursingh-nv commented Mar 17, 2026

Uh oh!

Klaud-Cold commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

claude bot Mar 17, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Why existing patterns don't prevent it

Step-by-step proof

Fix

Uh oh!

Uh oh!

functionstackx Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

ankursingh-nv Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ankursingh-nv commented Mar 17, 2026 •

edited by Klaud-Cold

Loading

Klaud-Cold commented Mar 17, 2026 •

edited

Loading