Skip to content

[NV] GLM5 fp8 b200 SGL#915

Open
ankursingh-nv wants to merge 5 commits intomainfrom
nv/glm5-fp8-b200-sgl
Open

[NV] GLM5 fp8 b200 SGL#915
ankursingh-nv wants to merge 5 commits intomainfrom
nv/glm5-fp8-b200-sgl

Conversation

@ankursingh-nv
Copy link
Collaborator

@ankursingh-nv ankursingh-nv commented Mar 17, 2026

Summary

Add GLM-5 (FP8) benchmark configuration and script for B200 GPUs using SGLang.

Changes

  • .github/configs/nvidia-master.yaml: Add glm5-fp8-b200-sglang config entry using the lmsysorg/sglang:glm5-blackwell image and zai-org/GLM-5-FP8 model. Sweeps across three sequence length configs (1k/1k, 1k/8k, 8k/1k) with TP8 (conc 4–32) and DEP8 with dp-attention (conc 32–128).

  • benchmarks/single_node/glm5_fp8_b200.sh: New benchmark script supporting two modes:

    • TP8 (low latency): Standard tensor-parallel serving
    • DEP8 (high throughput): Data-parallel attention with expert parallelism
    • Uses NSA attention backend with TRT-LLM decode, FlashInfer+TRT-LLM MoE runner, BF16 KV cache, FP8 quantization, and FlashInfer allreduce fusion.
  • perf-changelog.yaml: Add changelog entry for the new config.

Key Configuration Details

Parameter Value
Model zai-org/GLM-5-FP8
Image lmsysorg/sglang:glm5-blackwell
Framework SGLang
GPU B200
Precision FP8
Attention Backend NSA (TRT-LLM decode)
MoE Runner FlashInfer + TRT-LLM
KV Cache BF16

@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

1 similar comment
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@ankursingh-nv
Copy link
Collaborator Author

@claude add PR Description

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 17, 2026

Claude finished @ankursingh-nv's task in 39s —— View job


Adding PR Description

  • Gather context from changed files
  • Write and update PR description

PR description has been added with a summary of all changes, including the new GLM-5 FP8 B200 SGLang benchmark config, script details (TP8 and DEP8 modes), and a key configuration table.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward benchmark config addition following established patterns; inline nits are worth addressing but non-blocking.

Extended reasoning...

Overview

This PR adds a GLM-5 FP8 SGLang benchmark configuration for B200 GPUs across three files: a new config block in nvidia-master.yaml, a new benchmark shell script glm5_fp8_b200.sh, and a changelog entry in perf-changelog.yaml. The script supports both TP8 (low latency) and DEP8 (high throughput via dp-attention) modes with NSA attention backend.

Security risks

None. This is a benchmark configuration and launch script with no auth, user input handling, or external API interaction beyond downloading a model from HuggingFace.

Level of scrutiny

Low scrutiny appropriate. This is a mechanical addition of a new benchmark config following well-established patterns from dozens of similar scripts in the repo (e.g., glm5_fp8_mi355x.sh, qwen3.5_bf16_b200.sh, dsr1_fp8_b200.sh). The script structure, server flags, and benchmark invocation all match existing conventions.

Other factors

All three reported bugs are nits: (1) DP_ATTENTION missing from check_env_vars is a consistency issue — the CI framework always provides this variable, so it has no runtime impact; (2) the <PR_LINK> placeholder is a common pre-merge oversight seen in other entries (several existing entries have /pull/XXX placeholders); (3) missing --tool-call-parser/--reasoning-parser flags only matter if RUN_EVAL=true is used, which is not the primary benchmark path. The PR is small, self-contained, and follows the repo's established patterns closely.

description:
- "Add GLM-5 FP8 SGLang benchmark for B200"
- "Supports TP8 (low latency) and DEP8 (high throughput) modes with NSA attention backend"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/915
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The pr-link field at line 970 has the raw placeholder <PR_LINK> instead of the actual PR URL https://github.com/SemiAnalysisAI/InferenceX/pull/915. Please replace it before merging.

Extended reasoning...

What the bug is

The new perf-changelog entry for glm5-fp8-b200-sglang has pr-link: <PR_LINK> which is a literal placeholder string, not a valid URL. Every changelog entry is expected to have a proper GitHub PR link so that changes can be traced back to their source.

How it manifests

Any tooling or scripts that parse perf-changelog.yaml and expect the pr-link field to be a valid URL will break or produce incorrect results when encountering <PR_LINK>. It also looks like an HTML/XML tag, which could cause issues with certain YAML consumers or markdown renderers.

Why existing patterns don't prevent it

The file does have a few other entries with placeholder pr-links using the /pull/XXX pattern (e.g., the glm5-fp8-mi355x-sglang and minimaxm2.5-fp8-h200-vllm entries), but those at least follow the URL format https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. The <PR_LINK> placeholder is a different and more obviously incomplete format that isn't even a URL.

Step-by-step proof

  1. Look at the diff for perf-changelog.yaml — the last new entry adds pr-link: <PR_LINK>
  2. The PR number is 915 (visible from the PR metadata)
  3. Therefore the correct value should be pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/915
  4. <PR_LINK> is not a valid URL and breaks the consistency of the changelog

Fix

Replace <PR_LINK> with https://github.com/SemiAnalysisAI/InferenceX/pull/915 on line 970.

Comment on lines +46 to +52
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP \
$DP_FLAGS \
--kv-cache-dtype fp8_e4m3 --quantization fp8 \
--nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
--moe-runner-backend flashinfer_trtllm \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add to sglang cookbook?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofc, I'll let the team know once we finalize the config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants