[NV] Add GLM-5 h200 sglang by hshrivastava-droid · Pull Request #914 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-03-16T21:58:42Z

No description provided.

github-actions · 2026-03-16T21:58:50Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

claude · 2026-03-16T22:33:19Z

benchmarks/single_node/glm5_fp8_h200.sh

+python3 utils/bench_serving/benchmark_serving.py \
+    --backend openai-chat \
+    --base-url "$SGLANG_URL" \
+    --endpoint /v1/chat/completions \
+    --model glm-5-fp8 \
+    $TOKENIZER_ARGS \
+    --dataset-name random \
+    --num-prompts "$num_prompts" \
+    --random-input-len "$ISL" \
+    --random-output-len "$OSL" \
+    --random-range-ratio "${RANDOM_RANGE_RATIO:-0.8}" \
+    --ignore-eos \
+    --percentile-metrics ttft,tpot,itl,e2el \
+    --max-concurrency "$CONC" \
+    --save-result \
+    --result-dir /workspace \
+    --result-filename "$RESULT_FILENAME.json"
+
+# After throughput, run evaluation only if RUN_EVAL is true


🔴 This script calls python3 utils/bench_serving/benchmark_serving.py directly instead of using the run_benchmark_serving helper that every other benchmark script (50+) uses, including the sibling glm5_fp8_mi355x.sh. This causes multiple compounding issues: --backend openai-chat instead of vllm, --model glm-5-fp8 (not a valid HF ID) will crash tokenizer loading for the random dataset, num_prompts uses CONC*5 instead of the standard CONC*10, and --num-warmups is missing so cold-start latency will skew results. Switching to run_benchmark_serving (as the MI355X script does) would fix all of these at once.

Extended reasoning...

What the bug is

The new glm5_fp8_h200.sh benchmark script (lines 57-73) directly invokes python3 utils/bench_serving/benchmark_serving.py instead of using the run_benchmark_serving helper function from benchmark_lib.sh. This is the only script out of 50+ benchmark scripts in the repo that does this. The companion script glm5_fp8_mi355x.sh for the same model on MI355X hardware correctly uses the helper.

Why this matters — multiple compounding issues

Bypassing the helper causes several problems at once:

Wrong backend (line 58): Uses --backend openai-chat with --endpoint /v1/chat/completions. Every other SGLang benchmark uses --backend vllm (which hits /v1/completions). This measures different API behavior (chat template processing overhead), making H200 results non-comparable with MI355X results for the same model.

Tokenizer crash (line 60): Passes --model glm-5-fp8 (the served model name) with --dataset-name random. The random dataset generator needs a tokenizer, and falls back to args.model when --tokenizer is not provided (benchmark_serving.py line 642: tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model). Since glm-5-fp8 is not a valid HuggingFace model ID (the real one is zai-org/GLM-5-FP8), get_tokenizer("glm-5-fp8") will fail. The script has a BENCH_TOKENIZER env var workaround (lines 47-50), but this variable is never set in the YAML config or anywhere in the launch chain. The MI355X script avoids this by passing --model "$MODEL" (the real HF ID) via run_benchmark_serving.

Half the standard prompts (line 55): num_prompts=$((CONC * 5)) uses a multiplier of 5 instead of the standard 10 used by every other script (including glm5_fp8_mi355x.sh line 61: --num-prompts "$((CONC * 10))"). This produces less statistically reliable results.

No warmup (missing flag): The helper provides --num-warmups "$((2 * max_concurrency))" (benchmark_lib.sh line 332), but the direct call omits this entirely. The default in benchmark_serving.py is 0 (line 1069), so cold-start latency (CUDA graph compilation, kernel warmup) will be included in measurements.

Missing --request-rate inf: The helper sets this (line 328) but the direct call omits it, potentially throttling the benchmark.

No server health monitoring: The helper monitors the server process during benchmarking (lines 350-373) and reports if it dies mid-run. The direct call has no such protection.

Step-by-step proof of the tokenizer crash

YAML config sets model: zai-org/GLM-5-FP8 → passed as $MODEL env var

Script starts server with --served-model-name glm-5-fp8 (line 40)

Script calls benchmark_serving.py --model glm-5-fp8 (line 60) — uses served name, not $MODEL

BENCH_TOKENIZER is not set anywhere in config, so TOKENIZER_ARGS is empty (line 48)

benchmark_serving.py resolves tokenizer: tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model → "glm-5-fp8"

get_tokenizer("glm-5-fp8") calls AutoTokenizer.from_pretrained("glm-5-fp8") → fails because glm-5-fp8 is not a valid HF model ID

The benchmark crashes before any requests are sent

How to fix

Replace lines 45-73 with the same pattern used in glm5_fp8_mi355x.sh:

run_benchmark_serving \\ --model "$MODEL" \\ --port "$PORT" \\ --backend vllm \\ --input-len "$ISL" \\ --output-len "$OSL" \\ --random-range-ratio "$RANDOM_RANGE_RATIO" \\ --num-prompts "$((CONC * 10))" \\ --max-concurrency "$CONC" \\ --result-filename "$RESULT_FILENAME" \\ --result-dir /workspace/

This fixes all six issues simultaneously by leveraging the standardized helper function.

wow, claude review is pretty good @kedarpotdar-nv

perf-changelog.yaml

claude · 2026-03-16T22:33:19Z

benchmarks/single_node/glm5_fp8_h200.sh

+  --host 0.0.0.0 \
+  --port "$PORT" \
+  --tp-size "$TP" \
+  --tool-call-parser glm47 \
+  --reasoning-parser glm45 \
+  --mem-fraction-static 0.85 \
+  --served-model-name glm-5-fp8 \
+  > "$SERVER_LOG" 2>&1 &
+


🟡 Two issues with the SGLang server launch: (1) --trust-remote-code is missing — the MI355X GLM-5 script includes it, and virtually every other non-trivial model script in this repo does too. (2) --served-model-name glm-5-fp8 will cause run_eval to fail when RUN_EVAL=true, because run_lm_eval resolves MODEL_NAME to $MODEL (i.e. zai-org/GLM-5-FP8), but the server only accepts glm-5-fp8. Either remove --served-model-name or add export MODEL_NAME=glm-5-fp8 before calling run_eval.

Extended reasoning...

Missing --trust-remote-code

The new glm5_fp8_h200.sh launches the SGLang server (lines 29-38) without --trust-remote-code. The sibling script glm5_fp8_mi355x.sh for the same GLM-5 model explicitly includes --trust-remote-code at line 41. A grep across benchmarks/single_node/ shows that virtually every model script (dsr1, qwen3.5, kimik2.5, minimaxm2.5, etc.) includes this flag. GLM-5 (zai-org/GLM-5-FP8) is a custom model from Zhipu AI that likely ships custom modeling code requiring trust-remote-code to load properly.

While the custom Docker image lmsysorg/sglang:glm5-hopper may have GLM-5 support baked in (potentially making the flag unnecessary at runtime), the MI355X script uses a similarly model-specific image (rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219) and still includes the flag. Adding --trust-remote-code is harmless and ensures the script works even if the image changes or the model code is loaded dynamically.

--served-model-name breaks eval

The server is launched with --served-model-name glm-5-fp8 (line 37), which means SGLang will only accept API requests where model=glm-5-fp8. When RUN_EVAL=true (lines 74-76), the script calls run_eval, which invokes run_lm_eval in benchmark_lib.sh. At line 607, MODEL_NAME is resolved as:

MODEL_NAME=${MODEL_NAME:-$MODEL}

Since MODEL_NAME is never exported in glm5_fp8_h200.sh, it falls back to $MODEL, which is zai-org/GLM-5-FP8 (set from the YAML config). Then at line 617, lm_eval sends requests with model=zai-org/GLM-5-FP8 in the --model_args. The server rejects these because it only recognizes the served model name glm-5-fp8.

Step-by-step proof:

YAML config sets model: zai-org/GLM-5-FP8, which becomes the $MODEL env var

Server starts with --served-model-name glm-5-fp8 → only accepts model=glm-5-fp8 in API calls

RUN_EVAL=true triggers run_eval at line 74

run_lm_eval at benchmark_lib.sh:607 does MODEL_NAME=${MODEL_NAME:-$MODEL} → resolves to zai-org/GLM-5-FP8

Line 617: --model_args "model=${MODEL_NAME},..." sends model=zai-org/GLM-5-FP8 to the server

Server rejects the request — model name mismatch

The only other script using --served-model-name is qwen3.5_bf16_b200.sh, which sets it to Qwen/Qwen3.5-397B-A17B — matching the actual HF model ID, so there is no mismatch there. The MI355X GLM-5 script avoids this entirely by not using --served-model-name.

Fix: Add --trust-remote-code to the server launch args. For the eval issue, either remove --served-model-name glm-5-fp8 from the server launch, or add export MODEL_NAME=glm-5-fp8 before the run_eval call at line 74.

@claude evals were successful with current flow

Claude finished @kedarpotdar-nv's task in 56s —— View job

Eval feedback analysis

Read current benchmark script and configs

Verify eval flow and respond to feedback

Thanks for confirming, @kedarpotdar-nv! Looking at the current state of the branch, I can see the script has already been updated to address the concerns from the initial review:

run_benchmark_serving — now used (commit 4de829f)

--trust-remote-code — added to server launch (line 38)

export MODEL_NAME=glm-5-fp8 — set before run_eval (commit 8d7009e, line 62), so eval correctly targets the served model name

PR link — updated to pull/914 (commit 767a6e2)

The eval flow looks correct: since MODEL_NAME is explicitly exported as glm-5-fp8 before calling run_eval, run_lm_eval in benchmark_lib.sh will use that instead of falling back to $MODEL (zai-org/GLM-5-FP8), which matches the --served-model-name. This is consistent with your report that evals pass.

One minor nit: perf-changelog.yaml line 971 still references "openai-chat backend" in the description, but the script now uses --backend vllm via run_benchmark_serving. Not a blocker, but worth updating for accuracy.

hshrivastava-droid · 2026-03-17T18:07:49Z

Resolved above comments, configs are taken from https://github.com/zai-org/GLM-5
@Klaud-Cold please feel free to review

.github/configs/nvidia-master.yaml

kedarpotdar-nv

lgtm thanks @hshrivastava-droid

Add GLM-5 h200 sglang

0c82808

hshrivastava-droid requested a review from a team March 16, 2026 21:58

hshrivastava-droid requested review from ankursingh-nv, jgangani and kedarpotdar-nv as code owners March 16, 2026 21:58

github-project-automation bot added this to InferenceMAX Board Mar 16, 2026

updating PR number

767a6e2

hshrivastava-droid added NVIDIA sweep-enabled labels Mar 16, 2026

claude bot reviewed Mar 16, 2026

View reviewed changes

hshrivastava-droid added 2 commits March 16, 2026 15:43

fix: benchmark client

4de829f

fix eval

8d7009e

kedarpotdar-nv approved these changes Mar 17, 2026

View reviewed changes

fixing perf

534a6b1

hshrivastava-droid requested a review from Klaud-Cold March 17, 2026 18:03

jgangani approved these changes Mar 17, 2026

View reviewed changes

.github/configs/nvidia-master.yaml Show resolved Hide resolved

hshrivastava-droid requested review from functionstackx and kedarpotdar-nv March 18, 2026 04:37

kedarpotdar-nv approved these changes Mar 18, 2026

View reviewed changes

kedarpotdar-nv changed the title ~~[WIP] Add GLM-5 h200 sglang~~ [NV] Add GLM-5 h200 sglang Mar 18, 2026

ankursingh-nv approved these changes Mar 18, 2026

View reviewed changes

cquil11 approved these changes Mar 19, 2026

View reviewed changes

Merge branch 'main' into nv/h200-GLM5

5f03048

ankursingh-nv merged commit ebb48d4 into main Mar 19, 2026
18 of 20 checks passed

ankursingh-nv deleted the nv/h200-GLM5 branch March 19, 2026 06:26

github-project-automation bot moved this to Done in InferenceMAX Board Mar 19, 2026

claude bot mentioned this pull request Mar 20, 2026

Separate eval-only workflow and change to 8k1k #911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Add GLM-5 h200 sglang#914

[NV] Add GLM-5 h200 sglang#914
ankursingh-nv merged 6 commits intomainfrom
nv/h200-GLM5

hshrivastava-droid commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

functionstackx Mar 17, 2026

Uh oh!

Uh oh!

claude bot Mar 16, 2026

Uh oh!

kedarpotdar-nv Mar 17, 2026

Uh oh!

Klaud-Cold Mar 17, 2026 •

edited

Loading

Uh oh!

hshrivastava-droid commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

kedarpotdar-nv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

hshrivastava-droid commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters — multiple compounding issues

Step-by-step proof of the tokenizer crash

How to fix

Uh oh!

functionstackx Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

kedarpotdar-nv Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Klaud-Cold Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Eval feedback analysis

Uh oh!

hshrivastava-droid commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kedarpotdar-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Klaud-Cold Mar 17, 2026 •

edited

Loading

hshrivastava-droid commented Mar 17, 2026 •

edited

Loading