Skip to content

[NV] Add GLM-5 h200 sglang#914

Merged
ankursingh-nv merged 6 commits intomainfrom
nv/h200-GLM5
Mar 19, 2026
Merged

[NV] Add GLM-5 h200 sglang#914
ankursingh-nv merged 6 commits intomainfrom
nv/h200-GLM5

Conversation

@hshrivastava-droid
Copy link
Collaborator

No description provided.

@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment on lines +55 to +73
python3 utils/bench_serving/benchmark_serving.py \
--backend openai-chat \
--base-url "$SGLANG_URL" \
--endpoint /v1/chat/completions \
--model glm-5-fp8 \
$TOKENIZER_ARGS \
--dataset-name random \
--num-prompts "$num_prompts" \
--random-input-len "$ISL" \
--random-output-len "$OSL" \
--random-range-ratio "${RANDOM_RANGE_RATIO:-0.8}" \
--ignore-eos \
--percentile-metrics ttft,tpot,itl,e2el \
--max-concurrency "$CONC" \
--save-result \
--result-dir /workspace \
--result-filename "$RESULT_FILENAME.json"

# After throughput, run evaluation only if RUN_EVAL is true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This script calls python3 utils/bench_serving/benchmark_serving.py directly instead of using the run_benchmark_serving helper that every other benchmark script (50+) uses, including the sibling glm5_fp8_mi355x.sh. This causes multiple compounding issues: --backend openai-chat instead of vllm, --model glm-5-fp8 (not a valid HF ID) will crash tokenizer loading for the random dataset, num_prompts uses CONC*5 instead of the standard CONC*10, and --num-warmups is missing so cold-start latency will skew results. Switching to run_benchmark_serving (as the MI355X script does) would fix all of these at once.

Extended reasoning...

What the bug is

The new glm5_fp8_h200.sh benchmark script (lines 57-73) directly invokes python3 utils/bench_serving/benchmark_serving.py instead of using the run_benchmark_serving helper function from benchmark_lib.sh. This is the only script out of 50+ benchmark scripts in the repo that does this. The companion script glm5_fp8_mi355x.sh for the same model on MI355X hardware correctly uses the helper.

Why this matters — multiple compounding issues

Bypassing the helper causes several problems at once:

  1. Wrong backend (line 58): Uses --backend openai-chat with --endpoint /v1/chat/completions. Every other SGLang benchmark uses --backend vllm (which hits /v1/completions). This measures different API behavior (chat template processing overhead), making H200 results non-comparable with MI355X results for the same model.

  2. Tokenizer crash (line 60): Passes --model glm-5-fp8 (the served model name) with --dataset-name random. The random dataset generator needs a tokenizer, and falls back to args.model when --tokenizer is not provided (benchmark_serving.py line 642: tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model). Since glm-5-fp8 is not a valid HuggingFace model ID (the real one is zai-org/GLM-5-FP8), get_tokenizer("glm-5-fp8") will fail. The script has a BENCH_TOKENIZER env var workaround (lines 47-50), but this variable is never set in the YAML config or anywhere in the launch chain. The MI355X script avoids this by passing --model "$MODEL" (the real HF ID) via run_benchmark_serving.

  3. Half the standard prompts (line 55): num_prompts=$((CONC * 5)) uses a multiplier of 5 instead of the standard 10 used by every other script (including glm5_fp8_mi355x.sh line 61: --num-prompts "$((CONC * 10))"). This produces less statistically reliable results.

  4. No warmup (missing flag): The helper provides --num-warmups "$((2 * max_concurrency))" (benchmark_lib.sh line 332), but the direct call omits this entirely. The default in benchmark_serving.py is 0 (line 1069), so cold-start latency (CUDA graph compilation, kernel warmup) will be included in measurements.

  5. Missing --request-rate inf: The helper sets this (line 328) but the direct call omits it, potentially throttling the benchmark.

  6. No server health monitoring: The helper monitors the server process during benchmarking (lines 350-373) and reports if it dies mid-run. The direct call has no such protection.

Step-by-step proof of the tokenizer crash

  1. YAML config sets model: zai-org/GLM-5-FP8 → passed as $MODEL env var
  2. Script starts server with --served-model-name glm-5-fp8 (line 40)
  3. Script calls benchmark_serving.py --model glm-5-fp8 (line 60) — uses served name, not $MODEL
  4. BENCH_TOKENIZER is not set anywhere in config, so TOKENIZER_ARGS is empty (line 48)
  5. benchmark_serving.py resolves tokenizer: tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model"glm-5-fp8"
  6. get_tokenizer("glm-5-fp8") calls AutoTokenizer.from_pretrained("glm-5-fp8")fails because glm-5-fp8 is not a valid HF model ID
  7. The benchmark crashes before any requests are sent

How to fix

Replace lines 45-73 with the same pattern used in glm5_fp8_mi355x.sh:

run_benchmark_serving \\
    --model "$MODEL" \\
    --port "$PORT" \\
    --backend vllm \\
    --input-len "$ISL" \\
    --output-len "$OSL" \\
    --random-range-ratio "$RANDOM_RANGE_RATIO" \\
    --num-prompts "$((CONC * 10))" \\
    --max-concurrency "$CONC" \\
    --result-filename "$RESULT_FILENAME" \\
    --result-dir /workspace/

This fixes all six issues simultaneously by leveraging the standardized helper function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, claude review is pretty good @kedarpotdar-nv

Comment on lines +31 to +39
--host 0.0.0.0 \
--port "$PORT" \
--tp-size "$TP" \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--served-model-name glm-5-fp8 \
> "$SERVER_LOG" 2>&1 &

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two issues with the SGLang server launch: (1) --trust-remote-code is missing — the MI355X GLM-5 script includes it, and virtually every other non-trivial model script in this repo does too. (2) --served-model-name glm-5-fp8 will cause run_eval to fail when RUN_EVAL=true, because run_lm_eval resolves MODEL_NAME to $MODEL (i.e. zai-org/GLM-5-FP8), but the server only accepts glm-5-fp8. Either remove --served-model-name or add export MODEL_NAME=glm-5-fp8 before calling run_eval.

Extended reasoning...

Missing --trust-remote-code

The new glm5_fp8_h200.sh launches the SGLang server (lines 29-38) without --trust-remote-code. The sibling script glm5_fp8_mi355x.sh for the same GLM-5 model explicitly includes --trust-remote-code at line 41. A grep across benchmarks/single_node/ shows that virtually every model script (dsr1, qwen3.5, kimik2.5, minimaxm2.5, etc.) includes this flag. GLM-5 (zai-org/GLM-5-FP8) is a custom model from Zhipu AI that likely ships custom modeling code requiring trust-remote-code to load properly.

While the custom Docker image lmsysorg/sglang:glm5-hopper may have GLM-5 support baked in (potentially making the flag unnecessary at runtime), the MI355X script uses a similarly model-specific image (rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219) and still includes the flag. Adding --trust-remote-code is harmless and ensures the script works even if the image changes or the model code is loaded dynamically.

--served-model-name breaks eval

The server is launched with --served-model-name glm-5-fp8 (line 37), which means SGLang will only accept API requests where model=glm-5-fp8. When RUN_EVAL=true (lines 74-76), the script calls run_eval, which invokes run_lm_eval in benchmark_lib.sh. At line 607, MODEL_NAME is resolved as:

MODEL_NAME=${MODEL_NAME:-$MODEL}

Since MODEL_NAME is never exported in glm5_fp8_h200.sh, it falls back to $MODEL, which is zai-org/GLM-5-FP8 (set from the YAML config). Then at line 617, lm_eval sends requests with model=zai-org/GLM-5-FP8 in the --model_args. The server rejects these because it only recognizes the served model name glm-5-fp8.

Step-by-step proof:

  1. YAML config sets model: zai-org/GLM-5-FP8, which becomes the $MODEL env var
  2. Server starts with --served-model-name glm-5-fp8 → only accepts model=glm-5-fp8 in API calls
  3. RUN_EVAL=true triggers run_eval at line 74
  4. run_lm_eval at benchmark_lib.sh:607 does MODEL_NAME=${MODEL_NAME:-$MODEL} → resolves to zai-org/GLM-5-FP8
  5. Line 617: --model_args "model=${MODEL_NAME},..." sends model=zai-org/GLM-5-FP8 to the server
  6. Server rejects the request — model name mismatch

The only other script using --served-model-name is qwen3.5_bf16_b200.sh, which sets it to Qwen/Qwen3.5-397B-A17B — matching the actual HF model ID, so there is no mismatch there. The MI355X GLM-5 script avoids this entirely by not using --served-model-name.

Fix: Add --trust-remote-code to the server launch args. For the eval issue, either remove --served-model-name glm-5-fp8 from the server launch, or add export MODEL_NAME=glm-5-fp8 before the run_eval call at line 74.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude evals were successful with current flow

Copy link
Collaborator

@Klaud-Cold Klaud-Cold Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @kedarpotdar-nv's task in 56s —— View job


Eval feedback analysis

  • Read current benchmark script and configs
  • Verify eval flow and respond to feedback

Thanks for confirming, @kedarpotdar-nv! Looking at the current state of the branch, I can see the script has already been updated to address the concerns from the initial review:

  • run_benchmark_serving — now used (commit 4de829f)
  • --trust-remote-code — added to server launch (line 38)
  • export MODEL_NAME=glm-5-fp8 — set before run_eval (commit 8d7009e, line 62), so eval correctly targets the served model name
  • PR link — updated to pull/914 (commit 767a6e2)

The eval flow looks correct: since MODEL_NAME is explicitly exported as glm-5-fp8 before calling run_eval, run_lm_eval in benchmark_lib.sh will use that instead of falling back to $MODEL (zai-org/GLM-5-FP8), which matches the --served-model-name. This is consistent with your report that evals pass.

One minor nit: perf-changelog.yaml line 971 still references "openai-chat backend" in the description, but the script now uses --backend vllm via run_benchmark_serving. Not a blocker, but worth updating for accuracy.

@hshrivastava-droid
Copy link
Collaborator Author

hshrivastava-droid commented Mar 17, 2026

Resolved above comments, configs are taken from https://github.com/zai-org/GLM-5
@Klaud-Cold please feel free to review

Copy link
Collaborator

@kedarpotdar-nv kedarpotdar-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @hshrivastava-droid

@kedarpotdar-nv kedarpotdar-nv changed the title [WIP] Add GLM-5 h200 sglang [NV] Add GLM-5 h200 sglang Mar 18, 2026
@ankursingh-nv ankursingh-nv merged commit ebb48d4 into main Mar 19, 2026
18 of 20 checks passed
@ankursingh-nv ankursingh-nv deleted the nv/h200-GLM5 branch March 19, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

8 participants