Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
| python3 utils/bench_serving/benchmark_serving.py \ | ||
| --backend openai-chat \ | ||
| --base-url "$SGLANG_URL" \ | ||
| --endpoint /v1/chat/completions \ | ||
| --model glm-5-fp8 \ | ||
| $TOKENIZER_ARGS \ | ||
| --dataset-name random \ | ||
| --num-prompts "$num_prompts" \ | ||
| --random-input-len "$ISL" \ | ||
| --random-output-len "$OSL" \ | ||
| --random-range-ratio "${RANDOM_RANGE_RATIO:-0.8}" \ | ||
| --ignore-eos \ | ||
| --percentile-metrics ttft,tpot,itl,e2el \ | ||
| --max-concurrency "$CONC" \ | ||
| --save-result \ | ||
| --result-dir /workspace \ | ||
| --result-filename "$RESULT_FILENAME.json" | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true |
There was a problem hiding this comment.
🔴 This script calls python3 utils/bench_serving/benchmark_serving.py directly instead of using the run_benchmark_serving helper that every other benchmark script (50+) uses, including the sibling glm5_fp8_mi355x.sh. This causes multiple compounding issues: --backend openai-chat instead of vllm, --model glm-5-fp8 (not a valid HF ID) will crash tokenizer loading for the random dataset, num_prompts uses CONC*5 instead of the standard CONC*10, and --num-warmups is missing so cold-start latency will skew results. Switching to run_benchmark_serving (as the MI355X script does) would fix all of these at once.
Extended reasoning...
What the bug is
The new glm5_fp8_h200.sh benchmark script (lines 57-73) directly invokes python3 utils/bench_serving/benchmark_serving.py instead of using the run_benchmark_serving helper function from benchmark_lib.sh. This is the only script out of 50+ benchmark scripts in the repo that does this. The companion script glm5_fp8_mi355x.sh for the same model on MI355X hardware correctly uses the helper.
Why this matters — multiple compounding issues
Bypassing the helper causes several problems at once:
-
Wrong backend (line 58): Uses
--backend openai-chatwith--endpoint /v1/chat/completions. Every other SGLang benchmark uses--backend vllm(which hits/v1/completions). This measures different API behavior (chat template processing overhead), making H200 results non-comparable with MI355X results for the same model. -
Tokenizer crash (line 60): Passes
--model glm-5-fp8(the served model name) with--dataset-name random. The random dataset generator needs a tokenizer, and falls back toargs.modelwhen--tokenizeris not provided (benchmark_serving.pyline 642:tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model). Sinceglm-5-fp8is not a valid HuggingFace model ID (the real one iszai-org/GLM-5-FP8),get_tokenizer("glm-5-fp8")will fail. The script has aBENCH_TOKENIZERenv var workaround (lines 47-50), but this variable is never set in the YAML config or anywhere in the launch chain. The MI355X script avoids this by passing--model "$MODEL"(the real HF ID) viarun_benchmark_serving. -
Half the standard prompts (line 55):
num_prompts=$((CONC * 5))uses a multiplier of 5 instead of the standard 10 used by every other script (includingglm5_fp8_mi355x.shline 61:--num-prompts "$((CONC * 10))"). This produces less statistically reliable results. -
No warmup (missing flag): The helper provides
--num-warmups "$((2 * max_concurrency))"(benchmark_lib.sh line 332), but the direct call omits this entirely. The default inbenchmark_serving.pyis 0 (line 1069), so cold-start latency (CUDA graph compilation, kernel warmup) will be included in measurements. -
Missing
--request-rate inf: The helper sets this (line 328) but the direct call omits it, potentially throttling the benchmark. -
No server health monitoring: The helper monitors the server process during benchmarking (lines 350-373) and reports if it dies mid-run. The direct call has no such protection.
Step-by-step proof of the tokenizer crash
- YAML config sets
model: zai-org/GLM-5-FP8→ passed as$MODELenv var - Script starts server with
--served-model-name glm-5-fp8(line 40) - Script calls
benchmark_serving.py --model glm-5-fp8(line 60) — uses served name, not$MODEL BENCH_TOKENIZERis not set anywhere in config, soTOKENIZER_ARGSis empty (line 48)benchmark_serving.pyresolves tokenizer:tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model→"glm-5-fp8"get_tokenizer("glm-5-fp8")callsAutoTokenizer.from_pretrained("glm-5-fp8")→ fails becauseglm-5-fp8is not a valid HF model ID- The benchmark crashes before any requests are sent
How to fix
Replace lines 45-73 with the same pattern used in glm5_fp8_mi355x.sh:
run_benchmark_serving \\
--model "$MODEL" \\
--port "$PORT" \\
--backend vllm \\
--input-len "$ISL" \\
--output-len "$OSL" \\
--random-range-ratio "$RANDOM_RANGE_RATIO" \\
--num-prompts "$((CONC * 10))" \\
--max-concurrency "$CONC" \\
--result-filename "$RESULT_FILENAME" \\
--result-dir /workspace/This fixes all six issues simultaneously by leveraging the standardized helper function.
There was a problem hiding this comment.
wow, claude review is pretty good @kedarpotdar-nv
| --host 0.0.0.0 \ | ||
| --port "$PORT" \ | ||
| --tp-size "$TP" \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --mem-fraction-static 0.85 \ | ||
| --served-model-name glm-5-fp8 \ | ||
| > "$SERVER_LOG" 2>&1 & | ||
|
|
There was a problem hiding this comment.
🟡 Two issues with the SGLang server launch: (1) --trust-remote-code is missing — the MI355X GLM-5 script includes it, and virtually every other non-trivial model script in this repo does too. (2) --served-model-name glm-5-fp8 will cause run_eval to fail when RUN_EVAL=true, because run_lm_eval resolves MODEL_NAME to $MODEL (i.e. zai-org/GLM-5-FP8), but the server only accepts glm-5-fp8. Either remove --served-model-name or add export MODEL_NAME=glm-5-fp8 before calling run_eval.
Extended reasoning...
Missing --trust-remote-code
The new glm5_fp8_h200.sh launches the SGLang server (lines 29-38) without --trust-remote-code. The sibling script glm5_fp8_mi355x.sh for the same GLM-5 model explicitly includes --trust-remote-code at line 41. A grep across benchmarks/single_node/ shows that virtually every model script (dsr1, qwen3.5, kimik2.5, minimaxm2.5, etc.) includes this flag. GLM-5 (zai-org/GLM-5-FP8) is a custom model from Zhipu AI that likely ships custom modeling code requiring trust-remote-code to load properly.
While the custom Docker image lmsysorg/sglang:glm5-hopper may have GLM-5 support baked in (potentially making the flag unnecessary at runtime), the MI355X script uses a similarly model-specific image (rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219) and still includes the flag. Adding --trust-remote-code is harmless and ensures the script works even if the image changes or the model code is loaded dynamically.
--served-model-name breaks eval
The server is launched with --served-model-name glm-5-fp8 (line 37), which means SGLang will only accept API requests where model=glm-5-fp8. When RUN_EVAL=true (lines 74-76), the script calls run_eval, which invokes run_lm_eval in benchmark_lib.sh. At line 607, MODEL_NAME is resolved as:
MODEL_NAME=${MODEL_NAME:-$MODEL}Since MODEL_NAME is never exported in glm5_fp8_h200.sh, it falls back to $MODEL, which is zai-org/GLM-5-FP8 (set from the YAML config). Then at line 617, lm_eval sends requests with model=zai-org/GLM-5-FP8 in the --model_args. The server rejects these because it only recognizes the served model name glm-5-fp8.
Step-by-step proof:
- YAML config sets
model: zai-org/GLM-5-FP8, which becomes the$MODELenv var - Server starts with
--served-model-name glm-5-fp8→ only acceptsmodel=glm-5-fp8in API calls RUN_EVAL=truetriggersrun_evalat line 74run_lm_evalat benchmark_lib.sh:607 doesMODEL_NAME=${MODEL_NAME:-$MODEL}→ resolves tozai-org/GLM-5-FP8- Line 617:
--model_args "model=${MODEL_NAME},..."sendsmodel=zai-org/GLM-5-FP8to the server - Server rejects the request — model name mismatch
The only other script using --served-model-name is qwen3.5_bf16_b200.sh, which sets it to Qwen/Qwen3.5-397B-A17B — matching the actual HF model ID, so there is no mismatch there. The MI355X GLM-5 script avoids this entirely by not using --served-model-name.
Fix: Add --trust-remote-code to the server launch args. For the eval issue, either remove --served-model-name glm-5-fp8 from the server launch, or add export MODEL_NAME=glm-5-fp8 before the run_eval call at line 74.
There was a problem hiding this comment.
@claude evals were successful with current flow
There was a problem hiding this comment.
Claude finished @kedarpotdar-nv's task in 56s —— View job
Eval feedback analysis
- Read current benchmark script and configs
- Verify eval flow and respond to feedback
Thanks for confirming, @kedarpotdar-nv! Looking at the current state of the branch, I can see the script has already been updated to address the concerns from the initial review:
run_benchmark_serving— now used (commit4de829f)--trust-remote-code— added to server launch (line 38)export MODEL_NAME=glm-5-fp8— set beforerun_eval(commit8d7009e, line 62), so eval correctly targets the served model name- PR link — updated to
pull/914(commit767a6e2)
The eval flow looks correct: since MODEL_NAME is explicitly exported as glm-5-fp8 before calling run_eval, run_lm_eval in benchmark_lib.sh will use that instead of falling back to $MODEL (zai-org/GLM-5-FP8), which matches the --served-model-name. This is consistent with your report that evals pass.
One minor nit: perf-changelog.yaml line 971 still references "openai-chat backend" in the description, but the script now uses --backend vllm via run_benchmark_serving. Not a blocker, but worth updating for accuracy.
|
Resolved above comments, configs are taken from https://github.com/zai-org/GLM-5 |
kedarpotdar-nv
left a comment
There was a problem hiding this comment.
lgtm thanks @hshrivastava-droid
No description provided.