Separate eval-only workflow and change to 8k1k#911
Separate eval-only workflow and change to 8k1k#911Oseltamivir wants to merge 37 commits intomainfrom
Conversation
- Switch eval selection from 1k8k to 8k1k in mark_eval_entries() - After throughput benchmark, kill server and restart with model's native max context length (max_position_embeddings) for eval - Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with the native max model length - Add _start_eval_server() supporting sglang, vllm, trtllm, and atom - Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
max_tokens in gen_kwargs must be less than max_length to leave room for the input prompt. Without this cap, the server rejects requests where input_tokens + max_tokens > max_model_len. Uses max_length - 4096 as the generation cap (min 8192). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --no-skip-eval-only flag to wait_for_server_ready so _start_eval_server properly waits for the eval server - Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC - Cap eval max_gen_tokens to 8192 to avoid KV cache issues - Read num_fewshot from eval YAML instead of CLI override - Add 'trt' as alias for trtllm in eval server case - Reduce seq len configs for OOM prevention - Change eval sweep to top of curve and middle configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t error The ATOM container has a broken torchvision that causes circular import errors when lm_eval loads. Since we use local-chat-completions (API-based), torchvision is not needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 SGLang scripts now pass --context-length with compute_eval_context_length when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly. Reverts the max_gen_tokens cap since the server should have sufficient context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # benchmarks/single_node/dsr1_fp4_b200.sh # benchmarks/single_node/dsr1_fp4_mi355x.sh # benchmarks/single_node/dsr1_fp8_b200.sh # benchmarks/single_node/dsr1_fp8_b200_mtp.sh # benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_mi300x.sh # benchmarks/single_node/dsr1_fp8_mi355x.sh # benchmarks/single_node/glm5_fp8_mi355x.sh # benchmarks/single_node/gptoss_fp4_mi300x.sh # benchmarks/single_node/gptoss_fp4_mi325x.sh # benchmarks/single_node/gptoss_fp4_mi355x.sh # benchmarks/single_node/kimik2.5_fp4_b200.sh # benchmarks/single_node/kimik2.5_int4_b200.sh # benchmarks/single_node/kimik2.5_int4_mi325x.sh # benchmarks/single_node/kimik2.5_int4_mi355x.sh # benchmarks/single_node/minimaxm2.5_fp8_b200.sh # benchmarks/single_node/minimaxm2.5_fp8_h200.sh # benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh # benchmarks/single_node/qwen3.5_bf16_mi300x.sh # benchmarks/single_node/qwen3.5_bf16_mi325x.sh # benchmarks/single_node/qwen3.5_bf16_mi355x.sh # benchmarks/single_node/qwen3.5_fp8_mi300x.sh # benchmarks/single_node/qwen3.5_fp8_mi325x.sh # benchmarks/single_node/qwen3.5_fp8_mi355x.sh
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
There was a problem hiding this comment.
thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential
mostly LGTM
just had a couple of questions
- does this PR change the evals to use the industry standard num of shots for gsm8k?
- previously when evals hard crashed, the script still exits with success exit code, is this fixed now?
600582e to
826035c
Compare
|
812fe72 to
fa65d40
Compare
1d77739 to
4978aed
Compare
|
lgtm, can u get @cquil11 to review too? |
|
@claude review plz |
|
Claude finished @functionstackx's task in 3m 42s —— View job PR Review: Separate eval-only workflow and change to 8k1k
Overall: Looks good. This is a well-structured PR that cleanly separates eval jobs from benchmark jobs, fixes the GSM8k prompt, and adds eval support across all 49 benchmark scripts. The architecture is sound — the I verified that the issues from the previous automated review have all been addressed in the current revision:
Remaining issues1. Median concurrency selection for 2-element lists — median_conc = conc_values[len(conc_values) // 2]For groups with exactly 2 distinct concurrency values 2. The function both 3. Eval failure in combined mode (non-EVAL_ONLY) is silent — When 4.
Looks good
| |
There was a problem hiding this comment.
All 6 issues from my previous review have been addressed — nice work. Deferring to human review given the scope (64 files, new eval-only workflow mode, CI/CD workflow changes).
Extended reasoning...
Overview
This PR separates eval jobs from benchmark jobs into a dedicated workflow stage, changes eval sequence length from 1k8k to 8k1k, fixes the GSM8k prompt template, adds eval support to 49 single-node benchmark scripts, and updates matrix generation to emit a separate evals matrix. It touches 64 files across GitHub Actions workflows, shell benchmark scripts, Python matrix generation code, and documentation.
Bug Resolution
All 6 issues I previously identified have been addressed in the current code:
- The critical EVAL_ONLY Docker forwarding bug is fixed (all 4 runner scripts now pass
-e EVAL_ONLY) - The 3 scripts with stale
--concurrent-requests $CONChave been cleaned up - The subshell export issue is fixed with explicit
export EVAL_MAX_MODEL_LENafter each subshell call - Documentation in EVALS.md and AGENTS.md now accurately reflects the new selection policy
- The median concurrency collapse for 2-value groups is explicitly documented as intended behavior
Security Risks
No direct security risks. The changes are internal CI/CD pipeline modifications. The subprocess.run calls in process_changelog.py use list-based argument passing (no shell injection). No auth, crypto, or permissions code is touched.
Level of Scrutiny
This PR warrants human review due to: (1) the sheer scope of 64 files, (2) structural changes to GitHub Actions workflows that define the CI/CD pipeline, (3) a new eval-only execution mode that changes how benchmark infrastructure operates, and (4) non-trivial Python logic changes in mark_eval_entries() and process_changelog.py. While the 49 benchmark script changes are mechanical and follow a clear pattern, the workflow and library changes require careful human verification.
Other Factors
- A maintainer has already LGTM'd the PR
- A full test run has been linked showing the separate eval jobs working
- The bulk of file changes (49 benchmark scripts) follow two clear patterns (Pattern A for SGLang, Pattern B for vLLM/TRT)
- The code is well-structured with proper validation via Pydantic models
| *) echo "Unknown framework '${framework}'"; eval_rc=1 ;; | ||
| esac | ||
|
|
||
| if [ "$eval_rc" -ne 0 ]; then | ||
| echo "ERROR: run_eval failed with exit code $eval_rc" >&2 | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| echo "Eval-only mode: exiting with failure" >&2 | ||
| exit "$eval_rc" | ||
| fi |
There was a problem hiding this comment.
🟡 exit "$eval_rc" on line 787 terminates the entire script when eval-only mode fails, skipping append_lm_eval_summary() which every benchmark script calls on the next line. This means partial eval artifacts (results*.json, sample*.jsonl, meta_env.json) remain stranded in /tmp/eval_out-XXXXXX/ and are never moved to the workspace where the upload step looks for them. Fix: replace exit with return so the calling script can still collect artifacts before failing.
Extended reasoning...
What the bug is
In run_eval() (benchmark_lib.sh lines 783-788), when eval fails and EVAL_ONLY=true, the function calls exit "$eval_rc" which terminates the entire bash process immediately. This is new code added by this PR. In contrast, when EVAL_ONLY is not true, the function uses return $eval_rc, allowing the caller to continue.
The calling pattern across all benchmark scripts
Every single benchmark script (all ~49 of them) follows this pattern:
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary # <-- never reached on eval-only failure
fi
stop_gpu_monitor # <-- also never reachedappend_lm_eval_summary() is responsible for: (1) writing meta_env.json with job metadata, (2) moving eval artifacts (results*.json, sample*.jsonl) from the temp output directory (/tmp/eval_out-XXXXXX/) to the current working directory (workspace). Without it running, any partial eval results remain stranded in the temp directory.
Step-by-step proof
- An eval-only job runs with
EVAL_ONLY=true,RUN_EVAL=true - The benchmark script calls
run_eval --framework lm-eval --port "$PORT" run_lm_eval()runs lm-eval, which produces partial results in/tmp/eval_out-XXXXXX/before failing (e.g., timeout on some tasks, model errors on certain prompts)run_lm_evalreturns non-zero, captured aseval_rc- Line 785:
EVAL_ONLYis"true", so line 787 executes:exit "$eval_rc" - The entire script terminates immediately
append_lm_eval_summaryon the next line in the benchmark script never executes — artifacts stay in/tmp/eval_out-XXXXXX/stop_gpu_monitoralso never executes — background GPU monitoring process is orphaned- The workflow "Upload eval results" step (line 212) now correctly has
always()guard so it does run, but it looks forresults*.jsonin the workspace (PWD), not in/tmp/eval_out-XXXXXX/ - Result: no eval artifacts are uploaded, even though partial results may exist
Impact
When an eval-only run fails, partial eval artifacts that could help debug the failure are lost. The meta_env.json (job metadata) is never created, and any partial results*.json or sample*.jsonl files remain in the temp directory until the container/job exits. Since eval-only mode is the primary validation path for evals in this PR, losing debugging artifacts on failure is significant for operational debugging.
Fix
Replace exit "$eval_rc" with return "$eval_rc" on line 787. The calling scripts will then continue to append_lm_eval_summary (which collects whatever artifacts exist) and stop_gpu_monitor before the script exits naturally. The workflow-level result file check (if ! ls results*.json) at benchmark-tmpl.yml line 159 will still detect the failure and exit non-zero, so the job still fails as intended — but now with artifacts preserved.
| --temperature) temperature="$2"; shift 2 ;; | ||
| --top-p) top_p="$2"; shift 2 ;; | ||
| --concurrent-requests) concurrent_requests="$2"; shift 2 ;; | ||
| *) echo "Unknown parameter: $1"; return 1 ;; |
There was a problem hiding this comment.
🔴 glm5_fp8_h200.sh (added by PR #914, already merged) still passes --concurrent-requests $CONC to run_eval on line 63. This PR removed the --concurrent-requests handler from run_lm_eval, so the catch-all *) at benchmark_lib.sh:646 will print "Unknown parameter: --concurrent-requests" and return 1, failing ALL eval runs for GLM-5 on H200. The script also lacks the EVAL_ONLY block that was added to all other benchmark scripts. Fix: update glm5_fp8_h200.sh to remove --concurrent-requests $CONC from the run_eval call and add the EVAL_ONLY context expansion block.
Extended reasoning...
What the bug is
PR #914 (commit ebb48d4, already merged into main) added benchmarks/single_node/glm5_fp8_h200.sh for GLM-5 FP8 benchmarks on H200 GPUs. On line 63, this script calls:
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONCThis PR (#911) removes the --concurrent-requests handler from run_lm_eval() in benchmark_lib.sh. The previous code had --concurrent-requests) shift 2; continue ;; to silently consume the argument, but that handler no longer exists. The catch-all *) case at line 646 now does echo "Unknown parameter: $1"; return 1 ;;.
The specific code path that triggers it
glm5_fp8_h200.shline 63 callsrun_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONCrun_eval()(benchmark_lib.sh:761) parses--frameworkand puts--port,"$PORT",--concurrent-requests, and$CONCinto theforwarded[]arrayrun_eval()callsrun_lm_eval "${forwarded[@]}"at line 779run_lm_eval()processes--portsuccessfully, then encounters--concurrent-requests- The catch-all
*)at line 646 fires: prints "Unknown parameter: --concurrent-requests" and returns 1 run_evalcaptures the failure (eval_rc=$?= 1)
Why existing code doesn’t prevent it
This script was added by a separate PR (#914) that merged after the initial review of PR #911. The initial review (comment #3) correctly identified 3 other scripts (kimik2.5_fp4_b200.sh, kimik2.5_int4_mi325x.sh, minimaxm2.5_fp8_b200.sh) that still had --concurrent-requests, and those were subsequently fixed. However, glm5_fp8_h200.sh arrived via a different PR and was missed.
I verified via the diff that the sibling script glm5_fp8_mi355x.sh WAS updated by this PR to remove --concurrent-requests and add the EVAL_ONLY block, confirming that glm5_fp8_h200.sh was simply overlooked.
Step-by-step proof
- Confirmed
glm5_fp8_h200.shline 63:run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC - Confirmed
run_eval()at benchmark_lib.sh:768 forwards unrecognized args:*) forwarded+=("$1"); shift ;; - Confirmed
run_lm_eval()at benchmark_lib.sh:646 has no--concurrent-requestshandler:*) echo "Unknown parameter: $1"; return 1 ;; - Result: eval fails immediately with "Unknown parameter: --concurrent-requests", return code 1
Impact
This breaks ALL eval runs for the GLM-5 FP8 H200 configuration. Additionally, the script lacks the EVAL_ONLY block (for computing expanded eval context length and starting the server with it) that was added to all ~48 other benchmark scripts in this PR, meaning eval-only mode also cannot work correctly for this config.
How to fix
Update benchmarks/single_node/glm5_fp8_h200.sh to:
- Remove
--concurrent-requests $CONCfrom therun_evalcall on line 63 - Add the
EVAL_ONLYblock (withEVAL_CONTEXT_ARGSandexport EVAL_MAX_MODEL_LEN) matching the pattern in the siblingglm5_fp8_mi355x.sh - Append
$EVAL_CONTEXT_ARGSto the sglang launch command
cfe5ca1 to
766a742
Compare
| SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))") | ||
| MULTI=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' in x]))") | ||
| EVALS=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and x.get('run-eval', False)]))") | ||
| echo "single-node-config=$SINGLE" >> $GITHUB_OUTPUT | ||
| echo "multi-node-config=$MULTI" >> $GITHUB_OUTPUT | ||
| echo "eval-config=$EVALS" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
🟡 When --evals-only is passed to e2e-tests.yml, every config runs twice: once as a full benchmark via test-sweep-single-node (which hardcodes run-eval: false) and once as the intended eval-only run via test-sweep-evals. The SINGLE filter on line 57 has no exclusion for eval-only entries, so it captures the same configs as the EVALS filter, wasting CI GPU time on redundant full benchmark runs. Fix by excluding entries with run-eval: True from the SINGLE filter (e.g., [x for x in d if "prefill" not in x and not x.get("run-eval", False)]).
Extended reasoning...
Bug Analysis
The e2e-tests.yml workflow computes three matrix filters from the output of generate_sweep_configs.py:
- SINGLE (line 57):
[x for x in d if "prefill" not in x]— all non-prefill entries - MULTI (line 58):
[x for x in d if "prefill" in x]— only prefill/decode disaggregated entries - EVALS (line 59):
[x for x in d if "prefill" not in x and x.get("run-eval", False)]— non-prefill entries withrun-eval: True
These feed into three downstream jobs: test-sweep-single-node (uses SINGLE), test-sweep-multi-node (uses MULTI), and test-sweep-evals (uses EVALS).
The Problem
When --evals-only is passed, generate_sweep_configs.py (lines 935-937) filters its output to only include entries where run-eval is True:
if args.evals_only:
matrix_values = [e for e in matrix_values if e.get(Fields.RUN_EVAL.value, False)]This means every entry in the output has run-eval: True. Since the SINGLE filter only checks for "prefill" not in x without excluding run-eval entries, SINGLE and EVALS end up containing the exact same set of configs.
Step-by-Step Proof
- User triggers
e2e-tests.ymlwith a command including--evals-only generate_sweep_configs.pyoutputs, say, 3 configs — all withrun-eval: True, none withprefill- SINGLE filter: all 3 pass (no prefill check is satisfied, no run-eval exclusion exists) →
[config1, config2, config3] - EVALS filter: all 3 pass (no prefill, all have run-eval: True) →
[config1, config2, config3] test-sweep-single-noderuns all 3 as full benchmarks (hardcodedrun-eval: falseon line 129, noeval-onlyinput)test-sweep-evalsruns all 3 as eval-only (passesrun-eval: trueandeval-only: trueon lines 159-160)- Each config runs twice — the full benchmark run is completely wasted
Impact
This wastes CI GPU resources by running unnecessary full benchmarks when only evals were requested. The eval results themselves are still correct since test-sweep-evals runs properly, but the redundant test-sweep-single-node jobs consume GPU time for no useful output.
Fix
The simplest fix is to exclude run-eval entries from the SINGLE filter so they only appear in EVALS:
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if prefill not in x and not x.get(run-eval, False)]))")Note: this would also affect non-evals-only runs by removing eval-marked entries from the benchmark job. If that is undesirable, an alternative is to have generate_sweep_configs.py output a top-level flag indicating evals-only mode, and conditionally skip the SINGLE matrix entirely.
Summary
eval /workflow stage, so evals can run independently withevals-only: truein perf-changelogeval-onlyinput to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval#### <answer>to#### [number]to prevent models (e.g. Kimi-K2.5) from outputting the literal<answer>tag, which caused strict-match to return[invalid]process_changelog.pyandgenerate_sweep_configs.pyto emit a separateevalsmatrix for eval-only jobsSeparate jobs on a normal run tested with:
Working full run: