Skip to content

Separate eval-only workflow and change to 8k1k#911

Open
Oseltamivir wants to merge 37 commits intomainfrom
eval-8k1k-server-restart
Open

Separate eval-only workflow and change to 8k1k#911
Oseltamivir wants to merge 37 commits intomainfrom
eval-8k1k-server-restart

Conversation

@Oseltamivir
Copy link
Collaborator

@Oseltamivir Oseltamivir commented Mar 15, 2026

Summary

  • Decouple eval jobs from benchmark jobs into a dedicated eval / workflow stage, so evals can run independently with evals-only: true in perf-changelog
  • Add eval-only input to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval
  • Fix GSM8k prompt template: change #### <answer> to #### [number] to prevent models (e.g. Kimi-K2.5) from outputting the literal <answer> tag, which caused strict-match to return [invalid]
  • Add eval support to all 49 single-node benchmark scripts (server restart with native max context, eval context override, request timeout increase)
  • Update process_changelog.py and generate_sweep_configs.py to emit a separate evals matrix for eval-only jobs
  • Changed from 1k8k to 8k1k
  • Groups now include dp-attn (previously missing, so dp-attn variants were merged)
  • Instead of picking max-conc at highest/lowest TP, it picks all TPs at the highest conc + all TPs at the median conc.

Separate jobs on a normal run tested with:

- config-keys:
    - gptoss-fp4-h100-vllm
    - dsr1-fp8-mi325x-sglang

Working full run:

Oseltamivir and others added 22 commits March 8, 2026 08:31
- Switch eval selection from 1k8k to 8k1k in mark_eval_entries()
- After throughput benchmark, kill server and restart with model's
  native max context length (max_position_embeddings) for eval
- Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with
  the native max model length
- Add _start_eval_server() supporting sglang, vllm, trtllm, and atom
- Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
max_tokens in gen_kwargs must be less than max_length to leave room
for the input prompt. Without this cap, the server rejects requests
where input_tokens + max_tokens > max_model_len.

Uses max_length - 4096 as the generation cap (min 8192).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use 70% of max_length for max_tokens generation cap, leaving 30%
for the prompt. Echo the budget for visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --no-skip-eval-only flag to wait_for_server_ready so
  _start_eval_server properly waits for the eval server
- Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC
- Cap eval max_gen_tokens to 8192 to avoid KV cache issues
- Read num_fewshot from eval YAML instead of CLI override
- Add 'trt' as alias for trtllm in eval server case
- Reduce seq len configs for OOM prevention
- Change eval sweep to top of curve and middle configs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t error

The ATOM container has a broken torchvision that causes circular import
errors when lm_eval loads. Since we use local-chat-completions (API-based),
torchvision is not needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…val call

Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files
and runs gsm8k + gpqa_diamond consecutively in one invocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…jection

max_gen_tokens was hardcoded to 16384 but servers with smaller
EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15 SGLang scripts now pass --context-length with compute_eval_context_length
when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly.
Reverts the max_gen_tokens cap since the server should have sufficient context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… file

Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only
run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t to 1800s

5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents
single-request timeouts on slow models like Kimi K2.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	benchmarks/single_node/dsr1_fp4_b200.sh
#	benchmarks/single_node/dsr1_fp4_mi355x.sh
#	benchmarks/single_node/dsr1_fp8_b200.sh
#	benchmarks/single_node/dsr1_fp8_b200_mtp.sh
#	benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh
#	benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh
#	benchmarks/single_node/dsr1_fp8_mi300x.sh
#	benchmarks/single_node/dsr1_fp8_mi355x.sh
#	benchmarks/single_node/glm5_fp8_mi355x.sh
#	benchmarks/single_node/gptoss_fp4_mi300x.sh
#	benchmarks/single_node/gptoss_fp4_mi325x.sh
#	benchmarks/single_node/gptoss_fp4_mi355x.sh
#	benchmarks/single_node/kimik2.5_fp4_b200.sh
#	benchmarks/single_node/kimik2.5_int4_b200.sh
#	benchmarks/single_node/kimik2.5_int4_mi325x.sh
#	benchmarks/single_node/kimik2.5_int4_mi355x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_b200.sh
#	benchmarks/single_node/minimaxm2.5_fp8_h200.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh
#	benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi300x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi325x.sh
#	benchmarks/single_node/qwen3.5_bf16_mi355x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi300x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi325x.sh
#	benchmarks/single_node/qwen3.5_fp8_mi355x.sh
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential

mostly LGTM

just had a couple of questions

  1. does this PR change the evals to use the industry standard num of shots for gsm8k?
  2. previously when evals hard crashed, the script still exits with success exit code, is this fixed now?

@Oseltamivir Oseltamivir requested a review from a team March 15, 2026 22:08
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from 600582e to 826035c Compare March 15, 2026 22:08
@Oseltamivir
Copy link
Collaborator Author

  1. Changed to 5-shot
  2. Yes, also added suggestion of failing when result <85%

@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from 812fe72 to fa65d40 Compare March 16, 2026 01:07
@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from 1d77739 to 4978aed Compare March 16, 2026 02:01
@functionstackx functionstackx self-requested a review March 16, 2026 03:14
@functionstackx
Copy link
Contributor

lgtm, can u get @cquil11 to review too?

@functionstackx
Copy link
Contributor

@claude review plz

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 16, 2026

Claude finished @functionstackx's task in 3m 42s —— View job


PR Review: Separate eval-only workflow and change to 8k1k

  • Read key changed files and gather context
  • Review workflow changes (benchmark-tmpl, e2e-tests, run-sweep)
  • Review benchmark_lib.sh changes
  • Review benchmark scripts (all 49 single-node)
  • Review matrix logic (generate_sweep_configs, process_changelog)
  • Review runner launch scripts
  • Review eval config and docs
  • Post final review

Overall: Looks good. This is a well-structured PR that cleanly separates eval jobs from benchmark jobs, fixes the GSM8k prompt, and adds eval support across all 49 benchmark scripts. The architecture is sound — the EVAL_ONLY / RUN_EVAL separation is clean, process_changelog.py properly generates separate benchmark and eval matrices, and the 85% accuracy threshold in eval-only mode is a good addition.

I verified that the issues from the previous automated review have all been addressed in the current revision:

  • EVAL_ONLY is correctly forwarded in all 4 Docker launch scripts (-e EVAL_ONLY)
  • --concurrent-requests $CONC has been removed from all benchmark scripts (0 matches)
  • Pattern B scripts (vLLM/TRT) correctly export EVAL_MAX_MODEL_LEN after the subshell call
  • The SINGLE filter in e2e-tests.yml no longer excludes eval-marked entries

Remaining issues

1. Median concurrency selection for 2-element listsgenerate_sweep_configs.py:76

median_conc = conc_values[len(conc_values) // 2]

For groups with exactly 2 distinct concurrency values [a, b], len // 2 = 1, so median_conc = b = highest. This means target_concs = {b, b} = {b} — only one concurrency level gets eval coverage. The AGENTS.md documents this ("When only 2 concurrency levels exist, this collapses to just the highest"), so it's acknowledged. But if the intent is two distinct levels for diversity, use (len(conc_values) - 1) // 2 to select the lower-middle element.

Fix this →

2. compute_eval_context_length export is redundantbenchmark_lib.sh:616

The function both exports EVAL_MAX_MODEL_LEN and echos the value. Since every caller either (a) captures the echo in a subshell (losing the export) and then manually exports, or (b) calls via run_eval() which redirects to /dev/null — the export inside the function never actually takes effect in the parent shell. Consider removing the export from the function to avoid confusion about where the variable is actually set.

3. Eval failure in combined mode (non-EVAL_ONLY) is silentbenchmark_lib.sh:764-769

When RUN_EVAL=true (combined mode) and the eval fails, run_eval() logs an error but does not exit with failure — the benchmark job still succeeds. This means eval regressions in combined mode can go unnoticed. The 85% threshold check only runs in the eval-only workflow path (benchmark-tmpl.yml:162-181), not in the combined RUN_EVAL=true path. This is probably intentional (don't fail expensive benchmark jobs over eval issues), but worth being explicit about.

4. EVAL_MAX_MODEL_LEN used as max_length in lm_evalbenchmark_lib.sh:624,661

gen_max_tokens (from EVAL_MAX_MODEL_LEN, defaulting to 16384) is passed as max_length to lm_eval, which controls the maximum total input+output context window. For eval-only mode with expanded context (e.g., 47080 tokens for DeepSeek), this means lm_eval is told it can use up to 47080 tokens of context — but GSM8K prompts are short, so this is fine in practice. Just noting that if longer eval tasks are added, the max_gen_tokens=16384 hard cap (line 651) will limit per-generation output independently.

Looks good

  • GSM8k prompt fix (#### [number] instead of #### <answer>) with 5-shot — good fix for the strict-match [invalid] issue
  • evals-only: true in perf-changelog.yaml — clean separation for the production workflow
  • 85% accuracy threshold in eval-only mode with clear pass/fail logging
  • run_benchmark_serving correctly skips in eval-only mode
  • All scripts consistently handle eval context expansion with the correct pattern for their framework (SGLang: EVAL_CONTEXT_ARGS, vLLM/TRT: MAX_MODEL_LEN override + export)
  • process_changelog.py generates completely separate benchmark and eval arrays — no cross-contamination

|

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 6 issues from my previous review have been addressed — nice work. Deferring to human review given the scope (64 files, new eval-only workflow mode, CI/CD workflow changes).

Extended reasoning...

Overview

This PR separates eval jobs from benchmark jobs into a dedicated workflow stage, changes eval sequence length from 1k8k to 8k1k, fixes the GSM8k prompt template, adds eval support to 49 single-node benchmark scripts, and updates matrix generation to emit a separate evals matrix. It touches 64 files across GitHub Actions workflows, shell benchmark scripts, Python matrix generation code, and documentation.

Bug Resolution

All 6 issues I previously identified have been addressed in the current code:

  • The critical EVAL_ONLY Docker forwarding bug is fixed (all 4 runner scripts now pass -e EVAL_ONLY)
  • The 3 scripts with stale --concurrent-requests $CONC have been cleaned up
  • The subshell export issue is fixed with explicit export EVAL_MAX_MODEL_LEN after each subshell call
  • Documentation in EVALS.md and AGENTS.md now accurately reflects the new selection policy
  • The median concurrency collapse for 2-value groups is explicitly documented as intended behavior

Security Risks

No direct security risks. The changes are internal CI/CD pipeline modifications. The subprocess.run calls in process_changelog.py use list-based argument passing (no shell injection). No auth, crypto, or permissions code is touched.

Level of Scrutiny

This PR warrants human review due to: (1) the sheer scope of 64 files, (2) structural changes to GitHub Actions workflows that define the CI/CD pipeline, (3) a new eval-only execution mode that changes how benchmark infrastructure operates, and (4) non-trivial Python logic changes in mark_eval_entries() and process_changelog.py. While the 49 benchmark script changes are mechanical and follow a clear pattern, the workflow and library changes require careful human verification.

Other Factors

  • A maintainer has already LGTM'd the PR
  • A full test run has been linked showing the separate eval jobs working
  • The bulk of file changes (49 benchmark scripts) follow two clear patterns (Pattern A for SGLang, Pattern B for vLLM/TRT)
  • The code is well-structured with proper validation via Pydantic models

Comment on lines +780 to +788
*) echo "Unknown framework '${framework}'"; eval_rc=1 ;;
esac

if [ "$eval_rc" -ne 0 ]; then
echo "ERROR: run_eval failed with exit code $eval_rc" >&2
if [ "${EVAL_ONLY}" = "true" ]; then
echo "Eval-only mode: exiting with failure" >&2
exit "$eval_rc"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 exit "$eval_rc" on line 787 terminates the entire script when eval-only mode fails, skipping append_lm_eval_summary() which every benchmark script calls on the next line. This means partial eval artifacts (results*.json, sample*.jsonl, meta_env.json) remain stranded in /tmp/eval_out-XXXXXX/ and are never moved to the workspace where the upload step looks for them. Fix: replace exit with return so the calling script can still collect artifacts before failing.

Extended reasoning...

What the bug is

In run_eval() (benchmark_lib.sh lines 783-788), when eval fails and EVAL_ONLY=true, the function calls exit "$eval_rc" which terminates the entire bash process immediately. This is new code added by this PR. In contrast, when EVAL_ONLY is not true, the function uses return $eval_rc, allowing the caller to continue.

The calling pattern across all benchmark scripts

Every single benchmark script (all ~49 of them) follows this pattern:

if [ "${RUN_EVAL}" = "true" ]; then
    run_eval --framework lm-eval --port "$PORT"
    append_lm_eval_summary    # <-- never reached on eval-only failure
fi
stop_gpu_monitor              # <-- also never reached

append_lm_eval_summary() is responsible for: (1) writing meta_env.json with job metadata, (2) moving eval artifacts (results*.json, sample*.jsonl) from the temp output directory (/tmp/eval_out-XXXXXX/) to the current working directory (workspace). Without it running, any partial eval results remain stranded in the temp directory.

Step-by-step proof

  1. An eval-only job runs with EVAL_ONLY=true, RUN_EVAL=true
  2. The benchmark script calls run_eval --framework lm-eval --port "$PORT"
  3. run_lm_eval() runs lm-eval, which produces partial results in /tmp/eval_out-XXXXXX/ before failing (e.g., timeout on some tasks, model errors on certain prompts)
  4. run_lm_eval returns non-zero, captured as eval_rc
  5. Line 785: EVAL_ONLY is "true", so line 787 executes: exit "$eval_rc"
  6. The entire script terminates immediately
  7. append_lm_eval_summary on the next line in the benchmark script never executes — artifacts stay in /tmp/eval_out-XXXXXX/
  8. stop_gpu_monitor also never executes — background GPU monitoring process is orphaned
  9. The workflow "Upload eval results" step (line 212) now correctly has always() guard so it does run, but it looks for results*.json in the workspace (PWD), not in /tmp/eval_out-XXXXXX/
  10. Result: no eval artifacts are uploaded, even though partial results may exist

Impact

When an eval-only run fails, partial eval artifacts that could help debug the failure are lost. The meta_env.json (job metadata) is never created, and any partial results*.json or sample*.jsonl files remain in the temp directory until the container/job exits. Since eval-only mode is the primary validation path for evals in this PR, losing debugging artifacts on failure is significant for operational debugging.

Fix

Replace exit "$eval_rc" with return "$eval_rc" on line 787. The calling scripts will then continue to append_lm_eval_summary (which collects whatever artifacts exist) and stop_gpu_monitor before the script exits naturally. The workflow-level result file check (if ! ls results*.json) at benchmark-tmpl.yml line 159 will still detect the failure and exit non-zero, so the job still fails as intended — but now with artifacts preserved.

--temperature) temperature="$2"; shift 2 ;;
--top-p) top_p="$2"; shift 2 ;;
--concurrent-requests) concurrent_requests="$2"; shift 2 ;;
*) echo "Unknown parameter: $1"; return 1 ;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 glm5_fp8_h200.sh (added by PR #914, already merged) still passes --concurrent-requests $CONC to run_eval on line 63. This PR removed the --concurrent-requests handler from run_lm_eval, so the catch-all *) at benchmark_lib.sh:646 will print "Unknown parameter: --concurrent-requests" and return 1, failing ALL eval runs for GLM-5 on H200. The script also lacks the EVAL_ONLY block that was added to all other benchmark scripts. Fix: update glm5_fp8_h200.sh to remove --concurrent-requests $CONC from the run_eval call and add the EVAL_ONLY context expansion block.

Extended reasoning...

What the bug is

PR #914 (commit ebb48d4, already merged into main) added benchmarks/single_node/glm5_fp8_h200.sh for GLM-5 FP8 benchmarks on H200 GPUs. On line 63, this script calls:

run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC

This PR (#911) removes the --concurrent-requests handler from run_lm_eval() in benchmark_lib.sh. The previous code had --concurrent-requests) shift 2; continue ;; to silently consume the argument, but that handler no longer exists. The catch-all *) case at line 646 now does echo "Unknown parameter: $1"; return 1 ;;.

The specific code path that triggers it

  1. glm5_fp8_h200.sh line 63 calls run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
  2. run_eval() (benchmark_lib.sh:761) parses --framework and puts --port, "$PORT", --concurrent-requests, and $CONC into the forwarded[] array
  3. run_eval() calls run_lm_eval "${forwarded[@]}" at line 779
  4. run_lm_eval() processes --port successfully, then encounters --concurrent-requests
  5. The catch-all *) at line 646 fires: prints "Unknown parameter: --concurrent-requests" and returns 1
  6. run_eval captures the failure (eval_rc=$? = 1)

Why existing code doesn’t prevent it

This script was added by a separate PR (#914) that merged after the initial review of PR #911. The initial review (comment #3) correctly identified 3 other scripts (kimik2.5_fp4_b200.sh, kimik2.5_int4_mi325x.sh, minimaxm2.5_fp8_b200.sh) that still had --concurrent-requests, and those were subsequently fixed. However, glm5_fp8_h200.sh arrived via a different PR and was missed.

I verified via the diff that the sibling script glm5_fp8_mi355x.sh WAS updated by this PR to remove --concurrent-requests and add the EVAL_ONLY block, confirming that glm5_fp8_h200.sh was simply overlooked.

Step-by-step proof

  1. Confirmed glm5_fp8_h200.sh line 63: run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
  2. Confirmed run_eval() at benchmark_lib.sh:768 forwards unrecognized args: *) forwarded+=("$1"); shift ;;
  3. Confirmed run_lm_eval() at benchmark_lib.sh:646 has no --concurrent-requests handler: *) echo "Unknown parameter: $1"; return 1 ;;
  4. Result: eval fails immediately with "Unknown parameter: --concurrent-requests", return code 1

Impact

This breaks ALL eval runs for the GLM-5 FP8 H200 configuration. Additionally, the script lacks the EVAL_ONLY block (for computing expanded eval context length and starting the server with it) that was added to all ~48 other benchmark scripts in this PR, meaning eval-only mode also cannot work correctly for this config.

How to fix

Update benchmarks/single_node/glm5_fp8_h200.sh to:

  1. Remove --concurrent-requests $CONC from the run_eval call on line 63
  2. Add the EVAL_ONLY block (with EVAL_CONTEXT_ARGS and export EVAL_MAX_MODEL_LEN) matching the pattern in the sibling glm5_fp8_mi355x.sh
  3. Append $EVAL_CONTEXT_ARGS to the sglang launch command

@Oseltamivir Oseltamivir force-pushed the eval-8k1k-server-restart branch from cfe5ca1 to 766a742 Compare March 20, 2026 04:25
Comment on lines 57 to +62
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))")
MULTI=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' in x]))")
EVALS=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and x.get('run-eval', False)]))")
echo "single-node-config=$SINGLE" >> $GITHUB_OUTPUT
echo "multi-node-config=$MULTI" >> $GITHUB_OUTPUT
echo "eval-config=$EVALS" >> $GITHUB_OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 When --evals-only is passed to e2e-tests.yml, every config runs twice: once as a full benchmark via test-sweep-single-node (which hardcodes run-eval: false) and once as the intended eval-only run via test-sweep-evals. The SINGLE filter on line 57 has no exclusion for eval-only entries, so it captures the same configs as the EVALS filter, wasting CI GPU time on redundant full benchmark runs. Fix by excluding entries with run-eval: True from the SINGLE filter (e.g., [x for x in d if "prefill" not in x and not x.get("run-eval", False)]).

Extended reasoning...

Bug Analysis

The e2e-tests.yml workflow computes three matrix filters from the output of generate_sweep_configs.py:

  • SINGLE (line 57): [x for x in d if "prefill" not in x] — all non-prefill entries
  • MULTI (line 58): [x for x in d if "prefill" in x] — only prefill/decode disaggregated entries
  • EVALS (line 59): [x for x in d if "prefill" not in x and x.get("run-eval", False)] — non-prefill entries with run-eval: True

These feed into three downstream jobs: test-sweep-single-node (uses SINGLE), test-sweep-multi-node (uses MULTI), and test-sweep-evals (uses EVALS).

The Problem

When --evals-only is passed, generate_sweep_configs.py (lines 935-937) filters its output to only include entries where run-eval is True:

if args.evals_only:
    matrix_values = [e for e in matrix_values if e.get(Fields.RUN_EVAL.value, False)]

This means every entry in the output has run-eval: True. Since the SINGLE filter only checks for "prefill" not in x without excluding run-eval entries, SINGLE and EVALS end up containing the exact same set of configs.

Step-by-Step Proof

  1. User triggers e2e-tests.yml with a command including --evals-only
  2. generate_sweep_configs.py outputs, say, 3 configs — all with run-eval: True, none with prefill
  3. SINGLE filter: all 3 pass (no prefill check is satisfied, no run-eval exclusion exists) → [config1, config2, config3]
  4. EVALS filter: all 3 pass (no prefill, all have run-eval: True) → [config1, config2, config3]
  5. test-sweep-single-node runs all 3 as full benchmarks (hardcoded run-eval: false on line 129, no eval-only input)
  6. test-sweep-evals runs all 3 as eval-only (passes run-eval: true and eval-only: true on lines 159-160)
  7. Each config runs twice — the full benchmark run is completely wasted

Impact

This wastes CI GPU resources by running unnecessary full benchmarks when only evals were requested. The eval results themselves are still correct since test-sweep-evals runs properly, but the redundant test-sweep-single-node jobs consume GPU time for no useful output.

Fix

The simplest fix is to exclude run-eval entries from the SINGLE filter so they only appear in EVALS:

SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if prefill not in x and not x.get(run-eval, False)]))")

Note: this would also affect non-evals-only runs by removing eval-marked entries from the benchmark job. If that is undesirable, an alternative is to have generate_sweep_configs.py output a top-level flag indicating evals-only mode, and conditionally skip the SINGLE matrix entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants