Separate eval-only workflow and change to 8k1k by Oseltamivir · Pull Request #911 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-03-15T21:15:10Z

Summary

Decouple eval jobs from benchmark jobs into a dedicated eval / workflow stage, so evals can run independently with evals-only: true in perf-changelog
Add eval-only input to benchmark-tmpl.yml that skips throughput benchmarks and result file checks, running only lm_eval
Fix GSM8k prompt template: change #### <answer> to #### [number] to prevent models (e.g. Kimi-K2.5) from outputting the literal <answer> tag, which caused strict-match to return [invalid]
Add eval support to all 49 single-node benchmark scripts (server restart with native max context, eval context override, request timeout increase)
Update process_changelog.py and generate_sweep_configs.py to emit a separate evals matrix for eval-only jobs
Changed from 1k8k to 8k1k
Groups now include dp-attn (previously missing, so dp-attn variants were merged)
Instead of picking max-conc at highest/lowest TP, it picks all TPs at the highest conc + all TPs at the median conc.

Separate jobs on a normal run tested with:

- config-keys:
    - gptoss-fp4-h100-vllm
    - dsr1-fp8-mi325x-sglang

Working full run:

- Switch eval selection from 1k8k to 8k1k in mark_eval_entries() - After throughput benchmark, kill server and restart with model's native max context length (max_position_embeddings) for eval - Replace hardcoded gen_max_tokens=16384 and max_tokens=8192 with the native max model length - Add _start_eval_server() supporting sglang, vllm, trtllm, and atom - Add EVAL_SERVER_EXTRA_ARGS to benchmark scripts for correctness flags Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

max_tokens in gen_kwargs must be less than max_length to leave room for the input prompt. Without this cap, the server rejects requests where input_tokens + max_tokens > max_model_len. Uses max_length - 4096 as the generation cap (min 8192). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --no-skip-eval-only flag to wait_for_server_ready so _start_eval_server properly waits for the eval server - Fix pip3 install in dsr1_fp8_h200.sh with --break-system-packages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Bump eval concurrency to 64 via EVAL_CONCURRENT_REQUESTS, independent of benchmark CONC - Cap eval max_gen_tokens to 8192 to avoid KV cache issues - Read num_fewshot from eval YAML instead of CLI override - Add 'trt' as alias for trtllm in eval server case - Reduce seq len configs for OOM prevention - Change eval sweep to top of curve and middle configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t error The ATOM container has a broken torchvision that causes circular import errors when lm_eval loads. Since we use local-chat-completions (API-based), torchvision is not needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

15 SGLang scripts now pass --context-length with compute_eval_context_length when EVAL_ONLY=true. 3 vLLM scripts override MAX_MODEL_LEN similarly. Reverts the max_gen_tokens cap since the server should have sufficient context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # benchmarks/single_node/dsr1_fp4_b200.sh # benchmarks/single_node/dsr1_fp4_mi355x.sh # benchmarks/single_node/dsr1_fp8_b200.sh # benchmarks/single_node/dsr1_fp8_b200_mtp.sh # benchmarks/single_node/dsr1_fp8_b200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_h200_trt_mtp.sh # benchmarks/single_node/dsr1_fp8_mi300x.sh # benchmarks/single_node/dsr1_fp8_mi355x.sh # benchmarks/single_node/glm5_fp8_mi355x.sh # benchmarks/single_node/gptoss_fp4_mi300x.sh # benchmarks/single_node/gptoss_fp4_mi325x.sh # benchmarks/single_node/gptoss_fp4_mi355x.sh # benchmarks/single_node/kimik2.5_fp4_b200.sh # benchmarks/single_node/kimik2.5_int4_b200.sh # benchmarks/single_node/kimik2.5_int4_mi325x.sh # benchmarks/single_node/kimik2.5_int4_mi355x.sh # benchmarks/single_node/minimaxm2.5_fp8_b200.sh # benchmarks/single_node/minimaxm2.5_fp8_h200.sh # benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh # benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh # benchmarks/single_node/qwen3.5_bf16_mi300x.sh # benchmarks/single_node/qwen3.5_bf16_mi325x.sh # benchmarks/single_node/qwen3.5_bf16_mi355x.sh # benchmarks/single_node/qwen3.5_fp8_mi300x.sh # benchmarks/single_node/qwen3.5_fp8_mi325x.sh # benchmarks/single_node/qwen3.5_fp8_mi355x.sh

github-actions · 2026-03-15T21:15:18Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

functionstackx

thanks for the PR @Oseltamivir ! seems great that we now parallel evals instead of having them in sequential

mostly LGTM

just had a couple of questions

does this PR change the evals to use the industry standard num of shots for gsm8k?
previously when evals hard crashed, the script still exits with success exit code, is this fixed now?

.github/workflows/benchmark-tmpl.yml

.github/workflows/e2e-tests.yml

benchmarks/single_node/kimik2.5_fp4_b200.sh

benchmarks/benchmark_lib.sh

utils/evals/EVALS.md

utils/matrix_logic/generate_sweep_configs.py

Oseltamivir · 2026-03-15T22:08:58Z

Changed to 5-shot
Yes, also added suggestion of failing when result <85%

functionstackx · 2026-03-16T03:14:22Z

lgtm, can u get @cquil11 to review too?

functionstackx · 2026-03-16T05:32:41Z

@claude review plz

Klaud-Cold · 2026-03-16T05:33:02Z

Claude finished @functionstackx's task in 3m 42s —— View job

PR Review: Separate eval-only workflow and change to 8k1k

Read key changed files and gather context
Review workflow changes (benchmark-tmpl, e2e-tests, run-sweep)
Review benchmark_lib.sh changes
Review benchmark scripts (all 49 single-node)
Review matrix logic (generate_sweep_configs, process_changelog)
Review runner launch scripts
Review eval config and docs
Post final review

Overall: Looks good. This is a well-structured PR that cleanly separates eval jobs from benchmark jobs, fixes the GSM8k prompt, and adds eval support across all 49 benchmark scripts. The architecture is sound — the EVAL_ONLY / RUN_EVAL separation is clean, process_changelog.py properly generates separate benchmark and eval matrices, and the 85% accuracy threshold in eval-only mode is a good addition.

I verified that the issues from the previous automated review have all been addressed in the current revision:

EVAL_ONLY is correctly forwarded in all 4 Docker launch scripts (-e EVAL_ONLY)
--concurrent-requests $CONC has been removed from all benchmark scripts (0 matches)
Pattern B scripts (vLLM/TRT) correctly export EVAL_MAX_MODEL_LEN after the subshell call
The SINGLE filter in e2e-tests.yml no longer excludes eval-marked entries

Remaining issues

1. Median concurrency selection for 2-element lists — generate_sweep_configs.py:76

median_conc = conc_values[len(conc_values) // 2]

For groups with exactly 2 distinct concurrency values [a, b], len // 2 = 1, so median_conc = b = highest. This means target_concs = {b, b} = {b} — only one concurrency level gets eval coverage. The AGENTS.md documents this ("When only 2 concurrency levels exist, this collapses to just the highest"), so it's acknowledged. But if the intent is two distinct levels for diversity, use (len(conc_values) - 1) // 2 to select the lower-middle element.

Fix this →

2. compute_eval_context_length export is redundant — benchmark_lib.sh:616

The function both exports EVAL_MAX_MODEL_LEN and echos the value. Since every caller either (a) captures the echo in a subshell (losing the export) and then manually exports, or (b) calls via run_eval() which redirects to /dev/null — the export inside the function never actually takes effect in the parent shell. Consider removing the export from the function to avoid confusion about where the variable is actually set.

3. Eval failure in combined mode (non-EVAL_ONLY) is silent — benchmark_lib.sh:764-769

When RUN_EVAL=true (combined mode) and the eval fails, run_eval() logs an error but does not exit with failure — the benchmark job still succeeds. This means eval regressions in combined mode can go unnoticed. The 85% threshold check only runs in the eval-only workflow path (benchmark-tmpl.yml:162-181), not in the combined RUN_EVAL=true path. This is probably intentional (don't fail expensive benchmark jobs over eval issues), but worth being explicit about.

4. EVAL_MAX_MODEL_LEN used as max_length in lm_eval — benchmark_lib.sh:624,661

gen_max_tokens (from EVAL_MAX_MODEL_LEN, defaulting to 16384) is passed as max_length to lm_eval, which controls the maximum total input+output context window. For eval-only mode with expanded context (e.g., 47080 tokens for DeepSeek), this means lm_eval is told it can use up to 47080 tokens of context — but GSM8K prompts are short, so this is fine in practice. Just noting that if longer eval tasks are added, the max_gen_tokens=16384 hard cap (line 651) will limit per-generation output independently.

Looks good

GSM8k prompt fix (#### [number] instead of #### <answer>) with 5-shot — good fix for the strict-match [invalid] issue
evals-only: true in perf-changelog.yaml — clean separation for the production workflow
85% accuracy threshold in eval-only mode with clear pass/fail logging
run_benchmark_serving correctly skips in eval-only mode
All scripts consistently handle eval context expansion with the correct pattern for their framework (SGLang: EVAL_CONTEXT_ARGS, vLLM/TRT: MAX_MODEL_LEN override + export)
process_changelog.py generates completely separate benchmark and eval arrays — no cross-contamination

|

claude

All 6 issues from my previous review have been addressed — nice work. Deferring to human review given the scope (64 files, new eval-only workflow mode, CI/CD workflow changes).

Extended reasoning...

Overview

This PR separates eval jobs from benchmark jobs into a dedicated workflow stage, changes eval sequence length from 1k8k to 8k1k, fixes the GSM8k prompt template, adds eval support to 49 single-node benchmark scripts, and updates matrix generation to emit a separate evals matrix. It touches 64 files across GitHub Actions workflows, shell benchmark scripts, Python matrix generation code, and documentation.

Bug Resolution

All 6 issues I previously identified have been addressed in the current code:

The critical EVAL_ONLY Docker forwarding bug is fixed (all 4 runner scripts now pass -e EVAL_ONLY)
The 3 scripts with stale --concurrent-requests $CONC have been cleaned up
The subshell export issue is fixed with explicit export EVAL_MAX_MODEL_LEN after each subshell call
Documentation in EVALS.md and AGENTS.md now accurately reflects the new selection policy
The median concurrency collapse for 2-value groups is explicitly documented as intended behavior

Security Risks

No direct security risks. The changes are internal CI/CD pipeline modifications. The subprocess.run calls in process_changelog.py use list-based argument passing (no shell injection). No auth, crypto, or permissions code is touched.

Level of Scrutiny

This PR warrants human review due to: (1) the sheer scope of 64 files, (2) structural changes to GitHub Actions workflows that define the CI/CD pipeline, (3) a new eval-only execution mode that changes how benchmark infrastructure operates, and (4) non-trivial Python logic changes in mark_eval_entries() and process_changelog.py. While the 49 benchmark script changes are mechanical and follow a clear pattern, the workflow and library changes require careful human verification.

Other Factors

A maintainer has already LGTM'd the PR
A full test run has been linked showing the separate eval jobs working
The bulk of file changes (49 benchmark scripts) follow two clear patterns (Pattern A for SGLang, Pattern B for vLLM/TRT)
The code is well-structured with proper validation via Pydantic models

benchmarks/benchmark_lib.sh

benchmarks/single_node/dsr1_fp4_b200.sh

utils/evals/validate_scores.py

benchmarks/single_node/gptoss_fp4_h200_trt.sh

utils/process_changelog.py

AGENTS.md

.github/workflows/benchmark-tmpl.yml

AGENTS.md

claude · 2026-03-19T23:59:41Z

benchmarks/benchmark_lib.sh

+        *)               echo "Unknown framework '${framework}'"; eval_rc=1 ;;
    esac
+
+    if [ "$eval_rc" -ne 0 ]; then
+        echo "ERROR: run_eval failed with exit code $eval_rc" >&2
+        if [ "${EVAL_ONLY}" = "true" ]; then
+            echo "Eval-only mode: exiting with failure" >&2
+            exit "$eval_rc"
+        fi


🟡 exit "$eval_rc" on line 787 terminates the entire script when eval-only mode fails, skipping append_lm_eval_summary() which every benchmark script calls on the next line. This means partial eval artifacts (results*.json, sample*.jsonl, meta_env.json) remain stranded in /tmp/eval_out-XXXXXX/ and are never moved to the workspace where the upload step looks for them. Fix: replace exit with return so the calling script can still collect artifacts before failing.

Extended reasoning...

What the bug is

In run_eval() (benchmark_lib.sh lines 783-788), when eval fails and EVAL_ONLY=true, the function calls exit "$eval_rc" which terminates the entire bash process immediately. This is new code added by this PR. In contrast, when EVAL_ONLY is not true, the function uses return $eval_rc, allowing the caller to continue.

The calling pattern across all benchmark scripts

Every single benchmark script (all ~49 of them) follows this pattern:

if [ "${RUN_EVAL}" = "true" ]; then run_eval --framework lm-eval --port "$PORT" append_lm_eval_summary # <-- never reached on eval-only failure fi stop_gpu_monitor # <-- also never reached

append_lm_eval_summary() is responsible for: (1) writing meta_env.json with job metadata, (2) moving eval artifacts (results*.json, sample*.jsonl) from the temp output directory (/tmp/eval_out-XXXXXX/) to the current working directory (workspace). Without it running, any partial eval results remain stranded in the temp directory.

Step-by-step proof

An eval-only job runs with EVAL_ONLY=true, RUN_EVAL=true

The benchmark script calls run_eval --framework lm-eval --port "$PORT"

run_lm_eval() runs lm-eval, which produces partial results in /tmp/eval_out-XXXXXX/ before failing (e.g., timeout on some tasks, model errors on certain prompts)

run_lm_eval returns non-zero, captured as eval_rc

Line 785: EVAL_ONLY is "true", so line 787 executes: exit "$eval_rc"

The entire script terminates immediately

append_lm_eval_summary on the next line in the benchmark script never executes — artifacts stay in /tmp/eval_out-XXXXXX/

stop_gpu_monitor also never executes — background GPU monitoring process is orphaned

The workflow "Upload eval results" step (line 212) now correctly has always() guard so it does run, but it looks for results*.json in the workspace (PWD), not in /tmp/eval_out-XXXXXX/

Result: no eval artifacts are uploaded, even though partial results may exist

Impact

When an eval-only run fails, partial eval artifacts that could help debug the failure are lost. The meta_env.json (job metadata) is never created, and any partial results*.json or sample*.jsonl files remain in the temp directory until the container/job exits. Since eval-only mode is the primary validation path for evals in this PR, losing debugging artifacts on failure is significant for operational debugging.

Fix

Replace exit "$eval_rc" with return "$eval_rc" on line 787. The calling scripts will then continue to append_lm_eval_summary (which collects whatever artifacts exist) and stop_gpu_monitor before the script exits naturally. The workflow-level result file check (if ! ls results*.json) at benchmark-tmpl.yml line 159 will still detect the failure and exit non-zero, so the job still fails as intended — but now with artifacts preserved.

claude · 2026-03-20T00:36:18Z

benchmarks/benchmark_lib.sh

            --temperature)    temperature="$2"; shift 2 ;;
            --top-p)          top_p="$2"; shift 2 ;;
-            --concurrent-requests) concurrent_requests="$2"; shift 2 ;;
            *)                echo "Unknown parameter: $1"; return 1 ;;


🔴 glm5_fp8_h200.sh (added by PR #914, already merged) still passes --concurrent-requests $CONC to run_eval on line 63. This PR removed the --concurrent-requests handler from run_lm_eval, so the catch-all *) at benchmark_lib.sh:646 will print "Unknown parameter: --concurrent-requests" and return 1, failing ALL eval runs for GLM-5 on H200. The script also lacks the EVAL_ONLY block that was added to all other benchmark scripts. Fix: update glm5_fp8_h200.sh to remove --concurrent-requests $CONC from the run_eval call and add the EVAL_ONLY context expansion block.

Extended reasoning...

What the bug is

PR #914 (commit ebb48d4, already merged into main) added benchmarks/single_node/glm5_fp8_h200.sh for GLM-5 FP8 benchmarks on H200 GPUs. On line 63, this script calls:

run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC

This PR (#911) removes the --concurrent-requests handler from run_lm_eval() in benchmark_lib.sh. The previous code had --concurrent-requests) shift 2; continue ;; to silently consume the argument, but that handler no longer exists. The catch-all *) case at line 646 now does echo "Unknown parameter: $1"; return 1 ;;.

The specific code path that triggers it

glm5_fp8_h200.sh line 63 calls run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC

run_eval() (benchmark_lib.sh:761) parses --framework and puts --port, "$PORT", --concurrent-requests, and $CONC into the forwarded[] array

run_eval() calls run_lm_eval "${forwarded[@]}" at line 779

run_lm_eval() processes --port successfully, then encounters --concurrent-requests

The catch-all *) at line 646 fires: prints "Unknown parameter: --concurrent-requests" and returns 1

run_eval captures the failure (eval_rc=$? = 1)

Why existing code doesn’t prevent it

This script was added by a separate PR (#914) that merged after the initial review of PR #911. The initial review (comment #3) correctly identified 3 other scripts (kimik2.5_fp4_b200.sh, kimik2.5_int4_mi325x.sh, minimaxm2.5_fp8_b200.sh) that still had --concurrent-requests, and those were subsequently fixed. However, glm5_fp8_h200.sh arrived via a different PR and was missed.

I verified via the diff that the sibling script glm5_fp8_mi355x.sh WAS updated by this PR to remove --concurrent-requests and add the EVAL_ONLY block, confirming that glm5_fp8_h200.sh was simply overlooked.

Step-by-step proof

Confirmed glm5_fp8_h200.sh line 63: run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC

Confirmed run_eval() at benchmark_lib.sh:768 forwards unrecognized args: *) forwarded+=("$1"); shift ;;

Confirmed run_lm_eval() at benchmark_lib.sh:646 has no --concurrent-requests handler: *) echo "Unknown parameter: $1"; return 1 ;;

Result: eval fails immediately with "Unknown parameter: --concurrent-requests", return code 1

Impact

This breaks ALL eval runs for the GLM-5 FP8 H200 configuration. Additionally, the script lacks the EVAL_ONLY block (for computing expanded eval context length and starting the server with it) that was added to all ~48 other benchmark scripts in this PR, meaning eval-only mode also cannot work correctly for this config.

How to fix

Update benchmarks/single_node/glm5_fp8_h200.sh to:

Remove --concurrent-requests $CONC from the run_eval call on line 63

Add the EVAL_ONLY block (with EVAL_CONTEXT_ARGS and export EVAL_MAX_MODEL_LEN) matching the pattern in the sibling glm5_fp8_mi355x.sh

Append $EVAL_CONTEXT_ARGS to the sglang launch command

claude · 2026-03-20T05:18:55Z

.github/workflows/e2e-tests.yml

                  SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))")
                  MULTI=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' in x]))")
+                  EVALS=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and x.get('run-eval', False)]))")
                  echo "single-node-config=$SINGLE" >> $GITHUB_OUTPUT
                  echo "multi-node-config=$MULTI" >> $GITHUB_OUTPUT
+                  echo "eval-config=$EVALS" >> $GITHUB_OUTPUT


🟡 When --evals-only is passed to e2e-tests.yml, every config runs twice: once as a full benchmark via test-sweep-single-node (which hardcodes run-eval: false) and once as the intended eval-only run via test-sweep-evals. The SINGLE filter on line 57 has no exclusion for eval-only entries, so it captures the same configs as the EVALS filter, wasting CI GPU time on redundant full benchmark runs. Fix by excluding entries with run-eval: True from the SINGLE filter (e.g., [x for x in d if "prefill" not in x and not x.get("run-eval", False)]).

Extended reasoning...

Bug Analysis

The e2e-tests.yml workflow computes three matrix filters from the output of generate_sweep_configs.py:

SINGLE (line 57): [x for x in d if "prefill" not in x] — all non-prefill entries

MULTI (line 58): [x for x in d if "prefill" in x] — only prefill/decode disaggregated entries

EVALS (line 59): [x for x in d if "prefill" not in x and x.get("run-eval", False)] — non-prefill entries with run-eval: True

These feed into three downstream jobs: test-sweep-single-node (uses SINGLE), test-sweep-multi-node (uses MULTI), and test-sweep-evals (uses EVALS).

The Problem

When --evals-only is passed, generate_sweep_configs.py (lines 935-937) filters its output to only include entries where run-eval is True:

if args.evals_only: matrix_values = [e for e in matrix_values if e.get(Fields.RUN_EVAL.value, False)]

This means every entry in the output has run-eval: True. Since the SINGLE filter only checks for "prefill" not in x without excluding run-eval entries, SINGLE and EVALS end up containing the exact same set of configs.

Step-by-Step Proof

User triggers e2e-tests.yml with a command including --evals-only

generate_sweep_configs.py outputs, say, 3 configs — all with run-eval: True, none with prefill

SINGLE filter: all 3 pass (no prefill check is satisfied, no run-eval exclusion exists) → [config1, config2, config3]

EVALS filter: all 3 pass (no prefill, all have run-eval: True) → [config1, config2, config3]

test-sweep-single-node runs all 3 as full benchmarks (hardcoded run-eval: false on line 129, no eval-only input)

test-sweep-evals runs all 3 as eval-only (passes run-eval: true and eval-only: true on lines 159-160)

Each config runs twice — the full benchmark run is completely wasted

Impact

This wastes CI GPU resources by running unnecessary full benchmarks when only evals were requested. The eval results themselves are still correct since test-sweep-evals runs properly, but the redundant test-sweep-single-node jobs consume GPU time for no useful output.

Fix

The simplest fix is to exclude run-eval entries from the SINGLE filter so they only appear in EVALS:

SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if prefill not in x and not x.get(run-eval, False)]))")

Note: this would also affect non-evals-only runs by removing eval-marked entries from the benchmark job. If that is undesirable, an alternative is to have generate_sweep_configs.py output a top-level flag indicating evals-only mode, and conditionally skip the SINGLE matrix entirely.

Oseltamivir and others added 22 commits March 8, 2026 08:31

fix: reserve 30% of context for prompt in eval gen_kwargs

4452868

Use 70% of max_length for max_tokens generation cap, leaving 30% for the prompt. Echo the budget for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

change eval calling

83cd7de

change gsm8k to 8-shot

73b6846

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

decouple

aad5336

add gpqa

1cdc72a

fix: pass multiple eval tasks as separate args for older lm-eval compat

49b0b90

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: run eval tasks sequentially for cross-version lm-eval compat

cc59d50

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use directory-based task discovery for multi-eval in single lm_e…

c8b5858

…val call Pass utils/evals/ directory to --tasks so lm-eval globs all *.yaml files and runs gsm8k + gpqa_diamond consecutively in one invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gsm8k only

ffee6c5

Merge branch 'main' into eval-8k1k-server-restart

c577ca2

fix: cap max_gen_tokens to server's max_model_len to avoid request re…

dc25ccd

…jection max_gen_tokens was hardcoded to 16384 but servers with smaller EVAL_MAX_MODEL_LEN (e.g. 9416) reject requests exceeding their limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: default EVAL_TASKS_DIR to utils/evals directory, not single yaml…

2d9d7ba

… file Was defaulting to utils/evals/gsm8k.yaml which caused lm-eval to only run gsm8k. Directory path lets lm-eval glob all *.yaml files (gsm8k + gpqa). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: reduce eval context multiplier to 5x and increase request timeou…

1ca2173

…t to 1800s 5x avoids TRT OOM on H200 (47K vs 94K context). 1800s timeout prevents single-request timeouts on slow models like Kimi K2.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test other prompt

6999263

pr

ba45203

github-project-automation bot added this to InferenceMAX Board Mar 15, 2026

functionstackx requested changes Mar 15, 2026

View reviewed changes

claude bot reviewed Mar 15, 2026

View reviewed changes

Oseltamivir requested a review from a team March 15, 2026 22:08

test evals

826035c

Oseltamivir force-pushed the eval-8k1k-server-restart branch from 600582e to 826035c Compare March 15, 2026 22:08

resolve claude issues

34bc7c4

Oseltamivir force-pushed the eval-8k1k-server-restart branch from 812fe72 to fa65d40 Compare March 16, 2026 01:07

torchvision

4978aed

Oseltamivir force-pushed the eval-8k1k-server-restart branch from 1d77739 to 4978aed Compare March 16, 2026 02:01

functionstackx self-requested a review March 16, 2026 03:14

claude bot reviewed Mar 16, 2026

View reviewed changes

benchmarks/benchmark_lib.sh Show resolved Hide resolved

make stuff neater, ready for merge

c038e1b

claude bot reviewed Mar 16, 2026

View reviewed changes

benchmarks/single_node/dsr1_fp4_b200.sh Show resolved Hide resolved

utils/evals/validate_scores.py Outdated Show resolved Hide resolved

Oseltamivir added 2 commits March 17, 2026 22:23

resolve issues, add --no-evals, change default to flag-less

29d69fa

ctxt len

327fd6d

claude bot reviewed Mar 18, 2026

View reviewed changes

benchmarks/single_node/gptoss_fp4_h200_trt.sh Show resolved Hide resolved

utils/process_changelog.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 18, 2026

View reviewed changes

AGENTS.md Show resolved Hide resolved

Oseltamivir and others added 3 commits March 18, 2026 16:02

resolve claude

6350b6b

Merge branch 'main' into eval-8k1k-server-restart

9ae1ae4

h200 change

dec6f60

claude bot reviewed Mar 18, 2026

View reviewed changes

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

final touches

b08e063

claude bot reviewed Mar 19, 2026

View reviewed changes

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

AGENTS.md Show resolved Hide resolved

Oseltamivir and others added 2 commits March 19, 2026 16:16

test normal perf-changelog

f04881d

Merge branch 'main' into eval-8k1k-server-restart

86764fa

claude bot reviewed Mar 19, 2026

View reviewed changes

claude bot reviewed Mar 20, 2026

View reviewed changes

Oseltamivir added 3 commits March 19, 2026 18:13

test normal perf-changelog

c17619f

all evals

d30f807

remove pycache

766a742

Oseltamivir force-pushed the eval-8k1k-server-restart branch from cfe5ca1 to 766a742 Compare March 20, 2026 04:25

claude bot reviewed Mar 20, 2026

View reviewed changes

Conversation

Oseltamivir commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

functionstackx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Oseltamivir commented Mar 15, 2026

Uh oh!

functionstackx commented Mar 16, 2026

Uh oh!

functionstackx commented Mar 16, 2026

Uh oh!

Klaud-Cold commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Separate eval-only workflow and change to 8k1k

Remaining issues

Looks good

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Bug Resolution

Security Risks

Level of Scrutiny

Other Factors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Mar 19, 2026

Choose a reason for hiding this comment

What the bug is

The calling pattern across all benchmark scripts

Step-by-step proof

Impact

Fix

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

What the bug is

The specific code path that triggers it

Why existing code doesn’t prevent it

Step-by-step proof

Impact

How to fix

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

Bug Analysis

The Problem

Step-by-Step Proof

Impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oseltamivir commented Mar 15, 2026 •

edited

Loading

functionstackx left a comment •

edited

Loading

Klaud-Cold commented Mar 16, 2026 •

edited

Loading