Add lm-eval benchmark runner for evals by Oseltamivir · Pull Request #41 · NVIDIA/srt-slurm

Oseltamivir · 2026-04-17T02:30:39Z

Summary

Add InferenceX multi-node eval support through an lm-eval benchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.

Copied from ishandhanani/srt-slurm#245

How

Add an lm-eval benchmark runner that sources InferenceX's benchmarks/benchmark_lib.sh from a mounted /infmax-workspace.
Mount INFMAX_WORKSPACE into the container as /infmax-workspace when provided.
Add EVAL_ONLY=true handling in do_sweep.py so eval-only jobs start infra/workers/frontend, run
the full model health check, skip throughput, and launch lm-eval directly.
Keep RUN_EVAL=true behavior as a post-benchmark eval path for normal throughput jobs.
Pass model/framework/topology metadata into the eval container, including served MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.
Map srt-slurm PREFILL_DP_ATTN / DECODE_DP_ATTN env vars to the InferenceX PREFILL_DP_ATTENTION /DECODE_DP_ATTENTION names expected by append_lm_eval_summary.
Copy eval outputs (meta_env.json, results*.json, sample*.jsonl) into /logs/eval_results/ for launcher-side artifact pickup.
Preserve partial eval artifacts on lm-eval failure while still returning the original eval failure
code.
Document the InferenceX lm-eval integration in docs/accuracy.md.

What

For EVAL_ONLY=true:

srt-slurm still starts the normal deployment topology.
The throughput benchmark runner is skipped.
wait_for_model() verifies the configured prefill/decode or aggregated worker counts.
lm-eval runs against the OpenAI-compatible endpoint.
Eval failure is fatal.
Low score leads to failure

For RUN_EVAL=true without EVAL_ONLY=true:

The normal benchmark runs first.
lm-eval runs as a post-step if throughput succeeds.
Eval failure is non-fatal to the benchmark result.
Low score leads to failure

Validation run

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

SemiAnalysisAI/InferenceX#1000

One-entry sweep config used by the manual workflow_dispatch that validates NVIDIA/srt-slurm#41 on GB200. Mirrors the cheapest entry from dsr1-fp8-gb200-dynamo-trt (8k1k stp, eval-conc=63, 1P/3D) so the end-to-end eval path is exercised without running the full gb200 sweep. Not referenced by any automated workflow; picked up only when passed explicitly via --config-files.

Integrate EleutherAI lm-evaluation-harness as a standalone benchmark runner. The default path runs the lm_eval CLI directly against the OpenAI-compatible endpoint (installing via pip if needed). An external eval harness can optionally take over via LM_EVAL_WORKSPACE mount or LM_EVAL_LIB env var. - New lm-eval runner registered in benchmark registry - _run_post_eval() in do_sweep.py handles EVAL_ONLY and RUN_EVAL modes - LM_EVAL_WORKSPACE env var mounts host workspace at /lm-eval-workspace - Topology/precision env vars passed through for metadata recording - Documentation and comprehensive tests

Oseltamivir · 2026-04-23T19:53:09Z

@xinli-sw

Hi, made the changes requested in #12 , more general now and not InferenceX-specific. Tested with InferenceX still https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24812985409

Oseltamivir requested review from alec-flowers, csahithi, ishandhanani and nlevin-ui as code owners April 17, 2026 02:30

functionstackx mentioned this pull request Apr 17, 2026

Add lm-eval benchmark runner for InferenceX evals #12

Merged

Oseltamivir mentioned this pull request Apr 22, 2026

[WIP] Allow overriding srt-slurm repo/ref at the launcher level SemiAnalysisAI/InferenceX#1118

Open

Oseltamivir changed the title ~~Add lm-eval benchmark runner for InferenceX evals~~ Add lm-eval benchmark runner for evals Apr 23, 2026

Oseltamivir force-pushed the lm-eval-main branch from 70306a5 to e09f6b1 Compare April 23, 2026 19:46

Oseltamivir mentioned this pull request Apr 29, 2026

[NV] dsv4-fp4-gb200-dynamo-vllm SemiAnalysisAI/InferenceX#1163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lm-eval benchmark runner for evals#41

Add lm-eval benchmark runner for evals#41
Oseltamivir wants to merge 1 commit intoNVIDIA:mainfrom
Oseltamivir:lm-eval-main

Oseltamivir commented Apr 17, 2026

Uh oh!

Oseltamivir commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 17, 2026

Summary

How

What

Validation run

InferenceX PR

Uh oh!

Oseltamivir commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 23, 2026 •

edited

Loading