Skip to content

Add lm-eval benchmark runner for evals#41

Open
Oseltamivir wants to merge 1 commit intoNVIDIA:mainfrom
Oseltamivir:lm-eval-main
Open

Add lm-eval benchmark runner for evals#41
Oseltamivir wants to merge 1 commit intoNVIDIA:mainfrom
Oseltamivir:lm-eval-main

Conversation

@Oseltamivir
Copy link
Copy Markdown
Contributor

Summary

Add InferenceX multi-node eval support through an lm-eval benchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.

Copied from ishandhanani/srt-slurm#245

How

  • Add an lm-eval benchmark runner that sources InferenceX's benchmarks/benchmark_lib.sh from a mounted /infmax-workspace.
  • Mount INFMAX_WORKSPACE into the container as /infmax-workspace when provided.
  • Add EVAL_ONLY=true handling in do_sweep.py so eval-only jobs start infra/workers/frontend, run
    the full model health check, skip throughput, and launch lm-eval directly.
  • Keep RUN_EVAL=true behavior as a post-benchmark eval path for normal throughput jobs.
  • Pass model/framework/topology metadata into the eval container, including served MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.
  • Map srt-slurm PREFILL_DP_ATTN / DECODE_DP_ATTN env vars to the InferenceX PREFILL_DP_ATTENTION /DECODE_DP_ATTENTION names expected by append_lm_eval_summary.
  • Copy eval outputs (meta_env.json, results*.json, sample*.jsonl) into /logs/eval_results/ for launcher-side artifact pickup.
  • Preserve partial eval artifacts on lm-eval failure while still returning the original eval failure
    code.
  • Document the InferenceX lm-eval integration in docs/accuracy.md.

What

For EVAL_ONLY=true:

  • srt-slurm still starts the normal deployment topology.
  • The throughput benchmark runner is skipped.
  • wait_for_model() verifies the configured prefill/decode or aggregated worker counts.
  • lm-eval runs against the OpenAI-compatible endpoint.
  • Eval failure is fatal.
  • Low score leads to failure

For RUN_EVAL=true without EVAL_ONLY=true:

  • The normal benchmark runs first.
  • lm-eval runs as a post-step if throughput succeeds.
  • Eval failure is non-fatal to the benchmark result.
  • Low score leads to failure

Validation run

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

SemiAnalysisAI/InferenceX#1000

Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Apr 22, 2026
One-entry sweep config used by the manual workflow_dispatch that
validates NVIDIA/srt-slurm#41 on GB200. Mirrors the cheapest entry
from dsr1-fp8-gb200-dynamo-trt (8k1k stp, eval-conc=63, 1P/3D) so
the end-to-end eval path is exercised without running the full
gb200 sweep.

Not referenced by any automated workflow; picked up only when passed
explicitly via --config-files.
@Oseltamivir Oseltamivir changed the title Add lm-eval benchmark runner for InferenceX evals Add lm-eval benchmark runner for evals Apr 23, 2026
Integrate EleutherAI lm-evaluation-harness as a standalone benchmark
runner. The default path runs the lm_eval CLI directly against the
OpenAI-compatible endpoint (installing via pip if needed). An external
eval harness can optionally take over via LM_EVAL_WORKSPACE mount or
LM_EVAL_LIB env var.

- New lm-eval runner registered in benchmark registry
- _run_post_eval() in do_sweep.py handles EVAL_ONLY and RUN_EVAL modes
- LM_EVAL_WORKSPACE env var mounts host workspace at /lm-eval-workspace
- Topology/precision env vars passed through for metadata recording
- Documentation and comprehensive tests
@Oseltamivir
Copy link
Copy Markdown
Contributor Author

Oseltamivir commented Apr 23, 2026

@xinli-sw

Hi, made the changes requested in #12 , more general now and not InferenceX-specific. Tested with InferenceX still https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24812985409

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant