Cherry-pick lm-eval benchmark runner from sa-submission-q2-2026 by ishandhanani · Pull Request #122 · NVIDIA/srt-slurm

ishandhanani · 2026-04-29T23:39:16Z

Summary

Cherry-picks f61dbba (PR Add lm-eval benchmark runner for InferenceX evals #12, originally on sa-submission-q2-2026) onto main, so the lm-eval benchmark runner and the RUN_EVAL / EVAL_ONLY post-eval flow in do_sweep.py are available outside the SA submission branch.
Motivation: recipes/dsv4-agg-disagg periodically merges from main but does not have the lm-eval logic, which is needed for SA CI to produce both speed and accuracy reports against the dsv4 recipes. Forward-porting to main lets dsv4 (and other recipe branches) pick it up on their next merge.

Conflict resolution

The cherry-pick had conflicts in 5 files. All were independent additions on both sides and were resolved by keeping both:

docs/accuracy.md — kept AIME mention from main and lm-eval section from PR
src/srtctl/benchmarks/__init__.py — kept custom/trace_replay registrations from main alongside lm_eval
src/srtctl/cli/do_sweep.py — kept HF model pre-cache helpers (main) alongside _run_post_eval (PR)
tests/test_benchmarks.py — kept TestTraceReplayRunner + TestCustomDatasetLoader (main) alongside TestLMEvalRunner + TestRunPostEval + TestSweepRunEvalIntegration (PR)
tests/test_configs.py — kept TestHuggingFaceModelSupport (main) alongside TestInfmaxWorkspaceMount (PR)

Test plan

uv run ruff check src/srtctl/ — passes
uv run pytest tests/test_benchmarks.py -v — 56/56 pass (incl. all TestLMEvalRunner, TestRunPostEval, TestSweepRunEvalIntegration)
uv run pytest tests/ — 657 passed, 2 skipped
Smoke run a recipe with RUN_EVAL=true on a cluster

🤖 Generated with Claude Code

* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.

Add lm-eval benchmark runner for InferenceX evals (#12)

1ace87d

* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.

ishandhanani requested review from alec-flowers, csahithi and nlevin-ui as code owners April 29, 2026 23:39

ch-wan mentioned this pull request Apr 30, 2026

Day 0 DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks SemiAnalysisAI/InferenceX#1157

Merged

5 tasks

ishandhanani merged commit 7e15601 into main Apr 30, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick lm-eval benchmark runner from sa-submission-q2-2026#122

Cherry-pick lm-eval benchmark runner from sa-submission-q2-2026#122
ishandhanani merged 1 commit intomainfrom
ishan/cherry-pick-lm-eval

ishandhanani commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ishandhanani commented Apr 29, 2026

Summary

Conflict resolution

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants