Skip to content

Cherry-pick lm-eval benchmark runner from sa-submission-q2-2026#122

Merged
ishandhanani merged 1 commit intomainfrom
ishan/cherry-pick-lm-eval
Apr 30, 2026
Merged

Cherry-pick lm-eval benchmark runner from sa-submission-q2-2026#122
ishandhanani merged 1 commit intomainfrom
ishan/cherry-pick-lm-eval

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

Summary

  • Cherry-picks f61dbba (PR Add lm-eval benchmark runner for InferenceX evals #12, originally on sa-submission-q2-2026) onto main, so the lm-eval benchmark runner and the RUN_EVAL / EVAL_ONLY post-eval flow in do_sweep.py are available outside the SA submission branch.
  • Motivation: recipes/dsv4-agg-disagg periodically merges from main but does not have the lm-eval logic, which is needed for SA CI to produce both speed and accuracy reports against the dsv4 recipes. Forward-porting to main lets dsv4 (and other recipe branches) pick it up on their next merge.

Conflict resolution

The cherry-pick had conflicts in 5 files. All were independent additions on both sides and were resolved by keeping both:

  • docs/accuracy.md — kept AIME mention from main and lm-eval section from PR
  • src/srtctl/benchmarks/__init__.py — kept custom/trace_replay registrations from main alongside lm_eval
  • src/srtctl/cli/do_sweep.py — kept HF model pre-cache helpers (main) alongside _run_post_eval (PR)
  • tests/test_benchmarks.py — kept TestTraceReplayRunner + TestCustomDatasetLoader (main) alongside TestLMEvalRunner + TestRunPostEval + TestSweepRunEvalIntegration (PR)
  • tests/test_configs.py — kept TestHuggingFaceModelSupport (main) alongside TestInfmaxWorkspaceMount (PR)

Test plan

  • uv run ruff check src/srtctl/ — passes
  • uv run pytest tests/test_benchmarks.py -v — 56/56 pass (incl. all TestLMEvalRunner, TestRunPostEval, TestSweepRunEvalIntegration)
  • uv run pytest tests/ — 657 passed, 2 skipped
  • Smoke run a recipe with RUN_EVAL=true on a cluster

🤖 Generated with Claude Code

* Add lm-eval benchmark runner for InferenceX evals

Adds support for running lm-eval accuracy evaluations as a post-benchmark
step, leveraging the InferenceX benchmark_lib.sh harness.
@ishandhanani ishandhanani merged commit 7e15601 into main Apr 30, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants