feat(evaluation): offline evaluation module with uv run evaluate CLI by ShuxinLin · Pull Request #280 · IBM/AssetOpsBench

ShuxinLin · 2026-04-27T20:27:53Z

Summary

Adds src/evaluation/ — an offline scorer for saved agent trajectories. Follows MLflow's evaluator/scorer split per reviewer feedback: an Evaluator dispatches to one of three scorer families.

Family	Registered name	Status
LLM-As-Judge	`llm_judge`	Wired; 6-criterion rubric mirroring `src/tmp/evaluation_agent/result_evaluation_prompt.py`
Code-Based	`exact_string_match`, `numeric_match`	Skeleton (`NotImplementedError`), not auto-registered
Semantic-Score	`semantic_similarity`	Skeleton, not auto-registered

Interface

uv run evaluate \
  --trajectories <dir> --scenarios <file(s)> \
  --scorer-default llm_judge --judge-model <litellm_proxy/…>

Writes reports/<run_id>.json per trajectory and reports/_aggregate.json for the rollup.

Scenario field scoring_method selects the scorer per-scenario; --scorer-default is the fallback.

Docs

docs/evaluation.md — schema, CLI, output layout, custom-scorer pattern, groundtruth loop. Pointer from INSTRUCTIONS.md.

Test plan

uv run pytest src/evaluation/ — 40 passed
uv run pytest src/ -k "not integration" — 310 passed
End-to-end against groundtruth/101.json with llm_judge — 6/6 rubric pass, reports written

Closes #279

Implement src/evaluation/ — consumes saved agent trajectories ({run_id}.json under AGENT_TRAJECTORY_DIR) and scenario files, joins them on scenario_id, runs a registered grader per scenario, and emits a JSON report combining grading results with operational metrics (tokens, duration p50/p95, tool calls, optional cost estimate). The shape follows SWE-bench / HELM / τ-bench conventions: agent run → evaluate → report.json, with offline re-grading from saved trajectories as a first-class workflow. Includes: - Pydantic models (Scenario, PersistedTrajectory, GradeResult, OpsMetrics, EvalReport) - Loader for trajectory dirs and JSON/JSONL scenario files - Grader registry with two deterministic graders (exact_string_match, numeric_match) and a pluggable LLM judge bound to LLMBackend (six-criterion rubric) - Per-task ops metric extraction (handles both SDK Trajectory and plan-execute list[StepResult] shapes) plus aggregate rollups - Report writer with terminal summary and JSON output - evaluate script registered in [project.scripts] - 39 unit tests covering models, loader, graders, metrics, report, and end-to-end runner — all passing alongside existing 270 tests Closes #279 Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

DhavalRepo18 · 2026-04-28T15:38:00Z

https://mlflow.org/docs/latest/genai/concepts/scorers/ Please use these concept and prefer to use Scorer

Evaluator has multiple Scorer
- LLM-As-Judge
- Semantic-Score
- Code-Based

# Conflicts: # pyproject.toml Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Reviewer feedback (PR #280): align with MLflow's evaluator/scorer split. - Rename src/evaluation/graders/ -> src/evaluation/scorers/ and organise by family: code_based (exact/numeric), llm_judge (LLM-As-Judge), semantic (new). - Rename GradeResult -> ScorerResult with field `scorer` (Scenario input field `grading_method` unchanged — input contract preserved). - Add `Evaluator` class as the orchestration entry point; functional `evaluate()` now delegates to it. - Add Semantic-Score scorer using difflib.SequenceMatcher (stdlib only, no extra deps); threshold overridable via scenario.similarity_threshold. - CLI: add --scorer-default (keeps --grader-default as alias). Tests: 7 new (4 semantic + 2 evaluator + 1 registry); 46 total in src/evaluation/, full suite 316 passed. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Code-Based scorers (exact_string_match, numeric_match) are now skeleton stubs raising NotImplementedError; the family slot in the Evaluator/Scorer taxonomy is preserved but implementations are deferred. Registry no longer auto-registers them on import. Tests for behavior removed; new TestCodeBasedSkeletons asserts NotImplementedError. test_runner and test_evaluator override tests re-pointed at semantic_similarity. 41 evaluation tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

…warg - LLM-As-Judge prompt now mirrors src/tmp/evaluation_agent/result_evaluation_prompt.py: full 6-criterion rubric text, split Agent's Thinking vs Final Response, output schema uses `suggestions` (back-compat: `reason` still accepted), parser strips "(END OF RESPONSE)" sentinel. - CLI: LiteLLMBackend takes `model_id=`, not `model=`. Fixes: TypeError: LiteLLMBackend.__init__() got an unexpected keyword argument 'model' Verified end-to-end: claude-agent on groundtruth/101 ("List all failure modes of asset Chiller.") → uv run evaluate with --scorer-default llm_judge → 6/6 criteria pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

- CLI: --output FILE replaced with --reports-dir DIR (default reports/). Writes one JSON per result (named by trajectory run_id, which is a UUID) plus _aggregate.json for the rollup. - ScenarioResult now carries run_id (propagated from PersistedTrajectory). - New report.write_reports_dir(); falls back to scenario-<id>.json for legacy trajectories with no run_id. - 2 new tests; 43 evaluation tests pass. Verified: uv run evaluate against groundtruth/101.json wrote reports/112c1b56-...json + reports/_aggregate.json. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

- Strip src/evaluation/scorers/semantic.py to a NotImplementedError skeleton; no longer auto-registered. Code-Based + Semantic-Score families now both ship as slot-only placeholders; LLM-As-Judge is the only working scorer in this branch. - Tests: TestSemanticSimilarity collapsed to a NotImplementedError assertion; runner/evaluator override tests pivot to local stub scorers (no skeleton dependency). - INSTRUCTIONS.md: new Evaluation section linking to the full doc. - docs/evaluation.md: scenario/trajectory schema, CLI reference, report layout, scorer families table, custom-scorer plug-in pattern, loop over groundtruth/*.json. 40 evaluation tests pass; full suite 310 passed. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

DhavalRepo18

We want to use Evaluator and Scorer (universal terminology).

Field and identifier renames so the module speaks a single vocabulary: - Scenario.grading_method -> Scenario.scoring_method - ScenarioResult.grade -> ScenarioResult.score (typed ScorerResult) - runner.default_grading_method -> runner.default_scoring_method - CLI: --grader-default legacy alias removed (only --scorer-default) - report.totals["graded"] -> report.totals["scored"] - Docstrings/comments/docs: "grading"/"graded"/"grader" -> "scoring"/"scored"/"scorer" Tests, INSTRUCTIONS.md, docs/evaluation.md updated. Note: the inner numeric ScorerResult.score is unchanged; access is result.score.score for the numeric, result.score.passed for the bool. 40 evaluation tests pass; full suite 310 passed; end-to-end against groundtruth/101.json still emits 6/6 rubric pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

…dd aggregate JSON shape Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

This was referenced Apr 29, 2026

Resolve IBM evaluation-module dependency strategy HPML6998-S26-Team13/AssetOpsBench#8

Open

AOB-WS1 Evaluation adapter + parity proof HPML6998-S26-Team13/AssetOpsBench#1

Open

jasdian mentioned this pull request May 11, 2026

Proposal: A judge-calibration subset for AssetOpsBench #296

Closed

ShuxinLin added 6 commits May 13, 2026 11:54

Merge remote-tracking branch 'origin/main' into feat/evaluation-module

4434dc4

# Conflicts: # pyproject.toml Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

ShuxinLin mentioned this pull request May 13, 2026

feat: add evaluation module (src/evaluation/) with uv run evaluate CLI #279

Open

13 tasks

ShuxinLin requested a review from DhavalRepo18 May 13, 2026 16:45

DhavalRepo18 approved these changes May 13, 2026

View reviewed changes

ShuxinLin added 2 commits May 13, 2026 15:54

docs(evaluation): clarify skeleton scorers in scenario-field table; a…

3af08e6

…dd aggregate JSON shape Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): offline evaluation module with uv run evaluate CLI#280

feat(evaluation): offline evaluation module with uv run evaluate CLI#280
ShuxinLin wants to merge 9 commits into
mainfrom
feat/evaluation-module

ShuxinLin commented Apr 27, 2026 •

edited

Loading

Uh oh!

DhavalRepo18 commented Apr 28, 2026 •

edited

Loading

Uh oh!

DhavalRepo18 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShuxinLin commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Interface

Docs

Test plan

Uh oh!

DhavalRepo18 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DhavalRepo18 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShuxinLin commented Apr 27, 2026 •

edited

Loading

DhavalRepo18 commented Apr 28, 2026 •

edited

Loading