Skip to content

feat(evaluation): offline evaluation module with uv run evaluate CLI#280

Open
ShuxinLin wants to merge 9 commits into
mainfrom
feat/evaluation-module
Open

feat(evaluation): offline evaluation module with uv run evaluate CLI#280
ShuxinLin wants to merge 9 commits into
mainfrom
feat/evaluation-module

Conversation

@ShuxinLin
Copy link
Copy Markdown
Collaborator

@ShuxinLin ShuxinLin commented Apr 27, 2026

Summary

Adds src/evaluation/ — an offline scorer for saved agent trajectories. Follows MLflow's evaluator/scorer split per reviewer feedback: an Evaluator dispatches to one of three scorer families.

Family Registered name Status
LLM-As-Judge llm_judge Wired; 6-criterion rubric mirroring src/tmp/evaluation_agent/result_evaluation_prompt.py
Code-Based exact_string_match, numeric_match Skeleton (NotImplementedError), not auto-registered
Semantic-Score semantic_similarity Skeleton, not auto-registered

Interface

uv run evaluate \
  --trajectories <dir> --scenarios <file(s)> \
  --scorer-default llm_judge --judge-model <litellm_proxy/…>

Writes reports/<run_id>.json per trajectory and reports/_aggregate.json for the rollup.

Scenario field scoring_method selects the scorer per-scenario; --scorer-default is the fallback.

Docs

docs/evaluation.md — schema, CLI, output layout, custom-scorer pattern, groundtruth loop. Pointer from INSTRUCTIONS.md.

Test plan

  • uv run pytest src/evaluation/ — 40 passed
  • uv run pytest src/ -k "not integration" — 310 passed
  • End-to-end against groundtruth/101.json with llm_judge — 6/6 rubric pass, reports written

Closes #279

Implement src/evaluation/ — consumes saved agent trajectories
({run_id}.json under AGENT_TRAJECTORY_DIR) and scenario files, joins
them on scenario_id, runs a registered grader per scenario, and emits
a JSON report combining grading results with operational metrics
(tokens, duration p50/p95, tool calls, optional cost estimate).

The shape follows SWE-bench / HELM / τ-bench conventions: agent run
→ evaluate → report.json, with offline re-grading from saved
trajectories as a first-class workflow.

Includes:
- Pydantic models (Scenario, PersistedTrajectory, GradeResult,
  OpsMetrics, EvalReport)
- Loader for trajectory dirs and JSON/JSONL scenario files
- Grader registry with two deterministic graders
  (exact_string_match, numeric_match) and a pluggable LLM judge
  bound to LLMBackend (six-criterion rubric)
- Per-task ops metric extraction (handles both SDK Trajectory and
  plan-execute list[StepResult] shapes) plus aggregate rollups
- Report writer with terminal summary and JSON output
- evaluate script registered in [project.scripts]
- 39 unit tests covering models, loader, graders, metrics, report,
  and end-to-end runner — all passing alongside existing 270 tests

Closes #279

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
@DhavalRepo18
Copy link
Copy Markdown
Collaborator

DhavalRepo18 commented Apr 28, 2026

https://mlflow.org/docs/latest/genai/concepts/scorers/ Please use these concept and prefer to use Scorer

  • Evaluator has multiple Scorer
    • LLM-As-Judge
    • Semantic-Score
    • Code-Based

ShuxinLin added 6 commits May 13, 2026 11:54
# Conflicts:
#	pyproject.toml

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Reviewer feedback (PR #280): align with MLflow's evaluator/scorer split.

- Rename src/evaluation/graders/ -> src/evaluation/scorers/ and organise
  by family: code_based (exact/numeric), llm_judge (LLM-As-Judge),
  semantic (new).
- Rename GradeResult -> ScorerResult with field `scorer` (Scenario
  input field `grading_method` unchanged — input contract preserved).
- Add `Evaluator` class as the orchestration entry point; functional
  `evaluate()` now delegates to it.
- Add Semantic-Score scorer using difflib.SequenceMatcher (stdlib only,
  no extra deps); threshold overridable via scenario.similarity_threshold.
- CLI: add --scorer-default (keeps --grader-default as alias).

Tests: 7 new (4 semantic + 2 evaluator + 1 registry); 46 total in
src/evaluation/, full suite 316 passed.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Code-Based scorers (exact_string_match, numeric_match) are now
skeleton stubs raising NotImplementedError; the family slot in the
Evaluator/Scorer taxonomy is preserved but implementations are
deferred. Registry no longer auto-registers them on import.

Tests for behavior removed; new TestCodeBasedSkeletons asserts
NotImplementedError. test_runner and test_evaluator override tests
re-pointed at semantic_similarity. 41 evaluation tests pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
…warg

- LLM-As-Judge prompt now mirrors src/tmp/evaluation_agent/result_evaluation_prompt.py:
  full 6-criterion rubric text, split Agent's Thinking vs Final Response,
  output schema uses `suggestions` (back-compat: `reason` still accepted),
  parser strips "(END OF RESPONSE)" sentinel.
- CLI: LiteLLMBackend takes `model_id=`, not `model=`. Fixes:
    TypeError: LiteLLMBackend.__init__() got an unexpected keyword argument 'model'

Verified end-to-end: claude-agent on groundtruth/101 ("List all failure
modes of asset Chiller.") → uv run evaluate with --scorer-default
llm_judge → 6/6 criteria pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
- CLI: --output FILE replaced with --reports-dir DIR (default reports/).
  Writes one JSON per result (named by trajectory run_id, which is a
  UUID) plus _aggregate.json for the rollup.
- ScenarioResult now carries run_id (propagated from PersistedTrajectory).
- New report.write_reports_dir(); falls back to scenario-<id>.json for
  legacy trajectories with no run_id.
- 2 new tests; 43 evaluation tests pass.

Verified: uv run evaluate against groundtruth/101.json wrote
reports/112c1b56-...json + reports/_aggregate.json.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
- Strip src/evaluation/scorers/semantic.py to a NotImplementedError
  skeleton; no longer auto-registered. Code-Based + Semantic-Score
  families now both ship as slot-only placeholders; LLM-As-Judge is
  the only working scorer in this branch.
- Tests: TestSemanticSimilarity collapsed to a NotImplementedError
  assertion; runner/evaluator override tests pivot to local stub
  scorers (no skeleton dependency).
- INSTRUCTIONS.md: new Evaluation section linking to the full doc.
- docs/evaluation.md: scenario/trajectory schema, CLI reference, report
  layout, scorer families table, custom-scorer plug-in pattern, loop
  over groundtruth/*.json.

40 evaluation tests pass; full suite 310 passed.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Copy link
Copy Markdown
Collaborator

@DhavalRepo18 DhavalRepo18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use Evaluator and Scorer (universal terminology).

ShuxinLin added 2 commits May 13, 2026 15:54
Field and identifier renames so the module speaks a single vocabulary:

- Scenario.grading_method      -> Scenario.scoring_method
- ScenarioResult.grade         -> ScenarioResult.score (typed ScorerResult)
- runner.default_grading_method -> runner.default_scoring_method
- CLI: --grader-default legacy alias removed (only --scorer-default)
- report.totals["graded"]      -> report.totals["scored"]
- Docstrings/comments/docs: "grading"/"graded"/"grader" -> "scoring"/"scored"/"scorer"

Tests, INSTRUCTIONS.md, docs/evaluation.md updated.

Note: the inner numeric ScorerResult.score is unchanged; access is
result.score.score for the numeric, result.score.passed for the bool.

40 evaluation tests pass; full suite 310 passed; end-to-end against
groundtruth/101.json still emits 6/6 rubric pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
…dd aggregate JSON shape

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add evaluation module (src/evaluation/) with uv run evaluate CLI

2 participants