Skip to content

feat: add evaluation module (src/evaluation/) with uv run evaluate CLI #279

@ShuxinLin

Description

@ShuxinLin

Summary

Implement src/evaluation/ — a standalone module that consumes saved
agent trajectories and emits per-run JSON reports combining scoring
results and operational metrics. Exposes uv run evaluate as a new CLI
entry point.

Motivation

src/evaluation/ exists but is empty. Today we have all the inputs
(trajectories under AGENT_TRAJECTORY_DIR, scenarios under
src/scenarios/ and groundtruth/, graders in
aobench/scenario-server/grading/) but no glue for offline batch
evaluation or a standard report format.

Design

Follows the three-stage pattern used by SWE-bench, HELM, τ-bench:
agent runevaluate (NEW) → reports. Re-grading from saved
trajectories is first-class.

Vocabulary follows MLflow's evaluator/scorer split (per reviewer
suggestion
):
an Evaluator orchestrates one or more Scorers. Scorers fall into
three families — Code-Based, LLM-As-Judge, Semantic-Score.

Layout

src/evaluation/
├── cli.py            # uv run evaluate
├── evaluator.py      # Evaluator — orchestration
├── runner.py         # functional wrapper
├── models.py         # Scenario, PersistedTrajectory, ScorerResult, EvalReport (pydantic)
├── loader.py         # join trajectories ↔ scenarios on scenario_id
├── scorers/
│   ├── code_based.py   # exact_string_match, numeric_match
│   ├── llm_judge.py    # 6-criterion rubric
│   └── semantic.py     # semantic_similarity
├── metrics.py        # ops rollups (tokens, latency, tool calls, cost)
└── report.py

Scorer contract

Pure callable (scenario, answer, trajectory_text) -> ScorerResult.
Registry keyed by grading_method. Scenario grading_method overrides
the CLI --scorer-default.

Output layout

reports/
├── <run_id>.json       # one ScenarioResult per trajectory
└── _aggregate.json     # EvalReport: totals, by_scenario_type, ops rollup

Per-run file fields: scenario_id, run_id, runner, model,
question, answer, grade (scorer name + passed + score + rationale

  • rubric details), ops (turns, tool calls, tokens, duration,
    est_cost).

CLI

uv run evaluate \
  --trajectories traces/trajectories \
  --scenarios groundtruth/101.json \
  --scorer-default llm_judge \
  --judge-model litellm_proxy/aws/claude-opus-4-6
# defaults to --reports-dir reports/

Status (delivered in #280)

  • Pydantic models (Scenario, PersistedTrajectory, ScorerResult, OpsMetrics, EvalReport)
  • Loader: join trajectories ↔ scenarios on scenario_id
  • Evaluator class (MLflow-style orchestration)
  • LLM-As-Judge scorer (llm_judge) — six-criterion rubric, prompt mirrors src/tmp/evaluation_agent/result_evaluation_prompt.py
  • Ops metric rollups (tokens, duration p50/p95, tool calls, optional cost)
  • Per-run JSON writer (reports/<run_id>.json) + aggregate (_aggregate.json)
  • Terminal summary table
  • evaluate script in pyproject.toml [project.scripts]
  • 40 unit tests
  • Docs: full reference in docs/evaluation.md, pointer from INSTRUCTIONS.md
  • Code-Based scorers (exact_string_match, numeric_match) — skeleton only (NotImplementedError, not auto-registered); to be filled in when needed
  • Semantic-Score scorer (semantic_similarity) — skeleton only; to be filled in when needed

Follow-ups (not in #280)

  • Implement Code-Based scorers (exact_string_match, numeric_match) — fill in the skeletons + register
  • Implement Semantic-Score scorer (semantic_similarity) — pick an approach (embedding cosine, BLEU, difflib ratio, sentence-transformers) + register
  • Cost lookup table for all WatsonX / LiteLLM models
  • HTML report
  • pass^k / reliability metrics (τ-bench style — needs k-trial runs)
  • Consolidate aobench/scenario-server/grading/ into src/evaluation/scorers/

References

Conventions surveyed: SWE-bench, τ-bench, AgentBench, WebArena,
GAIA, AppWorld, HELM, OpenAI evals.
MLflow Scorer concept: https://mlflow.org/docs/latest/genai/concepts/scorers/

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions