You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement src/evaluation/ — a standalone module that consumes saved
agent trajectories and emits per-run JSON reports combining scoring
results and operational metrics. Exposes uv run evaluate as a new CLI
entry point.
Motivation
src/evaluation/ exists but is empty. Today we have all the inputs
(trajectories under AGENT_TRAJECTORY_DIR, scenarios under src/scenarios/ and groundtruth/, graders in aobench/scenario-server/grading/) but no glue for offline batch
evaluation or a standard report format.
Design
Follows the three-stage pattern used by SWE-bench, HELM, τ-bench: agent run → evaluate (NEW) → reports. Re-grading from saved
trajectories is first-class.
Vocabulary follows MLflow's evaluator/scorer split (per reviewer
suggestion):
an Evaluator orchestrates one or more Scorers. Scorers fall into
three families — Code-Based, LLM-As-Judge, Semantic-Score.
Summary
Implement
src/evaluation/— a standalone module that consumes savedagent trajectories and emits per-run JSON reports combining scoring
results and operational metrics. Exposes
uv run evaluateas a new CLIentry point.
Motivation
src/evaluation/exists but is empty. Today we have all the inputs(trajectories under
AGENT_TRAJECTORY_DIR, scenarios undersrc/scenarios/andgroundtruth/, graders inaobench/scenario-server/grading/) but no glue for offline batchevaluation or a standard report format.
Design
Follows the three-stage pattern used by SWE-bench, HELM, τ-bench:
agent run→evaluate(NEW) → reports. Re-grading from savedtrajectories is first-class.
Vocabulary follows MLflow's evaluator/scorer split (per reviewer
suggestion):
an
Evaluatororchestrates one or moreScorers. Scorers fall intothree families — Code-Based, LLM-As-Judge, Semantic-Score.
Layout
Scorer contract
Pure callable
(scenario, answer, trajectory_text) -> ScorerResult.Registry keyed by
grading_method. Scenariograding_methodoverridesthe CLI
--scorer-default.Output layout
Per-run file fields:
scenario_id,run_id,runner,model,question,answer,grade(scorer name + passed + score + rationaleops(turns, tool calls, tokens, duration,est_cost).
CLI
uv run evaluate \ --trajectories traces/trajectories \ --scenarios groundtruth/101.json \ --scorer-default llm_judge \ --judge-model litellm_proxy/aws/claude-opus-4-6 # defaults to --reports-dir reports/Status (delivered in #280)
Scenario,PersistedTrajectory,ScorerResult,OpsMetrics,EvalReport)scenario_idEvaluatorclass (MLflow-style orchestration)llm_judge) — six-criterion rubric, prompt mirrorssrc/tmp/evaluation_agent/result_evaluation_prompt.pyreports/<run_id>.json) + aggregate (_aggregate.json)evaluatescript inpyproject.toml [project.scripts]docs/evaluation.md, pointer fromINSTRUCTIONS.mdexact_string_match,numeric_match) — skeleton only (NotImplementedError, not auto-registered); to be filled in when neededsemantic_similarity) — skeleton only; to be filled in when neededFollow-ups (not in #280)
exact_string_match,numeric_match) — fill in the skeletons + registersemantic_similarity) — pick an approach (embedding cosine, BLEU, difflib ratio, sentence-transformers) + registeraobench/scenario-server/grading/intosrc/evaluation/scorers/References
Conventions surveyed: SWE-bench, τ-bench, AgentBench, WebArena,
GAIA, AppWorld, HELM, OpenAI evals.
MLflow Scorer concept: https://mlflow.org/docs/latest/genai/concepts/scorers/