feat: add evaluation module (src/evaluation/) with uv run evaluate CLI

## Summary

Implement `src/evaluation/` — a standalone module that consumes saved
agent trajectories and emits per-run JSON reports combining scoring
results and operational metrics. Exposes `uv run evaluate` as a new CLI
entry point.

## Motivation

`src/evaluation/` exists but is empty. Today we have all the inputs
(trajectories under `AGENT_TRAJECTORY_DIR`, scenarios under
`src/scenarios/` and `groundtruth/`, graders in
`aobench/scenario-server/grading/`) but no glue for offline batch
evaluation or a standard report format.

## Design

Follows the three-stage pattern used by SWE-bench, HELM, τ-bench:
`agent run` → `evaluate` (NEW) → reports. Re-grading from saved
trajectories is first-class.

Vocabulary follows MLflow's evaluator/scorer split (per [reviewer
suggestion](https://github.com/IBM/AssetOpsBench/pull/280#issuecomment-4336795594)):
an `Evaluator` orchestrates one or more `Scorer`s. Scorers fall into
three families — **Code-Based**, **LLM-As-Judge**, **Semantic-Score**.

### Layout
```
src/evaluation/
├── cli.py            # uv run evaluate
├── evaluator.py      # Evaluator — orchestration
├── runner.py         # functional wrapper
├── models.py         # Scenario, PersistedTrajectory, ScorerResult, EvalReport (pydantic)
├── loader.py         # join trajectories ↔ scenarios on scenario_id
├── scorers/
│   ├── code_based.py   # exact_string_match, numeric_match
│   ├── llm_judge.py    # 6-criterion rubric
│   └── semantic.py     # semantic_similarity
├── metrics.py        # ops rollups (tokens, latency, tool calls, cost)
└── report.py
```

### Scorer contract
Pure callable `(scenario, answer, trajectory_text) -> ScorerResult`.
Registry keyed by `grading_method`. Scenario `grading_method` overrides
the CLI `--scorer-default`.

### Output layout
```
reports/
├── <run_id>.json       # one ScenarioResult per trajectory
└── _aggregate.json     # EvalReport: totals, by_scenario_type, ops rollup
```

Per-run file fields: `scenario_id`, `run_id`, `runner`, `model`,
`question`, `answer`, `grade` (scorer name + passed + score + rationale
+ rubric details), `ops` (turns, tool calls, tokens, duration,
est_cost).

### CLI
```bash
uv run evaluate \
  --trajectories traces/trajectories \
  --scenarios groundtruth/101.json \
  --scorer-default llm_judge \
  --judge-model litellm_proxy/aws/claude-opus-4-6
# defaults to --reports-dir reports/
```

## Status (delivered in #280)

- [x] Pydantic models (`Scenario`, `PersistedTrajectory`, `ScorerResult`, `OpsMetrics`, `EvalReport`)
- [x] Loader: join trajectories ↔ scenarios on `scenario_id`
- [x] `Evaluator` class (MLflow-style orchestration)
- [x] LLM-As-Judge scorer (`llm_judge`) — six-criterion rubric, prompt mirrors `src/tmp/evaluation_agent/result_evaluation_prompt.py`
- [x] Ops metric rollups (tokens, duration p50/p95, tool calls, optional cost)
- [x] Per-run JSON writer (`reports/<run_id>.json`) + aggregate (`_aggregate.json`)
- [x] Terminal summary table
- [x] `evaluate` script in `pyproject.toml [project.scripts]`
- [x] 40 unit tests
- [x] Docs: full reference in `docs/evaluation.md`, pointer from `INSTRUCTIONS.md`
- [x] Code-Based scorers (`exact_string_match`, `numeric_match`) — **skeleton only** (`NotImplementedError`, not auto-registered); to be filled in when needed
- [x] Semantic-Score scorer (`semantic_similarity`) — **skeleton only**; to be filled in when needed

## Follow-ups (not in #280)

- Implement Code-Based scorers (`exact_string_match`, `numeric_match`) — fill in the skeletons + register
- Implement Semantic-Score scorer (`semantic_similarity`) — pick an approach (embedding cosine, BLEU, difflib ratio, sentence-transformers) + register
- Cost lookup table for all WatsonX / LiteLLM models
- HTML report
- pass^k / reliability metrics (τ-bench style — needs k-trial runs)
- Consolidate `aobench/scenario-server/grading/` into `src/evaluation/scorers/`

## References

Conventions surveyed: SWE-bench, τ-bench, AgentBench, WebArena,
GAIA, AppWorld, HELM, OpenAI evals.
MLflow Scorer concept: https://mlflow.org/docs/latest/genai/concepts/scorers/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add evaluation module (src/evaluation/) with uv run evaluate CLI #279

Summary

Motivation

Design

Layout

Scorer contract

Output layout

CLI

Status (delivered in #280)

Follow-ups (not in #280)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: add evaluation module (src/evaluation/) with uv run evaluate CLI #279

Description

Summary

Motivation

Design

Layout

Scorer contract

Output layout

CLI

Status (delivered in #280)

Follow-ups (not in #280)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions