Skip to content

feat(eval): offline fixture-based agent regression suite (M2.3)#13

Merged
wusijian007 merged 1 commit into
mainfrom
feat/m2.3-eval-suite
May 15, 2026
Merged

feat(eval): offline fixture-based agent regression suite (M2.3)#13
wusijian007 merged 1 commit into
mainfrom
feat/m2.3-eval-suite

Conversation

@wusijian007

Copy link
Copy Markdown
Owner

The final v2-plan core item. A deterministic, offline regression gate that drives the real query loop with scripted FakeModel runs and pins pass/turns/token/cost per task.

A1 — FakeModel scripted usage:
FakeModelStep's assistant_message gains an optional usage: ModelUsage. When present the stream event carries it; when absent behavior is byte-for-byte unchanged. This is what makes offline token/cost metrics deterministic instead of always-zero, and is reusable by any test that wants to assert token accounting.

eval.ts (mirrors week12/week18 structure):

  • 5 inline EvalTasks exercising distinct surfaces: read-only analysis (Glob+Read, plan), read-before-write safe edit (bypassPermissions), read-only Bash whitelist (pwd), plan-mode permission enforcement (Write must be denied, no file leaked), and an explore sub-agent through the same loop.
  • Each task: FakeModel script with scripted usage, validate() that inspects the transcript/fixture, pass = terminal completed + zero notes.
  • Per-task metrics: turns (assistant_message count), summed in/out/cache tokens, cost via a fixed EVAL_REFERENCE_RATES constant so the cost column is reproducible regardless of env.
  • Writes per-task transcript JSON + REPORT.md under .myagent/evals/runs//. formatEvalReport prints an [eval] <status> summary with a totals line.
  • CLI: myagent eval run (exit 0 iff all pass). Help text + router wired next to week18.

B1 — CI gate:
packages/cli/test/eval.test.ts runs the whole suite into a temp dir, hard-asserts status=passed, pins the deterministic metric fingerprint (read-only-analysis exact tokens/cost, suite totals turns=11 in=8400 out=485), and checks the permission task really denied the plan-mode Write. A behavior regression in the agent loop flips this red in the 3-OS CI. Plus fake-model-usage.test.ts covers the A1 extension (scripted usage present / absent / per-turn).

Catalog rows added; CLAUDE.md documents the suite + gate.

Local: 187 tests, 3/3 green. myagent eval run passes 5/5 with a stable $0.0528 total cost fingerprint.

The final v2-plan core item. A deterministic, offline regression
gate that drives the real query loop with scripted FakeModel runs
and pins pass/turns/token/cost per task.

A1 — FakeModel scripted usage:
`FakeModelStep`'s `assistant_message` gains an optional `usage:
ModelUsage`. When present the stream event carries it; when absent
behavior is byte-for-byte unchanged. This is what makes offline
token/cost metrics deterministic instead of always-zero, and is
reusable by any test that wants to assert token accounting.

eval.ts (mirrors week12/week18 structure):
- 5 inline EvalTasks exercising distinct surfaces: read-only
  analysis (Glob+Read, plan), read-before-write safe edit
  (bypassPermissions), read-only Bash whitelist (pwd), plan-mode
  permission enforcement (Write must be denied, no file leaked),
  and an explore sub-agent through the same loop.
- Each task: FakeModel script with scripted usage, validate() that
  inspects the transcript/fixture, pass = terminal completed +
  zero notes.
- Per-task metrics: turns (assistant_message count), summed
  in/out/cache tokens, cost via a fixed EVAL_REFERENCE_RATES
  constant so the cost column is reproducible regardless of env.
- Writes per-task transcript JSON + REPORT.md under
  .myagent/evals/runs/<runId>/. `formatEvalReport` prints an
  `[eval] <status>` summary with a totals line.
- CLI: `myagent eval run` (exit 0 iff all pass). Help text +
  router wired next to week18.

B1 — CI gate:
packages/cli/test/eval.test.ts runs the whole suite into a temp
dir, hard-asserts status=passed, pins the deterministic metric
fingerprint (read-only-analysis exact tokens/cost, suite totals
turns=11 in=8400 out=485), and checks the permission task really
denied the plan-mode Write. A behavior regression in the agent
loop flips this red in the 3-OS CI. Plus fake-model-usage.test.ts
covers the A1 extension (scripted usage present / absent / per-turn).

Catalog rows added; CLAUDE.md documents the suite + gate.

Local: 187 tests, 3/3 green. `myagent eval run` passes 5/5 with a
stable $0.0528 total cost fingerprint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@wusijian007 wusijian007 merged commit a04135b into main May 15, 2026
3 checks passed
@wusijian007 wusijian007 deleted the feat/m2.3-eval-suite branch May 15, 2026 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant