feat(eval): offline fixture-based agent regression suite (M2.3) by wusijian007 · Pull Request #13 · wusijian007/mini-claude-code

wusijian007 · 2026-05-15T06:54:52Z

The final v2-plan core item. A deterministic, offline regression gate that drives the real query loop with scripted FakeModel runs and pins pass/turns/token/cost per task.

A1 — FakeModel scripted usage:
FakeModelStep's assistant_message gains an optional usage: ModelUsage. When present the stream event carries it; when absent behavior is byte-for-byte unchanged. This is what makes offline token/cost metrics deterministic instead of always-zero, and is reusable by any test that wants to assert token accounting.

eval.ts (mirrors week12/week18 structure):

5 inline EvalTasks exercising distinct surfaces: read-only analysis (Glob+Read, plan), read-before-write safe edit (bypassPermissions), read-only Bash whitelist (pwd), plan-mode permission enforcement (Write must be denied, no file leaked), and an explore sub-agent through the same loop.
Each task: FakeModel script with scripted usage, validate() that inspects the transcript/fixture, pass = terminal completed + zero notes.
Per-task metrics: turns (assistant_message count), summed in/out/cache tokens, cost via a fixed EVAL_REFERENCE_RATES constant so the cost column is reproducible regardless of env.
Writes per-task transcript JSON + REPORT.md under .myagent/evals/runs//. formatEvalReport prints an [eval] <status> summary with a totals line.
CLI: myagent eval run (exit 0 iff all pass). Help text + router wired next to week18.

B1 — CI gate:
packages/cli/test/eval.test.ts runs the whole suite into a temp dir, hard-asserts status=passed, pins the deterministic metric fingerprint (read-only-analysis exact tokens/cost, suite totals turns=11 in=8400 out=485), and checks the permission task really denied the plan-mode Write. A behavior regression in the agent loop flips this red in the 3-OS CI. Plus fake-model-usage.test.ts covers the A1 extension (scripted usage present / absent / per-turn).

Catalog rows added; CLAUDE.md documents the suite + gate.

Local: 187 tests, 3/3 green. myagent eval run passes 5/5 with a stable $0.0528 total cost fingerprint.

The final v2-plan core item. A deterministic, offline regression gate that drives the real query loop with scripted FakeModel runs and pins pass/turns/token/cost per task. A1 — FakeModel scripted usage: `FakeModelStep`'s `assistant_message` gains an optional `usage: ModelUsage`. When present the stream event carries it; when absent behavior is byte-for-byte unchanged. This is what makes offline token/cost metrics deterministic instead of always-zero, and is reusable by any test that wants to assert token accounting. eval.ts (mirrors week12/week18 structure): - 5 inline EvalTasks exercising distinct surfaces: read-only analysis (Glob+Read, plan), read-before-write safe edit (bypassPermissions), read-only Bash whitelist (pwd), plan-mode permission enforcement (Write must be denied, no file leaked), and an explore sub-agent through the same loop. - Each task: FakeModel script with scripted usage, validate() that inspects the transcript/fixture, pass = terminal completed + zero notes. - Per-task metrics: turns (assistant_message count), summed in/out/cache tokens, cost via a fixed EVAL_REFERENCE_RATES constant so the cost column is reproducible regardless of env. - Writes per-task transcript JSON + REPORT.md under .myagent/evals/runs/<runId>/. `formatEvalReport` prints an `[eval] <status>` summary with a totals line. - CLI: `myagent eval run` (exit 0 iff all pass). Help text + router wired next to week18. B1 — CI gate: packages/cli/test/eval.test.ts runs the whole suite into a temp dir, hard-asserts status=passed, pins the deterministic metric fingerprint (read-only-analysis exact tokens/cost, suite totals turns=11 in=8400 out=485), and checks the permission task really denied the plan-mode Write. A behavior regression in the agent loop flips this red in the 3-OS CI. Plus fake-model-usage.test.ts covers the A1 extension (scripted usage present / absent / per-turn). Catalog rows added; CLAUDE.md documents the suite + gate. Local: 187 tests, 3/3 green. `myagent eval run` passes 5/5 with a stable $0.0528 total cost fingerprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

wusijian007 merged commit a04135b into main May 15, 2026
3 checks passed

wusijian007 deleted the feat/m2.3-eval-suite branch May 15, 2026 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): offline fixture-based agent regression suite (M2.3)#13

feat(eval): offline fixture-based agent regression suite (M2.3)#13
wusijian007 merged 1 commit into
mainfrom
feat/m2.3-eval-suite

wusijian007 commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wusijian007 commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant