feat(eval): offline fixture-based agent regression suite (M2.3)#13
Merged
Conversation
The final v2-plan core item. A deterministic, offline regression gate that drives the real query loop with scripted FakeModel runs and pins pass/turns/token/cost per task. A1 — FakeModel scripted usage: `FakeModelStep`'s `assistant_message` gains an optional `usage: ModelUsage`. When present the stream event carries it; when absent behavior is byte-for-byte unchanged. This is what makes offline token/cost metrics deterministic instead of always-zero, and is reusable by any test that wants to assert token accounting. eval.ts (mirrors week12/week18 structure): - 5 inline EvalTasks exercising distinct surfaces: read-only analysis (Glob+Read, plan), read-before-write safe edit (bypassPermissions), read-only Bash whitelist (pwd), plan-mode permission enforcement (Write must be denied, no file leaked), and an explore sub-agent through the same loop. - Each task: FakeModel script with scripted usage, validate() that inspects the transcript/fixture, pass = terminal completed + zero notes. - Per-task metrics: turns (assistant_message count), summed in/out/cache tokens, cost via a fixed EVAL_REFERENCE_RATES constant so the cost column is reproducible regardless of env. - Writes per-task transcript JSON + REPORT.md under .myagent/evals/runs/<runId>/. `formatEvalReport` prints an `[eval] <status>` summary with a totals line. - CLI: `myagent eval run` (exit 0 iff all pass). Help text + router wired next to week18. B1 — CI gate: packages/cli/test/eval.test.ts runs the whole suite into a temp dir, hard-asserts status=passed, pins the deterministic metric fingerprint (read-only-analysis exact tokens/cost, suite totals turns=11 in=8400 out=485), and checks the permission task really denied the plan-mode Write. A behavior regression in the agent loop flips this red in the 3-OS CI. Plus fake-model-usage.test.ts covers the A1 extension (scripted usage present / absent / per-turn). Catalog rows added; CLAUDE.md documents the suite + gate. Local: 187 tests, 3/3 green. `myagent eval run` passes 5/5 with a stable $0.0528 total cost fingerprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The final v2-plan core item. A deterministic, offline regression gate that drives the real query loop with scripted FakeModel runs and pins pass/turns/token/cost per task.
A1 — FakeModel scripted usage:
FakeModelStep'sassistant_messagegains an optionalusage: ModelUsage. When present the stream event carries it; when absent behavior is byte-for-byte unchanged. This is what makes offline token/cost metrics deterministic instead of always-zero, and is reusable by any test that wants to assert token accounting.eval.ts (mirrors week12/week18 structure):
formatEvalReportprints an[eval] <status>summary with a totals line.myagent eval run(exit 0 iff all pass). Help text + router wired next to week18.B1 — CI gate:
packages/cli/test/eval.test.ts runs the whole suite into a temp dir, hard-asserts status=passed, pins the deterministic metric fingerprint (read-only-analysis exact tokens/cost, suite totals turns=11 in=8400 out=485), and checks the permission task really denied the plan-mode Write. A behavior regression in the agent loop flips this red in the 3-OS CI. Plus fake-model-usage.test.ts covers the A1 extension (scripted usage present / absent / per-turn).
Catalog rows added; CLAUDE.md documents the suite + gate.
Local: 187 tests, 3/3 green.
myagent eval runpasses 5/5 with a stable $0.0528 total cost fingerprint.