feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner#523
Conversation
… stub-LLM runner
Phase 8 Task 8.4 — Tier 4 S1 brutal-review item. Ships the
release-engineering measurement tool that produces the v1.0 ship
gate's headline number: the Llama 3.1 8B success-rate delta on a
standardized curate task set, raw `tools.*` vs the current harness.
v1.0 ship gate: delta >= 30 percentage points on the reference
fixture or the release is blocked.
## Layout
All release-engineering tooling for this feature lives together
under `scripts/autoharness-kpi/`:
scripts/autoharness-kpi/
runner.ts — main script (parseArgs + runArm +
computeKpiReport + exitCodeForReport
+ main — all pure, unit-testable)
fixture-tasks.json — 20 curate tasks hand-crafted against
this codebase (stable across releases)
expected-baseline.json — placeholder; populated at v1.0 ship
with the real-LLM measurement
test/unit/scripts/autoharness-kpi/
runner.test.ts — 19 unit tests
package.json
"kpi:harness" — runs via `tsx scripts/autoharness-kpi/runner.ts`
Chose the `scripts/<feature>/` namespace over a top-level `kpi/`
directory: KPI tooling is a subsystem concern (AutoHarness V2 only),
not a first-class project concern. Big-tech repos (Kubernetes /
Rust / CockroachDB / VSCode) consistently keep per-feature
release tooling under an umbrella like `scripts/`, `hack/`, or
`tools/` — never at repo root.
## Scope — scaffolding + stub LLM only
`--llm stub` is fully operational with deterministic canned arm
rates (delta = 50pp, well above the 30pp gate) for script-logic
validation. `--llm real` currently throws a clear "not yet
implemented" error — wiring the real Llama 3.1 8B path into the
agent LLM service lands in a follow-up once Phase 7 finalizes the
command surface.
The fixture + math + CLI + exit-code logic all ship now (zero
conflict risk with Phat's in-flight Phase 6 work) so the
follow-up real-LLM PR is ~50 LOC of wiring.
## Fixture design
20 curate tasks covering concrete byterover-cli subjects:
exported classes under harness/, HarnessConfigSchema fields,
IHarnessStore methods, Zod schema enumeration, project-type
detection rules, mode thresholds, Mode C cap values, feedback
weights, etc. Each task has {id, taskDescription,
expectedBehavior}. Hand-crafted so a reviewer can eye-ball
realism. Aims at the "weak model needs hand-holding" band —
tasks too easy (any model succeeds) or too hard (no model
succeeds) wouldn't discriminate between arms.
## Tests (19)
- loadFixture: shipped fixture parses; malformed/empty rejected
- makeStubLlmClient: deterministic per-(task, arm); unknown-id
fallback to both-arms-succeed
- runArm: aggregates correctly; overall = flat mean across runs
- computeKpiReport: delta math; metadata carried through
- exitCodeForReport: 0 at exactly 0.30 (inclusive floor), 1 at
0.29, 1 on negative delta
- parseArgs: defaults; flag parsing; validates --llm + --runs
- main (end-to-end with stub): shipped fixture runs; --output
writes valid JSON; --llm real surfaces the clear
"not yet implemented" error
## Not merging yet
Per team process, this PR opens for review; Danh decides separately
whether to merge to proj/autoharness-v2 once review feedback is in.
|
Claude finished @danhdoan's task in 2m 32s —— View job PR Review — AutoHarness V2 KPI Harness
Overall Assessment: 3.5 / 5The scaffolding is well-designed. Pure functions, clean exports for testability, correct delta math, proper Seven issues were flagged inline; two are blocking against project standards, the rest are suggestions. Issues by severityBlocking (project-standards violations)
Suggestions
Fixture realism noteThe 20 tasks are anchored to concrete AutoHarness V2 symbols (schema fields, interface methods, mode thresholds, template names, etc.) which is the right approach. Two callouts:
What looks good
|
Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable `HarnessBaselineRunner` that powers the future `brv harness baseline` CLI command: replays the last N stored scenarios against two arms (pass-through template = raw, current version = harness), reports per-scenario outcomes + overall rates + delta. Complementary to the reference KPI harness (`scripts/autoharness-kpi/`, PR #523): - KPI harness: fixed task set + fixed model → release-notes headline number. - Baseline runner: user's own scenarios + current harness → per-user "is it working for me?" signal. The oclif command file (`src/oclif/commands/harness/baseline.ts`) is deliberately NOT shipped in this PR. The command needs the daemon-transport wiring for harness subcommands that Task 7.1 (status/inspect) establishes — none of Phase 7's CLI surface is merged yet, so there's no transport pattern to follow. Shipping just the runner class now (pure, testable, DI-ready) lets the oclif wrapper land in a small follow-up after 7.1 defines the pattern. This matches the "reusable logic now, wiring later" pattern used by the KPI harness scaffolding PR (#523). - Pure orchestration; caller injects the `HarnessToolsFactory` (production wires this to `SandboxService.buildHarnessTools({dryRun: true})` for write-blocked eval runs, matching Phase 6.1). - Raw arm fabricates a synthetic version from the stored template (`getTemplate(commandType, currentVersion.projectType).code`) — no need to persist it; the module builder accepts any code string. - `v1.0 scope narrowing`: only `commandType === 'curate'` is supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until query templates ship in the Phase 4.3 follow-up. - `count` bounded [1, 50]; scenarios < 3 fails with a clear "run curate N more times first" message so users know how to unblock themselves. - Each scenario runs once per arm (single-run side-by-side display). Contrast Phase 6.1 Evaluator which does 10 runs per scenario for statistical significance — different semantics, different surface, not shared on purpose. Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE (query, chat), INSUFFICIENT_SCENARIOS (2 scenarios), NO_CURRENT_VERSION (empty store), happy-path delta = +100%, no-op delta = 0, count-caps-slice, harness-throws-captures- stderr. All 278 harness unit tests pass. Build + lint + typecheck clean.
Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable `HarnessBaselineRunner` that powers the future `brv harness baseline` CLI command: replays the last N stored scenarios against two arms (pass-through template = raw, current version = harness), reports per-scenario outcomes + overall rates + delta. Complementary to the reference KPI harness (`scripts/autoharness-kpi/`, PR #523): - KPI harness: fixed task set + fixed model → release-notes headline number. - Baseline runner: user's own scenarios + current harness → per-user "is it working for me?" signal. The oclif command file (`src/oclif/commands/harness/baseline.ts`) is deliberately NOT shipped in this PR. The command needs the daemon-transport wiring for harness subcommands that Task 7.1 (status/inspect) establishes — none of Phase 7's CLI surface is merged yet, so there's no transport pattern to follow. Shipping just the runner class now (pure, testable, DI-ready) lets the oclif wrapper land in a small follow-up after 7.1 defines the pattern. This matches the "reusable logic now, wiring later" pattern used by the KPI harness scaffolding PR (#523). - Pure orchestration; caller injects the `HarnessToolsFactory` (production wires this to `SandboxService.buildHarnessTools({dryRun: true})` for write-blocked eval runs, matching Phase 6.1). - Raw arm fabricates a synthetic version from the stored template (`getTemplate(commandType, currentVersion.projectType).code`) — no need to persist it; the module builder accepts any code string. - `v1.0 scope narrowing`: only `commandType === 'curate'` is supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until query templates ship in the Phase 4.3 follow-up. - `count` bounded [1, 50]; scenarios < 3 fails with a clear "run curate N more times first" message so users know how to unblock themselves. - Each scenario runs once per arm (single-run side-by-side display). Contrast Phase 6.1 Evaluator which does 10 runs per scenario for statistical significance — different semantics, different surface, not shared on purpose. Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE (query, chat), INSUFFICIENT_SCENARIOS (2 scenarios), NO_CURRENT_VERSION (empty store), happy-path delta = +100%, no-op delta = 0, count-caps-slice, harness-throws-captures- stderr. All 278 harness unit tests pass. Build + lint + typecheck clean.
|
Claude finished @RyanNg1403's task in 2m 6s —— View job PR Review — AutoHarness V2 KPI Harness (re-review after branch sync)
Overall Assessment: 3.5 / 5This re-review was triggered by the merge-sync commit ( Both blocking issues from the previous review remain open. The overall assessment and score are unchanged. A brief re-audit confirmed no regressions were introduced and no new issues were found. Still-Open Blocking Issues1.
commandType: (parsed.commandType as Fixture['commandType']) ?? 'curate',
fixtureVersion: (parsed.fixtureVersion as string) ?? 'unversioned',
targetModel: (parsed.targetModel as string) ?? 'unknown',
const VALID_COMMAND_TYPES = ['chat', 'curate', 'query'] as const
const rawCt = parsed.commandType
const commandType: Fixture['commandType'] =
typeof rawCt === 'string' &&
(VALID_COMMAND_TYPES as readonly string[]).includes(rawCt)
? (rawCt as Fixture['commandType'])
: 'curate'
const rawVersion = parsed.fixtureVersion
const fixtureVersion = typeof rawVersion === 'string' ? rawVersion : 'unversioned'
const rawTarget = parsed.targetModel
const targetModel = typeof rawTarget === 'string' ? rawTarget : 'unknown'2. Missing
Options (pick one):
Still-Open Suggestions3. No A malformed fixture file will surface a raw 4. Silent If the two arms ever diverge on task IDs the delta silently uses 0% as the raw baseline. A task-ID mismatch is a bug and should throw. 5.
6.
7. Trailing whitespace at line 368 (runner.ts:368) Three spaces before What Still Looks Good
|
|
Claude finished @RyanNg1403's task in 2m 2s —— View job PR Review — AutoHarness V2 KPI Harness (re-review after 9c366d3)
Overall Assessment: 4.5 / 5All seven issues from the previous two reviews have been addressed. Both blocking violations are gone, all five suggestions are implemented, and the test suite grew from 19 → 28 tests. The only remaining item is the cosmetic trailing whitespace, which is the only thing keeping this below a 5. Previous Issues — Status
Detail: What Was FixedIssue 1 —
Issue 2 — Issue 3 — Issue 4 — task-ID mismatch ( throw new Error(`task ID mismatch: harness arm has '${h.taskId}' but raw arm does not`)Test at Issues 5 & 6 — Remaining ItemNitpick — trailing whitespace ( What Looks Good
|
Summary
scripts/autoharness-kpi/. Scaffolding + 20-task fixture + stub-LLM runner are fully operational;--llm realwiring lands in a follow-up.brv harness baseline(Task 7.5, per-user KPI) is a distinct command — not affected.Type of change
Scope (select all touched areas)
Linked issues
brv harness baseline(per-user KPI — same math, different consumer)Root cause (bug fixes only, otherwise write
N/A)N/A
Layout decision
Chose the
scripts/<feature>/namespace over a top-levelkpi/directory after thinking through it from a big-tech senior-engineer lens: KPI tooling is a subsystem concern (AutoHarness V2 only), not a first-class project concern. Top-level promotion would over-scope it. Kubernetes useshack/, Rust usessrc/tools/, CockroachDB usespkg/cmd/, VSCode co-locates eval with each extension — none promote single-subsystem eval to repo root.Test plan
test/unit/scripts/autoharness-kpi/runner.test.ts(19 tests)loadFixture: shipped fixture parses cleanly; malformed / empty rejectedmakeStubLlmClient: deterministic per-(task, arm); unknown-id → both-arms-succeed fallbackrunArm: aggregates runs-per-task; overall = flat meancomputeKpiReport: delta math = harness − raw; metadata flows throughexitCodeForReport: 0 at delta = 0.30 (inclusive floor), 1 at 0.29, 1 on negative deltaparseArgs: defaults + all flags + validates--llm/--runsmainend-to-end with stub: shipped fixture runs → delta = 0.50 → exit 0;--outputwrites valid JSON;--llm realthrows clear not-implemented errorUser-visible changes
None. Release-engineering tool only.
Evidence
Checklist
Risks and mitigations
Risk: Scaffolding-only —
--llm realis NOT implemented in this PR. The real-LLM path (~50 LOC wiring the agent LLM stack with Llama 3.1 8B) lands in a follow-up. Without it, the v1.0 ship gate can't run.--llm realso no one thinks they can run the ship gate yet.Risk: Fixture realism. The 20 curate tasks are hand-crafted against this codebase. Too-easy tasks (any model succeeds) or too-hard tasks (no model succeeds) dilute the delta signal.
Risk:
30ppthreshold lock-in. If the first real-LLM run produces delta < 30pp, we either rework the refinement prompts or lower the bar. The bar is defined inv1-design-decisions.md §2.7; moving it needs a design-review.SHIP_GATE_DELTAis a named const at the top ofrunner.ts, one-line change if the bar moves. Test pins the current value so drift shows up in diff.Risk:
expected-baseline.jsonis a placeholder. Until the real-LLM run lands, the file hasnullfields.measuredAt: nulletc.) + the\$commentfield explains when and how to populate. Regressions on future ships compare against this file.Risk: Holding merge for manual review. PR explicitly flags this at the top — no auto-merge expected.