feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner by danhdoan · Pull Request #523 · campfirein/byterover-cli

danhdoan · 2026-04-22T08:45:09Z

⚠️ Awaiting Danh's review — do NOT merge now.
Opened for discussion; merge decision is held pending manual review of the fixture realism and the script-shape / layout.

Summary

Problem: AutoHarness V2 is a 4000+ LOC feature built on a specific thesis — "a learned pass-through + refinement layer measurably helps weak LLMs succeed at curate tasks." Without a KPI gate, we'd ship that thesis unverified.
Why it matters: Tier 4 S1 brutal-review item + the v1.0 ship-gate. Produces the release-notes headline number (Llama 3.1 8B success-rate delta, harness vs raw). Future ships get regression detection on the same gate.
What changed: Ships the reference KPI tooling under scripts/autoharness-kpi/. Scaffolding + 20-task fixture + stub-LLM runner are fully operational; --llm real wiring lands in a follow-up.
What did NOT change (scope boundary): No source code touched. No real LLM calls (stub path only). brv harness baseline (Task 7.5, per-user KPI) is a distinct command — not affected.

Type of change

New feature (release-engineering tooling)

Scope (select all touched areas)

CI/CD / Infra

Linked issues

Closes ENG-2332
Related: ENG-2xxx Task 7.5 brv harness baseline (per-user KPI — same math, different consumer)
Ship gate: Public v1.0 blocks on a real-LLM run of this script meeting the 30pp delta target

Root cause (bug fixes only, otherwise write `N/A`)

N/A

Layout decision

scripts/autoharness-kpi/
  runner.ts                # ~390 LOC, pure functions + CLI entry
  fixture-tasks.json       # 20 hand-crafted curate tasks
  expected-baseline.json   # placeholder; populated at ship time
test/unit/scripts/autoharness-kpi/
  runner.test.ts           # 19 unit tests
package.json
  "kpi:harness": "tsx scripts/autoharness-kpi/runner.ts"

Chose the scripts/<feature>/ namespace over a top-level kpi/ directory after thinking through it from a big-tech senior-engineer lens: KPI tooling is a subsystem concern (AutoHarness V2 only), not a first-class project concern. Top-level promotion would over-scope it. Kubernetes uses hack/, Rust uses src/tools/, CockroachDB uses pkg/cmd/, VSCode co-locates eval with each extension — none promote single-subsystem eval to repo root.

Test plan

Coverage added:
- Unit test
Test file: test/unit/scripts/autoharness-kpi/runner.test.ts (19 tests)
Key scenarios:
- loadFixture: shipped fixture parses cleanly; malformed / empty rejected
- makeStubLlmClient: deterministic per-(task, arm); unknown-id → both-arms-succeed fallback
- runArm: aggregates runs-per-task; overall = flat mean
- computeKpiReport: delta math = harness − raw; metadata flows through
- exitCodeForReport: 0 at delta = 0.30 (inclusive floor), 1 at 0.29, 1 on negative delta
- parseArgs: defaults + all flags + validates --llm / --runs
- main end-to-end with stub: shipped fixture runs → delta = 0.50 → exit 0; --output writes valid JSON; --llm real throws clear not-implemented error

User-visible changes

None. Release-engineering tool only.

Evidence

Failing test/log before + passing after

$ npx mocha --forbid-only 'test/unit/scripts/autoharness-kpi/runner.test.ts'
  19 passing (9ms)

$ npm run kpi:harness -- --runs 1
task                         raw     harness
──────────────────────────────────────────────────
t01-list-exports                0%       100%
...
t20-phase-ship-gates          100%       100%
──────────────────────────────────────────────────
overall:                      50.0%    100.0%
delta:                       +50.0pp
✓ ship gate met (>= 30pp)

Checklist

Tests added or updated and passing
Lint passes
Type check passes
Build succeeds
Commits follow Conventional Commits format
Documentation updated — N/A (internal tool; docstrings in-file)
No breaking changes
Branch is up to date with `proj/autoharness-v2`

Risks and mitigations

Risk: Scaffolding-only — --llm real is NOT implemented in this PR. The real-LLM path (~50 LOC wiring the agent LLM stack with Llama 3.1 8B) lands in a follow-up. Without it, the v1.0 ship gate can't run.
- Mitigation: The stub path validates the script's math + CLI + exit-code logic. The real-LLM wiring is scoped as a follow-up because it's cleaner to land after Phase 7 command surface stabilizes. Clear error message on --llm real so no one thinks they can run the ship gate yet.
Risk: Fixture realism. The 20 curate tasks are hand-crafted against this codebase. Too-easy tasks (any model succeeds) or too-hard tasks (no model succeeds) dilute the delta signal.
- Mitigation: Tasks deliberately target concrete symbols/rules in the AutoHarness V2 code (HarnessConfigSchema fields, IHarnessStore methods, mode thresholds, Mode C cap values, etc.). Reviewer should eye-ball the fixture for realism and flag tasks that seem off.
Risk: 30pp threshold lock-in. If the first real-LLM run produces delta < 30pp, we either rework the refinement prompts or lower the bar. The bar is defined in v1-design-decisions.md §2.7; moving it needs a design-review.
- Mitigation: SHIP_GATE_DELTA is a named const at the top of runner.ts, one-line change if the bar moves. Test pins the current value so drift shows up in diff.
Risk: expected-baseline.json is a placeholder. Until the real-LLM run lands, the file has null fields.
- Mitigation: The placeholder is explicit (measuredAt: null etc.) + the \$comment field explains when and how to populate. Regressions on future ships compare against this file.
Risk: Holding merge for manual review. PR explicitly flags this at the top — no auto-merge expected.
- Mitigation: Clear reviewer gate at the top of the description. No labels or automation that auto-merge PRs in this repo that I'm aware of.

… stub-LLM runner Phase 8 Task 8.4 — Tier 4 S1 brutal-review item. Ships the release-engineering measurement tool that produces the v1.0 ship gate's headline number: the Llama 3.1 8B success-rate delta on a standardized curate task set, raw `tools.*` vs the current harness. v1.0 ship gate: delta >= 30 percentage points on the reference fixture or the release is blocked. ## Layout All release-engineering tooling for this feature lives together under `scripts/autoharness-kpi/`: scripts/autoharness-kpi/ runner.ts — main script (parseArgs + runArm + computeKpiReport + exitCodeForReport + main — all pure, unit-testable) fixture-tasks.json — 20 curate tasks hand-crafted against this codebase (stable across releases) expected-baseline.json — placeholder; populated at v1.0 ship with the real-LLM measurement test/unit/scripts/autoharness-kpi/ runner.test.ts — 19 unit tests package.json "kpi:harness" — runs via `tsx scripts/autoharness-kpi/runner.ts` Chose the `scripts/<feature>/` namespace over a top-level `kpi/` directory: KPI tooling is a subsystem concern (AutoHarness V2 only), not a first-class project concern. Big-tech repos (Kubernetes / Rust / CockroachDB / VSCode) consistently keep per-feature release tooling under an umbrella like `scripts/`, `hack/`, or `tools/` — never at repo root. ## Scope — scaffolding + stub LLM only `--llm stub` is fully operational with deterministic canned arm rates (delta = 50pp, well above the 30pp gate) for script-logic validation. `--llm real` currently throws a clear "not yet implemented" error — wiring the real Llama 3.1 8B path into the agent LLM service lands in a follow-up once Phase 7 finalizes the command surface. The fixture + math + CLI + exit-code logic all ship now (zero conflict risk with Phat's in-flight Phase 6 work) so the follow-up real-LLM PR is ~50 LOC of wiring. ## Fixture design 20 curate tasks covering concrete byterover-cli subjects: exported classes under harness/, HarnessConfigSchema fields, IHarnessStore methods, Zod schema enumeration, project-type detection rules, mode thresholds, Mode C cap values, feedback weights, etc. Each task has {id, taskDescription, expectedBehavior}. Hand-crafted so a reviewer can eye-ball realism. Aims at the "weak model needs hand-holding" band — tasks too easy (any model succeeds) or too hard (no model succeeds) wouldn't discriminate between arms. ## Tests (19) - loadFixture: shipped fixture parses; malformed/empty rejected - makeStubLlmClient: deterministic per-(task, arm); unknown-id fallback to both-arms-succeed - runArm: aggregates correctly; overall = flat mean across runs - computeKpiReport: delta math; metadata carried through - exitCodeForReport: 0 at exactly 0.30 (inclusive floor), 1 at 0.29, 1 on negative delta - parseArgs: defaults; flag parsing; validates --llm + --runs - main (end-to-end with stub): shipped fixture runs; --output writes valid JSON; --llm real surfaces the clear "not yet implemented" error ## Not merging yet Per team process, this PR opens for review; Danh decides separately whether to merge to proj/autoharness-v2 once review feedback is in.

github-actions · 2026-04-22T08:45:24Z

Claude finished @danhdoan's task in 2m 32s —— View job

PR Review — AutoHarness V2 KPI Harness

Gather context and understand request
Read all changed files
Review runner.ts for correctness, security, and code quality
Review fixture-tasks.json for fixture realism
Review runner.test.ts for test quality and coverage
Review expected-baseline.json and package.json
Post inline comments on specific issues
Post overall summary with score

Overall Assessment: 3.5 / 5

The scaffolding is well-designed. Pure functions, clean exports for testability, correct delta math, proper import.meta.url guard for direct invocation, named SHIP_GATE_DELTA constant, and a clear error on the unimplemented --llm real path. The 19 tests cover the main paths. For internal release tooling this is solid groundwork.

Seven issues were flagged inline; two are blocking against project standards, the rest are suggestions.

Issues by severity

Blocking (project-standards violations)

Multiple as Type casts in loadFixture (runner.ts:149) — CLAUDE.md prohibits as Type assertions and requires type guards. commandType is the most dangerous: a bad value in the fixture file ("commandType": "invalid") would silently flow through as an illegal union member. fixtureVersion and targetModel also skip the typeof x === 'string' guard.
Missing $schema file (fixture-tasks.json:4) — the fixture references ./fixture-tasks.schema.json but no such file exists. Tooling (VS Code, ajv-cli) silently ignores a missing schema, so the reference buys nothing and adds confusion. Either ship the schema or remove the $schema key.

Suggestions

No try/catch around JSON.parse (runner.ts:126) — a malformed fixture file will produce a raw SyntaxError with a confusing stack trace instead of fixture file is not valid JSON: <path>.
Silent ?? 0 in computeKpiReport (runner.ts:224) — if the two arms ever diverge on task IDs, the delta silently uses 0% as the raw baseline instead of throwing. A task-ID mismatch is a bug and should be loud.
renderReport is exported but untested (runner.test.ts:314) — the user-facing table and the ✓/✗ gate line are the most visible output of the script. Format regressions won't be caught.
parseArgs missing edge-case tests (runner.test.ts:266) — --runs -1, --fixture / --output with no argument, etc.
Trailing whitespace on line 368 (runner.ts:368) — minor; lint should catch it.

Fixture realism note

The 20 tasks are anchored to concrete AutoHarness V2 symbols (schema fields, interface methods, mode thresholds, template names, etc.) which is the right approach. Two callouts:

t08-oclif-commands has an intentionally weak expected behavior ("empty result is acceptable") — this is fine given phase staging, but it means t08 will contribute noise-free signal only after Phase 7 lands. Make sure the comment in STUB_RATES notes this rather than implying t08 is deliberately easy.
The "both arms succeed" split (t11–t20) is synthetic; the real-LLM run may reveal a very different distribution. The current stub design is correct for CI validation, but the 0.50 stub delta should not be confused with the expected real-world delta.

What looks good

Math is correct — flat mean, delta = harness − raw, exit-code boundary inclusive at 0.30 ✓
makeStubLlmClient is deterministic and well-commented ✓
runArm uses no-await-in-loop disable with the right justification ✓
main properly separates loadFixture / runArm / computeKpiReport / renderReport ✓
All pure functions are exported and unit-tested independently ✓
expected-baseline.json placeholder is explicit with $comment ✓
invokedDirectly guard prevents side effects on import in tests ✓

Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable `HarnessBaselineRunner` that powers the future `brv harness baseline` CLI command: replays the last N stored scenarios against two arms (pass-through template = raw, current version = harness), reports per-scenario outcomes + overall rates + delta. Complementary to the reference KPI harness (`scripts/autoharness-kpi/`, PR #523): - KPI harness: fixed task set + fixed model → release-notes headline number. - Baseline runner: user's own scenarios + current harness → per-user "is it working for me?" signal. The oclif command file (`src/oclif/commands/harness/baseline.ts`) is deliberately NOT shipped in this PR. The command needs the daemon-transport wiring for harness subcommands that Task 7.1 (status/inspect) establishes — none of Phase 7's CLI surface is merged yet, so there's no transport pattern to follow. Shipping just the runner class now (pure, testable, DI-ready) lets the oclif wrapper land in a small follow-up after 7.1 defines the pattern. This matches the "reusable logic now, wiring later" pattern used by the KPI harness scaffolding PR (#523). - Pure orchestration; caller injects the `HarnessToolsFactory` (production wires this to `SandboxService.buildHarnessTools({dryRun: true})` for write-blocked eval runs, matching Phase 6.1). - Raw arm fabricates a synthetic version from the stored template (`getTemplate(commandType, currentVersion.projectType).code`) — no need to persist it; the module builder accepts any code string. - `v1.0 scope narrowing`: only `commandType === 'curate'` is supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until query templates ship in the Phase 4.3 follow-up. - `count` bounded [1, 50]; scenarios < 3 fails with a clear "run curate N more times first" message so users know how to unblock themselves. - Each scenario runs once per arm (single-run side-by-side display). Contrast Phase 6.1 Evaluator which does 10 runs per scenario for statistical significance — different semantics, different surface, not shared on purpose. Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE (query, chat), INSUFFICIENT_SCENARIOS (2 scenarios), NO_CURRENT_VERSION (empty store), happy-path delta = +100%, no-op delta = 0, count-caps-slice, harness-throws-captures- stderr. All 278 harness unit tests pass. Build + lint + typecheck clean.

github-actions · 2026-04-23T09:52:16Z

Claude finished @RyanNg1403's task in 2m 6s —— View job

PR Review — AutoHarness V2 KPI Harness (re-review after branch sync)

Gather context and understand request
Read all changed files and check diffs
Check if previous blocking issues were resolved
Review for new issues
Post updated summary with score

Overall Assessment: 3.5 / 5

This re-review was triggered by the merge-sync commit (3ccd213) that brought proj/autoharness-v2 up to date. No source changes were made to the KPI harness files — the five files (runner.ts, fixture-tasks.json, runner.test.ts, expected-baseline.json, package.json) are identical to those reviewed on 2026-04-22.

Both blocking issues from the previous review remain open. The overall assessment and score are unchanged. A brief re-audit confirmed no regressions were introduced and no new issues were found.

Still-Open Blocking Issues

1. as Type casts in loadFixture — CLAUDE.md violation (runner.ts:149–153)

commandType, fixtureVersion, and targetModel are all cast without a runtime check:

commandType: (parsed.commandType as Fixture['commandType']) ?? 'curate',
fixtureVersion: (parsed.fixtureVersion as string) ?? 'unversioned',
targetModel: (parsed.targetModel as string) ?? 'unknown',

commandType is the dangerous one — a fixture file with "commandType": "query" silently flows through even if the enum later gains constraints; and an entirely unknown value is silently cast rather than rejected or defaulted. Replace with runtime checks:

const VALID_COMMAND_TYPES = ['chat', 'curate', 'query'] as const
const rawCt = parsed.commandType
const commandType: Fixture['commandType'] =
  typeof rawCt === 'string' &&
  (VALID_COMMAND_TYPES as readonly string[]).includes(rawCt)
    ? (rawCt as Fixture['commandType'])
    : 'curate'

const rawVersion = parsed.fixtureVersion
const fixtureVersion = typeof rawVersion === 'string' ? rawVersion : 'unversioned'
const rawTarget = parsed.targetModel
const targetModel = typeof rawTarget === 'string' ? rawTarget : 'unknown'

Fix this →

2. Missing $schema file (fixture-tasks.json:3)

"$schema": "./fixture-tasks.schema.json" references a file that does not exist in the repository. The ls scripts/autoharness-kpi/ shows only expected-baseline.json, fixture-tasks.json, and runner.ts. Tooling silently ignores the missing reference — the fixture has no structural protection.

Options (pick one):

Ship scripts/autoharness-kpi/fixture-tasks.schema.json describing the Fixture shape
Remove the $schema key

Fix this →

Still-Open Suggestions

3. No try/catch around JSON.parse (runner.ts:126)

A malformed fixture file will surface a raw SyntaxError with a confusing stack trace. Wrapping in try/catch gives a clear diagnostic: fixture file is not valid JSON: <path>.

4. Silent ?? 0 in computeKpiReport (runner.ts:224)

If the two arms ever diverge on task IDs the delta silently uses 0% as the raw baseline. A task-ID mismatch is a bug and should throw.

5. renderReport exported but untested (runner.test.ts)

renderReport is not imported in the test file. Format regressions (column misalignment, missing ✓/✗ gate line, wrong percentage rounding) would only surface at manual run time.

6. parseArgs edge-case tests missing (runner.test.ts:266)

--runs -1, --fixture / --output with no following argument are not tested.

7. Trailing whitespace at line 368 (runner.ts:368)

Three spaces before console.log(renderReport(report)). Lint should catch this.

What Still Looks Good

Math is correct — flat mean, delta = harness − raw, inclusive floor at 0.30 ✓
makeStubLlmClient is deterministic ✓
runArm sequential no-await-in-loop pattern is correct ✓
main separates concerns cleanly ✓
All pure functions exported and independently testable ✓
invokedDirectly guard prevents side-effects on import ✓
Explicit --llm real not-implemented error ✓

github-actions · 2026-04-23T09:55:54Z

Claude finished @RyanNg1403's task in 2m 2s —— View job

PR Review — AutoHarness V2 KPI Harness (re-review after `9c366d3`)

Gather context and understand request
Read all changed files and check current state
Check if previous blocking issues were resolved
Review for new issues
Post updated summary

Overall Assessment: 4.5 / 5

All seven issues from the previous two reviews have been addressed. Both blocking violations are gone, all five suggestions are implemented, and the test suite grew from 19 → 28 tests. The only remaining item is the cosmetic trailing whitespace, which is the only thing keeping this below a 5.

Previous Issues — Status

#	Severity	Issue	Status
1	Blocking	`as Type` casts in `loadFixture`	✅ Fixed
2	Blocking	Missing `$schema` file	✅ Fixed
3	Suggestion	No `try/catch` around `JSON.parse`	✅ Fixed
4	Suggestion	Silent `?? 0` in `computeKpiReport`	✅ Fixed
5	Suggestion	`renderReport` exported but untested	✅ Fixed
6	Suggestion	`parseArgs` missing edge-case tests	✅ Fixed
7	Nitpick	Trailing whitespace at line 368	⚠️ Still present (line 394)

Detail: What Was Fixed

Issue 1 — as Type casts (runner.ts:134–173): The three unsafe casts have been replaced correctly.

commandType now uses a Set<string>-backed isValidCommandType() type guard
fixtureVersion and targetModel use typeof === 'string' checks
The remaining as Record<string, unknown> casts on lines 126 and 153 are acceptable — they follow explicit typeof !== 'object' || value === null guards, which is the standard TypeScript narrowing pattern, not an unsafe bypass.

Issue 2 — $schema (fixture-tasks.json): The dangling "$schema": "./fixture-tasks.schema.json" reference has been removed. ✓

Issue 3 — JSON.parse error handling (runner.ts:142–147): Now wrapped in try/catch with a clear diagnostic:

fixture file is not valid JSON: <path> — <SyntaxError message>

Issue 4 — task-ID mismatch (runner.ts:242–245): Replaced the silent ?? 0 fallback with an explicit throw:

throw new Error(`task ID mismatch: harness arm has '${h.taskId}' but raw arm does not`)

Test at runner.test.ts:182 verifies this path. ✓

Issues 5 & 6 — renderReport and parseArgs coverage (runner.test.ts:291–348): Two renderReport smoke tests added (gate-met and shortfall cases). parseArgs edge cases added: --runs -1, --runs 0, --fixture with no arg, --output with no arg, --runs with no arg.

Remaining Item

Nitpick — trailing whitespace (runner.ts:394): Line 394 is a blank line containing three spaces before console.log(renderReport(report)). This is unchanged from the previous review. Minor cosmetic; lint should have flagged it.

Fix this →

What Looks Good

All blocking CLAUDE.md violations resolved ✓
isValidCommandType() type guard with Set<string> is clean and idiomatic ✓
Task-ID mismatch now throws with a clear diagnostic ✓
renderReport covered: format regression on gate line and shortfall line both caught ✓
parseArgs now covers negative, zero, and missing-argument edge cases ✓
computeKpiReport test for task-ID mismatch explicitly validates the new throw ✓
Test count 19 → 28; all paths exercised ✓
as unknown on JSON.parse (line 144) is the correct safe widening from any ✓

danhdoan requested review from DatPham-6996, bao-byterover and leehpham as code owners April 22, 2026 08:45