feat: [ENG-2325] AutoHarness V2 baseline runner (dual-arm replay)#526
feat: [ENG-2325] AutoHarness V2 baseline runner (dual-arm replay)#526danhdoan merged 3 commits intoproj/autoharness-v2from
Conversation
Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable `HarnessBaselineRunner` that powers the future `brv harness baseline` CLI command: replays the last N stored scenarios against two arms (pass-through template = raw, current version = harness), reports per-scenario outcomes + overall rates + delta. Complementary to the reference KPI harness (`scripts/autoharness-kpi/`, PR #523): - KPI harness: fixed task set + fixed model → release-notes headline number. - Baseline runner: user's own scenarios + current harness → per-user "is it working for me?" signal. The oclif command file (`src/oclif/commands/harness/baseline.ts`) is deliberately NOT shipped in this PR. The command needs the daemon-transport wiring for harness subcommands that Task 7.1 (status/inspect) establishes — none of Phase 7's CLI surface is merged yet, so there's no transport pattern to follow. Shipping just the runner class now (pure, testable, DI-ready) lets the oclif wrapper land in a small follow-up after 7.1 defines the pattern. This matches the "reusable logic now, wiring later" pattern used by the KPI harness scaffolding PR (#523). - Pure orchestration; caller injects the `HarnessToolsFactory` (production wires this to `SandboxService.buildHarnessTools({dryRun: true})` for write-blocked eval runs, matching Phase 6.1). - Raw arm fabricates a synthetic version from the stored template (`getTemplate(commandType, currentVersion.projectType).code`) — no need to persist it; the module builder accepts any code string. - `v1.0 scope narrowing`: only `commandType === 'curate'` is supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until query templates ship in the Phase 4.3 follow-up. - `count` bounded [1, 50]; scenarios < 3 fails with a clear "run curate N more times first" message so users know how to unblock themselves. - Each scenario runs once per arm (single-run side-by-side display). Contrast Phase 6.1 Evaluator which does 10 runs per scenario for statistical significance — different semantics, different surface, not shared on purpose. Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE (query, chat), INSUFFICIENT_SCENARIOS (2 scenarios), NO_CURRENT_VERSION (empty store), happy-path delta = +100%, no-op delta = 0, count-caps-slice, harness-throws-captures- stderr. All 278 harness unit tests pass. Build + lint + typecheck clean.
Proj/autoharness-v2 picked up a new IHarnessStore.deleteOutcome method between branch-point and PR CI. Rebase picked up the interface change; this commit updates the makeStoreStub helper to satisfy it.
|
Claude finished @danhdoan's task in 2m 21s —— View job PR Review —
|
- Fix INSUFFICIENT_SCENARIOS to guard on store coverage, not the sliced window (caller passing a too-low --count now surfaces as a valid run, not a misleading "run curate N more times" error). Add test 4b. - Hoist HarnessModule to top-level import type block; drop inline dynamic type import. - Replace `as SupportedCommandType` cast with an isSupportedBaselineCommand type predicate so TypeScript narrows at the guard. - Swap ternary fn-lookup for an explicit fnMap — prevents silent fall-through if chat is ever added to the supported set. - Document the raw-first call-order coupling in makeTwoArmToolsFactory.
|
Claude finished @danhdoan's task in 1m 50s —— View job PR Review —
|

Summary
brv harness baselineCLI command; before the CLI, the underlying dual-arm replay logic needs to exist as reusable code.HarnessBaselineRunner— pure orchestration class that takes the last N stored scenarios, replays them through two arms (pass-through template = raw; current stored version = harness), returns per-scenario outcomes + overall rates +delta = harness - raw.src/oclif/commands/harness/baseline.tsis deferred to a follow-up PR — it needs the daemon-transport wiring pattern that Task 7.1 (status/inspect) establishes, which isn't merged yet. Shipping the runner now (ready for injection once the CLI layer exists) matches the "reusable logic now, wiring later" pattern from PR feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner #523 (KPI scaffolding).Type of change
Scope (select all touched areas)
Linked issues
src/oclif/commands/harness/baseline.ts— lands after Task 7.1 defines the CLI↔daemon transport patternRoot cause (bug fixes only, otherwise write
N/A)N/A
Test plan
test/unit/agent/harness/harness-baseline-runner.test.ts(9 tests)COUNT_OUT_OF_RANGE: count=0 → error; count=51 → errorUNSUPPORTED_COMMAND_TYPE: commandType=query/chat → error (v1.0 curate-only per Phase 4.3 scope narrowing)INSUFFICIENT_SCENARIOS: 2 scenarios in store → error withrequired: 3andfound: 2in detailsNO_CURRENT_VERSION: store has no version for the pair → errorcount=5→ only first 5 run, scenario IDs in orderharnessSuccess=false,harnessStderrpopulated; raw arm unaffectedUser-visible changes
None yet. User-facing impact lands when the oclif wrapper ships in the follow-up.
Evidence
Checklist
Risks and mitigations
Risk: Runner-only scope — no user-facing command yet. A reviewer could reasonably ask "so users can't actually run baseline yet?" Correct.
Risk: v1.0
UNSUPPORTED_COMMAND_TYPEfor query/chat. Users runningbrv harness baseline --commandType querywill see an error.SUPPORTED_BASELINE_COMMANDSonce query templates land. Set is aReadonlySet<string>so any command name works — no need to touch the type union.Risk:
HarnessEvaluator.executeSingleRunlogic is mirrored here, not shared. Slight duplication.Risk: Raw arm fabricates a synthetic version from the template. The synthetic version isn't persisted but does go through
moduleBuilder.build().${currentVersion.id}:raw) to avoid any collision with real version ids if a log ever surfaces the raw build's metadata.Risk:
dryRunenforcement depends on caller wiring. The runner doesn't enforce it directly; it expects theHarnessToolsFactoryto return write-blocked tools.SandboxService.buildHarnessTools({dryRun: true}). Documented in the file header and the factory parameter's JSDoc.