feat: [ENG-2325] AutoHarness V2 baseline runner (dual-arm replay) by danhdoan · Pull Request #526 · campfirein/byterover-cli

danhdoan · 2026-04-22T09:35:21Z

Summary

Problem: The v1.0 KPI story has two halves — a reference number for release notes (shipped in feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner #523) and a per-user "is it working for me?" number. The latter needs a runtime brv harness baseline CLI command; before the CLI, the underlying dual-arm replay logic needs to exist as reusable code.
Why it matters: Tier 1 Q1 brutal-review item. Without per-user KPI, users have only our release-notes number (measured on our own repo) — useless for "does it help on MY codebase?" Shipping the runner now means the oclif wrapper is a small follow-up once Task 7.1 defines the daemon-transport pattern for harness commands.
What changed: Ships HarnessBaselineRunner — pure orchestration class that takes the last N stored scenarios, replays them through two arms (pass-through template = raw; current stored version = harness), returns per-scenario outcomes + overall rates + delta = harness - raw.
What did NOT change (scope boundary): No oclif command file. The CLI wrapper src/oclif/commands/harness/baseline.ts is deferred to a follow-up PR — it needs the daemon-transport wiring pattern that Task 7.1 (status/inspect) establishes, which isn't merged yet. Shipping the runner now (ready for injection once the CLI layer exists) matches the "reusable logic now, wiring later" pattern from PR feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner #523 (KPI scaffolding).

Type of change

New feature

Scope (select all touched areas)

Agent / Tools

Linked issues

Closes ENG-2325
Related: ENG-2332 (KPI harness — reference baseline for release notes, complementary)
Blocked follow-up: src/oclif/commands/harness/baseline.ts — lands after Task 7.1 defines the CLI↔daemon transport pattern

Root cause (bug fixes only, otherwise write `N/A`)

N/A

Test plan

Coverage added:
- Unit test
Test file: test/unit/agent/harness/harness-baseline-runner.test.ts (9 tests)
Key scenarios:
- COUNT_OUT_OF_RANGE: count=0 → error; count=51 → error
- UNSUPPORTED_COMMAND_TYPE: commandType=query/chat → error (v1.0 curate-only per Phase 4.3 scope narrowing)
- INSUFFICIENT_SCENARIOS: 2 scenarios in store → error with required: 3 and found: 2 in details
- NO_CURRENT_VERSION: store has no version for the pair → error
- Happy path: raw arm fails every scenario, harness arm succeeds → rawRate=0, harnessRate=1, delta=+1.0, per-scenario details populated with stderr
- No-op path: both arms pass → delta=0
- Slice semantics: 10 scenarios in store, count=5 → only first 5 run, scenario IDs in order
- Failure capture: harness throws → harnessSuccess=false, harnessStderr populated; raw arm unaffected

User-visible changes

None yet. User-facing impact lands when the oclif wrapper ships in the follow-up.

Evidence

Failing test/log before + passing after

$ npx mocha --forbid-only 'test/unit/agent/harness/harness-baseline-runner.test.ts'
  HarnessBaselineRunner
    ✔ 1. throws COUNT_OUT_OF_RANGE when count is 0
    ✔ 2. throws COUNT_OUT_OF_RANGE when count exceeds BASELINE_MAX_COUNT
    ✔ 3. throws UNSUPPORTED_COMMAND_TYPE for query / chat (v1.0 curate-only)
    ✔ 4. throws INSUFFICIENT_SCENARIOS when < 3 scenarios exist
    ✔ 5. throws NO_CURRENT_VERSION when the pair has no stored version
    ✔ 6. happy path: raw fails every scenario, harness succeeds → delta = 100%
    ✔ 7. mixed path: both arms pass → delta = 0
    ✔ 8. count caps the scenarios slice
    ✔ 9. harness run that throws is captured as failure with stderr

  9 passing (13ms)

$ npx mocha --forbid-only 'test/unit/agent/harness/**/*.test.ts'
  278 passing (25s)

Checklist

Tests added or updated and passing
Lint passes
Type check passes
Build succeeds
Commits follow Conventional Commits format
Documentation updated — N/A (internal; in-file docstrings + handoff contract)
No breaking changes
Branch is up to date with `proj/autoharness-v2` (post main-sync)

Risks and mitigations

Risk: Runner-only scope — no user-facing command yet. A reviewer could reasonably ask "so users can't actually run baseline yet?" Correct.
- Mitigation: The follow-up oclif command is a ~50-100 LOC wrapper once Task 7.1 defines the CLI↔daemon pattern. Shipping the runner now unblocks that follow-up and lets it focus purely on wiring, not logic. Precedent: the KPI harness PR (feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner #523) took the same approach.
Risk: v1.0 UNSUPPORTED_COMMAND_TYPE for query/chat. Users running brv harness baseline --commandType query will see an error.
- Mitigation: Error message names the constraint explicitly and points at the Phase 4.3 follow-up. Widening support is a single-line change in SUPPORTED_BASELINE_COMMANDS once query templates land. Set is a ReadonlySet<string> so any command name works — no need to touch the type union.
Risk: HarnessEvaluator.executeSingleRun logic is mirrored here, not shared. Slight duplication.
- Mitigation: Documented as intentional — evaluator does 10-run statistical mean; baseline does single-run side-by-side. Different semantics, different assertion surface. Extracting a shared helper would couple two different concerns and make both harder to reason about. If a third consumer emerges, revisit.
Risk: Raw arm fabricates a synthetic version from the template. The synthetic version isn't persisted but does go through moduleBuilder.build().
- Mitigation: The module builder accepts any code string by design (it's how the evaluator works). The synthetic version uses a derived id (${currentVersion.id}:raw) to avoid any collision with real version ids if a log ever surfaces the raw build's metadata.
Risk: dryRun enforcement depends on caller wiring. The runner doesn't enforce it directly; it expects the HarnessToolsFactory to return write-blocked tools.
- Mitigation: Matches the Phase 6.1 Evaluator pattern exactly. Production will wire both to SandboxService.buildHarnessTools({dryRun: true}). Documented in the file header and the factory parameter's JSDoc.

github-actions · 2026-04-22T09:35:36Z

PR Review — `HarnessBaselineRunner` (dual-arm replay)

Gather context and understand the request
Read src/agent/infra/harness/harness-baseline-runner.ts
Read test/unit/agent/harness/harness-baseline-runner.test.ts
Check related harness files for context
Analyze and post review

View job run

github-actions · 2026-04-22T09:38:51Z

PR Review — `HarnessBaselineRunner` (dual-arm replay)

Gather context and understand the request
Read src/agent/infra/harness/harness-baseline-runner.ts
Read test/unit/agent/harness/harness-baseline-runner.test.ts
Check related harness files for context
Analyze and post review

View job run

github-actions · 2026-04-22T09:39:55Z

PR Review — `HarnessBaselineRunner` (dual-arm replay)

Gather context and understand the request
Read src/agent/infra/harness/harness-baseline-runner.ts
Read test/unit/agent/harness/harness-baseline-runner.test.ts
Check related harness files for context
Analyze and post review

View job run

Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable `HarnessBaselineRunner` that powers the future `brv harness baseline` CLI command: replays the last N stored scenarios against two arms (pass-through template = raw, current version = harness), reports per-scenario outcomes + overall rates + delta. Complementary to the reference KPI harness (`scripts/autoharness-kpi/`, PR #523): - KPI harness: fixed task set + fixed model → release-notes headline number. - Baseline runner: user's own scenarios + current harness → per-user "is it working for me?" signal. The oclif command file (`src/oclif/commands/harness/baseline.ts`) is deliberately NOT shipped in this PR. The command needs the daemon-transport wiring for harness subcommands that Task 7.1 (status/inspect) establishes — none of Phase 7's CLI surface is merged yet, so there's no transport pattern to follow. Shipping just the runner class now (pure, testable, DI-ready) lets the oclif wrapper land in a small follow-up after 7.1 defines the pattern. This matches the "reusable logic now, wiring later" pattern used by the KPI harness scaffolding PR (#523). - Pure orchestration; caller injects the `HarnessToolsFactory` (production wires this to `SandboxService.buildHarnessTools({dryRun: true})` for write-blocked eval runs, matching Phase 6.1). - Raw arm fabricates a synthetic version from the stored template (`getTemplate(commandType, currentVersion.projectType).code`) — no need to persist it; the module builder accepts any code string. - `v1.0 scope narrowing`: only `commandType === 'curate'` is supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until query templates ship in the Phase 4.3 follow-up. - `count` bounded [1, 50]; scenarios < 3 fails with a clear "run curate N more times first" message so users know how to unblock themselves. - Each scenario runs once per arm (single-run side-by-side display). Contrast Phase 6.1 Evaluator which does 10 runs per scenario for statistical significance — different semantics, different surface, not shared on purpose. Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE (query, chat), INSUFFICIENT_SCENARIOS (2 scenarios), NO_CURRENT_VERSION (empty store), happy-path delta = +100%, no-op delta = 0, count-caps-slice, harness-throws-captures- stderr. All 278 harness unit tests pass. Build + lint + typecheck clean.

Proj/autoharness-v2 picked up a new IHarnessStore.deleteOutcome method between branch-point and PR CI. Rebase picked up the interface change; this commit updates the makeStoreStub helper to satisfy it.

github-actions · 2026-04-22T09:41:33Z

Claude finished @danhdoan's task in 2m 21s —— View job

PR Review — `HarnessBaselineRunner` (dual-arm replay)

Gather context and understand the request
Read src/agent/infra/harness/harness-baseline-runner.ts
Read test/unit/agent/harness/harness-baseline-runner.test.ts
Check related harness files for context
Analyze and post review

Overall

Score: 4 / 5

Solid implementation that fits naturally into the existing harness patterns (HarnessEvaluator, HarnessModuleBuilder, HarnessToolsFactory). The dual-arm approach is clean, the error taxonomy is consistent with the rest of the harness module, and the test coverage is comprehensive. A few things worth addressing before merge, one of which is a real logic bug.

Issues

`suggestion` Slice-before-check produces a misleading error message — `harness-baseline-runner.ts:119–127`

const scenarios = allScenarios.slice(0, count)  // ← applied first

if (scenarios.length < BASELINE_MIN_SCENARIOS) { // ← checks post-slice length
  throw new HarnessBaselineRunnerError(
    `not enough scenarios — baseline needs at least ${BASELINE_MIN_SCENARIOS}, found ${scenarios.length}. Run curate ${BASELINE_MIN_SCENARIOS - scenarios.length} more time(s) first.`,
    'INSUFFICIENT_SCENARIOS',
    {found: scenarios.length, required: BASELINE_MIN_SCENARIOS},
  )
}

If the user passes --count 2 but the store has 10 scenarios, scenarios.length is 2 (not 10), and the error tells them to "run curate 2 more time(s) first" — which is wrong: they have plenty of data, they just asked for too few. The fix is to check allScenarios.length before slicing so the guard is about store coverage, not the requested window:

if (allScenarios.length < BASELINE_MIN_SCENARIOS) {
  throw new HarnessBaselineRunnerError(
    `not enough scenarios — baseline needs at least ${BASELINE_MIN_SCENARIOS}, found ${allScenarios.length}. Run curate ${BASELINE_MIN_SCENARIOS - allScenarios.length} more time(s) first.`,
    'INSUFFICIENT_SCENARIOS',
    {found: allScenarios.length, required: BASELINE_MIN_SCENARIOS},
  )
}
const scenarios = allScenarios.slice(0, count)

Also missing a test for this case (count=2, store has 10 → should succeed). Fix this →

`nitpick` Inline dynamic import type — `harness-baseline-runner.ts:199`

private async runSingleScenario(
  module: import('../../core/domain/harness/types.js').HarnessModule,

HarnessModule is already imported at the top of the file (import type {HarnessContext, ...} at line 24 doesn't include it, but it could). The inline import() form works but is inconsistent with all the other type imports in this file. Add HarnessModule to the top-level import type block.

`nitpick` `as SupportedCommandType` cast — `harness-baseline-runner.ts:139`

const rawCode = getTemplate(
  commandType as SupportedCommandType,
  currentVersion.projectType,
).code

CLAUDE.md calls out "Avoid as Type assertions". The guard at line 110 (SUPPORTED_BASELINE_COMMANDS.has(commandType)) has already validated this, but TypeScript doesn't narrow a ReadonlySet<string>.has() check. A simple helper narrows without a cast:

function isSupportedCommandType(cmd: string): cmd is SupportedCommandType {
  return SUPPORTED_BASELINE_COMMANDS.has(cmd)
}

Alternatively, change SUPPORTED_BASELINE_COMMANDS from ReadonlySet<string> to ReadonlySet<SupportedCommandType> — the narrowing still requires a predicate but at least the source of truth is typed.

`nitpick` `runSingleScenario` silently maps `chat` to `module.curate` — `harness-baseline-runner.ts:214`

const fn = scenario.commandType === 'query' ? module.query : module.curate

If chat is ever added to SUPPORTED_BASELINE_COMMANDS, scenarios with commandType === 'chat' fall through to module.curate, which is silently wrong. A lookup table ({curate: module.curate, query: module.query}) is explicit and extensible.

`nitpick` Test arm-discrimination is tightly coupled to implementation order — `harness-baseline-runner.test.ts:107–113`

let call = 0
return () => {
  call++
  const isRawArm = call % 2 === 1 // 1st, 3rd, 5th... are raw

This relies on runSingleScenario being called in raw→harness order per scenario. If the implementation ever reorders the arms (e.g. to run harness first and subtract), the test would swap arm assignments and silently invert the assertions. The factory could instead accept a named arm identifier injected via the ctx.env or a separate parameter, making the distinction explicit. Low severity since the test is checking internal behavior, but worth noting for future maintainers.

What's good

HarnessBaselineRunnerError with typed code — consistent with the rest of the harness error taxonomy and makes the follow-up oclif wrapper easy to map to exit codes.
Serial loop with explicit eslint-disable — the decision is documented and the tradeoff is valid (deterministic log ordering beats trivial latency savings for a user-facing CLI).
${currentVersion.id}:raw id collision avoidance — good defensive choice.
BASELINE_MIN_SCENARIOS = 3 constant surfaced and used in test assertions rather than magic numbers.
Test 9 (thrown harness arm) correctly captures harnessStderr without affecting the raw arm — this asymmetry matters for the delta calculation.
dryRun delegation documented at the file header and factory JSDoc — matches the evaluator pattern cleanly.

- Fix INSUFFICIENT_SCENARIOS to guard on store coverage, not the sliced window (caller passing a too-low --count now surfaces as a valid run, not a misleading "run curate N more times" error). Add test 4b. - Hoist HarnessModule to top-level import type block; drop inline dynamic type import. - Replace `as SupportedCommandType` cast with an isSupportedBaselineCommand type predicate so TypeScript narrows at the guard. - Swap ternary fn-lookup for an explicit fnMap — prevents silent fall-through if chat is ever added to the supported set. - Document the raw-first call-order coupling in makeTwoArmToolsFactory.

github-actions · 2026-04-22T09:48:05Z

Claude finished @danhdoan's task in 1m 50s —— View job

PR Review — `HarnessBaselineRunner` (dual-arm replay)

Gather context and understand the request
Read updated src/agent/infra/harness/harness-baseline-runner.ts
Read updated test/unit/agent/harness/harness-baseline-runner.test.ts
Check related harness files for context
Verify previous review feedback was addressed
Post updated review

Overall

Score: 5 / 5

All five issues from the previous review have been addressed cleanly. The one real logic bug (slice-before-check) is fixed, and every nitpick was resolved with minimal, targeted changes. The new test 4b correctly pins the corrected semantics. This is ready to merge.

Previous feedback — all addressed

✅ `suggestion` Slice-before-check bug — fixed

allScenarios.length is now checked before the slice (lines 126–132), and the slice moved below. The comment on line 124–125 documents the intent precisely. New test 4b (line 219–239) locks in the corrected behaviour: 10 stored + count=2 no longer throws.

✅ `nitpick` Inline dynamic import type — fixed

HarnessModule is now in the top-level import type block at line 24. runSingleScenario signature at line 202 uses it directly.

✅ `nitpick` `as SupportedCommandType` cast — fixed

isSupportedBaselineCommand type predicate (lines 84–86) narrows commandType before getTemplate is called at line 145. No cast required downstream.

✅ `nitpick` `chat` fallthrough in `runSingleScenario` — fixed

Lookup table at lines 217–224 is explicit and fails closed: unknown command types return {success: false, stderr: ...} rather than silently routing to module.curate.

✅ `nitpick` Test arm-discrimination coupling — acknowledged

IMPLEMENTATION COUPLING: block (lines 93–103) in the test file names the dependency on arm call order and explains why a simpler approach is appropriate here. The note also flags what would need updating if the order is ever reversed.

Remaining minor observation

`nitpick` Two `as` casts remain in `runSingleScenario` — `harness-baseline-runner.ts:210–211`

commandType: scenario.commandType as 'chat' | 'curate' | 'query',
projectType: scenario.projectType as ProjectType,

ValidatedEvaluationScenario.commandType is typed as string in the Zod-derived type (the schema uses z.string().min(1)), while HarnessContext.env.commandType requires the literal union. The cast is technically necessary given the current schema typing, and these were present in the original commit, so not a blocker. A longer-term fix would be to narrow ValidatedEvaluationScenario.commandType to the same literal union at the schema level — but that's a cross-file change and out of scope for this PR. Fix later →

What's good (unchanged)

isSupportedBaselineCommand predicate is a clean, reusable pattern — if a second call site ever needs the guard it's already extracted.
The lookup table in runSingleScenario leaves an obvious extension point: adding chat is a one-liner.
Test 4b asserts caught === undefined (not just absence of throw), which is a strong assertion style.
Serial loop + eslint-disable comments remain correct and well-documented.

danhdoan requested review from RyanNg1403, bao-byterover, cuongdo-byterover, leehpham and ngduyanhece as code owners April 22, 2026 09:35

danhdoan force-pushed the feat/ENG-2325 branch from 53c02f8 to 409e5bd Compare April 22, 2026 09:39

danhdoan added 2 commits April 22, 2026 16:40

test: [ENG-2325] add deleteOutcome stub after rebase on proj

8c5c0d1

Proj/autoharness-v2 picked up a new IHarnessStore.deleteOutcome method between branch-point and PR CI. Rebase picked up the interface change; this commit updates the makeStoreStub helper to satisfy it.

danhdoan force-pushed the feat/ENG-2325 branch from 409e5bd to 8c5c0d1 Compare April 22, 2026 09:41