Skip to content

feat: [ENG-2325] AutoHarness V2 baseline runner (dual-arm replay)#526

Merged
danhdoan merged 3 commits intoproj/autoharness-v2from
feat/ENG-2325
Apr 22, 2026
Merged

feat: [ENG-2325] AutoHarness V2 baseline runner (dual-arm replay)#526
danhdoan merged 3 commits intoproj/autoharness-v2from
feat/ENG-2325

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

Summary

  • Problem: The v1.0 KPI story has two halves — a reference number for release notes (shipped in feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner #523) and a per-user "is it working for me?" number. The latter needs a runtime brv harness baseline CLI command; before the CLI, the underlying dual-arm replay logic needs to exist as reusable code.
  • Why it matters: Tier 1 Q1 brutal-review item. Without per-user KPI, users have only our release-notes number (measured on our own repo) — useless for "does it help on MY codebase?" Shipping the runner now means the oclif wrapper is a small follow-up once Task 7.1 defines the daemon-transport pattern for harness commands.
  • What changed: Ships HarnessBaselineRunner — pure orchestration class that takes the last N stored scenarios, replays them through two arms (pass-through template = raw; current stored version = harness), returns per-scenario outcomes + overall rates + delta = harness - raw.
  • What did NOT change (scope boundary): No oclif command file. The CLI wrapper src/oclif/commands/harness/baseline.ts is deferred to a follow-up PR — it needs the daemon-transport wiring pattern that Task 7.1 (status/inspect) establishes, which isn't merged yet. Shipping the runner now (ready for injection once the CLI layer exists) matches the "reusable logic now, wiring later" pattern from PR feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner #523 (KPI scaffolding).

Type of change

  • New feature

Scope (select all touched areas)

  • Agent / Tools

Linked issues

  • Closes ENG-2325
  • Related: ENG-2332 (KPI harness — reference baseline for release notes, complementary)
  • Blocked follow-up: src/oclif/commands/harness/baseline.ts — lands after Task 7.1 defines the CLI↔daemon transport pattern

Root cause (bug fixes only, otherwise write N/A)

N/A

Test plan

  • Coverage added:
    • Unit test
  • Test file: test/unit/agent/harness/harness-baseline-runner.test.ts (9 tests)
  • Key scenarios:
    • COUNT_OUT_OF_RANGE: count=0 → error; count=51 → error
    • UNSUPPORTED_COMMAND_TYPE: commandType=query/chat → error (v1.0 curate-only per Phase 4.3 scope narrowing)
    • INSUFFICIENT_SCENARIOS: 2 scenarios in store → error with required: 3 and found: 2 in details
    • NO_CURRENT_VERSION: store has no version for the pair → error
    • Happy path: raw arm fails every scenario, harness arm succeeds → rawRate=0, harnessRate=1, delta=+1.0, per-scenario details populated with stderr
    • No-op path: both arms pass → delta=0
    • Slice semantics: 10 scenarios in store, count=5 → only first 5 run, scenario IDs in order
    • Failure capture: harness throws → harnessSuccess=false, harnessStderr populated; raw arm unaffected

User-visible changes

None yet. User-facing impact lands when the oclif wrapper ships in the follow-up.

Evidence

  • Failing test/log before + passing after
$ npx mocha --forbid-only 'test/unit/agent/harness/harness-baseline-runner.test.ts'
  HarnessBaselineRunner
    ✔ 1. throws COUNT_OUT_OF_RANGE when count is 0
    ✔ 2. throws COUNT_OUT_OF_RANGE when count exceeds BASELINE_MAX_COUNT
    ✔ 3. throws UNSUPPORTED_COMMAND_TYPE for query / chat (v1.0 curate-only)
    ✔ 4. throws INSUFFICIENT_SCENARIOS when < 3 scenarios exist
    ✔ 5. throws NO_CURRENT_VERSION when the pair has no stored version
    ✔ 6. happy path: raw fails every scenario, harness succeeds → delta = 100%
    ✔ 7. mixed path: both arms pass → delta = 0
    ✔ 8. count caps the scenarios slice
    ✔ 9. harness run that throws is captured as failure with stderr

  9 passing (13ms)

$ npx mocha --forbid-only 'test/unit/agent/harness/**/*.test.ts'
  278 passing (25s)

Checklist

  • Tests added or updated and passing
  • Lint passes
  • Type check passes
  • Build succeeds
  • Commits follow Conventional Commits format
  • Documentation updated — N/A (internal; in-file docstrings + handoff contract)
  • No breaking changes
  • Branch is up to date with `proj/autoharness-v2` (post main-sync)

Risks and mitigations

  • Risk: Runner-only scope — no user-facing command yet. A reviewer could reasonably ask "so users can't actually run baseline yet?" Correct.

  • Risk: v1.0 UNSUPPORTED_COMMAND_TYPE for query/chat. Users running brv harness baseline --commandType query will see an error.

    • Mitigation: Error message names the constraint explicitly and points at the Phase 4.3 follow-up. Widening support is a single-line change in SUPPORTED_BASELINE_COMMANDS once query templates land. Set is a ReadonlySet<string> so any command name works — no need to touch the type union.
  • Risk: HarnessEvaluator.executeSingleRun logic is mirrored here, not shared. Slight duplication.

    • Mitigation: Documented as intentional — evaluator does 10-run statistical mean; baseline does single-run side-by-side. Different semantics, different assertion surface. Extracting a shared helper would couple two different concerns and make both harder to reason about. If a third consumer emerges, revisit.
  • Risk: Raw arm fabricates a synthetic version from the template. The synthetic version isn't persisted but does go through moduleBuilder.build().

    • Mitigation: The module builder accepts any code string by design (it's how the evaluator works). The synthetic version uses a derived id (${currentVersion.id}:raw) to avoid any collision with real version ids if a log ever surfaces the raw build's metadata.
  • Risk: dryRun enforcement depends on caller wiring. The runner doesn't enforce it directly; it expects the HarnessToolsFactory to return write-blocked tools.

    • Mitigation: Matches the Phase 6.1 Evaluator pattern exactly. Production will wire both to SandboxService.buildHarnessTools({dryRun: true}). Documented in the file header and the factory parameter's JSDoc.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

PR Review — HarnessBaselineRunner (dual-arm replay)

  • Gather context and understand the request
  • Read src/agent/infra/harness/harness-baseline-runner.ts
  • Read test/unit/agent/harness/harness-baseline-runner.test.ts
  • Check related harness files for context
  • Analyze and post review

View job run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

PR Review — HarnessBaselineRunner (dual-arm replay)

  • Gather context and understand the request
  • Read src/agent/infra/harness/harness-baseline-runner.ts
  • Read test/unit/agent/harness/harness-baseline-runner.test.ts
  • Check related harness files for context
  • Analyze and post review

View job run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

PR Review — HarnessBaselineRunner (dual-arm replay)

  • Gather context and understand the request
  • Read src/agent/infra/harness/harness-baseline-runner.ts
  • Read test/unit/agent/harness/harness-baseline-runner.test.ts
  • Check related harness files for context
  • Analyze and post review

View job run

Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable
`HarnessBaselineRunner` that powers the future `brv harness baseline`
CLI command: replays the last N stored scenarios against two arms
(pass-through template = raw, current version = harness), reports
per-scenario outcomes + overall rates + delta.

Complementary to the reference KPI harness
(`scripts/autoharness-kpi/`, PR #523):
  - KPI harness: fixed task set + fixed model → release-notes
    headline number.
  - Baseline runner: user's own scenarios + current harness →
    per-user "is it working for me?" signal.

The oclif command file (`src/oclif/commands/harness/baseline.ts`)
is deliberately NOT shipped in this PR. The command needs the
daemon-transport wiring for harness subcommands that Task 7.1
(status/inspect) establishes — none of Phase 7's CLI surface is
merged yet, so there's no transport pattern to follow. Shipping
just the runner class now (pure, testable, DI-ready) lets the
oclif wrapper land in a small follow-up after 7.1 defines the
pattern.

This matches the "reusable logic now, wiring later" pattern used
by the KPI harness scaffolding PR (#523).

- Pure orchestration; caller injects the `HarnessToolsFactory`
  (production wires this to `SandboxService.buildHarnessTools({dryRun: true})`
  for write-blocked eval runs, matching Phase 6.1).
- Raw arm fabricates a synthetic version from the stored template
  (`getTemplate(commandType, currentVersion.projectType).code`) —
  no need to persist it; the module builder accepts any code
  string.
- `v1.0 scope narrowing`: only `commandType === 'curate'` is
  supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until
  query templates ship in the Phase 4.3 follow-up.
- `count` bounded [1, 50]; scenarios < 3 fails with a clear
  "run curate N more times first" message so users know how to
  unblock themselves.
- Each scenario runs once per arm (single-run side-by-side
  display). Contrast Phase 6.1 Evaluator which does 10 runs per
  scenario for statistical significance — different semantics,
  different surface, not shared on purpose.

Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE
(query, chat), INSUFFICIENT_SCENARIOS (2 scenarios),
NO_CURRENT_VERSION (empty store), happy-path delta = +100%,
no-op delta = 0, count-caps-slice, harness-throws-captures-
stderr.

All 278 harness unit tests pass. Build + lint + typecheck clean.
Proj/autoharness-v2 picked up a new IHarnessStore.deleteOutcome method
between branch-point and PR CI. Rebase picked up the interface change;
this commit updates the makeStoreStub helper to satisfy it.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Claude finished @danhdoan's task in 2m 21s —— View job


PR Review — HarnessBaselineRunner (dual-arm replay)

  • Gather context and understand the request
  • Read src/agent/infra/harness/harness-baseline-runner.ts
  • Read test/unit/agent/harness/harness-baseline-runner.test.ts
  • Check related harness files for context
  • Analyze and post review

Overall

Score: 4 / 5

Solid implementation that fits naturally into the existing harness patterns (HarnessEvaluator, HarnessModuleBuilder, HarnessToolsFactory). The dual-arm approach is clean, the error taxonomy is consistent with the rest of the harness module, and the test coverage is comprehensive. A few things worth addressing before merge, one of which is a real logic bug.


Issues

suggestion Slice-before-check produces a misleading error message — harness-baseline-runner.ts:119–127

const scenarios = allScenarios.slice(0, count)  // ← applied first

if (scenarios.length < BASELINE_MIN_SCENARIOS) { // ← checks post-slice length
  throw new HarnessBaselineRunnerError(
    `not enough scenarios — baseline needs at least ${BASELINE_MIN_SCENARIOS}, found ${scenarios.length}. Run curate ${BASELINE_MIN_SCENARIOS - scenarios.length} more time(s) first.`,
    'INSUFFICIENT_SCENARIOS',
    {found: scenarios.length, required: BASELINE_MIN_SCENARIOS},
  )
}

If the user passes --count 2 but the store has 10 scenarios, scenarios.length is 2 (not 10), and the error tells them to "run curate 2 more time(s) first" — which is wrong: they have plenty of data, they just asked for too few. The fix is to check allScenarios.length before slicing so the guard is about store coverage, not the requested window:

if (allScenarios.length < BASELINE_MIN_SCENARIOS) {
  throw new HarnessBaselineRunnerError(
    `not enough scenarios — baseline needs at least ${BASELINE_MIN_SCENARIOS}, found ${allScenarios.length}. Run curate ${BASELINE_MIN_SCENARIOS - allScenarios.length} more time(s) first.`,
    'INSUFFICIENT_SCENARIOS',
    {found: allScenarios.length, required: BASELINE_MIN_SCENARIOS},
  )
}
const scenarios = allScenarios.slice(0, count)

Also missing a test for this case (count=2, store has 10 → should succeed). Fix this →


nitpick Inline dynamic import type — harness-baseline-runner.ts:199

private async runSingleScenario(
  module: import('../../core/domain/harness/types.js').HarnessModule,

HarnessModule is already imported at the top of the file (import type {HarnessContext, ...} at line 24 doesn't include it, but it could). The inline import() form works but is inconsistent with all the other type imports in this file. Add HarnessModule to the top-level import type block.


nitpick as SupportedCommandType cast — harness-baseline-runner.ts:139

const rawCode = getTemplate(
  commandType as SupportedCommandType,
  currentVersion.projectType,
).code

CLAUDE.md calls out "Avoid as Type assertions". The guard at line 110 (SUPPORTED_BASELINE_COMMANDS.has(commandType)) has already validated this, but TypeScript doesn't narrow a ReadonlySet<string>.has() check. A simple helper narrows without a cast:

function isSupportedCommandType(cmd: string): cmd is SupportedCommandType {
  return SUPPORTED_BASELINE_COMMANDS.has(cmd)
}

Alternatively, change SUPPORTED_BASELINE_COMMANDS from ReadonlySet<string> to ReadonlySet<SupportedCommandType> — the narrowing still requires a predicate but at least the source of truth is typed.


nitpick runSingleScenario silently maps chat to module.curateharness-baseline-runner.ts:214

const fn = scenario.commandType === 'query' ? module.query : module.curate

If chat is ever added to SUPPORTED_BASELINE_COMMANDS, scenarios with commandType === 'chat' fall through to module.curate, which is silently wrong. A lookup table ({curate: module.curate, query: module.query}) is explicit and extensible.


nitpick Test arm-discrimination is tightly coupled to implementation order — harness-baseline-runner.test.ts:107–113

let call = 0
return () => {
  call++
  const isRawArm = call % 2 === 1 // 1st, 3rd, 5th... are raw

This relies on runSingleScenario being called in raw→harness order per scenario. If the implementation ever reorders the arms (e.g. to run harness first and subtract), the test would swap arm assignments and silently invert the assertions. The factory could instead accept a named arm identifier injected via the ctx.env or a separate parameter, making the distinction explicit. Low severity since the test is checking internal behavior, but worth noting for future maintainers.


What's good

  • HarnessBaselineRunnerError with typed code — consistent with the rest of the harness error taxonomy and makes the follow-up oclif wrapper easy to map to exit codes.
  • Serial loop with explicit eslint-disable — the decision is documented and the tradeoff is valid (deterministic log ordering beats trivial latency savings for a user-facing CLI).
  • ${currentVersion.id}:raw id collision avoidance — good defensive choice.
  • BASELINE_MIN_SCENARIOS = 3 constant surfaced and used in test assertions rather than magic numbers.
  • Test 9 (thrown harness arm) correctly captures harnessStderr without affecting the raw arm — this asymmetry matters for the delta calculation.
  • dryRun delegation documented at the file header and factory JSDoc — matches the evaluator pattern cleanly.

Comment thread src/agent/infra/harness/harness-baseline-runner.ts Outdated
Comment thread src/agent/infra/harness/harness-baseline-runner.ts Outdated
Comment thread src/agent/infra/harness/harness-baseline-runner.ts Outdated
Comment thread src/agent/infra/harness/harness-baseline-runner.ts Outdated
Comment thread test/unit/agent/harness/harness-baseline-runner.test.ts
- Fix INSUFFICIENT_SCENARIOS to guard on store coverage, not the sliced
  window (caller passing a too-low --count now surfaces as a valid run,
  not a misleading "run curate N more times" error). Add test 4b.
- Hoist HarnessModule to top-level import type block; drop inline
  dynamic type import.
- Replace `as SupportedCommandType` cast with an isSupportedBaselineCommand
  type predicate so TypeScript narrows at the guard.
- Swap ternary fn-lookup for an explicit fnMap — prevents silent
  fall-through if chat is ever added to the supported set.
- Document the raw-first call-order coupling in makeTwoArmToolsFactory.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Claude finished @danhdoan's task in 1m 50s —— View job


PR Review — HarnessBaselineRunner (dual-arm replay)

  • Gather context and understand the request
  • Read updated src/agent/infra/harness/harness-baseline-runner.ts
  • Read updated test/unit/agent/harness/harness-baseline-runner.test.ts
  • Check related harness files for context
  • Verify previous review feedback was addressed
  • Post updated review

Overall

Score: 5 / 5

All five issues from the previous review have been addressed cleanly. The one real logic bug (slice-before-check) is fixed, and every nitpick was resolved with minimal, targeted changes. The new test 4b correctly pins the corrected semantics. This is ready to merge.


Previous feedback — all addressed

suggestion Slice-before-check bug — fixed

allScenarios.length is now checked before the slice (lines 126–132), and the slice moved below. The comment on line 124–125 documents the intent precisely. New test 4b (line 219–239) locks in the corrected behaviour: 10 stored + count=2 no longer throws.

nitpick Inline dynamic import type — fixed

HarnessModule is now in the top-level import type block at line 24. runSingleScenario signature at line 202 uses it directly.

nitpick as SupportedCommandType cast — fixed

isSupportedBaselineCommand type predicate (lines 84–86) narrows commandType before getTemplate is called at line 145. No cast required downstream.

nitpick chat fallthrough in runSingleScenariofixed

Lookup table at lines 217–224 is explicit and fails closed: unknown command types return {success: false, stderr: ...} rather than silently routing to module.curate.

nitpick Test arm-discrimination coupling — acknowledged

IMPLEMENTATION COUPLING: block (lines 93–103) in the test file names the dependency on arm call order and explains why a simpler approach is appropriate here. The note also flags what would need updating if the order is ever reversed.


Remaining minor observation

nitpick Two as casts remain in runSingleScenarioharness-baseline-runner.ts:210–211

commandType: scenario.commandType as 'chat' | 'curate' | 'query',
projectType: scenario.projectType as ProjectType,

ValidatedEvaluationScenario.commandType is typed as string in the Zod-derived type (the schema uses z.string().min(1)), while HarnessContext.env.commandType requires the literal union. The cast is technically necessary given the current schema typing, and these were present in the original commit, so not a blocker. A longer-term fix would be to narrow ValidatedEvaluationScenario.commandType to the same literal union at the schema level — but that's a cross-file change and out of scope for this PR. Fix later →


What's good (unchanged)

  • isSupportedBaselineCommand predicate is a clean, reusable pattern — if a second call site ever needs the guard it's already extracted.
  • The lookup table in runSingleScenario leaves an obvious extension point: adding chat is a one-liner.
  • Test 4b asserts caught === undefined (not just absence of throw), which is a strong assertion style.
  • Serial loop + eslint-disable comments remain correct and well-documented.

@danhdoan danhdoan merged commit 6b00edd into proj/autoharness-v2 Apr 22, 2026
6 checks passed
@danhdoan danhdoan deleted the feat/ENG-2325 branch April 22, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants