Skip to content

feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner#523

Merged
RyanNg1403 merged 3 commits intoproj/autoharness-v2from
feat/ENG-2332
Apr 23, 2026
Merged

feat: [ENG-2332] AutoHarness V2 KPI harness — scaffolding + fixture + stub-LLM runner#523
RyanNg1403 merged 3 commits intoproj/autoharness-v2from
feat/ENG-2332

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

@danhdoan danhdoan commented Apr 22, 2026

⚠️ Awaiting Danh's review — do NOT merge now.
Opened for discussion; merge decision is held pending manual review of the fixture realism and the script-shape / layout.

Summary

  • Problem: AutoHarness V2 is a 4000+ LOC feature built on a specific thesis — "a learned pass-through + refinement layer measurably helps weak LLMs succeed at curate tasks." Without a KPI gate, we'd ship that thesis unverified.
  • Why it matters: Tier 4 S1 brutal-review item + the v1.0 ship-gate. Produces the release-notes headline number (Llama 3.1 8B success-rate delta, harness vs raw). Future ships get regression detection on the same gate.
  • What changed: Ships the reference KPI tooling under scripts/autoharness-kpi/. Scaffolding + 20-task fixture + stub-LLM runner are fully operational; --llm real wiring lands in a follow-up.
  • What did NOT change (scope boundary): No source code touched. No real LLM calls (stub path only). brv harness baseline (Task 7.5, per-user KPI) is a distinct command — not affected.

Type of change

  • New feature (release-engineering tooling)

Scope (select all touched areas)

  • CI/CD / Infra

Linked issues

  • Closes ENG-2332
  • Related: ENG-2xxx Task 7.5 brv harness baseline (per-user KPI — same math, different consumer)
  • Ship gate: Public v1.0 blocks on a real-LLM run of this script meeting the 30pp delta target

Root cause (bug fixes only, otherwise write N/A)

N/A

Layout decision

scripts/autoharness-kpi/
  runner.ts                # ~390 LOC, pure functions + CLI entry
  fixture-tasks.json       # 20 hand-crafted curate tasks
  expected-baseline.json   # placeholder; populated at ship time
test/unit/scripts/autoharness-kpi/
  runner.test.ts           # 19 unit tests
package.json
  "kpi:harness": "tsx scripts/autoharness-kpi/runner.ts"

Chose the scripts/<feature>/ namespace over a top-level kpi/ directory after thinking through it from a big-tech senior-engineer lens: KPI tooling is a subsystem concern (AutoHarness V2 only), not a first-class project concern. Top-level promotion would over-scope it. Kubernetes uses hack/, Rust uses src/tools/, CockroachDB uses pkg/cmd/, VSCode co-locates eval with each extension — none promote single-subsystem eval to repo root.

Test plan

  • Coverage added:
    • Unit test
  • Test file: test/unit/scripts/autoharness-kpi/runner.test.ts (19 tests)
  • Key scenarios:
    • loadFixture: shipped fixture parses cleanly; malformed / empty rejected
    • makeStubLlmClient: deterministic per-(task, arm); unknown-id → both-arms-succeed fallback
    • runArm: aggregates runs-per-task; overall = flat mean
    • computeKpiReport: delta math = harness − raw; metadata flows through
    • exitCodeForReport: 0 at delta = 0.30 (inclusive floor), 1 at 0.29, 1 on negative delta
    • parseArgs: defaults + all flags + validates --llm / --runs
    • main end-to-end with stub: shipped fixture runs → delta = 0.50 → exit 0; --output writes valid JSON; --llm real throws clear not-implemented error

User-visible changes

None. Release-engineering tool only.

Evidence

  • Failing test/log before + passing after
$ npx mocha --forbid-only 'test/unit/scripts/autoharness-kpi/runner.test.ts'
  19 passing (9ms)

$ npm run kpi:harness -- --runs 1
task                         raw     harness
──────────────────────────────────────────────────
t01-list-exports                0%       100%
...
t20-phase-ship-gates          100%       100%
──────────────────────────────────────────────────
overall:                      50.0%    100.0%
delta:                       +50.0pp
✓ ship gate met (>= 30pp)

Checklist

  • Tests added or updated and passing
  • Lint passes
  • Type check passes
  • Build succeeds
  • Commits follow Conventional Commits format
  • Documentation updated — N/A (internal tool; docstrings in-file)
  • No breaking changes
  • Branch is up to date with `proj/autoharness-v2`

Risks and mitigations

  • Risk: Scaffolding-only — --llm real is NOT implemented in this PR. The real-LLM path (~50 LOC wiring the agent LLM stack with Llama 3.1 8B) lands in a follow-up. Without it, the v1.0 ship gate can't run.

    • Mitigation: The stub path validates the script's math + CLI + exit-code logic. The real-LLM wiring is scoped as a follow-up because it's cleaner to land after Phase 7 command surface stabilizes. Clear error message on --llm real so no one thinks they can run the ship gate yet.
  • Risk: Fixture realism. The 20 curate tasks are hand-crafted against this codebase. Too-easy tasks (any model succeeds) or too-hard tasks (no model succeeds) dilute the delta signal.

    • Mitigation: Tasks deliberately target concrete symbols/rules in the AutoHarness V2 code (HarnessConfigSchema fields, IHarnessStore methods, mode thresholds, Mode C cap values, etc.). Reviewer should eye-ball the fixture for realism and flag tasks that seem off.
  • Risk: 30pp threshold lock-in. If the first real-LLM run produces delta < 30pp, we either rework the refinement prompts or lower the bar. The bar is defined in v1-design-decisions.md §2.7; moving it needs a design-review.

    • Mitigation: SHIP_GATE_DELTA is a named const at the top of runner.ts, one-line change if the bar moves. Test pins the current value so drift shows up in diff.
  • Risk: expected-baseline.json is a placeholder. Until the real-LLM run lands, the file has null fields.

    • Mitigation: The placeholder is explicit (measuredAt: null etc.) + the \$comment field explains when and how to populate. Regressions on future ships compare against this file.
  • Risk: Holding merge for manual review. PR explicitly flags this at the top — no auto-merge expected.

    • Mitigation: Clear reviewer gate at the top of the description. No labels or automation that auto-merge PRs in this repo that I'm aware of.

… stub-LLM runner

Phase 8 Task 8.4 — Tier 4 S1 brutal-review item. Ships the
release-engineering measurement tool that produces the v1.0 ship
gate's headline number: the Llama 3.1 8B success-rate delta on a
standardized curate task set, raw `tools.*` vs the current harness.

v1.0 ship gate: delta >= 30 percentage points on the reference
fixture or the release is blocked.

## Layout

All release-engineering tooling for this feature lives together
under `scripts/autoharness-kpi/`:

  scripts/autoharness-kpi/
    runner.ts              — main script (parseArgs + runArm +
                             computeKpiReport + exitCodeForReport
                             + main — all pure, unit-testable)
    fixture-tasks.json     — 20 curate tasks hand-crafted against
                             this codebase (stable across releases)
    expected-baseline.json — placeholder; populated at v1.0 ship
                             with the real-LLM measurement
  test/unit/scripts/autoharness-kpi/
    runner.test.ts         — 19 unit tests
  package.json
    "kpi:harness" — runs via `tsx scripts/autoharness-kpi/runner.ts`

Chose the `scripts/<feature>/` namespace over a top-level `kpi/`
directory: KPI tooling is a subsystem concern (AutoHarness V2 only),
not a first-class project concern. Big-tech repos (Kubernetes /
Rust / CockroachDB / VSCode) consistently keep per-feature
release tooling under an umbrella like `scripts/`, `hack/`, or
`tools/` — never at repo root.

## Scope — scaffolding + stub LLM only

`--llm stub` is fully operational with deterministic canned arm
rates (delta = 50pp, well above the 30pp gate) for script-logic
validation. `--llm real` currently throws a clear "not yet
implemented" error — wiring the real Llama 3.1 8B path into the
agent LLM service lands in a follow-up once Phase 7 finalizes the
command surface.

The fixture + math + CLI + exit-code logic all ship now (zero
conflict risk with Phat's in-flight Phase 6 work) so the
follow-up real-LLM PR is ~50 LOC of wiring.

## Fixture design

20 curate tasks covering concrete byterover-cli subjects:
exported classes under harness/, HarnessConfigSchema fields,
IHarnessStore methods, Zod schema enumeration, project-type
detection rules, mode thresholds, Mode C cap values, feedback
weights, etc. Each task has {id, taskDescription,
expectedBehavior}. Hand-crafted so a reviewer can eye-ball
realism. Aims at the "weak model needs hand-holding" band —
tasks too easy (any model succeeds) or too hard (no model
succeeds) wouldn't discriminate between arms.

## Tests (19)

- loadFixture: shipped fixture parses; malformed/empty rejected
- makeStubLlmClient: deterministic per-(task, arm); unknown-id
  fallback to both-arms-succeed
- runArm: aggregates correctly; overall = flat mean across runs
- computeKpiReport: delta math; metadata carried through
- exitCodeForReport: 0 at exactly 0.30 (inclusive floor), 1 at
  0.29, 1 on negative delta
- parseArgs: defaults; flag parsing; validates --llm + --runs
- main (end-to-end with stub): shipped fixture runs; --output
  writes valid JSON; --llm real surfaces the clear
  "not yet implemented" error

## Not merging yet

Per team process, this PR opens for review; Danh decides separately
whether to merge to proj/autoharness-v2 once review feedback is in.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Claude finished @danhdoan's task in 2m 32s —— View job


PR Review — AutoHarness V2 KPI Harness

  • Gather context and understand request
  • Read all changed files
  • Review runner.ts for correctness, security, and code quality
  • Review fixture-tasks.json for fixture realism
  • Review runner.test.ts for test quality and coverage
  • Review expected-baseline.json and package.json
  • Post inline comments on specific issues
  • Post overall summary with score

Overall Assessment: 3.5 / 5

The scaffolding is well-designed. Pure functions, clean exports for testability, correct delta math, proper import.meta.url guard for direct invocation, named SHIP_GATE_DELTA constant, and a clear error on the unimplemented --llm real path. The 19 tests cover the main paths. For internal release tooling this is solid groundwork.

Seven issues were flagged inline; two are blocking against project standards, the rest are suggestions.


Issues by severity

Blocking (project-standards violations)

  1. Multiple as Type casts in loadFixture (runner.ts:149) — CLAUDE.md prohibits as Type assertions and requires type guards. commandType is the most dangerous: a bad value in the fixture file ("commandType": "invalid") would silently flow through as an illegal union member. fixtureVersion and targetModel also skip the typeof x === 'string' guard.

  2. Missing $schema file (fixture-tasks.json:4) — the fixture references ./fixture-tasks.schema.json but no such file exists. Tooling (VS Code, ajv-cli) silently ignores a missing schema, so the reference buys nothing and adds confusion. Either ship the schema or remove the $schema key.

Suggestions

  1. No try/catch around JSON.parse (runner.ts:126) — a malformed fixture file will produce a raw SyntaxError with a confusing stack trace instead of fixture file is not valid JSON: <path>.

  2. Silent ?? 0 in computeKpiReport (runner.ts:224) — if the two arms ever diverge on task IDs, the delta silently uses 0% as the raw baseline instead of throwing. A task-ID mismatch is a bug and should be loud.

  3. renderReport is exported but untested (runner.test.ts:314) — the user-facing table and the / gate line are the most visible output of the script. Format regressions won't be caught.

  4. parseArgs missing edge-case tests (runner.test.ts:266) — --runs -1, --fixture / --output with no argument, etc.

  5. Trailing whitespace on line 368 (runner.ts:368) — minor; lint should catch it.


Fixture realism note

The 20 tasks are anchored to concrete AutoHarness V2 symbols (schema fields, interface methods, mode thresholds, template names, etc.) which is the right approach. Two callouts:

  • t08-oclif-commands has an intentionally weak expected behavior ("empty result is acceptable") — this is fine given phase staging, but it means t08 will contribute noise-free signal only after Phase 7 lands. Make sure the comment in STUB_RATES notes this rather than implying t08 is deliberately easy.
  • The "both arms succeed" split (t11–t20) is synthetic; the real-LLM run may reveal a very different distribution. The current stub design is correct for CI validation, but the 0.50 stub delta should not be confused with the expected real-world delta.

What looks good

  • Math is correct — flat mean, delta = harness − raw, exit-code boundary inclusive at 0.30 ✓
  • makeStubLlmClient is deterministic and well-commented ✓
  • runArm uses no-await-in-loop disable with the right justification ✓
  • main properly separates loadFixture / runArm / computeKpiReport / renderReport
  • All pure functions are exported and unit-tested independently ✓
  • expected-baseline.json placeholder is explicit with $comment
  • invokedDirectly guard prevents side effects on import in tests ✓

Comment thread scripts/autoharness-kpi/runner.ts Outdated
Comment thread scripts/autoharness-kpi/runner.ts
Comment thread scripts/autoharness-kpi/runner.ts Outdated
Comment thread scripts/autoharness-kpi/runner.ts
Comment thread scripts/autoharness-kpi/fixture-tasks.json
Comment thread test/unit/scripts/autoharness-kpi/runner.test.ts
Comment thread test/unit/scripts/autoharness-kpi/runner.test.ts
danhdoan added a commit that referenced this pull request Apr 22, 2026
Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable
`HarnessBaselineRunner` that powers the future `brv harness baseline`
CLI command: replays the last N stored scenarios against two arms
(pass-through template = raw, current version = harness), reports
per-scenario outcomes + overall rates + delta.

Complementary to the reference KPI harness
(`scripts/autoharness-kpi/`, PR #523):
  - KPI harness: fixed task set + fixed model → release-notes
    headline number.
  - Baseline runner: user's own scenarios + current harness →
    per-user "is it working for me?" signal.

The oclif command file (`src/oclif/commands/harness/baseline.ts`)
is deliberately NOT shipped in this PR. The command needs the
daemon-transport wiring for harness subcommands that Task 7.1
(status/inspect) establishes — none of Phase 7's CLI surface is
merged yet, so there's no transport pattern to follow. Shipping
just the runner class now (pure, testable, DI-ready) lets the
oclif wrapper land in a small follow-up after 7.1 defines the
pattern.

This matches the "reusable logic now, wiring later" pattern used
by the KPI harness scaffolding PR (#523).

- Pure orchestration; caller injects the `HarnessToolsFactory`
  (production wires this to `SandboxService.buildHarnessTools({dryRun: true})`
  for write-blocked eval runs, matching Phase 6.1).
- Raw arm fabricates a synthetic version from the stored template
  (`getTemplate(commandType, currentVersion.projectType).code`) —
  no need to persist it; the module builder accepts any code
  string.
- `v1.0 scope narrowing`: only `commandType === 'curate'` is
  supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until
  query templates ship in the Phase 4.3 follow-up.
- `count` bounded [1, 50]; scenarios < 3 fails with a clear
  "run curate N more times first" message so users know how to
  unblock themselves.
- Each scenario runs once per arm (single-run side-by-side
  display). Contrast Phase 6.1 Evaluator which does 10 runs per
  scenario for statistical significance — different semantics,
  different surface, not shared on purpose.

Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE
(query, chat), INSUFFICIENT_SCENARIOS (2 scenarios),
NO_CURRENT_VERSION (empty store), happy-path delta = +100%,
no-op delta = 0, count-caps-slice, harness-throws-captures-
stderr.

All 278 harness unit tests pass. Build + lint + typecheck clean.
danhdoan added a commit that referenced this pull request Apr 22, 2026
Phase 7 Task 7.5 — Tier 1 Q1 brutal-review item. Ships the reusable
`HarnessBaselineRunner` that powers the future `brv harness baseline`
CLI command: replays the last N stored scenarios against two arms
(pass-through template = raw, current version = harness), reports
per-scenario outcomes + overall rates + delta.

Complementary to the reference KPI harness
(`scripts/autoharness-kpi/`, PR #523):
  - KPI harness: fixed task set + fixed model → release-notes
    headline number.
  - Baseline runner: user's own scenarios + current harness →
    per-user "is it working for me?" signal.

The oclif command file (`src/oclif/commands/harness/baseline.ts`)
is deliberately NOT shipped in this PR. The command needs the
daemon-transport wiring for harness subcommands that Task 7.1
(status/inspect) establishes — none of Phase 7's CLI surface is
merged yet, so there's no transport pattern to follow. Shipping
just the runner class now (pure, testable, DI-ready) lets the
oclif wrapper land in a small follow-up after 7.1 defines the
pattern.

This matches the "reusable logic now, wiring later" pattern used
by the KPI harness scaffolding PR (#523).

- Pure orchestration; caller injects the `HarnessToolsFactory`
  (production wires this to `SandboxService.buildHarnessTools({dryRun: true})`
  for write-blocked eval runs, matching Phase 6.1).
- Raw arm fabricates a synthetic version from the stored template
  (`getTemplate(commandType, currentVersion.projectType).code`) —
  no need to persist it; the module builder accepts any code
  string.
- `v1.0 scope narrowing`: only `commandType === 'curate'` is
  supported (query / chat throw `UNSUPPORTED_COMMAND_TYPE`) until
  query templates ship in the Phase 4.3 follow-up.
- `count` bounded [1, 50]; scenarios < 3 fails with a clear
  "run curate N more times first" message so users know how to
  unblock themselves.
- Each scenario runs once per arm (single-run side-by-side
  display). Contrast Phase 6.1 Evaluator which does 10 runs per
  scenario for statistical significance — different semantics,
  different surface, not shared on purpose.

Tests (9): COUNT_OUT_OF_RANGE (0, 51), UNSUPPORTED_COMMAND_TYPE
(query, chat), INSUFFICIENT_SCENARIOS (2 scenarios),
NO_CURRENT_VERSION (empty store), happy-path delta = +100%,
no-op delta = 0, count-caps-slice, harness-throws-captures-
stderr.

All 278 harness unit tests pass. Build + lint + typecheck clean.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Claude finished @RyanNg1403's task in 2m 6s —— View job


PR Review — AutoHarness V2 KPI Harness (re-review after branch sync)

  • Gather context and understand request
  • Read all changed files and check diffs
  • Check if previous blocking issues were resolved
  • Review for new issues
  • Post updated summary with score

Overall Assessment: 3.5 / 5

This re-review was triggered by the merge-sync commit (3ccd213) that brought proj/autoharness-v2 up to date. No source changes were made to the KPI harness files — the five files (runner.ts, fixture-tasks.json, runner.test.ts, expected-baseline.json, package.json) are identical to those reviewed on 2026-04-22.

Both blocking issues from the previous review remain open. The overall assessment and score are unchanged. A brief re-audit confirmed no regressions were introduced and no new issues were found.


Still-Open Blocking Issues

1. as Type casts in loadFixture — CLAUDE.md violation (runner.ts:149–153)

commandType, fixtureVersion, and targetModel are all cast without a runtime check:

commandType: (parsed.commandType as Fixture['commandType']) ?? 'curate',
fixtureVersion: (parsed.fixtureVersion as string) ?? 'unversioned',
targetModel: (parsed.targetModel as string) ?? 'unknown',

commandType is the dangerous one — a fixture file with "commandType": "query" silently flows through even if the enum later gains constraints; and an entirely unknown value is silently cast rather than rejected or defaulted. Replace with runtime checks:

const VALID_COMMAND_TYPES = ['chat', 'curate', 'query'] as const
const rawCt = parsed.commandType
const commandType: Fixture['commandType'] =
  typeof rawCt === 'string' &&
  (VALID_COMMAND_TYPES as readonly string[]).includes(rawCt)
    ? (rawCt as Fixture['commandType'])
    : 'curate'

const rawVersion = parsed.fixtureVersion
const fixtureVersion = typeof rawVersion === 'string' ? rawVersion : 'unversioned'
const rawTarget = parsed.targetModel
const targetModel = typeof rawTarget === 'string' ? rawTarget : 'unknown'

Fix this →


2. Missing $schema file (fixture-tasks.json:3)

"$schema": "./fixture-tasks.schema.json" references a file that does not exist in the repository. The ls scripts/autoharness-kpi/ shows only expected-baseline.json, fixture-tasks.json, and runner.ts. Tooling silently ignores the missing reference — the fixture has no structural protection.

Options (pick one):

  • Ship scripts/autoharness-kpi/fixture-tasks.schema.json describing the Fixture shape
  • Remove the $schema key

Fix this →


Still-Open Suggestions

3. No try/catch around JSON.parse (runner.ts:126)

A malformed fixture file will surface a raw SyntaxError with a confusing stack trace. Wrapping in try/catch gives a clear diagnostic: fixture file is not valid JSON: <path>.

4. Silent ?? 0 in computeKpiReport (runner.ts:224)

If the two arms ever diverge on task IDs the delta silently uses 0% as the raw baseline. A task-ID mismatch is a bug and should throw.

5. renderReport exported but untested (runner.test.ts)

renderReport is not imported in the test file. Format regressions (column misalignment, missing / gate line, wrong percentage rounding) would only surface at manual run time.

6. parseArgs edge-case tests missing (runner.test.ts:266)

--runs -1, --fixture / --output with no following argument are not tested.

7. Trailing whitespace at line 368 (runner.ts:368)

Three spaces before console.log(renderReport(report)). Lint should catch this.


What Still Looks Good

  • Math is correct — flat mean, delta = harness − raw, inclusive floor at 0.30 ✓
  • makeStubLlmClient is deterministic ✓
  • runArm sequential no-await-in-loop pattern is correct ✓
  • main separates concerns cleanly ✓
  • All pure functions exported and independently testable ✓
  • invokedDirectly guard prevents side-effects on import ✓
  • Explicit --llm real not-implemented error ✓

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Claude finished @RyanNg1403's task in 2m 2s —— View job


PR Review — AutoHarness V2 KPI Harness (re-review after 9c366d3)

  • Gather context and understand request
  • Read all changed files and check current state
  • Check if previous blocking issues were resolved
  • Review for new issues
  • Post updated summary

Overall Assessment: 4.5 / 5

All seven issues from the previous two reviews have been addressed. Both blocking violations are gone, all five suggestions are implemented, and the test suite grew from 19 → 28 tests. The only remaining item is the cosmetic trailing whitespace, which is the only thing keeping this below a 5.


Previous Issues — Status

# Severity Issue Status
1 Blocking as Type casts in loadFixture ✅ Fixed
2 Blocking Missing $schema file ✅ Fixed
3 Suggestion No try/catch around JSON.parse ✅ Fixed
4 Suggestion Silent ?? 0 in computeKpiReport ✅ Fixed
5 Suggestion renderReport exported but untested ✅ Fixed
6 Suggestion parseArgs missing edge-case tests ✅ Fixed
7 Nitpick Trailing whitespace at line 368 ⚠️ Still present (line 394)

Detail: What Was Fixed

Issue 1 — as Type casts (runner.ts:134–173): The three unsafe casts have been replaced correctly.

  • commandType now uses a Set<string>-backed isValidCommandType() type guard
  • fixtureVersion and targetModel use typeof === 'string' checks
  • The remaining as Record<string, unknown> casts on lines 126 and 153 are acceptable — they follow explicit typeof !== 'object' || value === null guards, which is the standard TypeScript narrowing pattern, not an unsafe bypass.

Issue 2 — $schema (fixture-tasks.json): The dangling "$schema": "./fixture-tasks.schema.json" reference has been removed. ✓

Issue 3 — JSON.parse error handling (runner.ts:142–147): Now wrapped in try/catch with a clear diagnostic:

fixture file is not valid JSON: <path> — <SyntaxError message>

Issue 4 — task-ID mismatch (runner.ts:242–245): Replaced the silent ?? 0 fallback with an explicit throw:

throw new Error(`task ID mismatch: harness arm has '${h.taskId}' but raw arm does not`)

Test at runner.test.ts:182 verifies this path. ✓

Issues 5 & 6 — renderReport and parseArgs coverage (runner.test.ts:291–348): Two renderReport smoke tests added (gate-met and shortfall cases). parseArgs edge cases added: --runs -1, --runs 0, --fixture with no arg, --output with no arg, --runs with no arg.


Remaining Item

Nitpick — trailing whitespace (runner.ts:394): Line 394 is a blank line containing three spaces before console.log(renderReport(report)). This is unchanged from the previous review. Minor cosmetic; lint should have flagged it.

Fix this →


What Looks Good

  • All blocking CLAUDE.md violations resolved ✓
  • isValidCommandType() type guard with Set<string> is clean and idiomatic ✓
  • Task-ID mismatch now throws with a clear diagnostic ✓
  • renderReport covered: format regression on gate line and shortfall line both caught ✓
  • parseArgs now covers negative, zero, and missing-argument edge cases ✓
  • computeKpiReport test for task-ID mismatch explicitly validates the new throw ✓
  • Test count 19 → 28; all paths exercised ✓
  • as unknown on JSON.parse (line 144) is the correct safe widening from any

@RyanNg1403 RyanNg1403 merged commit 33c2ce3 into proj/autoharness-v2 Apr 23, 2026
6 checks passed
@RyanNg1403 RyanNg1403 deleted the feat/ENG-2332 branch April 23, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants