Skip to content

feat(harness): rewrite-quality A/B (single vs ouroboros multi-pass) (5.4.0)#500

Merged
devswha merged 2 commits into
mainfrom
bot/rewrite-quality-harness
Jun 15, 2026
Merged

feat(harness): rewrite-quality A/B (single vs ouroboros multi-pass) (5.4.0)#500
devswha merged 2 commits into
mainfrom
bot/rewrite-quality-harness

Conversation

@devswha

@devswha devswha commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

Builds the robust measurement harness to answer pipeline questions with data (per the plan: build the harness first, then measure). Bumps to 5.4.0 (minor — opt-in contributor tool; no CLI/schema/detection-behavior change).

scripts/rewrite-ab.mjs (npm run quality:rewrite-ab) compares two rewrite configurations on the same live-quality fixtures:

  • produces a rewrite per config, model-grades both (before/after AI score, MPS, fidelity via the existing scoreText/scoreMPS/scoreFidelity),
  • measures word-level edit churn,
  • picks a per-fixture winner (lowest after-AI-score among configs meeting the MPS/fidelity floors; ties broken on churn),
  • reports per-config aggregates + head-to-head wins.

Why this shape

The earlier review flagged the multi-agent / --strict stack as over-engineered because it was never measured. The CLI multi-pass already exists as --ouroboros (detect → rewrite → score → rollback with MPS/fidelity floors), so the default A/B is single vs ouroboros — no redundant new --strict CLI mode. This makes "does a multi-pass pipeline rewrite better?" measurable, so the multi-agent surface can be kept or cut with evidence in the next (measurement) phase.

--strict itself lives only in SKILL.md (agent skill) and is not CLI-reachable; --ouroboros is its CLI-measurable proxy.

Verification

  • npm test797 pass / 0 fail (6 new rewrite-ab unit tests: editChurn, pickWinner, compare/aggregate, error handling — all with injected producers, no live model)
  • npm run lint — syntax OK (161 files), cspell 0 issues
  • npm run release:check — OK for 5.4.0
  • npm run check:no-private-assets — OK

LLM-backed and opt-in (--live / PATINA_LIVE), like quality:live; not in mandatory CI. Documented in docs/HARNESS.md and tests/quality/README.md.

Next phase (separate): run the A/B with a backend to measure single vs ouroboros, then decide keep/cut the multi-agent surface from data.

devswha added 2 commits June 15, 2026 20:05
Adds scripts/rewrite-ab.mjs (npm run quality:rewrite-ab): for each live-quality
fixture it produces a rewrite per config, model-grades both (before/after AI,
MPS, fidelity via the existing scoreText/scoreMPS/scoreFidelity), measures
word-level edit churn, and picks a per-fixture winner (lowest after-AI among
configs meeting the MPS/fidelity floors, ties broken on churn) + per-config
aggregates and head-to-head wins.

Default comparison is single (one-shot) vs ouroboros (the existing CLI
multi-pass), so "does a multi-pass/multi-agent pipeline rewrite better?" is
answerable with data instead of intuition — no redundant new --strict CLI mode.
LLM-backed/opt-in like quality:live; the comparison/aggregation core is
unit-tested with injected producers. Documented in HARNESS.md + quality README.
Minor bump for the opt-in rewrite-quality A/B harness. No CLI/schema/pattern/
detection-behavior change; all four languages byte-identical. Syncs version
surfaces + CHANGELOG.
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
patina Ready Ready Preview, Comment Jun 15, 2026 11:06am

Request Review

@devswha devswha merged commit b75dbf0 into main Jun 15, 2026
8 checks passed
@devswha devswha deleted the bot/rewrite-quality-harness branch June 15, 2026 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant