feat(harness): rewrite-quality A/B (single vs ouroboros multi-pass) (5.4.0)#500
Merged
Conversation
Adds scripts/rewrite-ab.mjs (npm run quality:rewrite-ab): for each live-quality fixture it produces a rewrite per config, model-grades both (before/after AI, MPS, fidelity via the existing scoreText/scoreMPS/scoreFidelity), measures word-level edit churn, and picks a per-fixture winner (lowest after-AI among configs meeting the MPS/fidelity floors, ties broken on churn) + per-config aggregates and head-to-head wins. Default comparison is single (one-shot) vs ouroboros (the existing CLI multi-pass), so "does a multi-pass/multi-agent pipeline rewrite better?" is answerable with data instead of intuition — no redundant new --strict CLI mode. LLM-backed/opt-in like quality:live; the comparison/aggregation core is unit-tested with injected producers. Documented in HARNESS.md + quality README.
Minor bump for the opt-in rewrite-quality A/B harness. No CLI/schema/pattern/ detection-behavior change; all four languages byte-identical. Syncs version surfaces + CHANGELOG.
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds the robust measurement harness to answer pipeline questions with data (per the plan: build the harness first, then measure). Bumps to 5.4.0 (minor — opt-in contributor tool; no CLI/schema/detection-behavior change).
scripts/rewrite-ab.mjs(npm run quality:rewrite-ab) compares two rewrite configurations on the same live-quality fixtures:scoreText/scoreMPS/scoreFidelity),Why this shape
The earlier review flagged the multi-agent /
--strictstack as over-engineered because it was never measured. The CLI multi-pass already exists as--ouroboros(detect → rewrite → score → rollback with MPS/fidelity floors), so the default A/B issinglevsouroboros— no redundant new--strictCLI mode. This makes "does a multi-pass pipeline rewrite better?" measurable, so the multi-agent surface can be kept or cut with evidence in the next (measurement) phase.--strictitself lives only inSKILL.md(agent skill) and is not CLI-reachable;--ouroborosis its CLI-measurable proxy.Verification
npm test— 797 pass / 0 fail (6 new rewrite-ab unit tests: editChurn, pickWinner, compare/aggregate, error handling — all with injected producers, no live model)npm run lint— syntax OK (161 files), cspell 0 issuesnpm run release:check— OK for 5.4.0npm run check:no-private-assets— OKLLM-backed and opt-in (
--live/PATINA_LIVE), likequality:live; not in mandatory CI. Documented indocs/HARNESS.mdandtests/quality/README.md.Next phase (separate): run the A/B with a backend to measure
singlevsouroboros, then decide keep/cut the multi-agent surface from data.