feat(v1.7.0): artifact-grounded evaluation + quantum-leap cycles#71
Merged
Conversation
The F1-racing dogfood graded A every cycle (Loop 0.995, futile 0%) while producing a dismal game — a Goodhart collapse. The loop measured process (did a PR/commit advance?) and was blind to the artifact (is it good?), and RICE + "one focused feature" could only ever take small steps. - Artifact axis: quality_multiplier = process_factor x artifact_factor (score_outcome.py); artifact_score is a first-class outcomes.json field. - Step 5-G Artifact Critique: an independent, evidence-grounded critic scores the real artifact against a human-authored, integrity-checked rubric. - Honest scorecard: Artifact Quality headline KPI, evidence-weighted goal, artifact-only Goodhart Guard (graduated C/D/F cap + warning). F1 re-grades A (0.995) -> D (0.567). - Quantum-leap cycles (2-G plateau + 3-K gate): plateau below bar forces an overhaul of the weakest dimension (RICE bypassed, bigger budget, artifact-improvement gate not unit-test). Demonstrated leap lifted the game's visual_fidelity 0.22 -> 0.41. - Leap safety: pre-PR artifact gate + checkpoint baseline, min_dimension_delta revert, thrashing -> HALT, per-leap cost cap. Hardened by a 5-agent adversarial red-team. - New scripts/rubric_score.py (pure); tests/verify.py 58 passing. Diagnosis + design: .claude/ooda-evolution-v1.7.0.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dogfood finding from the F1 leap campaign: after the visual leap, the lowest raw dimension was fun_challenge (0.38) but the dimension whose fix most raises the headline artifact_quality — and what a player actually perceives as "crap" — was visual_fidelity (gap 0.24 x weight 0.28 = 0.067 > fun's 0.27 x 0.20 = 0.054). Leaps now target the largest weight x (bar - score), tie-breaking to the lower raw score. - rubric_score.py: aggregate()/detect_plateau() expose `leap_target` (_weighted_gap_target); pure, deterministic. - evolve 2-G/3-K: targeted_dimension = leap_target (falls back to weakest). - verify.py: new check (visual outranks lower-scoring fun) — 59 passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The F1-racing dogfood graded A every cycle (Loop 0.995, futile 0%, mission-hit 100%) while producing a 처참 (dismal) game — a glowing cyan Z-fighting track, no recognizable car, polished HUD bolted over a broken core. The metrics were perfect; the artifact proved them a lie — a Goodhart collapse.
Root cause: the loop measured process (did a PR/commit advance?) and was blind to the artifact (is the thing good?), and its action selection (RICE + "one focused feature") could only ever take small steps.
Full diagnosis + design:
.claude/ooda-evolution-v1.7.0.md. Evidence:.claude/evidence/f1-BEFORE.png(cyan blob) vsf1-AFTER-leap.png(real circuit view).What changed
node smoke.mjsgated qualityquality_multiplier = process × artifactProof (with teeth)
Safety / hardening
Rubric is loop-read-only + hash-checked; pre-PR artifact gate + checkpoint baseline;
min_dimension_deltarevert; thrashing → HALT; per-leap cost cap. Design hardened by a 5-agent adversarial red-team (gaming-resistance / autonomous-safety / implementability).Tests
tests/verify.py58/58 PASS — 8 new artifact-axis checks + the previously-unregistered scorecard suite wired back in. Newscripts/rubric_score.pyis pure (offline-testable). plugin 1.6.1→1.7.0, config schema 1.2.0→1.3.0.🤖 Generated with Claude Code