feat(v1.7.0): artifact-grounded evaluation + quantum-leap cycles by mataeil · Pull Request #71 · mataeil/OODA-loop

mataeil · 2026-06-17T23:23:12Z

Why

The F1-racing dogfood graded A every cycle (Loop 0.995, futile 0%, mission-hit 100%) while producing a 처참 (dismal) game — a glowing cyan Z-fighting track, no recognizable car, polished HUD bolted over a broken core. The metrics were perfect; the artifact proved them a lie — a Goodhart collapse.

Root cause: the loop measured process (did a PR/commit advance?) and was blind to the artifact (is the thing good?), and its action selection (RICE + "one focused feature") could only ever take small steps.

Full diagnosis + design: .claude/ooda-evolution-v1.7.0.md. Evidence: .claude/evidence/f1-BEFORE.png (cyan blob) vs f1-AFTER-leap.png (real circuit view).

What changed

Defect	Fix
D2/D4 — score ignored the artifact; only `node smoke.mjs` gated quality	Step 5-G Artifact Critique (independent critic scores the real artifact vs a human-authored rubric) + `quality_multiplier = process × artifact`
D1 — grade pinned to A by a self-authored checklist	Honest scorecard: ★Artifact Quality headline, evidence-weighted goal, artifact-only Goodhart Guard (graduated C/D/F cap + warning)
D3 — RICE structurally forbids overhauls; no quantum leaps	Leap cycles (2-G plateau + 3-K gate): plateau below bar forces a weakest-dimension overhaul (RICE bypassed, bigger budget, artifact-improvement gate)

Proof (with teeth)

F1 re-grades A (0.995) → D (0.567), Goodhart warning fires, Mission-hit 100% → 0%.
A real leap on the game lifted visual_fidelity 0.22 → 0.41 (independent re-critique; Δ+0.19 ≫ 0.05 gate).

Safety / hardening

Rubric is loop-read-only + hash-checked; pre-PR artifact gate + checkpoint baseline; min_dimension_delta revert; thrashing → HALT; per-leap cost cap. Design hardened by a 5-agent adversarial red-team (gaming-resistance / autonomous-safety / implementability).

Tests

tests/verify.py 58/58 PASS — 8 new artifact-axis checks + the previously-unregistered scorecard suite wired back in. New scripts/rubric_score.py is pure (offline-testable). plugin 1.6.1→1.7.0, config schema 1.2.0→1.3.0.

🤖 Generated with Claude Code

The F1-racing dogfood graded A every cycle (Loop 0.995, futile 0%) while producing a dismal game — a Goodhart collapse. The loop measured process (did a PR/commit advance?) and was blind to the artifact (is it good?), and RICE + "one focused feature" could only ever take small steps. - Artifact axis: quality_multiplier = process_factor x artifact_factor (score_outcome.py); artifact_score is a first-class outcomes.json field. - Step 5-G Artifact Critique: an independent, evidence-grounded critic scores the real artifact against a human-authored, integrity-checked rubric. - Honest scorecard: Artifact Quality headline KPI, evidence-weighted goal, artifact-only Goodhart Guard (graduated C/D/F cap + warning). F1 re-grades A (0.995) -> D (0.567). - Quantum-leap cycles (2-G plateau + 3-K gate): plateau below bar forces an overhaul of the weakest dimension (RICE bypassed, bigger budget, artifact-improvement gate not unit-test). Demonstrated leap lifted the game's visual_fidelity 0.22 -> 0.41. - Leap safety: pre-PR artifact gate + checkpoint baseline, min_dimension_delta revert, thrashing -> HALT, per-leap cost cap. Hardened by a 5-agent adversarial red-team. - New scripts/rubric_score.py (pure); tests/verify.py 58 passing. Diagnosis + design: .claude/ooda-evolution-v1.7.0.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Dogfood finding from the F1 leap campaign: after the visual leap, the lowest raw dimension was fun_challenge (0.38) but the dimension whose fix most raises the headline artifact_quality — and what a player actually perceives as "crap" — was visual_fidelity (gap 0.24 x weight 0.28 = 0.067 > fun's 0.27 x 0.20 = 0.054). Leaps now target the largest weight x (bar - score), tie-breaking to the lower raw score. - rubric_score.py: aggregate()/detect_plateau() expose `leap_target` (_weighted_gap_target); pure, deterministic. - evolve 2-G/3-K: targeted_dimension = leap_target (falls back to weakest). - verify.py: new check (visual outranks lower-scoring fun) — 59 passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mataeil and others added 2 commits June 18, 2026 08:22

mataeil merged commit 7176243 into main Jun 19, 2026
2 checks passed

mataeil deleted the feat/v1.7.0-artifact-eval-leap branch June 19, 2026 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v1.7.0): artifact-grounded evaluation + quantum-leap cycles#71

feat(v1.7.0): artifact-grounded evaluation + quantum-leap cycles#71
mataeil merged 2 commits into
mainfrom
feat/v1.7.0-artifact-eval-leap

mataeil commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mataeil commented Jun 17, 2026

Why

What changed

Proof (with teeth)

Safety / hardening

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant