Skip to content

feat(v1.7.0): artifact-grounded evaluation + quantum-leap cycles#71

Merged
mataeil merged 2 commits into
mainfrom
feat/v1.7.0-artifact-eval-leap
Jun 19, 2026
Merged

feat(v1.7.0): artifact-grounded evaluation + quantum-leap cycles#71
mataeil merged 2 commits into
mainfrom
feat/v1.7.0-artifact-eval-leap

Conversation

@mataeil

@mataeil mataeil commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Why

The F1-racing dogfood graded A every cycle (Loop 0.995, futile 0%, mission-hit 100%) while producing a 처참 (dismal) game — a glowing cyan Z-fighting track, no recognizable car, polished HUD bolted over a broken core. The metrics were perfect; the artifact proved them a lie — a Goodhart collapse.

Root cause: the loop measured process (did a PR/commit advance?) and was blind to the artifact (is the thing good?), and its action selection (RICE + "one focused feature") could only ever take small steps.

Full diagnosis + design: .claude/ooda-evolution-v1.7.0.md. Evidence: .claude/evidence/f1-BEFORE.png (cyan blob) vs f1-AFTER-leap.png (real circuit view).

What changed

Defect Fix
D2/D4 — score ignored the artifact; only node smoke.mjs gated quality Step 5-G Artifact Critique (independent critic scores the real artifact vs a human-authored rubric) + quality_multiplier = process × artifact
D1 — grade pinned to A by a self-authored checklist Honest scorecard: ★Artifact Quality headline, evidence-weighted goal, artifact-only Goodhart Guard (graduated C/D/F cap + warning)
D3 — RICE structurally forbids overhauls; no quantum leaps Leap cycles (2-G plateau + 3-K gate): plateau below bar forces a weakest-dimension overhaul (RICE bypassed, bigger budget, artifact-improvement gate)

Proof (with teeth)

  • F1 re-grades A (0.995) → D (0.567), Goodhart warning fires, Mission-hit 100% → 0%.
  • A real leap on the game lifted visual_fidelity 0.22 → 0.41 (independent re-critique; Δ+0.19 ≫ 0.05 gate).

Safety / hardening

Rubric is loop-read-only + hash-checked; pre-PR artifact gate + checkpoint baseline; min_dimension_delta revert; thrashing → HALT; per-leap cost cap. Design hardened by a 5-agent adversarial red-team (gaming-resistance / autonomous-safety / implementability).

Tests

tests/verify.py 58/58 PASS — 8 new artifact-axis checks + the previously-unregistered scorecard suite wired back in. New scripts/rubric_score.py is pure (offline-testable). plugin 1.6.1→1.7.0, config schema 1.2.0→1.3.0.

🤖 Generated with Claude Code

mataeil and others added 2 commits June 18, 2026 08:22
The F1-racing dogfood graded A every cycle (Loop 0.995, futile 0%) while
producing a dismal game — a Goodhart collapse. The loop measured process
(did a PR/commit advance?) and was blind to the artifact (is it good?), and
RICE + "one focused feature" could only ever take small steps.

- Artifact axis: quality_multiplier = process_factor x artifact_factor
  (score_outcome.py); artifact_score is a first-class outcomes.json field.
- Step 5-G Artifact Critique: an independent, evidence-grounded critic scores
  the real artifact against a human-authored, integrity-checked rubric.
- Honest scorecard: Artifact Quality headline KPI, evidence-weighted goal,
  artifact-only Goodhart Guard (graduated C/D/F cap + warning). F1 re-grades
  A (0.995) -> D (0.567).
- Quantum-leap cycles (2-G plateau + 3-K gate): plateau below bar forces an
  overhaul of the weakest dimension (RICE bypassed, bigger budget,
  artifact-improvement gate not unit-test). Demonstrated leap lifted the
  game's visual_fidelity 0.22 -> 0.41.
- Leap safety: pre-PR artifact gate + checkpoint baseline, min_dimension_delta
  revert, thrashing -> HALT, per-leap cost cap. Hardened by a 5-agent
  adversarial red-team.
- New scripts/rubric_score.py (pure); tests/verify.py 58 passing.

Diagnosis + design: .claude/ooda-evolution-v1.7.0.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dogfood finding from the F1 leap campaign: after the visual leap, the lowest
raw dimension was fun_challenge (0.38) but the dimension whose fix most raises
the headline artifact_quality — and what a player actually perceives as "crap"
— was visual_fidelity (gap 0.24 x weight 0.28 = 0.067 > fun's 0.27 x 0.20 =
0.054). Leaps now target the largest weight x (bar - score), tie-breaking to the
lower raw score.

- rubric_score.py: aggregate()/detect_plateau() expose `leap_target`
  (_weighted_gap_target); pure, deterministic.
- evolve 2-G/3-K: targeted_dimension = leap_target (falls back to weakest).
- verify.py: new check (visual outranks lower-scoring fun) — 59 passing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mataeil mataeil merged commit 7176243 into main Jun 19, 2026
2 checks passed
@mataeil mataeil deleted the feat/v1.7.0-artifact-eval-leap branch June 19, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant