From a68193cc25bccfb7cd8060b0b3fc4b74fed77943 Mon Sep 17 00:00:00 2001
From: Taeil Ma <taeil.ma0915@gmail.com>
Date: Fri, 19 Jun 2026 14:55:34 +0900
Subject: [PATCH 1/2] =?UTF-8?q?feat(v1.8.0):=20drive=20quality=20to=20"goo?=
 =?UTF-8?q?d"=20=E2=80=94=20fix=20thrashing=20guard,=20per-dimension=20cap?=
 =?UTF-8?q?ture,=20dimension-lock,=20remainder?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The F1 probe stayed crap after 3 v1.7.0 leaps (artifact 0.394→0.447→0.472→0.522,
never reaching bar 0.65). A 13-agent adversarially-verified diagnosis found the
loop DETECTS a quality gap but isn't built to CLOSE it.

- 2-G thrashing-guard BUG FIX: counted a nonexistent `leap_delta` on
  weakest_dimension → fails was ALWAYS 0, HALT never fired. Now counts
  leap_attempts[].delta_score on leap_target (rubric_score.failed_leaps()).
- 5-G per-dimension capture_method: experiential axes (driving_feel +
  fun_challenge = 45% of weight, frozen across all 25 cycles) use a
  human-authored, hash-verified, protected gameplay_metrics harness; missing →
  null + skill_gap, never a faked/silent-screenshot score.
- 2-G dimension lock until bar: a successful leap below bar keeps the plateau on
  the SAME target (drive-to-bar, not detect-and-nudge+rotate). lock_target();
  config.leap.lock_until_bar; tolerance band + working max-attempts HALT.
- 5-G auto-queue remainder: a gate-passing leap still below bar queues a
  high-RICE remainder, triggered by the independent critic (not self-report).

Rejected: raising bar to 0.80 yet, inner refine loop, LLM-coverage gate,
multi_probe. verify.py 59 → 61. plugin 1.7→1.8, config schema 1.3→1.4.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .claude-plugin/plugin.json       |   2 +-
 .claude/ooda-evolution-v1.8.0.md | 100 ++++++++++++++++++++++
 CHANGELOG.md                     |  36 ++++++++
 config.example.json              |  19 ++++-
 scripts/rubric_score.py          |  37 ++++++++
 skills/evolve/SKILL.md           | 140 +++++++++++++++++++++++++------
 tests/verify.py                  |  36 ++++++++
 7 files changed, 338 insertions(+), 32 deletions(-)
 create mode 100644 .claude/ooda-evolution-v1.8.0.md

diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
index 4a3c5ac..da5041a 100644
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "ooda-loop",
   "displayName": "OODA-loop",
-  "version": "1.7.0",
+  "version": "1.8.0",
   "description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
   "author": {
     "name": "Taeil Ma",
diff --git a/.claude/ooda-evolution-v1.8.0.md b/.claude/ooda-evolution-v1.8.0.md
new file mode 100644
index 0000000..6e1d4ae
--- /dev/null
+++ b/.claude/ooda-evolution-v1.8.0.md
@@ -0,0 +1,100 @@
+# OODA-loop v1.8.0 — driving quality to "good", not "passable"
+
+**Date:** 2026-06-19
+**Purpose reminder:** the f1-racing game is a **probe**, not the product. Its job
+is to expose what OODA-loop still gets wrong. This release is about the loop.
+
+## The probe's verdict (why we believe the loop is still the bottleneck)
+
+v1.7.0 fixed *measurement* (artifact axis) and *anti-incrementalism* (leap
+cycles). It worked — the loop went from a flat, lying A to a climbing, honest D.
+But after **three** real leap cycles the owner's verdict is unchanged: *still too
+crappy.* The data says why:
+
+| | artifact_quality | note |
+|---|---|---|
+| 22 feature cycles (old loop) | **0.394** (flat) | Goodhart collapse |
+| + leap 1 (visual) | 0.447 | +0.053 |
+| + leap 2 (track) | 0.472 | +0.025 |
+| + leap 3 (car/cockpit) | 0.522 | +0.050 |
+| bar (shippable) | 0.65 | not reached |
+| "genuinely good" | ~0.80 | far off |
+
+Average **~+0.04 per leap**. At that rate "good" is ~7 leaps away. The loop
+*climbs* but it climbs **too slowly and stops too low**. Three concrete failures
+observed during the campaign point at the loop, not the game:
+
+1. **It settles for "+0.05 better."** A leap's success test is
+   `min_dimension_delta` (0.05) — so the moment a dimension nudges up, the loop
+   declares victory and moves the weighted-gap target elsewhere, abandoning a
+   dimension while it is still *bad* (visual went 0.22→0.41→0.59 across two
+   separate leaps with other work in between; it was never driven to the bar in
+   one sustained push).
+2. **It accepts partial implementation of its own plan.** Leap 3's design panel
+   specced car/cockpit **and** materials/lighting **and** environment; only
+   car/cockpit shipped. The dropped materials/lighting (shadows, tone-mapping,
+   textures) was the *single most-repeated* critic complaint at every leap — and
+   it silently vanished. The act step under-delivers with no completion check.
+3. **It can't see motion, feel, or fun.** The critic scores one still
+   screenshot, so `driving_feel` and `fun_challenge` are unmeasurable and
+   unimprovable — they sit at 0.38–0.51 and cap the weighted mean.
+
+## Root cause (from a 13-agent, adversarially-verified diagnosis)
+
+> The loop **detects** a quality gap and takes *a* step at it, but is not built to
+> **close** it. It (1) can't see ~45% of its own rubric — `driving_feel` + `fun_challenge`
+> were scored from a still screenshot and sat frozen across all 25 cycles; (2)
+> abandons a dimension after a +0.05 nudge and rotates targets instead of driving
+> one to the bar; (3) had a **silently broken thrashing guard** (read a nonexistent
+> `leap_delta` field → never fired → could thrash forever); and (4) accepts partial
+> implementation of its own leap plans, orphaning the rest. It optimizes to
+> "better," never to "good."
+
+The devil's-advocate agent confirmed the leap *routing* is sound — the binding
+constraint is **perception** (the critic can't measure feel/fun), not more leap
+machinery. That reframed the fix.
+
+## The v1.8.0 upgrades (ranked by leverage)
+
+1. **Fix the thrashing guard (prerequisite, real bug).** 2-G now counts
+   `leap_attempts[].delta_score` on the `leap_target` (was: nonexistent `leap_delta`
+   on `weakest_dimension`). The HALT safety valve actually fires now —
+   `rubric_score.failed_leaps()` is the deterministic ground truth.
+2. **Per-dimension `capture_method` (5-G).** Each rubric axis declares how its
+   evidence is captured; experiential axes use `gameplay_metrics` — a
+   HUMAN-AUTHORED, hash-verified, protected harness (same independence invariant
+   as the rubric hash). Missing/unverified harness → the axis scores `null`
+   (capture_failure) + a skill_gap, never a faked or screenshot-fallback score.
+   Unlocks the frozen 45% of rubric weight.
+3. **Dimension lock until bar (2-G).** After a successful leap whose target is
+   still below `bar − eps`, keep the plateau active on the SAME target so 3-K
+   leaps it again next cycle — drive-to-bar, not detect-and-nudge. A tolerance
+   band + the (now-working) max-attempts HALT prevent infinite lock.
+   `rubric_score.lock_target()` is the deterministic ground truth;
+   `config.leap.lock_until_bar` (default true) toggles it.
+4. **Auto-queue the remainder (5-G).** If a leap passes its delta gate but the
+   dimension is still below bar, the *independent critic's* score (not the
+   maker's self-report) queues a high-RICE remainder so dropped scope can't
+   vanish. Records `leap_dim_still_below_bar`.
+
+All four are deterministic where possible (`scripts/rubric_score.py`), config-
+driven, and gaming-resistant. `tests/verify.py` 59 → **61**.
+
+## What we deliberately did NOT change (rejected / devil's-advocate-validated)
+
+- **Don't raise the bar to 0.80 yet.** It only re-points the weighted-gap target
+  at an unmeasurable dimension and (pre-fix) would HALT the loop. Raise only after
+  a `gameplay_metrics` leap proves feel/fun are responsive. Stage 0.65→0.75→0.80.
+- **No inner multi-pass refine loop (yet).** The 3 completed leaps each landed
+  substantial single-pass deltas (+0.19/+0.27/+0.18); the cap was perception, not
+  pass-count. Adding passes on unmeasurable dimensions = cost with no signal.
+- **No LLM-component-coverage gate.** Gameable (touch one file per "component" →
+  100%). Change 4's critic-driven remainder is the robust substitute.
+- **No `multi_probe` still-sequence.** A burst of stills still can't tell
+  responsive steering from sluggish; `gameplay_metrics` is the right instrument.
+
+## Validation (leap 4 under the upgraded loop)
+
+<!-- run a leap with the new loop; show visual_fidelity driven toward the bar in a
+     sustained push (dimension lock) and the previously-dropped materials/lighting
+     actually shipped -->
diff --git a/CHANGELOG.md b/CHANGELOG.md
index a6dcb8a..1deb187 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,6 +8,42 @@ independently. Bump there signals migration work for downstream projects.
 
 ---
 
+## [v1.8.0] — 2026-06-19
+
+### Changed — drive quality to "good", not "passable" (config schema 1.4.0)
+
+The F1 probe stayed crap after **three** v1.7.0 leaps (artifact 0.394 → 0.447 →
+0.472 → 0.522, never reaching the 0.65 bar). A 13-agent adversarially-verified
+diagnosis (`.claude/ooda-evolution-v1.8.0.md`) found the loop **detects** a
+quality gap but isn't built to **close** it. Four fixes, ranked by leverage:
+
+- **Thrashing-guard bug fix (prerequisite).** evolve 2-G counted a nonexistent
+  `leap_delta` field on `weakest_dimension`, so the guard's `fails` count was
+  ALWAYS 0 and the HALT safety valve **never fired** — the loop could thrash a
+  dimension forever. Now counts `leap_attempts[].delta_score` on the actual
+  `leap_target` (`rubric_score.failed_leaps()`).
+- **Per-dimension `capture_method` (5-G).** The critic scored every axis from one
+  screenshot, so `driving_feel` + `fun_challenge` (**45% of the rubric's weight**)
+  were frozen across all 25 cycles. Each axis now declares its capture;
+  experiential axes use `gameplay_metrics` — a human-authored, hash-verified,
+  protected harness (same independence invariant as the rubric hash). Missing/
+  unverified → score `null` + skill_gap, never a faked or silent-screenshot score.
+- **Dimension lock until bar (2-G).** A successful leap that left its target below
+  `bar − eps` now keeps the plateau active on the SAME target (drive-to-bar, not
+  detect-and-nudge + rotate). `rubric_score.lock_target()`; toggle with
+  `config.leap.lock_until_bar`. Tolerance band + the now-working max-attempts HALT
+  prevent infinite lock.
+- **Auto-queue remainder (5-G).** A leap that passes its delta gate but leaves the
+  dimension below bar queues a high-RICE remainder, triggered by the *independent
+  critic's* score (not the maker's self-report), so dropped/partial scope can't be
+  silently orphaned (as leap 3's materials/lighting was).
+
+**Rejected** (kept the loop honest): raising the bar to 0.80 before feel/fun are
+measurable; an inner multi-pass refine loop; an LLM-component-coverage gate
+(gameable); `multi_probe` still-sequences. `tests/verify.py` 59 → **61**.
+
+---
+
 ## [v1.7.0] — 2026-06-17
 
 ### Added — artifact-grounded evaluation + quantum-leap cycles (config schema 1.3.0)
diff --git a/config.example.json b/config.example.json
index 61e0956..56df709 100644
--- a/config.example.json
+++ b/config.example.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": "1.3.0",
+  "schema_version": "1.4.0",
   "project": {
     "name": "my-app",
     "locale": "en",
@@ -270,16 +270,27 @@
     "plateau_window": 4,
     "plateau_eps": 0.05,
     "locked": true,
-    "dimensions": []
+    "__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.",
+    "dimensions": [],
+    "__example_dimension__": {
+      "name": "driving_feel",
+      "weight": 0.25,
+      "capture_method": "gameplay_metrics",
+      "gameplay_metrics_command": "node tools/feel_harness.mjs",
+      "gameplay_metrics_hash": "<sha256 of the harness file>",
+      "metrics_fields": ["input_lag_ms", "physics_response_ms", "completion_rate"],
+      "description": "Responsive steering, weight transfer, distinct braking — judge against: input_lag_ms<40, physics_response_ms stable, completion_rate>0.7."
+    }
   },
   "leap": {
-    "__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend.",
+    "__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge).",
     "max_lines": 1500,
     "min_dimension_delta": 0.05,
     "max_attempts_per_dimension": 2,
     "max_per_day": 2,
     "gap_weight": 30.0,
-    "cost_limit_usd": 0.5
+    "cost_limit_usd": 0.5,
+    "lock_until_bar": true
   },
   "goal_completion_idle": true
 }
diff --git a/scripts/rubric_score.py b/scripts/rubric_score.py
index a1fe391..f995652 100644
--- a/scripts/rubric_score.py
+++ b/scripts/rubric_score.py
@@ -214,6 +214,43 @@ def detect_plateau(outcomes: list, rubric: dict) -> dict:
     }
 
 
+def failed_leaps(outcomes: list, dimension: str, min_delta: float) -> int:
+    """v1.8.0 thrashing-guard count (deterministic ground truth for evolve 2-G).
+    Counts leap cycles whose `leap_attempts` on `dimension` failed to clear
+    `min_delta`. The v1.7.x engine read a nonexistent field `leap_delta` and
+    matched the raw weakest_dimension, so this was ALWAYS 0 and the guard never
+    fired — the loop could thrash forever. The real ledger field is
+    `leap_attempts[].delta_score`, keyed on the dimension actually leapt."""
+    n = 0
+    for e in outcomes:
+        if e.get("cycle_mode") != "leap":
+            continue
+        for a in (e.get("leap_attempts") or []):
+            if a.get("dimension") == dimension and a.get("delta_score", 1.0) < min_delta:
+                n += 1
+    return n
+
+
+def lock_target(outcomes: list, rubric: dict, leap_target: str | None) -> str | None:
+    """v1.8.0 dimension-lock: after a SUCCESSFUL leap whose target is still below
+    (bar − eps), return that target so evolve 2-G keeps the plateau active on it
+    (drive-to-bar) instead of coasting through feature cycles and re-rotating.
+    Returns None when nothing should be locked (no leap last, regressed, or the
+    target is at/near bar — the tolerance band stops critic variance from locking
+    the loop forever; the max_attempts HALT is the genuine-stuck backstop)."""
+    if not leap_target or not outcomes:
+        return None
+    last = outcomes[-1]
+    if last.get("cycle_mode") != "leap" or last.get("result_type") == "leap_regressed":
+        return None
+    bar = rubric.get("bar", DEFAULT_BAR)
+    eps = float(rubric.get("plateau_eps", DEFAULT_PLATEAU_EPS))
+    score = (last.get("dimension_scores") or {}).get(leap_target)
+    if score is None:
+        return None
+    return leap_target if score < bar - eps else None
+
+
 def compute(project: Path) -> dict:
     ev = Path(project) / "agent" / "state" / "evolve"
     config = _load(Path(project) / "config.json", {}) or {}
diff --git a/skills/evolve/SKILL.md b/skills/evolve/SKILL.md
index 42017c8..e4097b6 100644
--- a/skills/evolve/SKILL.md
+++ b/skills/evolve/SKILL.md
@@ -727,18 +727,27 @@ for the implementation/build domain that declares config.domains[d].quality_rubr
   -- requires below-bar — slow upward drift that never reaches bar still counts as
   -- stuck, and "already good" (>= bar) is never a plateau.
   if p.plateau:
-    -- THRASHING GUARD: if a leap on this weakest_dimension has already failed to
-    -- clear config.leap.min_dimension_delta `config.leap.max_attempts_per_dimension`
-    -- times (default 2) — counted from outcomes[].leap_attempts — do NOT leap
-    -- again. Escalate to a halt for human help:
-    fails = count(e in outcomes where cycle_mode=="leap"
-                  AND weakest_dimension == p.weakest_dimension
-                  AND e.leap_delta < config.leap.min_dimension_delta)
+    -- THRASHING GUARD (v1.8.0 fix): if leaps on the TARGET dimension have already
+    -- failed to clear config.leap.min_dimension_delta
+    -- `config.leap.max_attempts_per_dimension` times (default 2), do NOT leap
+    -- again — escalate to a HALT for human help.
+    -- BUG FIX: the v1.7.x guard read a nonexistent field `e.leap_delta` and
+    -- matched `weakest_dimension`, so `fails` was ALWAYS 0 and the guard NEVER
+    -- fired — the loop could thrash forever on an unmeasurable dimension. The
+    -- real ledger field is `leap_attempts[].delta_score`, and the dimension that
+    -- was actually leapt is `leap_target` (the weighted-gap winner from 3-K),
+    -- which can differ from the raw `weakest_dimension`.
+    fails = count(e in outcomes where e.cycle_mode == "leap"
+                  for a in (e.leap_attempts or [])
+                  if a.dimension == p.leap_target
+                     AND a.delta_score < config.leap.min_dimension_delta)
     if fails >= config.leap.max_attempts_per_dimension:
-      record skill_gap { name: "leap_stuck_{p.weakest_dimension}", type:"quality_gap",
-        detail:"Leap on {dim} failed {fails}× without clearing min delta" }
-      Create HALT: "Leap on '{p.weakest_dimension}' failed {fails}× — human review
-                    needed; the loop cannot self-fix this dimension."
+      record skill_gap { name: "leap_stuck_{p.leap_target}", type:"quality_gap",
+        detail:"Leap on {p.leap_target} failed {fails}× without clearing min delta —
+                likely UNMEASURABLE by the current capture_method (see 5-G)." }
+      Create HALT: "Leap on '{p.leap_target}' failed {fails}× — human review needed:
+                    supply a richer capture_method/metrics harness for this
+                    dimension, or reweight the rubric. The loop cannot self-fix it."
       set plateau_leap_blocked = true
     else:
       set orient.plateau = { active:true, leap_target:p.leap_target,
@@ -746,12 +755,37 @@ for the implementation/build domain that declares config.domains[d].quality_rubr
                              artifact_score:p.latest, reason:p.reason }
       Print "[Orient] 📉 PLATEAU. Leap target '{p.leap_target}' ({p.reason}). Next cycle should LEAP, not add a feature."
   else:
-    set orient.plateau = { active:false }
+    -- DIMENSION LOCK until bar (v1.8.0): the v1.7.x loop cleared the plateau the
+    -- instant a leap produced any improvement (+min_delta), then coasted through
+    -- ~plateau_window−1 feature cycles before re-detecting — and re-targeted via a
+    -- recomputed weighted gap, so a partially-fixed dimension could LOSE its slot
+    -- before reaching bar. Result: the loop "detected and nudged" instead of
+    -- "detected and CLOSED" (the F1 probe: visual_fidelity took 0.22→0.41→0.59
+    -- across non-consecutive leaps and never reached bar). Fix: if the last cycle
+    -- was a SUCCESSFUL leap whose target is still below bar, keep the plateau
+    -- active on the SAME target so 3-K leaps it again next cycle — drive it to bar.
+    last = outcomes.entries[-1] if outcomes.entries else null
+    bar = config.domains[d].quality_rubric.bar
+    if (config.leap.lock_until_bar (default true)
+        AND last is not null AND last.cycle_mode == "leap"
+        AND last.result_type != "leap_regressed"
+        AND p.leap_target is not null
+        AND last.dimension_scores[p.leap_target] < bar - rubric.plateau_eps):
+        -- tolerance band: within plateau_eps of bar counts as cleared, so critic
+        -- variance near threshold can't lock the loop forever (max_attempts HALT
+        -- is the backstop if it genuinely can't clear).
+      set orient.plateau = { active:true, leap_target:p.leap_target,
+                             weakest_dimension:p.weakest_dimension,
+                             artifact_score:p.latest,
+                             reason:"'{p.leap_target}' improved to {last.dimension_scores[p.leap_target]} but still < bar {bar} — keep leaping" }
+      Print "[Orient] 🔒 LEAP succeeded but '{p.leap_target}' still below bar. Locking target — leap again, don't coast."
+    else:
+      set orient.plateau = { active:false }
 ```
 
-`orient.plateau` is consumed by Step 3-K. A plateau cleared by a successful leap
-(artifact_score rose, or weakest_dimension changed) resets automatically — it is
-derived from outcomes.json each cycle, not a persisted counter.
+`orient.plateau` is consumed by Step 3-K. With **lock_until_bar** the loop drives
+one dimension to its bar before moving on (constraint *exploitation*, not
+rotation); set `config.leap.lock_until_bar=false` to restore v1.7.x rotation.
 
 ---
 
@@ -1810,23 +1844,51 @@ if rubric is null OR this cycle produced no artifact (observe/futile/error/skip)
 -- the loop may not author its own grading standard"). On first run, record the
 -- hash. (config.*.quality_rubric SHOULD be listed in config.safety.protected_paths.)
 
--- CAPTURE the actual artifact via rubric.capture_method:
---   "screenshot"  → serve + Playwright screenshot to a temp PNG (web UIs)
---   "api_call"    → run rubric.capture_command, capture stdout/response
---   "benchmark"   → run rubric.capture_command, capture metrics
-capture = run rubric.capture_command (or the screenshot path)
+-- CAPTURE per-DIMENSION (v1.8.0): the v1.7.x critic took ONE screenshot for every
+-- axis, so experiential dimensions (driving_feel, fun_challenge — 45% of the F1
+-- rubric's weight) were unmeasurable from a still and FROZEN across every cycle —
+-- the single biggest reason quality stalled (the loop was optimizing 55% of its
+-- own rubric). Each dimension now declares how to capture the evidence it needs:
+--   "screenshot"       → serve + Playwright screenshot (visual axes; one shot reused for all)
+--   "api_call"         → run capture_command, capture stdout/response (library/API axes)
+--   "benchmark"        → run capture_command, capture metrics
+--   "gameplay_metrics" → run a HUMAN-AUTHORED harness that exercises the artifact
+--                        (e.g. drive 10s, log input_lag_ms / physics_response_ms /
+--                        completion_rate) so feel/fun become measurable.
+for dim in rubric.dimensions:
+  method = dim.capture_method or rubric.capture_method or "screenshot"
+  if method == "gameplay_metrics":
+    -- INDEPENDENCE GATE (same invariant as rubric_hash): the harness must be
+    -- human-authored and out of the loop's reach, or its score is worthless.
+    --   (1) dim.gameplay_metrics_command path MUST be in config.safety.protected_paths
+    --   (2) sha256(harness file) MUST equal dim.gameplay_metrics_hash
+    -- If either fails: dim_artifact[dim] = CAPTURE_FAILURE; record
+    --   skill_gap("gameplay_metrics_not_independent_{dim}"); do NOT fall back to a
+    --   screenshot silently (that would re-freeze the dimension). The dimension
+    --   scores null and the 2-G thrashing guard will eventually HALT for a human.
+    if not independent: dim_artifact[dim] = null (capture_failure); continue
+    dim_artifact[dim] = run_protected_harness(dim.gameplay_metrics_command)  -- metrics JSON
+  else if method == "screenshot":
+    dim_artifact[dim] = the shared screenshot (captured once)
+  else:
+    dim_artifact[dim] = run dim.capture_command (or rubric.capture_command)
+-- If a dimension's harness is MISSING entirely, score it null (capture_failure) +
+-- skill_gap("gameplay_metrics_harness_missing_{dim}") — degrade gracefully, never
+-- fake a score. The loop keeps improving the dimensions it CAN see.
 
 -- INDEPENDENT critique. Separate model context (config.eval.model or default).
--- Give it ONLY: the rubric axes+descriptions, the captured artifact (image/output),
+-- Give it ONLY: the rubric axes+descriptions, each dimension's captured evidence,
 -- and the mission. NOT the build cycle's own reasoning (that would let the maker
--- grade itself).
+-- grade itself). Dimensions whose capture failed are passed as null → scored null.
 verdict = critic(
   system: "You are an independent product critic. You did NOT build this. Score
-           the ACTUAL artifact against each rubric axis 0..1 from what you observe.
-           Cite concrete evidence per axis. Be harsh; default low when unsure.
-           A good HUD cannot compensate for a broken core. Output
-           {dimension_scores:{axis:score}, weakest_dimension, critique(<=30 words)}.",
-  input:  { mission, rubric.dimensions, artifact: capture }
+           each rubric axis 0..1 from ITS captured evidence (image, metrics JSON,
+           or api output). Cite concrete evidence per axis. Be harsh; default low
+           when unsure. A good HUD cannot compensate for a broken core. For a
+           metrics axis, judge against the axis description's targets, not vibes.
+           Score null only if the evidence is null. Output
+           {dimension_scores:{axis:score|null}, weakest_dimension, critique(<=30 words)}.",
+  input:  { mission, rubric.dimensions, evidence: dim_artifact }
 ) -> { dimension_scores, weakest_dimension, critique }
 
 -- AGGREGATE deterministically (scripts/rubric_score.py — pure, no model):
@@ -1857,6 +1919,30 @@ trigger time on the *pre-change* artifact to anchor `leap_baseline_scores`
 targeted dimension rose by at least `config.leap.min_dimension_delta`. If it did
 not, the leap is reverted (4-C2) and recorded `leap_regressed` (quality 0.0).
 
+**Remainder auto-queue (v1.8.0).** A leap can PASS the gate (delta ≥ min_delta)
+yet leave its dimension still below bar — and a multi-component plan can ship only
+part of itself (the F1 probe: leap 3 shipped car/cockpit but the designed
+materials/lighting/shadows — the #1 recurring critic complaint — silently
+vanished). After the post-leap critique, if the targeted dimension is still below
+bar, record it and queue the remainder so it is never orphaned:
+```
+if cycle_mode == "leap":
+  post = dimension_scores[orient.plateau.leap_target]
+  bar  = config.domains[d].quality_rubric.bar
+  outcomes.entries[-1].leap_dim_still_below_bar = (post < bar)   -- crisp 2-G signal
+  if post < bar:
+    action_queue.pending.append({
+      id: next, title: "quality remainder: '{leap_target}' at {post} still < bar {bar}",
+      source_domain: build_domain, rice_score: 90, status: "pending",
+      origin: "leap_remainder_auto" })
+    Print "[5-G] Remainder queued: '{leap_target}' needs +{bar−post} more."
+```
+The trigger is the *independent critic's* score, not the build cycle's
+self-reported "plan complete" — so the maker cannot suppress it (the
+LLM-component-coverage gate was rejected as gameable). With **dimension lock**
+(2-G) active this is usually redundant for consecutive leaps, but it is the
+backstop when the lock releases near bar or `lock_until_bar=false`.
+
 ---
 
 ## Step 6: State Update + Commit
diff --git a/tests/verify.py b/tests/verify.py
index ae33e59..666096c 100644
--- a/tests/verify.py
+++ b/tests/verify.py
@@ -683,6 +683,42 @@ def _mod(name, fn):
         f"weighted-gap target={target} (lowest raw was {lowest_raw})",
     )
 
+    # 9) v1.8.0 thrashing-guard bug fix: failed_leaps counts leap_attempts[].delta_score
+    # on the leap_target. The v1.7.x engine read a nonexistent `leap_delta` → always 0.
+    oc_fail = [
+        {"cycle_mode": "leap", "leap_attempts": [{"dimension": "fun_challenge", "delta_score": 0.01}]},
+        {"cycle_mode": "feature", "leap_attempts": []},
+        {"cycle_mode": "leap", "leap_attempts": [{"dimension": "fun_challenge", "delta_score": 0.02}]},
+        {"cycle_mode": "leap", "leap_attempts": [{"dimension": "visual_fidelity", "delta_score": 0.18}]},
+    ]
+    fails = R.failed_leaps(oc_fail, "fun_challenge", 0.05)
+    nofake = R.failed_leaps(oc_fail, "visual_fidelity", 0.05)  # 0.18 cleared → 0 fails
+    r.check(
+        "artifact-axis: thrashing guard counts real leap_attempts deltas (v1.8.0 bug fix)",
+        fails == 2 and nofake == 0,
+        f"fun_challenge fails={fails} (want 2), visual fails={nofake} (want 0)",
+    )
+
+    # 10) v1.8.0 dimension lock: stay on a below-bar target after a successful leap;
+    # release once at/near bar (tolerance band) so critic variance can't lock forever.
+    rubL = R.rubric_of({"quality_rubric": {"bar": 0.65, "plateau_eps": 0.05,
+        "dimensions": [{"name": "visual_fidelity", "weight": 1}]}})
+    below = [{"cycle_mode": "leap", "result_type": "pr_created",
+              "dimension_scores": {"visual_fidelity": 0.45}}]
+    nearbar = [{"cycle_mode": "leap", "result_type": "pr_created",
+                "dimension_scores": {"visual_fidelity": 0.62}}]   # within eps of 0.65 → release
+    regressed = [{"cycle_mode": "leap", "result_type": "leap_regressed",
+                  "dimension_scores": {"visual_fidelity": 0.45}}]
+    r.check(
+        "artifact-axis: dimension lock holds below-bar, releases near bar / on regress (v1.8.0)",
+        R.lock_target(below, rubL, "visual_fidelity") == "visual_fidelity"
+        and R.lock_target(nearbar, rubL, "visual_fidelity") is None
+        and R.lock_target(regressed, rubL, "visual_fidelity") is None,
+        f"below={R.lock_target(below, rubL, 'visual_fidelity')}, "
+        f"nearbar={R.lock_target(nearbar, rubL, 'visual_fidelity')}, "
+        f"regressed={R.lock_target(regressed, rubL, 'visual_fidelity')}",
+    )
+
 
 def main() -> int:
     r = Runner()

From 47c958ae333b732abfd0d80e58493a327072bd4a Mon Sep 17 00:00:00 2001
From: Taeil Ma <taeil.ma0915@gmail.com>
Date: Fri, 19 Jun 2026 15:05:28 +0900
Subject: [PATCH 2/2] =?UTF-8?q?docs(v1.8.0):=20fill=20validation=20?=
 =?UTF-8?q?=E2=80=94=20leap=204=20hit=20the=20screenshot=20ceiling;=20bott?=
 =?UTF-8?q?leneck=20moved=20to=20perception?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .claude/ooda-evolution-v1.8.0.md | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/.claude/ooda-evolution-v1.8.0.md b/.claude/ooda-evolution-v1.8.0.md
index 6e1d4ae..6a6fe10 100644
--- a/.claude/ooda-evolution-v1.8.0.md
+++ b/.claude/ooda-evolution-v1.8.0.md
@@ -95,6 +95,26 @@ driven, and gaming-resistant. `tests/verify.py` 59 → **61**.
 
 ## Validation (leap 4 under the upgraded loop)
 
-<!-- run a leap with the new loop; show visual_fidelity driven toward the bar in a
-     sustained push (dimension lock) and the previously-dropped materials/lighting
-     actually shipped -->
+Ran a real leap under v1.8.0, targeting `visual_fidelity` (dimension lock kept it
+as the target). It shipped exactly the materials/lighting that leap 3 dropped —
+ACES filmic tone mapping, soft directional shadows (cars/buildings/kerbs cast,
+road/ground receive), and baked contact shadows under every car.
+
+- Independent re-critique: **visual_fidelity 0.59 → 0.63** (Δ +0.04). Building
+  shadows and tonal range are now visible; objects are grounded.
+- Artifact trajectory across all leaps: **0.394 → 0.447 → 0.472 → 0.522 → 0.533**.
+
+**The decisive finding (the whole point — f1 is the probe):** +0.04 is *at the
+gate boundary*, and the critic still under-credits the change — because
+`visual_fidelity` has hit the **ceiling of what a still screenshot can measure
+(~0.63)**. The loop's next weighted-gap target is `fun_challenge` (0.38), which a
+screenshot *cannot* judge at all. So the upgraded loop's correct next move is
+exactly what v1.8.0 built: with the **fixed** thrashing guard, two failed leaps on
+`fun_challenge` → **HALT requesting a human-authored `gameplay_metrics` harness**,
+instead of thrashing forever (the v1.7.x bug) or faking a score.
+
+The bottleneck has *moved*: from "the loop can't close gaps" (v1.7.0) to "the loop
+can't **perceive** the experiential half of quality" (now). v1.8.0's per-dimension
+`capture_method` is the mechanism to close it — but it requires a human to author
+the measurement harness, because the loop is forbidden (by design) from grading
+its own standard. **That hand-off is the next experiment.**