From a68193cc25bccfb7cd8060b0b3fc4b74fed77943 Mon Sep 17 00:00:00 2001 From: Taeil Ma Date: Fri, 19 Jun 2026 14:55:34 +0900 Subject: [PATCH 1/2] =?UTF-8?q?feat(v1.8.0):=20drive=20quality=20to=20"goo?= =?UTF-8?q?d"=20=E2=80=94=20fix=20thrashing=20guard,=20per-dimension=20cap?= =?UTF-8?q?ture,=20dimension-lock,=20remainder?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The F1 probe stayed crap after 3 v1.7.0 leaps (artifact 0.394→0.447→0.472→0.522, never reaching bar 0.65). A 13-agent adversarially-verified diagnosis found the loop DETECTS a quality gap but isn't built to CLOSE it. - 2-G thrashing-guard BUG FIX: counted a nonexistent `leap_delta` on weakest_dimension → fails was ALWAYS 0, HALT never fired. Now counts leap_attempts[].delta_score on leap_target (rubric_score.failed_leaps()). - 5-G per-dimension capture_method: experiential axes (driving_feel + fun_challenge = 45% of weight, frozen across all 25 cycles) use a human-authored, hash-verified, protected gameplay_metrics harness; missing → null + skill_gap, never a faked/silent-screenshot score. - 2-G dimension lock until bar: a successful leap below bar keeps the plateau on the SAME target (drive-to-bar, not detect-and-nudge+rotate). lock_target(); config.leap.lock_until_bar; tolerance band + working max-attempts HALT. - 5-G auto-queue remainder: a gate-passing leap still below bar queues a high-RICE remainder, triggered by the independent critic (not self-report). Rejected: raising bar to 0.80 yet, inner refine loop, LLM-coverage gate, multi_probe. verify.py 59 → 61. plugin 1.7→1.8, config schema 1.3→1.4. Co-Authored-By: Claude Opus 4.8 --- .claude-plugin/plugin.json | 2 +- .claude/ooda-evolution-v1.8.0.md | 100 ++++++++++++++++++++++ CHANGELOG.md | 36 ++++++++ config.example.json | 19 ++++- scripts/rubric_score.py | 37 ++++++++ skills/evolve/SKILL.md | 140 +++++++++++++++++++++++++------ tests/verify.py | 36 ++++++++ 7 files changed, 338 insertions(+), 32 deletions(-) create mode 100644 .claude/ooda-evolution-v1.8.0.md diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json index 4a3c5ac..da5041a 100644 --- a/.claude-plugin/plugin.json +++ b/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "ooda-loop", "displayName": "OODA-loop", - "version": "1.7.0", + "version": "1.8.0", "description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.", "author": { "name": "Taeil Ma", diff --git a/.claude/ooda-evolution-v1.8.0.md b/.claude/ooda-evolution-v1.8.0.md new file mode 100644 index 0000000..6e1d4ae --- /dev/null +++ b/.claude/ooda-evolution-v1.8.0.md @@ -0,0 +1,100 @@ +# OODA-loop v1.8.0 — driving quality to "good", not "passable" + +**Date:** 2026-06-19 +**Purpose reminder:** the f1-racing game is a **probe**, not the product. Its job +is to expose what OODA-loop still gets wrong. This release is about the loop. + +## The probe's verdict (why we believe the loop is still the bottleneck) + +v1.7.0 fixed *measurement* (artifact axis) and *anti-incrementalism* (leap +cycles). It worked — the loop went from a flat, lying A to a climbing, honest D. +But after **three** real leap cycles the owner's verdict is unchanged: *still too +crappy.* The data says why: + +| | artifact_quality | note | +|---|---|---| +| 22 feature cycles (old loop) | **0.394** (flat) | Goodhart collapse | +| + leap 1 (visual) | 0.447 | +0.053 | +| + leap 2 (track) | 0.472 | +0.025 | +| + leap 3 (car/cockpit) | 0.522 | +0.050 | +| bar (shippable) | 0.65 | not reached | +| "genuinely good" | ~0.80 | far off | + +Average **~+0.04 per leap**. At that rate "good" is ~7 leaps away. The loop +*climbs* but it climbs **too slowly and stops too low**. Three concrete failures +observed during the campaign point at the loop, not the game: + +1. **It settles for "+0.05 better."** A leap's success test is + `min_dimension_delta` (0.05) — so the moment a dimension nudges up, the loop + declares victory and moves the weighted-gap target elsewhere, abandoning a + dimension while it is still *bad* (visual went 0.22→0.41→0.59 across two + separate leaps with other work in between; it was never driven to the bar in + one sustained push). +2. **It accepts partial implementation of its own plan.** Leap 3's design panel + specced car/cockpit **and** materials/lighting **and** environment; only + car/cockpit shipped. The dropped materials/lighting (shadows, tone-mapping, + textures) was the *single most-repeated* critic complaint at every leap — and + it silently vanished. The act step under-delivers with no completion check. +3. **It can't see motion, feel, or fun.** The critic scores one still + screenshot, so `driving_feel` and `fun_challenge` are unmeasurable and + unimprovable — they sit at 0.38–0.51 and cap the weighted mean. + +## Root cause (from a 13-agent, adversarially-verified diagnosis) + +> The loop **detects** a quality gap and takes *a* step at it, but is not built to +> **close** it. It (1) can't see ~45% of its own rubric — `driving_feel` + `fun_challenge` +> were scored from a still screenshot and sat frozen across all 25 cycles; (2) +> abandons a dimension after a +0.05 nudge and rotates targets instead of driving +> one to the bar; (3) had a **silently broken thrashing guard** (read a nonexistent +> `leap_delta` field → never fired → could thrash forever); and (4) accepts partial +> implementation of its own leap plans, orphaning the rest. It optimizes to +> "better," never to "good." + +The devil's-advocate agent confirmed the leap *routing* is sound — the binding +constraint is **perception** (the critic can't measure feel/fun), not more leap +machinery. That reframed the fix. + +## The v1.8.0 upgrades (ranked by leverage) + +1. **Fix the thrashing guard (prerequisite, real bug).** 2-G now counts + `leap_attempts[].delta_score` on the `leap_target` (was: nonexistent `leap_delta` + on `weakest_dimension`). The HALT safety valve actually fires now — + `rubric_score.failed_leaps()` is the deterministic ground truth. +2. **Per-dimension `capture_method` (5-G).** Each rubric axis declares how its + evidence is captured; experiential axes use `gameplay_metrics` — a + HUMAN-AUTHORED, hash-verified, protected harness (same independence invariant + as the rubric hash). Missing/unverified harness → the axis scores `null` + (capture_failure) + a skill_gap, never a faked or screenshot-fallback score. + Unlocks the frozen 45% of rubric weight. +3. **Dimension lock until bar (2-G).** After a successful leap whose target is + still below `bar − eps`, keep the plateau active on the SAME target so 3-K + leaps it again next cycle — drive-to-bar, not detect-and-nudge. A tolerance + band + the (now-working) max-attempts HALT prevent infinite lock. + `rubric_score.lock_target()` is the deterministic ground truth; + `config.leap.lock_until_bar` (default true) toggles it. +4. **Auto-queue the remainder (5-G).** If a leap passes its delta gate but the + dimension is still below bar, the *independent critic's* score (not the + maker's self-report) queues a high-RICE remainder so dropped scope can't + vanish. Records `leap_dim_still_below_bar`. + +All four are deterministic where possible (`scripts/rubric_score.py`), config- +driven, and gaming-resistant. `tests/verify.py` 59 → **61**. + +## What we deliberately did NOT change (rejected / devil's-advocate-validated) + +- **Don't raise the bar to 0.80 yet.** It only re-points the weighted-gap target + at an unmeasurable dimension and (pre-fix) would HALT the loop. Raise only after + a `gameplay_metrics` leap proves feel/fun are responsive. Stage 0.65→0.75→0.80. +- **No inner multi-pass refine loop (yet).** The 3 completed leaps each landed + substantial single-pass deltas (+0.19/+0.27/+0.18); the cap was perception, not + pass-count. Adding passes on unmeasurable dimensions = cost with no signal. +- **No LLM-component-coverage gate.** Gameable (touch one file per "component" → + 100%). Change 4's critic-driven remainder is the robust substitute. +- **No `multi_probe` still-sequence.** A burst of stills still can't tell + responsive steering from sluggish; `gameplay_metrics` is the right instrument. + +## Validation (leap 4 under the upgraded loop) + + diff --git a/CHANGELOG.md b/CHANGELOG.md index a6dcb8a..1deb187 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,42 @@ independently. Bump there signals migration work for downstream projects. --- +## [v1.8.0] — 2026-06-19 + +### Changed — drive quality to "good", not "passable" (config schema 1.4.0) + +The F1 probe stayed crap after **three** v1.7.0 leaps (artifact 0.394 → 0.447 → +0.472 → 0.522, never reaching the 0.65 bar). A 13-agent adversarially-verified +diagnosis (`.claude/ooda-evolution-v1.8.0.md`) found the loop **detects** a +quality gap but isn't built to **close** it. Four fixes, ranked by leverage: + +- **Thrashing-guard bug fix (prerequisite).** evolve 2-G counted a nonexistent + `leap_delta` field on `weakest_dimension`, so the guard's `fails` count was + ALWAYS 0 and the HALT safety valve **never fired** — the loop could thrash a + dimension forever. Now counts `leap_attempts[].delta_score` on the actual + `leap_target` (`rubric_score.failed_leaps()`). +- **Per-dimension `capture_method` (5-G).** The critic scored every axis from one + screenshot, so `driving_feel` + `fun_challenge` (**45% of the rubric's weight**) + were frozen across all 25 cycles. Each axis now declares its capture; + experiential axes use `gameplay_metrics` — a human-authored, hash-verified, + protected harness (same independence invariant as the rubric hash). Missing/ + unverified → score `null` + skill_gap, never a faked or silent-screenshot score. +- **Dimension lock until bar (2-G).** A successful leap that left its target below + `bar − eps` now keeps the plateau active on the SAME target (drive-to-bar, not + detect-and-nudge + rotate). `rubric_score.lock_target()`; toggle with + `config.leap.lock_until_bar`. Tolerance band + the now-working max-attempts HALT + prevent infinite lock. +- **Auto-queue remainder (5-G).** A leap that passes its delta gate but leaves the + dimension below bar queues a high-RICE remainder, triggered by the *independent + critic's* score (not the maker's self-report), so dropped/partial scope can't be + silently orphaned (as leap 3's materials/lighting was). + +**Rejected** (kept the loop honest): raising the bar to 0.80 before feel/fun are +measurable; an inner multi-pass refine loop; an LLM-component-coverage gate +(gameable); `multi_probe` still-sequences. `tests/verify.py` 59 → **61**. + +--- + ## [v1.7.0] — 2026-06-17 ### Added — artifact-grounded evaluation + quantum-leap cycles (config schema 1.3.0) diff --git a/config.example.json b/config.example.json index 61e0956..56df709 100644 --- a/config.example.json +++ b/config.example.json @@ -1,5 +1,5 @@ { - "schema_version": "1.3.0", + "schema_version": "1.4.0", "project": { "name": "my-app", "locale": "en", @@ -270,16 +270,27 @@ "plateau_window": 4, "plateau_eps": 0.05, "locked": true, - "dimensions": [] + "__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.", + "dimensions": [], + "__example_dimension__": { + "name": "driving_feel", + "weight": 0.25, + "capture_method": "gameplay_metrics", + "gameplay_metrics_command": "node tools/feel_harness.mjs", + "gameplay_metrics_hash": "", + "metrics_fields": ["input_lag_ms", "physics_response_ms", "completion_rate"], + "description": "Responsive steering, weight transfer, distinct braking — judge against: input_lag_ms<40, physics_response_ms stable, completion_rate>0.7." + } }, "leap": { - "__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend.", + "__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge).", "max_lines": 1500, "min_dimension_delta": 0.05, "max_attempts_per_dimension": 2, "max_per_day": 2, "gap_weight": 30.0, - "cost_limit_usd": 0.5 + "cost_limit_usd": 0.5, + "lock_until_bar": true }, "goal_completion_idle": true } diff --git a/scripts/rubric_score.py b/scripts/rubric_score.py index a1fe391..f995652 100644 --- a/scripts/rubric_score.py +++ b/scripts/rubric_score.py @@ -214,6 +214,43 @@ def detect_plateau(outcomes: list, rubric: dict) -> dict: } +def failed_leaps(outcomes: list, dimension: str, min_delta: float) -> int: + """v1.8.0 thrashing-guard count (deterministic ground truth for evolve 2-G). + Counts leap cycles whose `leap_attempts` on `dimension` failed to clear + `min_delta`. The v1.7.x engine read a nonexistent field `leap_delta` and + matched the raw weakest_dimension, so this was ALWAYS 0 and the guard never + fired — the loop could thrash forever. The real ledger field is + `leap_attempts[].delta_score`, keyed on the dimension actually leapt.""" + n = 0 + for e in outcomes: + if e.get("cycle_mode") != "leap": + continue + for a in (e.get("leap_attempts") or []): + if a.get("dimension") == dimension and a.get("delta_score", 1.0) < min_delta: + n += 1 + return n + + +def lock_target(outcomes: list, rubric: dict, leap_target: str | None) -> str | None: + """v1.8.0 dimension-lock: after a SUCCESSFUL leap whose target is still below + (bar − eps), return that target so evolve 2-G keeps the plateau active on it + (drive-to-bar) instead of coasting through feature cycles and re-rotating. + Returns None when nothing should be locked (no leap last, regressed, or the + target is at/near bar — the tolerance band stops critic variance from locking + the loop forever; the max_attempts HALT is the genuine-stuck backstop).""" + if not leap_target or not outcomes: + return None + last = outcomes[-1] + if last.get("cycle_mode") != "leap" or last.get("result_type") == "leap_regressed": + return None + bar = rubric.get("bar", DEFAULT_BAR) + eps = float(rubric.get("plateau_eps", DEFAULT_PLATEAU_EPS)) + score = (last.get("dimension_scores") or {}).get(leap_target) + if score is None: + return None + return leap_target if score < bar - eps else None + + def compute(project: Path) -> dict: ev = Path(project) / "agent" / "state" / "evolve" config = _load(Path(project) / "config.json", {}) or {} diff --git a/skills/evolve/SKILL.md b/skills/evolve/SKILL.md index 42017c8..e4097b6 100644 --- a/skills/evolve/SKILL.md +++ b/skills/evolve/SKILL.md @@ -727,18 +727,27 @@ for the implementation/build domain that declares config.domains[d].quality_rubr -- requires below-bar — slow upward drift that never reaches bar still counts as -- stuck, and "already good" (>= bar) is never a plateau. if p.plateau: - -- THRASHING GUARD: if a leap on this weakest_dimension has already failed to - -- clear config.leap.min_dimension_delta `config.leap.max_attempts_per_dimension` - -- times (default 2) — counted from outcomes[].leap_attempts — do NOT leap - -- again. Escalate to a halt for human help: - fails = count(e in outcomes where cycle_mode=="leap" - AND weakest_dimension == p.weakest_dimension - AND e.leap_delta < config.leap.min_dimension_delta) + -- THRASHING GUARD (v1.8.0 fix): if leaps on the TARGET dimension have already + -- failed to clear config.leap.min_dimension_delta + -- `config.leap.max_attempts_per_dimension` times (default 2), do NOT leap + -- again — escalate to a HALT for human help. + -- BUG FIX: the v1.7.x guard read a nonexistent field `e.leap_delta` and + -- matched `weakest_dimension`, so `fails` was ALWAYS 0 and the guard NEVER + -- fired — the loop could thrash forever on an unmeasurable dimension. The + -- real ledger field is `leap_attempts[].delta_score`, and the dimension that + -- was actually leapt is `leap_target` (the weighted-gap winner from 3-K), + -- which can differ from the raw `weakest_dimension`. + fails = count(e in outcomes where e.cycle_mode == "leap" + for a in (e.leap_attempts or []) + if a.dimension == p.leap_target + AND a.delta_score < config.leap.min_dimension_delta) if fails >= config.leap.max_attempts_per_dimension: - record skill_gap { name: "leap_stuck_{p.weakest_dimension}", type:"quality_gap", - detail:"Leap on {dim} failed {fails}× without clearing min delta" } - Create HALT: "Leap on '{p.weakest_dimension}' failed {fails}× — human review - needed; the loop cannot self-fix this dimension." + record skill_gap { name: "leap_stuck_{p.leap_target}", type:"quality_gap", + detail:"Leap on {p.leap_target} failed {fails}× without clearing min delta — + likely UNMEASURABLE by the current capture_method (see 5-G)." } + Create HALT: "Leap on '{p.leap_target}' failed {fails}× — human review needed: + supply a richer capture_method/metrics harness for this + dimension, or reweight the rubric. The loop cannot self-fix it." set plateau_leap_blocked = true else: set orient.plateau = { active:true, leap_target:p.leap_target, @@ -746,12 +755,37 @@ for the implementation/build domain that declares config.domains[d].quality_rubr artifact_score:p.latest, reason:p.reason } Print "[Orient] 📉 PLATEAU. Leap target '{p.leap_target}' ({p.reason}). Next cycle should LEAP, not add a feature." else: - set orient.plateau = { active:false } + -- DIMENSION LOCK until bar (v1.8.0): the v1.7.x loop cleared the plateau the + -- instant a leap produced any improvement (+min_delta), then coasted through + -- ~plateau_window−1 feature cycles before re-detecting — and re-targeted via a + -- recomputed weighted gap, so a partially-fixed dimension could LOSE its slot + -- before reaching bar. Result: the loop "detected and nudged" instead of + -- "detected and CLOSED" (the F1 probe: visual_fidelity took 0.22→0.41→0.59 + -- across non-consecutive leaps and never reached bar). Fix: if the last cycle + -- was a SUCCESSFUL leap whose target is still below bar, keep the plateau + -- active on the SAME target so 3-K leaps it again next cycle — drive it to bar. + last = outcomes.entries[-1] if outcomes.entries else null + bar = config.domains[d].quality_rubric.bar + if (config.leap.lock_until_bar (default true) + AND last is not null AND last.cycle_mode == "leap" + AND last.result_type != "leap_regressed" + AND p.leap_target is not null + AND last.dimension_scores[p.leap_target] < bar - rubric.plateau_eps): + -- tolerance band: within plateau_eps of bar counts as cleared, so critic + -- variance near threshold can't lock the loop forever (max_attempts HALT + -- is the backstop if it genuinely can't clear). + set orient.plateau = { active:true, leap_target:p.leap_target, + weakest_dimension:p.weakest_dimension, + artifact_score:p.latest, + reason:"'{p.leap_target}' improved to {last.dimension_scores[p.leap_target]} but still < bar {bar} — keep leaping" } + Print "[Orient] 🔒 LEAP succeeded but '{p.leap_target}' still below bar. Locking target — leap again, don't coast." + else: + set orient.plateau = { active:false } ``` -`orient.plateau` is consumed by Step 3-K. A plateau cleared by a successful leap -(artifact_score rose, or weakest_dimension changed) resets automatically — it is -derived from outcomes.json each cycle, not a persisted counter. +`orient.plateau` is consumed by Step 3-K. With **lock_until_bar** the loop drives +one dimension to its bar before moving on (constraint *exploitation*, not +rotation); set `config.leap.lock_until_bar=false` to restore v1.7.x rotation. --- @@ -1810,23 +1844,51 @@ if rubric is null OR this cycle produced no artifact (observe/futile/error/skip) -- the loop may not author its own grading standard"). On first run, record the -- hash. (config.*.quality_rubric SHOULD be listed in config.safety.protected_paths.) --- CAPTURE the actual artifact via rubric.capture_method: --- "screenshot" → serve + Playwright screenshot to a temp PNG (web UIs) --- "api_call" → run rubric.capture_command, capture stdout/response --- "benchmark" → run rubric.capture_command, capture metrics -capture = run rubric.capture_command (or the screenshot path) +-- CAPTURE per-DIMENSION (v1.8.0): the v1.7.x critic took ONE screenshot for every +-- axis, so experiential dimensions (driving_feel, fun_challenge — 45% of the F1 +-- rubric's weight) were unmeasurable from a still and FROZEN across every cycle — +-- the single biggest reason quality stalled (the loop was optimizing 55% of its +-- own rubric). Each dimension now declares how to capture the evidence it needs: +-- "screenshot" → serve + Playwright screenshot (visual axes; one shot reused for all) +-- "api_call" → run capture_command, capture stdout/response (library/API axes) +-- "benchmark" → run capture_command, capture metrics +-- "gameplay_metrics" → run a HUMAN-AUTHORED harness that exercises the artifact +-- (e.g. drive 10s, log input_lag_ms / physics_response_ms / +-- completion_rate) so feel/fun become measurable. +for dim in rubric.dimensions: + method = dim.capture_method or rubric.capture_method or "screenshot" + if method == "gameplay_metrics": + -- INDEPENDENCE GATE (same invariant as rubric_hash): the harness must be + -- human-authored and out of the loop's reach, or its score is worthless. + -- (1) dim.gameplay_metrics_command path MUST be in config.safety.protected_paths + -- (2) sha256(harness file) MUST equal dim.gameplay_metrics_hash + -- If either fails: dim_artifact[dim] = CAPTURE_FAILURE; record + -- skill_gap("gameplay_metrics_not_independent_{dim}"); do NOT fall back to a + -- screenshot silently (that would re-freeze the dimension). The dimension + -- scores null and the 2-G thrashing guard will eventually HALT for a human. + if not independent: dim_artifact[dim] = null (capture_failure); continue + dim_artifact[dim] = run_protected_harness(dim.gameplay_metrics_command) -- metrics JSON + else if method == "screenshot": + dim_artifact[dim] = the shared screenshot (captured once) + else: + dim_artifact[dim] = run dim.capture_command (or rubric.capture_command) +-- If a dimension's harness is MISSING entirely, score it null (capture_failure) + +-- skill_gap("gameplay_metrics_harness_missing_{dim}") — degrade gracefully, never +-- fake a score. The loop keeps improving the dimensions it CAN see. -- INDEPENDENT critique. Separate model context (config.eval.model or default). --- Give it ONLY: the rubric axes+descriptions, the captured artifact (image/output), +-- Give it ONLY: the rubric axes+descriptions, each dimension's captured evidence, -- and the mission. NOT the build cycle's own reasoning (that would let the maker --- grade itself). +-- grade itself). Dimensions whose capture failed are passed as null → scored null. verdict = critic( system: "You are an independent product critic. You did NOT build this. Score - the ACTUAL artifact against each rubric axis 0..1 from what you observe. - Cite concrete evidence per axis. Be harsh; default low when unsure. - A good HUD cannot compensate for a broken core. Output - {dimension_scores:{axis:score}, weakest_dimension, critique(<=30 words)}.", - input: { mission, rubric.dimensions, artifact: capture } + each rubric axis 0..1 from ITS captured evidence (image, metrics JSON, + or api output). Cite concrete evidence per axis. Be harsh; default low + when unsure. A good HUD cannot compensate for a broken core. For a + metrics axis, judge against the axis description's targets, not vibes. + Score null only if the evidence is null. Output + {dimension_scores:{axis:score|null}, weakest_dimension, critique(<=30 words)}.", + input: { mission, rubric.dimensions, evidence: dim_artifact } ) -> { dimension_scores, weakest_dimension, critique } -- AGGREGATE deterministically (scripts/rubric_score.py — pure, no model): @@ -1857,6 +1919,30 @@ trigger time on the *pre-change* artifact to anchor `leap_baseline_scores` targeted dimension rose by at least `config.leap.min_dimension_delta`. If it did not, the leap is reverted (4-C2) and recorded `leap_regressed` (quality 0.0). +**Remainder auto-queue (v1.8.0).** A leap can PASS the gate (delta ≥ min_delta) +yet leave its dimension still below bar — and a multi-component plan can ship only +part of itself (the F1 probe: leap 3 shipped car/cockpit but the designed +materials/lighting/shadows — the #1 recurring critic complaint — silently +vanished). After the post-leap critique, if the targeted dimension is still below +bar, record it and queue the remainder so it is never orphaned: +``` +if cycle_mode == "leap": + post = dimension_scores[orient.plateau.leap_target] + bar = config.domains[d].quality_rubric.bar + outcomes.entries[-1].leap_dim_still_below_bar = (post < bar) -- crisp 2-G signal + if post < bar: + action_queue.pending.append({ + id: next, title: "quality remainder: '{leap_target}' at {post} still < bar {bar}", + source_domain: build_domain, rice_score: 90, status: "pending", + origin: "leap_remainder_auto" }) + Print "[5-G] Remainder queued: '{leap_target}' needs +{bar−post} more." +``` +The trigger is the *independent critic's* score, not the build cycle's +self-reported "plan complete" — so the maker cannot suppress it (the +LLM-component-coverage gate was rejected as gameable). With **dimension lock** +(2-G) active this is usually redundant for consecutive leaps, but it is the +backstop when the lock releases near bar or `lock_until_bar=false`. + --- ## Step 6: State Update + Commit diff --git a/tests/verify.py b/tests/verify.py index ae33e59..666096c 100644 --- a/tests/verify.py +++ b/tests/verify.py @@ -683,6 +683,42 @@ def _mod(name, fn): f"weighted-gap target={target} (lowest raw was {lowest_raw})", ) + # 9) v1.8.0 thrashing-guard bug fix: failed_leaps counts leap_attempts[].delta_score + # on the leap_target. The v1.7.x engine read a nonexistent `leap_delta` → always 0. + oc_fail = [ + {"cycle_mode": "leap", "leap_attempts": [{"dimension": "fun_challenge", "delta_score": 0.01}]}, + {"cycle_mode": "feature", "leap_attempts": []}, + {"cycle_mode": "leap", "leap_attempts": [{"dimension": "fun_challenge", "delta_score": 0.02}]}, + {"cycle_mode": "leap", "leap_attempts": [{"dimension": "visual_fidelity", "delta_score": 0.18}]}, + ] + fails = R.failed_leaps(oc_fail, "fun_challenge", 0.05) + nofake = R.failed_leaps(oc_fail, "visual_fidelity", 0.05) # 0.18 cleared → 0 fails + r.check( + "artifact-axis: thrashing guard counts real leap_attempts deltas (v1.8.0 bug fix)", + fails == 2 and nofake == 0, + f"fun_challenge fails={fails} (want 2), visual fails={nofake} (want 0)", + ) + + # 10) v1.8.0 dimension lock: stay on a below-bar target after a successful leap; + # release once at/near bar (tolerance band) so critic variance can't lock forever. + rubL = R.rubric_of({"quality_rubric": {"bar": 0.65, "plateau_eps": 0.05, + "dimensions": [{"name": "visual_fidelity", "weight": 1}]}}) + below = [{"cycle_mode": "leap", "result_type": "pr_created", + "dimension_scores": {"visual_fidelity": 0.45}}] + nearbar = [{"cycle_mode": "leap", "result_type": "pr_created", + "dimension_scores": {"visual_fidelity": 0.62}}] # within eps of 0.65 → release + regressed = [{"cycle_mode": "leap", "result_type": "leap_regressed", + "dimension_scores": {"visual_fidelity": 0.45}}] + r.check( + "artifact-axis: dimension lock holds below-bar, releases near bar / on regress (v1.8.0)", + R.lock_target(below, rubL, "visual_fidelity") == "visual_fidelity" + and R.lock_target(nearbar, rubL, "visual_fidelity") is None + and R.lock_target(regressed, rubL, "visual_fidelity") is None, + f"below={R.lock_target(below, rubL, 'visual_fidelity')}, " + f"nearbar={R.lock_target(nearbar, rubL, 'visual_fidelity')}, " + f"regressed={R.lock_target(regressed, rubL, 'visual_fidelity')}", + ) + def main() -> int: r = Runner() From 47c958ae333b732abfd0d80e58493a327072bd4a Mon Sep 17 00:00:00 2001 From: Taeil Ma Date: Fri, 19 Jun 2026 15:05:28 +0900 Subject: [PATCH 2/2] =?UTF-8?q?docs(v1.8.0):=20fill=20validation=20?= =?UTF-8?q?=E2=80=94=20leap=204=20hit=20the=20screenshot=20ceiling;=20bott?= =?UTF-8?q?leneck=20moved=20to=20perception?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.8 --- .claude/ooda-evolution-v1.8.0.md | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/.claude/ooda-evolution-v1.8.0.md b/.claude/ooda-evolution-v1.8.0.md index 6e1d4ae..6a6fe10 100644 --- a/.claude/ooda-evolution-v1.8.0.md +++ b/.claude/ooda-evolution-v1.8.0.md @@ -95,6 +95,26 @@ driven, and gaming-resistant. `tests/verify.py` 59 → **61**. ## Validation (leap 4 under the upgraded loop) - +Ran a real leap under v1.8.0, targeting `visual_fidelity` (dimension lock kept it +as the target). It shipped exactly the materials/lighting that leap 3 dropped — +ACES filmic tone mapping, soft directional shadows (cars/buildings/kerbs cast, +road/ground receive), and baked contact shadows under every car. + +- Independent re-critique: **visual_fidelity 0.59 → 0.63** (Δ +0.04). Building + shadows and tonal range are now visible; objects are grounded. +- Artifact trajectory across all leaps: **0.394 → 0.447 → 0.472 → 0.522 → 0.533**. + +**The decisive finding (the whole point — f1 is the probe):** +0.04 is *at the +gate boundary*, and the critic still under-credits the change — because +`visual_fidelity` has hit the **ceiling of what a still screenshot can measure +(~0.63)**. The loop's next weighted-gap target is `fun_challenge` (0.38), which a +screenshot *cannot* judge at all. So the upgraded loop's correct next move is +exactly what v1.8.0 built: with the **fixed** thrashing guard, two failed leaps on +`fun_challenge` → **HALT requesting a human-authored `gameplay_metrics` harness**, +instead of thrashing forever (the v1.7.x bug) or faking a score. + +The bottleneck has *moved*: from "the loop can't close gaps" (v1.7.0) to "the loop +can't **perceive** the experiential half of quality" (now). v1.8.0's per-dimension +`capture_method` is the mechanism to close it — but it requires a human to author +the measurement harness, because the loop is forbidden (by design) from grading +its own standard. **That hand-off is the next experiment.**