Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "ooda-loop",
"displayName": "OODA-loop",
"version": "1.7.0",
"version": "1.8.0",
"description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
"author": {
"name": "Taeil Ma",
Expand Down
120 changes: 120 additions & 0 deletions .claude/ooda-evolution-v1.8.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# OODA-loop v1.8.0 — driving quality to "good", not "passable"

**Date:** 2026-06-19
**Purpose reminder:** the f1-racing game is a **probe**, not the product. Its job
is to expose what OODA-loop still gets wrong. This release is about the loop.

## The probe's verdict (why we believe the loop is still the bottleneck)

v1.7.0 fixed *measurement* (artifact axis) and *anti-incrementalism* (leap
cycles). It worked — the loop went from a flat, lying A to a climbing, honest D.
But after **three** real leap cycles the owner's verdict is unchanged: *still too
crappy.* The data says why:

| | artifact_quality | note |
|---|---|---|
| 22 feature cycles (old loop) | **0.394** (flat) | Goodhart collapse |
| + leap 1 (visual) | 0.447 | +0.053 |
| + leap 2 (track) | 0.472 | +0.025 |
| + leap 3 (car/cockpit) | 0.522 | +0.050 |
| bar (shippable) | 0.65 | not reached |
| "genuinely good" | ~0.80 | far off |

Average **~+0.04 per leap**. At that rate "good" is ~7 leaps away. The loop
*climbs* but it climbs **too slowly and stops too low**. Three concrete failures
observed during the campaign point at the loop, not the game:

1. **It settles for "+0.05 better."** A leap's success test is
`min_dimension_delta` (0.05) — so the moment a dimension nudges up, the loop
declares victory and moves the weighted-gap target elsewhere, abandoning a
dimension while it is still *bad* (visual went 0.22→0.41→0.59 across two
separate leaps with other work in between; it was never driven to the bar in
one sustained push).
2. **It accepts partial implementation of its own plan.** Leap 3's design panel
specced car/cockpit **and** materials/lighting **and** environment; only
car/cockpit shipped. The dropped materials/lighting (shadows, tone-mapping,
textures) was the *single most-repeated* critic complaint at every leap — and
it silently vanished. The act step under-delivers with no completion check.
3. **It can't see motion, feel, or fun.** The critic scores one still
screenshot, so `driving_feel` and `fun_challenge` are unmeasurable and
unimprovable — they sit at 0.38–0.51 and cap the weighted mean.

## Root cause (from a 13-agent, adversarially-verified diagnosis)

> The loop **detects** a quality gap and takes *a* step at it, but is not built to
> **close** it. It (1) can't see ~45% of its own rubric — `driving_feel` + `fun_challenge`
> were scored from a still screenshot and sat frozen across all 25 cycles; (2)
> abandons a dimension after a +0.05 nudge and rotates targets instead of driving
> one to the bar; (3) had a **silently broken thrashing guard** (read a nonexistent
> `leap_delta` field → never fired → could thrash forever); and (4) accepts partial
> implementation of its own leap plans, orphaning the rest. It optimizes to
> "better," never to "good."

The devil's-advocate agent confirmed the leap *routing* is sound — the binding
constraint is **perception** (the critic can't measure feel/fun), not more leap
machinery. That reframed the fix.

## The v1.8.0 upgrades (ranked by leverage)

1. **Fix the thrashing guard (prerequisite, real bug).** 2-G now counts
`leap_attempts[].delta_score` on the `leap_target` (was: nonexistent `leap_delta`
on `weakest_dimension`). The HALT safety valve actually fires now —
`rubric_score.failed_leaps()` is the deterministic ground truth.
2. **Per-dimension `capture_method` (5-G).** Each rubric axis declares how its
evidence is captured; experiential axes use `gameplay_metrics` — a
HUMAN-AUTHORED, hash-verified, protected harness (same independence invariant
as the rubric hash). Missing/unverified harness → the axis scores `null`
(capture_failure) + a skill_gap, never a faked or screenshot-fallback score.
Unlocks the frozen 45% of rubric weight.
3. **Dimension lock until bar (2-G).** After a successful leap whose target is
still below `bar − eps`, keep the plateau active on the SAME target so 3-K
leaps it again next cycle — drive-to-bar, not detect-and-nudge. A tolerance
band + the (now-working) max-attempts HALT prevent infinite lock.
`rubric_score.lock_target()` is the deterministic ground truth;
`config.leap.lock_until_bar` (default true) toggles it.
4. **Auto-queue the remainder (5-G).** If a leap passes its delta gate but the
dimension is still below bar, the *independent critic's* score (not the
maker's self-report) queues a high-RICE remainder so dropped scope can't
vanish. Records `leap_dim_still_below_bar`.

All four are deterministic where possible (`scripts/rubric_score.py`), config-
driven, and gaming-resistant. `tests/verify.py` 59 → **61**.

## What we deliberately did NOT change (rejected / devil's-advocate-validated)

- **Don't raise the bar to 0.80 yet.** It only re-points the weighted-gap target
at an unmeasurable dimension and (pre-fix) would HALT the loop. Raise only after
a `gameplay_metrics` leap proves feel/fun are responsive. Stage 0.65→0.75→0.80.
- **No inner multi-pass refine loop (yet).** The 3 completed leaps each landed
substantial single-pass deltas (+0.19/+0.27/+0.18); the cap was perception, not
pass-count. Adding passes on unmeasurable dimensions = cost with no signal.
- **No LLM-component-coverage gate.** Gameable (touch one file per "component" →
100%). Change 4's critic-driven remainder is the robust substitute.
- **No `multi_probe` still-sequence.** A burst of stills still can't tell
responsive steering from sluggish; `gameplay_metrics` is the right instrument.

## Validation (leap 4 under the upgraded loop)

Ran a real leap under v1.8.0, targeting `visual_fidelity` (dimension lock kept it
as the target). It shipped exactly the materials/lighting that leap 3 dropped —
ACES filmic tone mapping, soft directional shadows (cars/buildings/kerbs cast,
road/ground receive), and baked contact shadows under every car.

- Independent re-critique: **visual_fidelity 0.59 → 0.63** (Δ +0.04). Building
shadows and tonal range are now visible; objects are grounded.
- Artifact trajectory across all leaps: **0.394 → 0.447 → 0.472 → 0.522 → 0.533**.

**The decisive finding (the whole point — f1 is the probe):** +0.04 is *at the
gate boundary*, and the critic still under-credits the change — because
`visual_fidelity` has hit the **ceiling of what a still screenshot can measure
(~0.63)**. The loop's next weighted-gap target is `fun_challenge` (0.38), which a
screenshot *cannot* judge at all. So the upgraded loop's correct next move is
exactly what v1.8.0 built: with the **fixed** thrashing guard, two failed leaps on
`fun_challenge` → **HALT requesting a human-authored `gameplay_metrics` harness**,
instead of thrashing forever (the v1.7.x bug) or faking a score.

The bottleneck has *moved*: from "the loop can't close gaps" (v1.7.0) to "the loop
can't **perceive** the experiential half of quality" (now). v1.8.0's per-dimension
`capture_method` is the mechanism to close it — but it requires a human to author
the measurement harness, because the loop is forbidden (by design) from grading
its own standard. **That hand-off is the next experiment.**
36 changes: 36 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,42 @@ independently. Bump there signals migration work for downstream projects.

---

## [v1.8.0] — 2026-06-19

### Changed — drive quality to "good", not "passable" (config schema 1.4.0)

The F1 probe stayed crap after **three** v1.7.0 leaps (artifact 0.394 → 0.447 →
0.472 → 0.522, never reaching the 0.65 bar). A 13-agent adversarially-verified
diagnosis (`.claude/ooda-evolution-v1.8.0.md`) found the loop **detects** a
quality gap but isn't built to **close** it. Four fixes, ranked by leverage:

- **Thrashing-guard bug fix (prerequisite).** evolve 2-G counted a nonexistent
`leap_delta` field on `weakest_dimension`, so the guard's `fails` count was
ALWAYS 0 and the HALT safety valve **never fired** — the loop could thrash a
dimension forever. Now counts `leap_attempts[].delta_score` on the actual
`leap_target` (`rubric_score.failed_leaps()`).
- **Per-dimension `capture_method` (5-G).** The critic scored every axis from one
screenshot, so `driving_feel` + `fun_challenge` (**45% of the rubric's weight**)
were frozen across all 25 cycles. Each axis now declares its capture;
experiential axes use `gameplay_metrics` — a human-authored, hash-verified,
protected harness (same independence invariant as the rubric hash). Missing/
unverified → score `null` + skill_gap, never a faked or silent-screenshot score.
- **Dimension lock until bar (2-G).** A successful leap that left its target below
`bar − eps` now keeps the plateau active on the SAME target (drive-to-bar, not
detect-and-nudge + rotate). `rubric_score.lock_target()`; toggle with
`config.leap.lock_until_bar`. Tolerance band + the now-working max-attempts HALT
prevent infinite lock.
- **Auto-queue remainder (5-G).** A leap that passes its delta gate but leaves the
dimension below bar queues a high-RICE remainder, triggered by the *independent
critic's* score (not the maker's self-report), so dropped/partial scope can't be
silently orphaned (as leap 3's materials/lighting was).

**Rejected** (kept the loop honest): raising the bar to 0.80 before feel/fun are
measurable; an inner multi-pass refine loop; an LLM-component-coverage gate
(gameable); `multi_probe` still-sequences. `tests/verify.py` 59 → **61**.

---

## [v1.7.0] — 2026-06-17

### Added — artifact-grounded evaluation + quantum-leap cycles (config schema 1.3.0)
Expand Down
19 changes: 15 additions & 4 deletions config.example.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": "1.3.0",
"schema_version": "1.4.0",
"project": {
"name": "my-app",
"locale": "en",
Expand Down Expand Up @@ -270,16 +270,27 @@
"plateau_window": 4,
"plateau_eps": 0.05,
"locked": true,
"dimensions": []
"__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.",
"dimensions": [],
"__example_dimension__": {
"name": "driving_feel",
"weight": 0.25,
"capture_method": "gameplay_metrics",
"gameplay_metrics_command": "node tools/feel_harness.mjs",
"gameplay_metrics_hash": "<sha256 of the harness file>",
"metrics_fields": ["input_lag_ms", "physics_response_ms", "completion_rate"],
"description": "Responsive steering, weight transfer, distinct braking — judge against: input_lag_ms<40, physics_response_ms stable, completion_rate>0.7."
}
},
"leap": {
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend.",
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge).",
"max_lines": 1500,
"min_dimension_delta": 0.05,
"max_attempts_per_dimension": 2,
"max_per_day": 2,
"gap_weight": 30.0,
"cost_limit_usd": 0.5
"cost_limit_usd": 0.5,
"lock_until_bar": true
},
"goal_completion_idle": true
}
37 changes: 37 additions & 0 deletions scripts/rubric_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,43 @@ def detect_plateau(outcomes: list, rubric: dict) -> dict:
}


def failed_leaps(outcomes: list, dimension: str, min_delta: float) -> int:
"""v1.8.0 thrashing-guard count (deterministic ground truth for evolve 2-G).
Counts leap cycles whose `leap_attempts` on `dimension` failed to clear
`min_delta`. The v1.7.x engine read a nonexistent field `leap_delta` and
matched the raw weakest_dimension, so this was ALWAYS 0 and the guard never
fired — the loop could thrash forever. The real ledger field is
`leap_attempts[].delta_score`, keyed on the dimension actually leapt."""
n = 0
for e in outcomes:
if e.get("cycle_mode") != "leap":
continue
for a in (e.get("leap_attempts") or []):
if a.get("dimension") == dimension and a.get("delta_score", 1.0) < min_delta:
n += 1
return n


def lock_target(outcomes: list, rubric: dict, leap_target: str | None) -> str | None:
"""v1.8.0 dimension-lock: after a SUCCESSFUL leap whose target is still below
(bar − eps), return that target so evolve 2-G keeps the plateau active on it
(drive-to-bar) instead of coasting through feature cycles and re-rotating.
Returns None when nothing should be locked (no leap last, regressed, or the
target is at/near bar — the tolerance band stops critic variance from locking
the loop forever; the max_attempts HALT is the genuine-stuck backstop)."""
if not leap_target or not outcomes:
return None
last = outcomes[-1]
if last.get("cycle_mode") != "leap" or last.get("result_type") == "leap_regressed":
return None
bar = rubric.get("bar", DEFAULT_BAR)
eps = float(rubric.get("plateau_eps", DEFAULT_PLATEAU_EPS))
score = (last.get("dimension_scores") or {}).get(leap_target)
if score is None:
return None
return leap_target if score < bar - eps else None


def compute(project: Path) -> dict:
ev = Path(project) / "agent" / "state" / "evolve"
config = _load(Path(project) / "config.json", {}) or {}
Expand Down
Loading
Loading