From 9a6426334de4907c0136f4bfc993ef56e663782c Mon Sep 17 00:00:00 2001 From: Taeil Ma Date: Fri, 19 Jun 2026 20:06:40 +0900 Subject: [PATCH] docs(v1.8.1): validate gameplay_metrics end-to-end + harness-behaviour guidance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The f1 probe exercised v1.8.0's per-dimension capture on the frozen experiential axes. Honest metric measurement first dropped artifact 0.533→0.490 (screenshot had over-scored feel/fun), then two unlocked leaps (fun_challenge 0.29→0.81, driving_feel 0.41→0.78) pushed artifact_quality past the bar (0.687) for the first time — an HONEST grade A. Guidance added to config.example.json: a gameplay_metrics harness must MEASURE BEHAVIOUR, not assert an implementation fact (a hardcoded flag can't credit a fix and would trigger a spurious thrashing-HALT). plugin 1.8.0→1.8.1. Co-Authored-By: Claude Opus 4.8 --- .claude-plugin/plugin.json | 2 +- CHANGELOG.md | 24 ++++++++++++++++++++++++ config.example.json | 2 +- 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json index da5041a..6671b4c 100644 --- a/.claude-plugin/plugin.json +++ b/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "ooda-loop", "displayName": "OODA-loop", - "version": "1.8.0", + "version": "1.8.1", "description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.", "author": { "name": "Taeil Ma", diff --git a/CHANGELOG.md b/CHANGELOG.md index 1deb187..ccc7bda 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,30 @@ independently. Bump there signals migration work for downstream projects. --- +## [v1.8.1] — 2026-06-19 + +### Validated + guidance — the gameplay_metrics path works end-to-end + +The f1 probe exercised v1.8.0's per-dimension capture: it authored a +`gameplay_metrics` harness (drives the real pure physics headlessly) for the two +**frozen** experiential axes (`driving_feel`, `fun_challenge` = 45% of the rubric +a screenshot can't judge). + +- **Honest measurement first dropped artifact_quality 0.533 → 0.490** — the + screenshot had been *over*-scoring feel/fun (0.51/0.38 → measured 0.41/0.29). + Confirms the v1.8.0 thesis: measurement was the bottleneck *and* inflating. +- Two leaps the unlock enabled: `fun_challenge` 0.29 → **0.81** (distinct AI + racing lines + tamed DRS slingshot) and `driving_feel` 0.41 → **0.78** (steering + inertia + power oversteer + weight transfer). **artifact_quality crossed the bar + for the first time (0.687 ≥ 0.65) → an HONEST grade A** (the loop's original A + was a lie; this one is earned). +- **Guidance (config doc):** a `gameplay_metrics` harness must MEASURE BEHAVIOUR + (drive the artifact, read the numbers), never assert an implementation fact — the + probe's first harness hardcoded a feature flag and couldn't credit the fix, + which would trigger a spurious thrashing-HALT. Rewritten to measure behaviour. + +--- + ## [v1.8.0] — 2026-06-19 ### Changed — drive quality to "good", not "passable" (config schema 1.4.0) diff --git a/config.example.json b/config.example.json index 56df709..995f7b0 100644 --- a/config.example.json +++ b/config.example.json @@ -270,7 +270,7 @@ "plateau_window": 4, "plateau_eps": 0.05, "locked": true, - "__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.", + "__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality. v1.8.1 rule (validated by the f1 probe): the harness must MEASURE BEHAVIOUR (e.g. drive the real physics and read the resulting numbers), NOT assert an implementation fact — a hardcoded flag like {feature: false} cannot credit a real fix, so it would trigger a spurious thrashing-HALT. Drive the artifact and report what it actually does.", "dimensions": [], "__example_dimension__": { "name": "driving_feel",