Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "ooda-loop",
"displayName": "OODA-loop",
"version": "1.8.0",
"version": "1.8.1",
"description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
"author": {
"name": "Taeil Ma",
Expand Down
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,30 @@ independently. Bump there signals migration work for downstream projects.

---

## [v1.8.1] — 2026-06-19

### Validated + guidance — the gameplay_metrics path works end-to-end

The f1 probe exercised v1.8.0's per-dimension capture: it authored a
`gameplay_metrics` harness (drives the real pure physics headlessly) for the two
**frozen** experiential axes (`driving_feel`, `fun_challenge` = 45% of the rubric
a screenshot can't judge).

- **Honest measurement first dropped artifact_quality 0.533 → 0.490** — the
screenshot had been *over*-scoring feel/fun (0.51/0.38 → measured 0.41/0.29).
Confirms the v1.8.0 thesis: measurement was the bottleneck *and* inflating.
- Two leaps the unlock enabled: `fun_challenge` 0.29 → **0.81** (distinct AI
racing lines + tamed DRS slingshot) and `driving_feel` 0.41 → **0.78** (steering
inertia + power oversteer + weight transfer). **artifact_quality crossed the bar
for the first time (0.687 ≥ 0.65) → an HONEST grade A** (the loop's original A
was a lie; this one is earned).
- **Guidance (config doc):** a `gameplay_metrics` harness must MEASURE BEHAVIOUR
(drive the artifact, read the numbers), never assert an implementation fact — the
probe's first harness hardcoded a feature flag and couldn't credit the fix,
which would trigger a spurious thrashing-HALT. Rewritten to measure behaviour.

---

## [v1.8.0] — 2026-06-19

### Changed — drive quality to "good", not "passable" (config schema 1.4.0)
Expand Down
2 changes: 1 addition & 1 deletion config.example.json
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@
"plateau_window": 4,
"plateau_eps": 0.05,
"locked": true,
"__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.",
"__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality. v1.8.1 rule (validated by the f1 probe): the harness must MEASURE BEHAVIOUR (e.g. drive the real physics and read the resulting numbers), NOT assert an implementation fact — a hardcoded flag like {feature: false} cannot credit a real fix, so it would trigger a spurious thrashing-HALT. Drive the artifact and report what it actually does.",
"dimensions": [],
"__example_dimension__": {
"name": "driving_feel",
Expand Down
Loading