feat(v1.8.0): drive quality to good — fix thrashing guard, per-dimension capture, dimension-lock, remainder#72
Merged
Merged
Conversation
…nsion capture, dimension-lock, remainder The F1 probe stayed crap after 3 v1.7.0 leaps (artifact 0.394→0.447→0.472→0.522, never reaching bar 0.65). A 13-agent adversarially-verified diagnosis found the loop DETECTS a quality gap but isn't built to CLOSE it. - 2-G thrashing-guard BUG FIX: counted a nonexistent `leap_delta` on weakest_dimension → fails was ALWAYS 0, HALT never fired. Now counts leap_attempts[].delta_score on leap_target (rubric_score.failed_leaps()). - 5-G per-dimension capture_method: experiential axes (driving_feel + fun_challenge = 45% of weight, frozen across all 25 cycles) use a human-authored, hash-verified, protected gameplay_metrics harness; missing → null + skill_gap, never a faked/silent-screenshot score. - 2-G dimension lock until bar: a successful leap below bar keeps the plateau on the SAME target (drive-to-bar, not detect-and-nudge+rotate). lock_target(); config.leap.lock_until_bar; tolerance band + working max-attempts HALT. - 5-G auto-queue remainder: a gate-passing leap still below bar queues a high-RICE remainder, triggered by the independent critic (not self-report). Rejected: raising bar to 0.80 yet, inner refine loop, LLM-coverage gate, multi_probe. verify.py 59 → 61. plugin 1.7→1.8, config schema 1.3→1.4. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ttleneck moved to perception Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The F1 probe stayed crap after three v1.7.0 leaps (artifact 0.394 → 0.447 → 0.472 → 0.522, never reaching bar 0.65). A 13-agent adversarially-verified diagnosis (
.claude/ooda-evolution-v1.8.0.md) found the loop detects a quality gap but isn't built to close it.Root cause
driving_feel+fun_challengewere scored from a still screenshot (unmeasurable) and sat unchanged across all 25 cycles.leap_deltafield →failsalways 0 → HALT never fired → could thrash forever.The devil's-advocate agent confirmed the leap routing is fine — the binding constraint is perception, not more leap machinery.
The 4 fixes (ranked)
leap_attempts[].delta_scoreonleap_target(rubric_score.failed_leaps()).capture_method(5-G) — experiential axes use a human-authored, hash-verified, protectedgameplay_metricsharness; missing →null+ skill_gap, never faked.rubric_score.lock_target(),config.leap.lock_until_bar).Rejected (devil's-advocate-validated): raising the bar to 0.80 yet; an inner refine loop; an LLM-component-coverage gate;
multi_probe.Validation (leap 4, separate game PR)
Ran a materials/lighting leap under v1.8.0: visual_fidelity 0.59 → 0.63, shipping the previously-dropped shadows/tone-mapping. It hit the screenshot-critique ceiling (~0.63) — proving the bottleneck has moved to perception. The loop's next target (
fun_challenge) is unmeasurable by screenshot, so the fixed guard now HALTs requesting a human metrics harness instead of thrashing.tests/verify.py59 → 61. plugin 1.7→1.8, config schema 1.3→1.4.🤖 Generated with Claude Code