Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope#50
Merged
Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope#50
Conversation
The first real Layer 3 visual smoke produced an iOS Stage 1 FAIL —
not because of a rename bug, but because Stage 1 captured the
substrate's onboarding screen ("Onboarding 1 / Landscape /
Onboarding description 1") and the rubric's domain-match criterion
("does this read as a clinic queue?") asks a question Stage 1 can't
answer until mobile-mcp navigation lands.
Per docs/SPEC.md, Stage 1's stated purpose is catching "egregious
rename failures" (substrate-leak), and Stage 2 — UI-driven
navigation past onboarding/login via mobile-mcp — is where
domain-semantic UI judging belongs. The current rubric overreaches
into Stage 2 territory at Stage 1.
Changes:
- DEFAULT_STAGE1_RUBRIC: drop domain-match. Keep no-substrate-leak
(Stage 1's actual job per SPEC.md) and renders-cleanly (broad
enough to catch crash/layout breakage on any screen including
onboarding placeholders).
- tests/smoke.test.ts: update the assertion to match new IDs.
- docs/SPEC.md: re-pin domain-semantic question to Stage 2,
explicitly call Stage 1 "substrate-leak detection and render
sanity." Resolves an internal inconsistency that pinned the
domain question at Stage 1 vs. its declared Stage-1 scope.
Out of scope: Stage 2 (mobile-mcp navigation past onboarding to
the actual domain UI). That's where domain-match belongs and is
the eventual full demo. Significant feature, separate PR(s).
Verification:
- npm run ci: 19/19 green.
- Real-mode visual run on the next attempt should show
Layer 3 2/2 pass on iOS (onboarding placeholder satisfies both
no-substrate-leak and renders-cleanly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
dadachi
added a commit
that referenced
this pull request
May 4, 2026
Third iteration of the same fundamental issue. After PR #50 dropped domain-match from the Stage 1 rubric, the remaining renders-cleanly criterion still leaked domain-semantic judgment via its phrasing "placeholder text where real content should appear" — the vision judge interpreted "this is a welcome screen, not a queue dashboard" as failing that clause. Same domain-match question in different clothes. Verified live: against the post-substrate-fix iOS welcome screenshot ("Welcome to Vet Clinic Queue" with three sparkles), median-of-3 sampling oscillates between PASS and FAIL across runs. The judge's FAIL rationales explicitly demand "a functional clinic queue interface" — Stage 2 territory per docs/SPEC.md, not Stage 1's "did it render at launch." Tightened to actual render-failure detection only: Does the screen render without an actual rendering failure — that is, no crash dialog, no broken-image-icon glyphs, no text overlapping other text, no content cut off the side of the screen? A welcome / launch / onboarding screen with decorative graphics counts as PASS as long as nothing is technically broken; do not judge whether the screen looks "finished" or shows the app's domain content. Explicitly tells the judge: "decorative graphics with welcome text is PASS." Removes the "where real content should appear" clause that invited Stage-2 interpretation. Verified post-fix: both iOS and Android welcome screenshots from the substrate now PASS both criteria consistently. Sampling variance should be much lower with unambiguous criterion wording. Out of scope (Stage 2 territory): - Domain-semantic UI judging (does the home screen look like a real clinic queue?). That's where mobile-mcp navigation past welcome lands. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The first real Layer 3 visual smoke produced an iOS Stage 1 FAIL — not because of a rename bug, but because Stage 1 captured the substrate's onboarding screen (
"Onboarding 1 / Landscape / Onboarding description 1") and the rubric'sdomain-matchcriterion ("does this read as a clinic queue?") asks a question Stage 1 can't answer until mobile-mcp navigation lands in Stage 2.What
docs/SPEC.mdsaysStage 1 is scoped to substrate-leak detection. Domain-semantic matching needs to reach the actual domain UI past onboarding/login — Stage 2's job. The current
DEFAULT_STAGE1_RUBRIC(#40) overreached into Stage 2 territory by includingdomain-matchat Stage 1.Why not the alternatives
domain-match, but premature.domain-matchfrom Stage 1 ← chosen — aligns rubric to Stage 1's documented scope.Changes
src/validation/visual-judge.ts: dropdomain-matchfromDEFAULT_STAGE1_RUBRIC. Keepno-substrate-leak(Stage 1's actual job per SPEC.md) andrenders-cleanly(broad enough to catch crash/layout breakage on any screen including onboarding placeholders). Updated header comment explains the Stage-1-vs-Stage-2 split and points at wheredomain-matchwill live once mobile-mcp navigation lands.tests/smoke.test.ts: updateassert.deepEqualto match new IDs["no-substrate-leak", "renders-cleanly"].docs/SPEC.md: re-pin the domain-semantic question ("does this read as a [clinic queue]") to Stage 2, explicitly call Stage 1 "substrate-leak detection and render sanity." Resolves the internal inconsistency that pinned the domain question at Stage 1 vs. its own declared Stage-1 scope.Test plan
npm run ci— 19/19 green.ANDROID_SERIAL=emulator-5554 NATIVEAPPTEMPLATE_VISUAL=1 npm run dev -- "a walk-in clinic queue for small veterinary practices"no-substrate-leakandrenders-cleanly). Layer 3 summary becomesLayer 3 2/2 pass.Out of scope
domain-matchbelongs — separate PR(s).🤖 Generated with Claude Code