Skip to content

Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope#50

Merged
dadachi merged 1 commit intomainfrom
stage1-rubric-align-to-spec
May 3, 2026
Merged

Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope#50
dadachi merged 1 commit intomainfrom
stage1-rubric-align-to-spec

Conversation

@dadachi
Copy link
Copy Markdown
Contributor

@dadachi dadachi commented May 3, 2026

Summary

The first real Layer 3 visual smoke produced an iOS Stage 1 FAIL — not because of a rename bug, but because Stage 1 captured the substrate's onboarding screen ("Onboarding 1 / Landscape / Onboarding description 1") and the rubric's domain-match criterion ("does this read as a clinic queue?") asks a question Stage 1 can't answer until mobile-mcp navigation lands in Stage 2.

What docs/SPEC.md says

Stage 1 (in-week minimum): screenshots of the launch / home screen only ... This alone catches egregious rename failures ("the app still says 'Shop' everywhere").

Stage 2 (in-week stretch): UI-driven navigation to 2–3 additional key screens — list view, detail view, form — using mobile-next/mobile-mcp.

Stage 1 is scoped to substrate-leak detection. Domain-semantic matching needs to reach the actual domain UI past onboarding/login — Stage 2's job. The current DEFAULT_STAGE1_RUBRIC (#40) overreached into Stage 2 territory by including domain-match at Stage 1.

Why not the alternatives

  • Substrate-side onboarding tweak — substrate's onboarding-as-tutorial is intentional design; not the agent's concern.
  • Agent-side mobile-mcp navigation — Stage 2 work, days of effort. The proper home for domain-match, but premature.
  • Drop domain-match from Stage 1 ← chosen — aligns rubric to Stage 1's documented scope.

Changes

  • src/validation/visual-judge.ts: drop domain-match from DEFAULT_STAGE1_RUBRIC. Keep no-substrate-leak (Stage 1's actual job per SPEC.md) and renders-cleanly (broad enough to catch crash/layout breakage on any screen including onboarding placeholders). Updated header comment explains the Stage-1-vs-Stage-2 split and points at where domain-match will live once mobile-mcp navigation lands.
  • tests/smoke.test.ts: update assert.deepEqual to match new IDs ["no-substrate-leak", "renders-cleanly"].
  • docs/SPEC.md: re-pin the domain-semantic question ("does this read as a [clinic queue]") to Stage 2, explicitly call Stage 1 "substrate-leak detection and render sanity." Resolves the internal inconsistency that pinned the domain question at Stage 1 vs. its own declared Stage-1 scope.

Test plan

  • npm run ci — 19/19 green.
  • Re-run the visual smoke that triggered this:
    ANDROID_SERIAL=emulator-5554 NATIVEAPPTEMPLATE_VISUAL=1 npm run dev -- "a walk-in clinic queue for small veterinary practices"
    Expected: iOS Stage 1 now passes (onboarding placeholder satisfies both no-substrate-leak and renders-cleanly). Layer 3 summary becomes Layer 3 2/2 pass.

Out of scope

  • Stage 2 (mobile-mcp navigation past onboarding to the real domain UI). That's where domain-match belongs — separate PR(s).
  • Substrate onboarding redesign — not the agent's concern.

🤖 Generated with Claude Code

The first real Layer 3 visual smoke produced an iOS Stage 1 FAIL —
not because of a rename bug, but because Stage 1 captured the
substrate's onboarding screen ("Onboarding 1 / Landscape /
Onboarding description 1") and the rubric's domain-match criterion
("does this read as a clinic queue?") asks a question Stage 1 can't
answer until mobile-mcp navigation lands.

Per docs/SPEC.md, Stage 1's stated purpose is catching "egregious
rename failures" (substrate-leak), and Stage 2 — UI-driven
navigation past onboarding/login via mobile-mcp — is where
domain-semantic UI judging belongs. The current rubric overreaches
into Stage 2 territory at Stage 1.

Changes:
  - DEFAULT_STAGE1_RUBRIC: drop domain-match. Keep no-substrate-leak
    (Stage 1's actual job per SPEC.md) and renders-cleanly (broad
    enough to catch crash/layout breakage on any screen including
    onboarding placeholders).
  - tests/smoke.test.ts: update the assertion to match new IDs.
  - docs/SPEC.md: re-pin domain-semantic question to Stage 2,
    explicitly call Stage 1 "substrate-leak detection and render
    sanity." Resolves an internal inconsistency that pinned the
    domain question at Stage 1 vs. its declared Stage-1 scope.

Out of scope: Stage 2 (mobile-mcp navigation past onboarding to
the actual domain UI). That's where domain-match belongs and is
the eventual full demo. Significant feature, separate PR(s).

Verification:
  - npm run ci: 19/19 green.
  - Real-mode visual run on the next attempt should show
    Layer 3 2/2 pass on iOS (onboarding placeholder satisfies both
    no-substrate-leak and renders-cleanly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dadachi dadachi merged commit 186b0f1 into main May 3, 2026
1 check passed
@dadachi dadachi deleted the stage1-rubric-align-to-spec branch May 3, 2026 05:41
dadachi added a commit that referenced this pull request May 4, 2026
Third iteration of the same fundamental issue. After PR #50 dropped
domain-match from the Stage 1 rubric, the remaining renders-cleanly
criterion still leaked domain-semantic judgment via its phrasing
"placeholder text where real content should appear" — the vision
judge interpreted "this is a welcome screen, not a queue dashboard"
as failing that clause. Same domain-match question in different
clothes.

Verified live: against the post-substrate-fix iOS welcome screenshot
("Welcome to Vet Clinic Queue" with three sparkles), median-of-3
sampling oscillates between PASS and FAIL across runs. The judge's
FAIL rationales explicitly demand "a functional clinic queue
interface" — Stage 2 territory per docs/SPEC.md, not Stage 1's
"did it render at launch."

Tightened to actual render-failure detection only:

  Does the screen render without an actual rendering failure —
  that is, no crash dialog, no broken-image-icon glyphs, no text
  overlapping other text, no content cut off the side of the
  screen? A welcome / launch / onboarding screen with decorative
  graphics counts as PASS as long as nothing is technically
  broken; do not judge whether the screen looks "finished" or
  shows the app's domain content.

Explicitly tells the judge: "decorative graphics with welcome text
is PASS." Removes the "where real content should appear" clause
that invited Stage-2 interpretation.

Verified post-fix: both iOS and Android welcome screenshots from
the substrate now PASS both criteria consistently. Sampling variance
should be much lower with unambiguous criterion wording.

Out of scope (Stage 2 territory):
  - Domain-semantic UI judging (does the home screen look like
    a real clinic queue?). That's where mobile-mcp navigation
    past welcome lands.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant