Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope by dadachi · Pull Request #50 · nativeapptemplate/nativeapptemplate-agent

dadachi · 2026-05-03T05:40:53Z

Summary

The first real Layer 3 visual smoke produced an iOS Stage 1 FAIL — not because of a rename bug, but because Stage 1 captured the substrate's onboarding screen ("Onboarding 1 / Landscape / Onboarding description 1") and the rubric's domain-match criterion ("does this read as a clinic queue?") asks a question Stage 1 can't answer until mobile-mcp navigation lands in Stage 2.

What `docs/SPEC.md` says

Stage 1 (in-week minimum): screenshots of the launch / home screen only ... This alone catches egregious rename failures ("the app still says 'Shop' everywhere").

Stage 2 (in-week stretch): UI-driven navigation to 2–3 additional key screens — list view, detail view, form — using mobile-next/mobile-mcp.

Stage 1 is scoped to substrate-leak detection. Domain-semantic matching needs to reach the actual domain UI past onboarding/login — Stage 2's job. The current DEFAULT_STAGE1_RUBRIC (#40) overreached into Stage 2 territory by including domain-match at Stage 1.

Why not the alternatives

Substrate-side onboarding tweak — substrate's onboarding-as-tutorial is intentional design; not the agent's concern.
Agent-side mobile-mcp navigation — Stage 2 work, days of effort. The proper home for domain-match, but premature.
Drop domain-match from Stage 1 ← chosen — aligns rubric to Stage 1's documented scope.

Changes

src/validation/visual-judge.ts: drop domain-match from DEFAULT_STAGE1_RUBRIC. Keep no-substrate-leak (Stage 1's actual job per SPEC.md) and renders-cleanly (broad enough to catch crash/layout breakage on any screen including onboarding placeholders). Updated header comment explains the Stage-1-vs-Stage-2 split and points at where domain-match will live once mobile-mcp navigation lands.
tests/smoke.test.ts: update assert.deepEqual to match new IDs ["no-substrate-leak", "renders-cleanly"].
docs/SPEC.md: re-pin the domain-semantic question ("does this read as a [clinic queue]") to Stage 2, explicitly call Stage 1 "substrate-leak detection and render sanity." Resolves the internal inconsistency that pinned the domain question at Stage 1 vs. its own declared Stage-1 scope.

Test plan

npm run ci — 19/19 green.
Re-run the visual smoke that triggered this:
```
ANDROID_SERIAL=emulator-5554 NATIVEAPPTEMPLATE_VISUAL=1 npm run dev -- "a walk-in clinic queue for small veterinary practices"
```
Expected: iOS Stage 1 now passes (onboarding placeholder satisfies both no-substrate-leak and renders-cleanly). Layer 3 summary becomes Layer 3 2/2 pass.

Out of scope

Stage 2 (mobile-mcp navigation past onboarding to the real domain UI). That's where domain-match belongs — separate PR(s).
Substrate onboarding redesign — not the agent's concern.

🤖 Generated with Claude Code

The first real Layer 3 visual smoke produced an iOS Stage 1 FAIL — not because of a rename bug, but because Stage 1 captured the substrate's onboarding screen ("Onboarding 1 / Landscape / Onboarding description 1") and the rubric's domain-match criterion ("does this read as a clinic queue?") asks a question Stage 1 can't answer until mobile-mcp navigation lands. Per docs/SPEC.md, Stage 1's stated purpose is catching "egregious rename failures" (substrate-leak), and Stage 2 — UI-driven navigation past onboarding/login via mobile-mcp — is where domain-semantic UI judging belongs. The current rubric overreaches into Stage 2 territory at Stage 1. Changes: - DEFAULT_STAGE1_RUBRIC: drop domain-match. Keep no-substrate-leak (Stage 1's actual job per SPEC.md) and renders-cleanly (broad enough to catch crash/layout breakage on any screen including onboarding placeholders). - tests/smoke.test.ts: update the assertion to match new IDs. - docs/SPEC.md: re-pin domain-semantic question to Stage 2, explicitly call Stage 1 "substrate-leak detection and render sanity." Resolves an internal inconsistency that pinned the domain question at Stage 1 vs. its declared Stage-1 scope. Out of scope: Stage 2 (mobile-mcp navigation past onboarding to the actual domain UI). That's where domain-match belongs and is the eventual full demo. Significant feature, separate PR(s). Verification: - npm run ci: 19/19 green. - Real-mode visual run on the next attempt should show Layer 3 2/2 pass on iOS (onboarding placeholder satisfies both no-substrate-leak and renders-cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Third iteration of the same fundamental issue. After PR #50 dropped domain-match from the Stage 1 rubric, the remaining renders-cleanly criterion still leaked domain-semantic judgment via its phrasing "placeholder text where real content should appear" — the vision judge interpreted "this is a welcome screen, not a queue dashboard" as failing that clause. Same domain-match question in different clothes. Verified live: against the post-substrate-fix iOS welcome screenshot ("Welcome to Vet Clinic Queue" with three sparkles), median-of-3 sampling oscillates between PASS and FAIL across runs. The judge's FAIL rationales explicitly demand "a functional clinic queue interface" — Stage 2 territory per docs/SPEC.md, not Stage 1's "did it render at launch." Tightened to actual render-failure detection only: Does the screen render without an actual rendering failure — that is, no crash dialog, no broken-image-icon glyphs, no text overlapping other text, no content cut off the side of the screen? A welcome / launch / onboarding screen with decorative graphics counts as PASS as long as nothing is technically broken; do not judge whether the screen looks "finished" or shows the app's domain content. Explicitly tells the judge: "decorative graphics with welcome text is PASS." Removes the "where real content should appear" clause that invited Stage-2 interpretation. Verified post-fix: both iOS and Android welcome screenshots from the substrate now PASS both criteria consistently. Sampling variance should be much lower with unambiguous criterion wording. Out of scope (Stage 2 territory): - Domain-semantic UI judging (does the home screen look like a real clinic queue?). That's where mobile-mcp navigation past welcome lands. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dadachi merged commit 186b0f1 into main May 3, 2026
1 check passed

dadachi deleted the stage1-rubric-align-to-spec branch May 3, 2026 05:41

dadachi mentioned this pull request May 4, 2026

Stage 1 rubric: tighten renders-cleanly to actual render failures #51

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope#50

Stage 1 rubric: drop domain-match, align to docs/SPEC.md scope#50
dadachi merged 1 commit intomainfrom
stage1-rubric-align-to-spec

dadachi commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dadachi commented May 3, 2026

Summary

What docs/SPEC.md says

Why not the alternatives

Changes

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What `docs/SPEC.md` says