Stage 1 rubric: tighten renders-cleanly to actual render failures by dadachi · Pull Request #51 · nativeapptemplate/nativeapptemplate-agent

dadachi · 2026-05-04T02:56:51Z

Summary

Third iteration of the same fundamental issue. PR #50 dropped domain-match from the Stage 1 rubric, but the remaining renders-cleanly criterion still leaked domain-semantic judgment via its phrasing "placeholder text where real content should appear" — the vision judge interpreted "this is a welcome screen, not a queue dashboard" as failing that clause. Same domain-match question in different clothes.

Verified live

Against the post-substrate-fix iOS welcome screenshot ("Welcome to Vet Clinic Queue" with three sparkles), median-of-3 sampling oscillates between PASS and FAIL across runs. Sample FAIL rationale:

"The home screen is sparse with only generic sparkle icons and a welcome title, lacking real queue content; the layout feels unfinished/placeholder-like rather than a functional clinic queue interface."

That's Stage 2 territory per docs/SPEC.md, not Stage 1's "did it render at launch."

New criterion wording

Does the screen render without an actual rendering failure — that is, no
crash dialog, no broken-image-icon glyphs, no text overlapping other text,
no content cut off the side of the screen? A welcome / launch / onboarding
screen with decorative graphics counts as PASS as long as nothing is
technically broken; do not judge whether the screen looks "finished" or
shows the app's domain content.

Explicitly tells the judge: "decorative graphics with welcome text is PASS." Removes the "where real content should appear" clause that invited Stage-2 interpretation.

Verified post-fix

Both iOS and Android welcome screenshots from the substrate now PASS both criteria consistently:

=== ios: PASS
  no-substrate-leak: PASS — only "Welcome to Vet Clinic Queue", ...
  renders-cleanly: PASS — no overlapping text, no broken image icons, no content cut off.
=== android: PASS
  no-substrate-leak: PASS — only "Welcome to Vet Clinic Queue", ...
  renders-cleanly: PASS — no overlapping text, no broken images, no content cut off.

Sampling variance should be much lower with unambiguous criterion wording.

Test plan

npm run ci — 19/19 green.
Direct runLayer3 calls against the captured screenshots (iOS + Android) → both PASS both criteria.
After merge: NATIVEAPPTEMPLATE_VISUAL=1 npm run dev -- "..." should report Layer 3 2/2 pass.

Out of scope

Stage 2 mobile-mcp navigation past welcome to the actual domain UI. That's where domain-semantic judging belongs.

🤖 Generated with Claude Code

Third iteration of the same fundamental issue. After PR #50 dropped domain-match from the Stage 1 rubric, the remaining renders-cleanly criterion still leaked domain-semantic judgment via its phrasing "placeholder text where real content should appear" — the vision judge interpreted "this is a welcome screen, not a queue dashboard" as failing that clause. Same domain-match question in different clothes. Verified live: against the post-substrate-fix iOS welcome screenshot ("Welcome to Vet Clinic Queue" with three sparkles), median-of-3 sampling oscillates between PASS and FAIL across runs. The judge's FAIL rationales explicitly demand "a functional clinic queue interface" — Stage 2 territory per docs/SPEC.md, not Stage 1's "did it render at launch." Tightened to actual render-failure detection only: Does the screen render without an actual rendering failure — that is, no crash dialog, no broken-image-icon glyphs, no text overlapping other text, no content cut off the side of the screen? A welcome / launch / onboarding screen with decorative graphics counts as PASS as long as nothing is technically broken; do not judge whether the screen looks "finished" or shows the app's domain content. Explicitly tells the judge: "decorative graphics with welcome text is PASS." Removes the "where real content should appear" clause that invited Stage-2 interpretation. Verified post-fix: both iOS and Android welcome screenshots from the substrate now PASS both criteria consistently. Sampling variance should be much lower with unambiguous criterion wording. Out of scope (Stage 2 territory): - Domain-semantic UI judging (does the home screen look like a real clinic queue?). That's where mobile-mcp navigation past welcome lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dadachi merged commit d1b6cb5 into main May 4, 2026
1 check passed

dadachi deleted the stage1-rubric-no-domain-bleed branch May 4, 2026 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 1 rubric: tighten renders-cleanly to actual render failures#51

Stage 1 rubric: tighten renders-cleanly to actual render failures#51
dadachi merged 1 commit intomainfrom
stage1-rubric-no-domain-bleed

dadachi commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dadachi commented May 4, 2026

Summary

Verified live

New criterion wording

Verified post-fix

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant