Skip to content

Stage 1 rubric: tighten renders-cleanly to actual render failures#51

Merged
dadachi merged 1 commit intomainfrom
stage1-rubric-no-domain-bleed
May 4, 2026
Merged

Stage 1 rubric: tighten renders-cleanly to actual render failures#51
dadachi merged 1 commit intomainfrom
stage1-rubric-no-domain-bleed

Conversation

@dadachi
Copy link
Copy Markdown
Contributor

@dadachi dadachi commented May 4, 2026

Summary

Third iteration of the same fundamental issue. PR #50 dropped domain-match from the Stage 1 rubric, but the remaining renders-cleanly criterion still leaked domain-semantic judgment via its phrasing "placeholder text where real content should appear" — the vision judge interpreted "this is a welcome screen, not a queue dashboard" as failing that clause. Same domain-match question in different clothes.

Verified live

Against the post-substrate-fix iOS welcome screenshot ("Welcome to Vet Clinic Queue" with three sparkles), median-of-3 sampling oscillates between PASS and FAIL across runs. Sample FAIL rationale:

"The home screen is sparse with only generic sparkle icons and a welcome title, lacking real queue content; the layout feels unfinished/placeholder-like rather than a functional clinic queue interface."

That's Stage 2 territory per docs/SPEC.md, not Stage 1's "did it render at launch."

New criterion wording

Does the screen render without an actual rendering failure — that is, no
crash dialog, no broken-image-icon glyphs, no text overlapping other text,
no content cut off the side of the screen? A welcome / launch / onboarding
screen with decorative graphics counts as PASS as long as nothing is
technically broken; do not judge whether the screen looks "finished" or
shows the app's domain content.

Explicitly tells the judge: "decorative graphics with welcome text is PASS." Removes the "where real content should appear" clause that invited Stage-2 interpretation.

Verified post-fix

Both iOS and Android welcome screenshots from the substrate now PASS both criteria consistently:

=== ios: PASS
  no-substrate-leak: PASS — only "Welcome to Vet Clinic Queue", ...
  renders-cleanly: PASS — no overlapping text, no broken image icons, no content cut off.
=== android: PASS
  no-substrate-leak: PASS — only "Welcome to Vet Clinic Queue", ...
  renders-cleanly: PASS — no overlapping text, no broken images, no content cut off.

Sampling variance should be much lower with unambiguous criterion wording.

Test plan

  • npm run ci — 19/19 green.
  • Direct runLayer3 calls against the captured screenshots (iOS + Android) → both PASS both criteria.
  • After merge: NATIVEAPPTEMPLATE_VISUAL=1 npm run dev -- "..." should report Layer 3 2/2 pass.

Out of scope

  • Stage 2 mobile-mcp navigation past welcome to the actual domain UI. That's where domain-semantic judging belongs.

🤖 Generated with Claude Code

Third iteration of the same fundamental issue. After PR #50 dropped
domain-match from the Stage 1 rubric, the remaining renders-cleanly
criterion still leaked domain-semantic judgment via its phrasing
"placeholder text where real content should appear" — the vision
judge interpreted "this is a welcome screen, not a queue dashboard"
as failing that clause. Same domain-match question in different
clothes.

Verified live: against the post-substrate-fix iOS welcome screenshot
("Welcome to Vet Clinic Queue" with three sparkles), median-of-3
sampling oscillates between PASS and FAIL across runs. The judge's
FAIL rationales explicitly demand "a functional clinic queue
interface" — Stage 2 territory per docs/SPEC.md, not Stage 1's
"did it render at launch."

Tightened to actual render-failure detection only:

  Does the screen render without an actual rendering failure —
  that is, no crash dialog, no broken-image-icon glyphs, no text
  overlapping other text, no content cut off the side of the
  screen? A welcome / launch / onboarding screen with decorative
  graphics counts as PASS as long as nothing is technically
  broken; do not judge whether the screen looks "finished" or
  shows the app's domain content.

Explicitly tells the judge: "decorative graphics with welcome text
is PASS." Removes the "where real content should appear" clause
that invited Stage-2 interpretation.

Verified post-fix: both iOS and Android welcome screenshots from
the substrate now PASS both criteria consistently. Sampling variance
should be much lower with unambiguous criterion wording.

Out of scope (Stage 2 territory):
  - Domain-semantic UI judging (does the home screen look like
    a real clinic queue?). That's where mobile-mcp navigation
    past welcome lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dadachi dadachi merged commit d1b6cb5 into main May 4, 2026
1 check passed
@dadachi dadachi deleted the stage1-rubric-no-domain-bleed branch May 4, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant