Layer 3 Phase 5b: integrate runVisualJudge into runJudge (opt-in) by dadachi · Pull Request #41 · nativeapptemplate/nativeapptemplate-agent

dadachi · 2026-05-02T12:46:14Z

Summary

Extends JudgeInput with an optional visual field. When provided, runJudge calls runVisualJudge (#40) per configured platform, captures results into JudgeResult, and rolls them into the summary line.

Default behavior unchanged. dispatch.ts doesn't pass visual, so runJudge takes the existing skip path with a clearer trace ("visual config not provided; skipped") instead of the old "not yet wired" placeholder.

When visual is configured, the summary becomes:

Layer 1 3/3 pass · Layer 2 3/3 pass · Layer 3 2/2 pass

New result shape

type JudgeResult = {
  overallPass: boolean;
  summary: string;
  visual?: {
    ios?: VisualJudgePlatformReport;
    android?: VisualJudgePlatformReport;
  };
};

type VisualJudgePlatformReport = {
  pass: boolean;
  screenshotPath?: string;
  scores?: { criterionId: string; pass: boolean; rationale: string }[];
  error?: string;
};

overallPass now requires layer1 + layer2 + visual all passing (when visual is configured).

Defaults when visual is configured

Rubric — DEFAULT_STAGE1_RUBRIC from Layer 3 Phase 5a: visual-judge orchestration + default Stage 1 rubric #40 (3 criteria).
Screenshot dir — tmp/screenshots/<slug>/.
Spec text — domain.displayName; callers can override.

Out of scope

Wiring visual from dispatch.ts (driver script / plugin entry populates it).
Resolving artifactPath / bundleId / packageName from the slug (driver-side via Info.plist / AndroidManifest.xml read post-build).
Bridge from Layer 2 build mode → installAndLaunch.

These three are user-environment specific (where do builds land, what bundle ID does the substrate use) and best decided once we have a real end-to-end run to point at.

Test plan

npm run ci — 12/12 green. Existing dispatch e2e test unchanged (visual not passed → skip path).
New shape verified via TypeScript strict-optional compilation.
After merge: a driver (or temporary main script) populates visual from build artifacts, runs end-to-end, confirms Layer 3 2/2 pass lands in the summary against a real generated app.

🤖 Generated with Claude Code

Extends JudgeInput with an optional `visual` field. When provided, runJudge calls runVisualJudge per configured platform (#40), captures results into the JudgeResult, and rolls them into the summary line. Default behavior unchanged: dispatch.ts doesn't pass `visual`, so runJudge takes the existing skip path with a clearer trace ("visual config not provided; skipped") instead of the old "not yet wired" placeholder. When visual IS configured: Layer 1 3/3 pass · Layer 2 3/3 pass · Layer 3 2/2 pass Per-platform reports surface in JudgeResult.visual.{ios,android}: - pass: boolean (overall median-of-3 verdict) - screenshotPath: where the captured PNG lives - scores: per-criterion verdict + rationale - error: populated when launch / capture / judge failed before Layer 3 produced scores Default rubric is DEFAULT_STAGE1_RUBRIC (3 criteria, from #40). Default screenshot dir is tmp/screenshots/<slug>/. Default spec text is domain.displayName; callers can override. JudgeResult shape grows to: { overallPass, summary, visual?: { ios?, android? } } overallPass requires layer1+layer2+visual all passing (when visual is configured). Strict-optional types: visual fields only present when the platform was judged. Out of scope: - Wiring `visual` from dispatch.ts (caller-side; driver script / plugin entry point would populate this). - Resolving artifactPath / bundleId / packageName from the slug (done by the driver via Info.plist / AndroidManifest.xml read). - Bridge from Layer 2 build mode → installAndLaunch. Tests: 12/12 npm run ci green. Existing dispatch e2e test still passes unchanged (visual not passed → skip path). New shape verified via TypeScript strict-optional compilation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After Layer 2 build mode produces the app bundle / APK, callers need the artifact path + identifier to feed into runVisualJudge (#40, #41). Hardcoding paths or guessing identifiers from the slug is fragile — the substrate's iOS bundle ID is `com.<slugflat>.<pascal>App.ios${TEAM}` where ${TEAM} is the developer's signing team, resolved at build time. The right source of truth is the build outputs. Adds two resolvers: discoverIosArtifact(iosDir): 1. Find *.xcodeproj, scheme = filename minus extension 2. xcodebuild -showBuildSettings -json → BUILT_PRODUCTS_DIR + WRAPPER_NAME 3. plutil -extract CFBundleIdentifier raw on the built Info.plist Returns {appPath, bundleId} | null discoverAndroidArtifact(androidDir): 1. apkPath = app/build/outputs/apk/debug/app-debug.apk (predictable) 2. Parse `applicationId = "..."` from app/build.gradle.kts Returns {apkPath, packageName} | null Both return null gracefully when: - Build hasn't happened (.app / .apk missing) - Project layout doesn't match (missing .xcodeproj, missing build.gradle.kts, etc.) - Tooling fails (xcodebuild / plutil exit non-zero, JSON parse fails) Why post-build for iOS (vs. parsing project.pbxproj at the source): the substrate's PRODUCT_BUNDLE_IDENTIFIER is `com.<slugflat>.<pascal>App.ios${SAMPLE_CODE_DISAMBIGUATOR}` where SAMPLE_CODE_DISAMBIGUATOR=${DEVELOPMENT_TEAM}. Pre-build it's a template; post-build the .app's Info.plist has the resolved value (e.g. ".iosNNYDL5U3V3" with the user's team ID). Reading the resolved form makes installAndLaunch's bundle ID match what's actually installed. Real-mode smoke: against existing out/vet-clinic-queue/, Android returned null (no apk built yet), iOS returned a fully-resolved {appPath, bundleId} from a prior xcodebuild run that lived in DerivedData. Both behaved as designed. Tests: 14/14 npm run ci green. Out of scope (Phase 5d): - Wire discovery into a higher-level runner that does build → discover → runVisualJudge in one call. - Plumb that runner into dispatch with a flag/env var to opt in to Stage 1 visual judging. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One-call wrapper that ties Phase 5c artifact discovery (#42) and Phase 5a visual-judge orchestration (#40) together for both platforms. Returns a Stage1VisualResult shaped to match JudgeInput.visual's per-platform expectation, so callers can pass it through to runJudge directly: const visual = await runStage1Visual({ iosDir: "./out/<slug>/ios", androidDir: "./out/<slug>/android", spec: domain.displayName, }); const judge = await runJudge({ ..., visual }); Per-platform behavior: - Pass undefined to skip the platform. - If discovery fails (build hasn't happened, project layout unexpected), surfaces a structured VisualJudgeResult with ok=false and an actionable error message ("iOS artifact not discovered (run Layer 2 build mode first)") — same shape as a real launch/capture failure, so downstream aggregation in runJudge (#41) doesn't need a special case. Caller responsibilities: - Run Layer 2 in build mode first so .app / .apk exists - Ensure a sim/emulator is booted for each platform being judged - Decide which platforms to judge (the function judges only those passed) Tests: 16/16 npm run ci green. - Structured failure when artifacts missing ✓ - Empty result when no platforms requested ✓ Out of scope (Phase 5e, the final integration): - CLI flag / env var that opts dispatch into Stage 1 visual - Forcing Layer 2 build mode when visual is enabled - Plumbing the runStage1Visual call into dispatch.ts post-Layer-2 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dadachi merged commit f5edd38 into main May 2, 2026
1 check passed

dadachi deleted the layer3-judge-visual-integration branch May 2, 2026 12:47

dadachi mentioned this pull request May 2, 2026

Layer 3 Phase 5c: artifact discovery for iOS .app and Android .apk #42

Merged

3 tasks

dadachi mentioned this pull request May 2, 2026

Layer 3 Phase 5d: runStage1Visual — convenience runner (discover + judge) #43

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer 3 Phase 5b: integrate runVisualJudge into runJudge (opt-in)#41

Layer 3 Phase 5b: integrate runVisualJudge into runJudge (opt-in)#41
dadachi merged 1 commit intomainfrom
layer3-judge-visual-integration

dadachi commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dadachi commented May 2, 2026

Summary

New result shape

Defaults when visual is configured

Out of scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant