aetheronhq · jacsamell · Jun 2, 2026 · Jun 2, 2026
diff --git a/docs/tasks/23-judge4-playwright-mcp.md b/docs/tasks/23-judge4-playwright-mcp.md
@@ -0,0 +1,211 @@
+# Slice B — Wire Playwright MCP into judge_4 + dev-server bring-up + screenshot routing
+
+Builds on slice A (#262), which plumbed per-judge ``mcp_servers`` from
+``cube.yaml`` all the way through ``SDKBridgeAdapter`` →
+``query({options:{mcpServers}})``. Plumbing has zero behaviour change today
+because no judge declares ``mcp_servers``.
+
+This slice flips judge_4 (Frontend & UX) into a browser-augmented
+reviewer: Playwright MCP for navigation + screenshot, a dev-server cube
+brings up before judge_4 runs, and screenshot files attached to the
+GitHub PR review.
+
+## Why
+
+cube's frontend judge today reviews ``.tsx``/``.css`` diffs purely from
+text. It can spot type errors, prop-drilling, ``next-intl`` violations
+— but it can't see that the button is invisible on dark mode, the
+modal clips off-viewport at 1280px, or the toast renders behind the
+nav. Visual regressions slip through every panel. Adding Playwright
+to judge_4 closes the only gap where text review structurally can't
+catch the bug.
+
+## Scope (in)
+
+1. **judge_4 declares Playwright** in ``cube.yaml``'s ``judges.judge_4.mcp_servers``
+   block. Schema mirrors the Anthropic SDK shape:
+   ```yaml
+   judges:
+     judge_4:
+       model: opus
+       label: "Judge Frontend & UX"
+       persona: "frontend-generic"
+       mcp_servers:
+         playwright:
+           command: "npx"
+           args: ["-y", "@playwright/mcp@latest"]
+   ```
+   The opus default ensures the claude-sdk path gets the MCP servers
+   (only path that honours mcp_servers in slice A).
+
+2. **Dev-server bring-up.** Add an optional ``verify.dev_server:``
+   block:
+   ```yaml
+   verify:
+     install: pnpm install --frozen-lockfile
+     check: pnpm lint && pnpm typecheck && pnpm test
+     dev_server:
+       cmd: pnpm dev
+       url: http://localhost:3000
+       ready_marker: "ready in"
+       timeout_seconds: 60
+   ```
+   - ``cmd``: shell command to start the dev server (run in the writer's
+     worktree, NOT cube's repo).
+   - ``url``: the base URL judge_4's prompt is told to navigate to.
+   - ``ready_marker``: substring cube watches for on the dev-server's
+     stdout/stderr before it considers the server ready. Falls back to
+     polling ``url`` with a HEAD request if not set.
+   - ``timeout_seconds``: hard ceiling for ready detection. On timeout,
+     cube logs a warning, kills the dev-server, and runs judge_4 with
+     a ``[BROWSER UNAVAILABLE]`` note in its prompt — judge still
+     reviews statically.
+
+   When the block is missing, judge_4 runs as before (no browser
+   pass). Out-of-the-box repos see no change.
+
+3. **Lifecycle.** A new helper ``cube.core.dev_server`` exposes:
+   - ``start_dev_server(worktree, config) -> DevServerHandle`` —
+     spawns the command in a process group, hooks stdout/stderr to a
+     log file under ``.prompts/dev-server-<task>.log``, watches for
+     ``ready_marker`` (or polls the URL) until ready or timeout.
+   - ``DevServerHandle.stop()`` — SIGTERM the process group with a
+     5s grace, SIGKILL on timeout (mirrors
+     ``terminate_active_subprocesses`` in core.adapters.base).
+   - Used by the judge-panel handler ONLY for the path where any
+     declared judge has Playwright in its ``mcp_servers``. Other
+     panels skip the bring-up entirely so non-frontend tasks pay no
+     cost.
+   - The base URL is exported into the dev-server's process env as
+     ``CUBE_DEV_SERVER_URL`` AND injected into judge_4's prompt
+     addendum (see step 4).
+
+4. **judge_4 persona addendum.** When a judge has non-empty
+   ``mcp_servers`` AND the dev-server came up, prepend a block to
+   the judge-specific prompt explaining the workflow:
+   - "You have Playwright MCP tools — playwright_navigate,
+     playwright_screenshot, playwright_click, playwright_fill,
+     playwright_pdf, plus the standard observation/eval tools."
+   - "The frontend dev server is running at ``{CUBE_DEV_SERVER_URL}``.
+     Navigate to the changed routes and screenshot the relevant
+     UI states (default, hover, dark mode if applicable, mobile
+     viewport)."
+   - "Save screenshots to
+     ``.prompts/screenshots/<task_id>/judge_4/<descriptive-name>.png``
+     under the cube repo root. Reference each one in your decision
+     file's new ``screenshots:`` field (see schema below)."
+   - "If the route fails to load or the screenshot doesn't match the
+     intent, that's a finding — call it out in
+     ``remaining_issues`` with the file:line of the related diff."
+   - The block is only added when ``mcp_servers`` is non-empty AND
+     the dev-server actually came up. Static-only fallback stays
+     unchanged.
+
+5. **Decision schema extension.** Judge decision files gain an
+   optional ``screenshots:`` field:
+   ```json
+   {
+     "decision": "APPROVED",
+     "remaining_issues": [],
+     "screenshots": [
+       {"path": ".prompts/screenshots/.../button-hover-dark.png",
+        "caption": "Primary button hover state on dark mode"}
+     ]
+   }
+   ```
+   - The aggregator in ``cube.core.decision_parser`` reads this and
+     surfaces the screenshot list in ``get_peer_review_status()``
+     output so callers can use it.
+
+6. **PR review attachment.** When cube posts the panel summary review
+   (``cube.github.reviews``), inline each screenshot referenced in the
+   aggregated decisions:
+   - Use GitHub's image attachment via base64 data URLs in the PR
+     review body (no separate upload API — GitHub renders these inline
+     in markdown).
+   - Cap total inline size at 5 MB across all screenshots; beyond
+     that, list paths instead of embedding.
+   - Group under a ``### Screenshots`` collapsible ``<details>`` block
+     so the panel summary text stays scannable.
+
+## Scope (out)
+
+- **Authenticated routes.** Don't try to thread cookies/session
+  storage through. Repos with auth gates will get
+  ``[BROWSER UNAVAILABLE]`` plus an open follow-up task.
+- **Mobile/responsive matrix.** Single viewport (1280x800) for v1;
+  judge_4 can describe its desired viewport in screenshot captions if
+  it manually resizes via Playwright.
+- **Cross-judge MCP sharing.** Only judge_4 gets Playwright by
+  default. Other judges that want browser access opt in via their own
+  ``mcp_servers`` block — same plumbing.
+- **Slice C polish.** File-trigger gating (only run dev-server when
+  ``.tsx``/``.css``/etc diff non-empty) lives in a follow-up — for
+  v1, every panel run with a declared dev-server pays the bring-up
+  cost.
+
+## Acceptance criteria
+
+1. ``cube.yaml`` parses the new ``verify.dev_server:`` block without
+   error; missing block = no behaviour change.
+2. ``judge_4`` declares Playwright in shipped ``cube.yaml``.
+3. ``cube prv`` on a repo with a real dev-server brings up the server,
+   judge_4 navigates + screenshots, files land under
+   ``.prompts/screenshots/``, screenshots appear in the GitHub PR
+   review.
+4. ``cube prv`` on a repo with NO ``dev_server`` config: judge_4
+   reviews statically, no warning spam, behaviour identical to today.
+5. ``cube prv`` when dev_server ``cmd`` fails to start: judge_4
+   reviews statically with a ``[BROWSER UNAVAILABLE]`` note in its
+   prompt, panel completes, no orphan ``pnpm dev`` processes left
+   running.
+6. ``cube prv`` when ``ready_marker`` never appears within
+   ``timeout_seconds``: same fallback as (5).
+7. Decision parser handles missing ``screenshots`` field (every
+   existing decision file) without crashing.
+8. Aggregation surfaces the screenshot list correctly even when one
+   judge emits an empty list and another emits multiple entries.
+
+## Tests required
+
+- ``tests/core/test_dev_server.py`` — ``start_dev_server`` happy path
+  (process spawns, ``ready_marker`` matches, ``stop()`` SIGTERMs the
+  group), timeout path, ``cmd`` exits non-zero, ``ready_marker``
+  detection via URL poll fallback.
+- ``tests/core/test_decision_parser.py`` — ``screenshots`` field
+  round-trips through ``parse_single_decision_file`` and aggregation;
+  missing field defaults to empty list (no crash).
+- ``tests/github/test_reviews.py`` — PR review body embeds
+  screenshots as data URLs, caps total size, falls back to path-only
+  listing when over cap.
+
+## Files likely to touch
+
+- ``python/cube.yaml`` — judge_4 mcp_servers declaration
+- ``python/cube/core/user_config.py`` — VerifyConfig gains
+  ``dev_server`` sub-field
+- ``python/cube/core/dev_server.py`` — NEW, dev-server lifecycle
+- ``python/cube/automation/judge_panel.py`` — bring-up call site,
+  prompt addendum injection
+- ``python/cube/core/decision_parser.py`` — read screenshots field
+- ``python/cube/github/reviews.py`` — embed screenshots in PR body
+- Test files above
+
+## Non-goals to call out explicitly
+
+- Don't rewrite ``cube.github.reviews`` — surgical addition only.
+- Don't add a "frontend-judge-only" code path; the dev-server runs
+  for any judge that declares Playwright. Future judges (Security
+  doing browser security review, for instance) inherit the same
+  bring-up.
+- Don't gate on file-type triggers in this slice. That's slice C.
+
+## Notes for the writer
+
+Take a smaller-PR approach if you can split this into two commits
+within the worktree — (a) dev-server lifecycle helper +
+``VerifyConfig`` extension + tests, (b) judge_4 wiring + PR review
+embed + tests. Either way: keep ``mcp_servers``-free judges unchanged.
+
+Read ``docs/tasks/22-steering-phase-e-telemetry.md`` for the
+conventions cube uses on these task specs; mirror them.
diff --git a/python/cube.yaml b/python/cube.yaml
@@ -147,6 +147,16 @@ judges:
     label: "Judge Frontend & UX"
     color: "blue"
     persona: "frontend-generic"
+    # Browser-augmented review (Slice B). When the repo also configures
+    # `verify.dev_server`, cube spins up the dev server, points judge_4
+    # at it via Playwright MCP, and pulls back screenshots referenced
+    # in the decision file's `screenshots` field. Without dev_server
+    # configured, the judge falls back to static review with a
+    # [BROWSER UNAVAILABLE] note.
+    mcp_servers:
+      playwright:
+        command: "npx"
+        args: ["-y", "@playwright/mcp@latest"]
 
   judge_5:
     model: "gpt"