Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions docs/tasks/23-judge4-playwright-mcp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# Slice B — Wire Playwright MCP into judge_4 + dev-server bring-up + screenshot routing

Builds on slice A (#262), which plumbed per-judge ``mcp_servers`` from
``cube.yaml`` all the way through ``SDKBridgeAdapter`` →
``query({options:{mcpServers}})``. Plumbing has zero behaviour change today
because no judge declares ``mcp_servers``.

This slice flips judge_4 (Frontend & UX) into a browser-augmented
reviewer: Playwright MCP for navigation + screenshot, a dev-server cube
brings up before judge_4 runs, and screenshot files attached to the
GitHub PR review.

## Why

cube's frontend judge today reviews ``.tsx``/``.css`` diffs purely from
text. It can spot type errors, prop-drilling, ``next-intl`` violations
— but it can't see that the button is invisible on dark mode, the
modal clips off-viewport at 1280px, or the toast renders behind the
nav. Visual regressions slip through every panel. Adding Playwright
to judge_4 closes the only gap where text review structurally can't
catch the bug.

## Scope (in)

1. **judge_4 declares Playwright** in ``cube.yaml``'s ``judges.judge_4.mcp_servers``
block. Schema mirrors the Anthropic SDK shape:
```yaml
judges:
judge_4:
model: opus
label: "Judge Frontend & UX"
persona: "frontend-generic"
mcp_servers:
playwright:
command: "npx"
args: ["-y", "@playwright/mcp@latest"]
```
The opus default ensures the claude-sdk path gets the MCP servers
(only path that honours mcp_servers in slice A).

2. **Dev-server bring-up.** Add an optional ``verify.dev_server:``
block:
```yaml
verify:
install: pnpm install --frozen-lockfile
check: pnpm lint && pnpm typecheck && pnpm test
dev_server:
cmd: pnpm dev
url: http://localhost:3000
ready_marker: "ready in"
timeout_seconds: 60
```
- ``cmd``: shell command to start the dev server (run in the writer's
worktree, NOT cube's repo).
- ``url``: the base URL judge_4's prompt is told to navigate to.
- ``ready_marker``: substring cube watches for on the dev-server's
stdout/stderr before it considers the server ready. Falls back to
polling ``url`` with a HEAD request if not set.
- ``timeout_seconds``: hard ceiling for ready detection. On timeout,
cube logs a warning, kills the dev-server, and runs judge_4 with
a ``[BROWSER UNAVAILABLE]`` note in its prompt — judge still
reviews statically.

When the block is missing, judge_4 runs as before (no browser
pass). Out-of-the-box repos see no change.

3. **Lifecycle.** A new helper ``cube.core.dev_server`` exposes:
- ``start_dev_server(worktree, config) -> DevServerHandle`` —
spawns the command in a process group, hooks stdout/stderr to a
log file under ``.prompts/dev-server-<task>.log``, watches for
``ready_marker`` (or polls the URL) until ready or timeout.
- ``DevServerHandle.stop()`` — SIGTERM the process group with a
5s grace, SIGKILL on timeout (mirrors
``terminate_active_subprocesses`` in core.adapters.base).
- Used by the judge-panel handler ONLY for the path where any
declared judge has Playwright in its ``mcp_servers``. Other
panels skip the bring-up entirely so non-frontend tasks pay no
cost.
- The base URL is exported into the dev-server's process env as
``CUBE_DEV_SERVER_URL`` AND injected into judge_4's prompt
addendum (see step 4).

4. **judge_4 persona addendum.** When a judge has non-empty
``mcp_servers`` AND the dev-server came up, prepend a block to
the judge-specific prompt explaining the workflow:
- "You have Playwright MCP tools — playwright_navigate,
playwright_screenshot, playwright_click, playwright_fill,
playwright_pdf, plus the standard observation/eval tools."
- "The frontend dev server is running at ``{CUBE_DEV_SERVER_URL}``.
Navigate to the changed routes and screenshot the relevant
UI states (default, hover, dark mode if applicable, mobile
viewport)."
- "Save screenshots to
``.prompts/screenshots/<task_id>/judge_4/<descriptive-name>.png``
under the cube repo root. Reference each one in your decision
file's new ``screenshots:`` field (see schema below)."
- "If the route fails to load or the screenshot doesn't match the
intent, that's a finding — call it out in
``remaining_issues`` with the file:line of the related diff."
- The block is only added when ``mcp_servers`` is non-empty AND
the dev-server actually came up. Static-only fallback stays
unchanged.

5. **Decision schema extension.** Judge decision files gain an
optional ``screenshots:`` field:
```json
{
"decision": "APPROVED",
"remaining_issues": [],
"screenshots": [
{"path": ".prompts/screenshots/.../button-hover-dark.png",
"caption": "Primary button hover state on dark mode"}
]
}
```
- The aggregator in ``cube.core.decision_parser`` reads this and
surfaces the screenshot list in ``get_peer_review_status()``
output so callers can use it.

6. **PR review attachment.** When cube posts the panel summary review
(``cube.github.reviews``), inline each screenshot referenced in the
aggregated decisions:
- Use GitHub's image attachment via base64 data URLs in the PR
review body (no separate upload API — GitHub renders these inline
in markdown).
- Cap total inline size at 5 MB across all screenshots; beyond
that, list paths instead of embedding.
- Group under a ``### Screenshots`` collapsible ``<details>`` block
so the panel summary text stays scannable.

## Scope (out)

- **Authenticated routes.** Don't try to thread cookies/session
storage through. Repos with auth gates will get
``[BROWSER UNAVAILABLE]`` plus an open follow-up task.
- **Mobile/responsive matrix.** Single viewport (1280x800) for v1;
judge_4 can describe its desired viewport in screenshot captions if
it manually resizes via Playwright.
- **Cross-judge MCP sharing.** Only judge_4 gets Playwright by
default. Other judges that want browser access opt in via their own
``mcp_servers`` block — same plumbing.
- **Slice C polish.** File-trigger gating (only run dev-server when
``.tsx``/``.css``/etc diff non-empty) lives in a follow-up — for
v1, every panel run with a declared dev-server pays the bring-up
cost.

## Acceptance criteria

1. ``cube.yaml`` parses the new ``verify.dev_server:`` block without
error; missing block = no behaviour change.
2. ``judge_4`` declares Playwright in shipped ``cube.yaml``.
3. ``cube prv`` on a repo with a real dev-server brings up the server,
judge_4 navigates + screenshots, files land under
``.prompts/screenshots/``, screenshots appear in the GitHub PR
review.
4. ``cube prv`` on a repo with NO ``dev_server`` config: judge_4
reviews statically, no warning spam, behaviour identical to today.
5. ``cube prv`` when dev_server ``cmd`` fails to start: judge_4
reviews statically with a ``[BROWSER UNAVAILABLE]`` note in its
prompt, panel completes, no orphan ``pnpm dev`` processes left
running.
6. ``cube prv`` when ``ready_marker`` never appears within
``timeout_seconds``: same fallback as (5).
7. Decision parser handles missing ``screenshots`` field (every
existing decision file) without crashing.
8. Aggregation surfaces the screenshot list correctly even when one
judge emits an empty list and another emits multiple entries.

## Tests required

- ``tests/core/test_dev_server.py`` — ``start_dev_server`` happy path
(process spawns, ``ready_marker`` matches, ``stop()`` SIGTERMs the
group), timeout path, ``cmd`` exits non-zero, ``ready_marker``
detection via URL poll fallback.
- ``tests/core/test_decision_parser.py`` — ``screenshots`` field
round-trips through ``parse_single_decision_file`` and aggregation;
missing field defaults to empty list (no crash).
- ``tests/github/test_reviews.py`` — PR review body embeds
screenshots as data URLs, caps total size, falls back to path-only
listing when over cap.

## Files likely to touch

- ``python/cube.yaml`` — judge_4 mcp_servers declaration
- ``python/cube/core/user_config.py`` — VerifyConfig gains
``dev_server`` sub-field
- ``python/cube/core/dev_server.py`` — NEW, dev-server lifecycle
- ``python/cube/automation/judge_panel.py`` — bring-up call site,
prompt addendum injection
- ``python/cube/core/decision_parser.py`` — read screenshots field
- ``python/cube/github/reviews.py`` — embed screenshots in PR body
- Test files above

## Non-goals to call out explicitly

- Don't rewrite ``cube.github.reviews`` — surgical addition only.
- Don't add a "frontend-judge-only" code path; the dev-server runs
for any judge that declares Playwright. Future judges (Security
doing browser security review, for instance) inherit the same
bring-up.
- Don't gate on file-type triggers in this slice. That's slice C.

## Notes for the writer

Take a smaller-PR approach if you can split this into two commits
within the worktree — (a) dev-server lifecycle helper +
``VerifyConfig`` extension + tests, (b) judge_4 wiring + PR review
embed + tests. Either way: keep ``mcp_servers``-free judges unchanged.

Read ``docs/tasks/22-steering-phase-e-telemetry.md`` for the
conventions cube uses on these task specs; mirror them.
10 changes: 10 additions & 0 deletions python/cube.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,16 @@ judges:
label: "Judge Frontend & UX"
color: "blue"
persona: "frontend-generic"
# Browser-augmented review (Slice B). When the repo also configures
# `verify.dev_server`, cube spins up the dev server, points judge_4
# at it via Playwright MCP, and pulls back screenshots referenced
# in the decision file's `screenshots` field. Without dev_server
# configured, the judge falls back to static review with a
# [BROWSER UNAVAILABLE] note.
mcp_servers:
playwright:
command: "npx"
args: ["-y", "@playwright/mcp@latest"]

judge_5:
model: "gpt"
Expand Down
Loading
Loading