Skip to content

feat(judge_4): browser-augmented review via Playwright MCP (Slice B)#263

Merged
jacsamell merged 1 commit into
mainfrom
feat/slice-b-judge4-playwright
Jun 2, 2026
Merged

feat(judge_4): browser-augmented review via Playwright MCP (Slice B)#263
jacsamell merged 1 commit into
mainfrom
feat/slice-b-judge4-playwright

Conversation

@jacsamell
Copy link
Copy Markdown
Contributor

@jacsamell jacsamell commented Jun 2, 2026

Summary

Builds on Slice A (#262). Wires judge_4 (Frontend & UX) to actually use Playwright MCP for visual review.

Component Behavior
cube.yaml judge_4 declares Playwright via mcp_servers
verify.dev_server: new optional block (cmd, url, ready_marker, timeout)
cube.core.dev_server lifecycle helper — process-group spawn, marker/HTTP readiness, idempotent stop
judge_panel brings up dev-server when ≥1 judge has mcp_servers, populates dev_server_url + screenshots_dir on each browser-judge, tears down in finally
Persona addendum injected only for browser-judges; describes the Playwright workflow + where to drop screenshots + how to reference them in the decision file
Decision schema screenshots: [{path, caption}] — optional, tolerant to schema drift
Aggregation get_peer_review_status()["screenshots"] collects across the panel

Failure modes degrade cleanly:

  • No mcp_servers declared → no dev-server bring-up, zero cost
  • mcp_servers declared but no verify.dev_server[BROWSER UNAVAILABLE] block in judge prompt, static review
  • cmd fails to spawn / marker never appears / timeout → same fallback, dev-server cleaned up

Scope NOT in this slice

  • Inline screenshot embedding in GitHub PR body (data URLs) → Slice C
  • File-trigger gating (only bring up on .tsx/.css diff) → Slice C
  • Multi-viewport / responsive matrix → out of scope
  • Auth state for protected routes → out of scope

Tests

  • 679/679 pass
  • 16 new tests: test_dev_server.py (8 — config, declared-check, spawn-fail, timeout, marker-success, idempotent stop, process-group cleanup), test_decision_parser.py (8 — screenshot coercion + panel-level aggregation)

🤖 Generated with Claude Code

Overview

Implements Slice B of browser-augmented review for judge_4 (Frontend & UX) via Playwright MCP, building on Slice A. Introduces dev-server lifecycle management with graceful degradation when the server is unavailable. All 679 tests passing, including 16 new tests.

Architecture Changes

Dev-server lifecycle management (cube/core/dev_server.py): New module providing DevServerConfig and DevServerHandle classes. Spawns configured dev-server in process group, detects readiness via marker or HTTP polling, logs output to .prompts/dev-server-<task>.log, and ensures idempotent cleanup on failure.

Conditional bring-up (cube/automation/judge_panel.py): Judge panel now conditionally starts dev-server only when ≥1 judge declares mcp_servers. Populates per-judge dev_server_url and screenshots_dir context, applies Playwright workflow addendum to prompts, and tears down in finally block. Falls back to [BROWSER UNAVAILABLE] block with static-review guidance on spawn/timeout/marker failures.

Playwright configuration (cube.yaml): Judge_4 declares Playwright MCP via mcp_servers.playwright with npx @playwright/mcp@latest.

Screenshot aggregation (cube/core/decision_parser.py): Extended decision schema with optional screenshots: [{path, caption}] field. New _coerce_screenshot_list() normalises various input shapes (missing, strings, dicts) for backward compatibility. Aggregates across judges via get_peer_review_status()["screenshots"].

Configuration support (cube/core/user_config.py): New optional verify.dev_server block with cmd, url, ready_marker, timeout_seconds semantics.

Type extensions (cube/models/types.py): JudgeInfo now carries dev_server_url, dev_server_unavailable_reason, and screenshots_dir.

Test Coverage

New tests in test_dev_server.py (8 tests): Config declaration, spawn failures, timeout/marker detection, process-group cleanup, and idempotent stopping.

New tests in test_decision_parser.py (8 tests): Screenshot coercion edge cases (None, empty, lists, strings, invalid types), missing field tolerance, and cross-judge aggregation.

Failure Modes

Gracefully degrades to static review: no mcp_servers → skip bring-up; no verify.dev_server → unavailable static block; spawn/marker/timeout failures → unavailable block with cleanup. No exceptions propagate; panel continues.

Built on slice A (#262) which plumbed per-judge mcp_servers through
the bridge. This wires the actual feature.

1) cube.yaml: judge_4 declares Playwright via mcp_servers. Default
   shipped config so anyone on Claude SDK gets the tool. Other judges
   stay tool-free.

2) verify.dev_server: new optional cube.yaml block (cmd, url,
   ready_marker, timeout_seconds). Cube spins the dev-server up in
   the writer's worktree before the panel, polls for ready via the
   marker (preferred) or HEAD-poll of the URL (fallback), tears down
   after asyncio.gather. Missing block = no behaviour change.

3) cube.core.dev_server: lifecycle helper. Spawns under a fresh
   process group so SIGTERM cleans up child workers (next dev forks
   a worker that survives a plain SIGTERM otherwise). Idempotent
   stop(). Returns unavailable=True on any failure path so the panel
   keeps running with static-only review.

4) judge_panel: _maybe_start_dev_server runs ONLY when ≥1 judge
   declares mcp_servers — non-frontend panels pay zero cost. Each
   browser-judge gets dev_server_url + screenshots_dir populated;
   the persona addendum reads these to generate the Playwright
   workflow block. When bring-up fails, judges get
   [BROWSER UNAVAILABLE] instead, with the reason, and explicit
   "do not call Playwright tools" guidance — so they degrade
   cleanly to static review.

5) Decision schema gains optional screenshots field:
   [{path, caption}]. Tolerant to schema drift (single string,
   list of strings, missing field all coerce). aggregate /
   get_peer_review_status surfaces the union across the panel so
   downstream PR-review code can embed them.

Scope NOT in this slice (deferred to slice C):
- Embedding screenshots inline in the GitHub PR review body via
  data URLs.
- File-trigger gating (only run dev_server when .tsx/.css/etc diff
  non-empty).
- Multi-viewport / responsive matrix.
- Auth state for protected routes.

679/679 tests pass. 16 new tests in test_dev_server.py +
test_decision_parser.py covering happy/timeout/spawn-fail paths,
process-group cleanup, schema-drift coercion, and panel-level
aggregation.
@jacsamell jacsamell merged commit c45a452 into main Jun 2, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a75ed0e-dd56-49da-bab6-ae534a180d45

📥 Commits

Reviewing files that changed from the base of the PR and between 6e5e425 and e63346f.

📒 Files selected for processing (9)
  • docs/tasks/23-judge4-playwright-mcp.md
  • python/cube.yaml
  • python/cube/automation/judge_panel.py
  • python/cube/core/decision_parser.py
  • python/cube/core/dev_server.py
  • python/cube/core/user_config.py
  • python/cube/models/types.py
  • tests/core/test_decision_parser.py
  • tests/core/test_dev_server.py

Walkthrough

This PR implements browser-augmented judge review by adding optional dev-server lifecycle management and Playwright MCP integration. A new DevServerConfig and DevServerHandle module manages process startup with readiness detection via log marker or URL polling. Configuration loading parses verify.dev_server from YAML. The judge panel orchestrator conditionally starts the dev-server before judge execution, populates browser context onto browser-capable judges, and applies conditional prompt addenda (Playwright instructions when available, or a static fallback note). Decision parsing aggregates an optional screenshots field across judges. Comprehensive tests validate dev-server contracts and decision parser screenshot handling.

Possibly related PRs

  • aetheronhq/agent-cube#172: Modifies python/cube/automation/judge_panel.py prompt construction flow; this PR adds Playwright/dev-server addenda and screenshot support, while the retrieved PR adds lens-based file-scope/worktree routing in the same code path.

Comment @coderabbitai help to get the list of available commands and usage tips.

jacsamell added a commit that referenced this pull request Jun 2, 2026
…e C) (#264)

Two small additions on top of slice B (#263):

1) PR review body lists screenshots produced by browser-judges.

   ``get_peer_review_status`` already aggregates the ``screenshots``
   field across the panel (slice B); the panel-review composer now
   appends a collapsible ``<details>`` block listing each screenshot
   path + caption. Inline image rendering via GitHub-side hosting is
   a follow-up — would need to commit the files to a branch or upload
   via the asset API. Listing the paths still surfaces visual findings
   in the PR comment thread, and lets the operator open them locally
   with ``open .prompts/screenshots/<task>/<judge>/<x>.png``.

2) File-trigger gating on dev-server bring-up.

   ``_maybe_start_dev_server`` now also checks that the browser-judge
   actually has at least one frontend file routed to it. Backend-only
   PRs (judge_4 scope empty / all .py / all .md) skip the bring-up
   entirely — saves the ``pnpm dev`` startup + teardown on every
   backend PR. Frontend suffixes covered: tsx/jsx/ts/js/css/scss/
   sass/html/svelte/vue/astro.

   Conservative when ``scoped_files`` is None (file routing didn't
   run): bring the server up. Cost-of-skip (visual regression slips)
   > cost-of-spin-up (wasted seconds).

7 new tests in test_judge_panel_browser_gating.py pin the gate's
contract across the suffix set, the backend-only case, and the
unknown-scope conservative default. 715/715 tests pass.

Inline image rendering via GitHub asset upload is the remaining piece
for full visual review in the PR thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant