feat(judge_4): browser-augmented review via Playwright MCP (Slice B) by jacsamell · Pull Request #263 · aetheronhq/agent-cube

jacsamell · 2026-06-02T08:06:04Z

Summary

Builds on Slice A (#262). Wires judge_4 (Frontend & UX) to actually use Playwright MCP for visual review.

Component	Behavior
`cube.yaml`	judge_4 declares Playwright via `mcp_servers`
`verify.dev_server:`	new optional block (cmd, url, ready_marker, timeout)
`cube.core.dev_server`	lifecycle helper — process-group spawn, marker/HTTP readiness, idempotent stop
`judge_panel`	brings up dev-server when ≥1 judge has `mcp_servers`, populates `dev_server_url` + `screenshots_dir` on each browser-judge, tears down in `finally`
Persona addendum	injected only for browser-judges; describes the Playwright workflow + where to drop screenshots + how to reference them in the decision file
Decision schema	`screenshots: [{path, caption}]` — optional, tolerant to schema drift
Aggregation	`get_peer_review_status()["screenshots"]` collects across the panel

Failure modes degrade cleanly:

No mcp_servers declared → no dev-server bring-up, zero cost
mcp_servers declared but no verify.dev_server → [BROWSER UNAVAILABLE] block in judge prompt, static review
cmd fails to spawn / marker never appears / timeout → same fallback, dev-server cleaned up

Scope NOT in this slice

Inline screenshot embedding in GitHub PR body (data URLs) → Slice C
File-trigger gating (only bring up on .tsx/.css diff) → Slice C
Multi-viewport / responsive matrix → out of scope
Auth state for protected routes → out of scope

Tests

679/679 pass
16 new tests: test_dev_server.py (8 — config, declared-check, spawn-fail, timeout, marker-success, idempotent stop, process-group cleanup), test_decision_parser.py (8 — screenshot coercion + panel-level aggregation)

🤖 Generated with Claude Code

Overview

Implements Slice B of browser-augmented review for judge_4 (Frontend & UX) via Playwright MCP, building on Slice A. Introduces dev-server lifecycle management with graceful degradation when the server is unavailable. All 679 tests passing, including 16 new tests.

Architecture Changes

Dev-server lifecycle management (cube/core/dev_server.py): New module providing DevServerConfig and DevServerHandle classes. Spawns configured dev-server in process group, detects readiness via marker or HTTP polling, logs output to .prompts/dev-server-<task>.log, and ensures idempotent cleanup on failure.

Conditional bring-up (cube/automation/judge_panel.py): Judge panel now conditionally starts dev-server only when ≥1 judge declares mcp_servers. Populates per-judge dev_server_url and screenshots_dir context, applies Playwright workflow addendum to prompts, and tears down in finally block. Falls back to [BROWSER UNAVAILABLE] block with static-review guidance on spawn/timeout/marker failures.

Playwright configuration (cube.yaml): Judge_4 declares Playwright MCP via mcp_servers.playwright with npx @playwright/mcp@latest.

Screenshot aggregation (cube/core/decision_parser.py): Extended decision schema with optional screenshots: [{path, caption}] field. New _coerce_screenshot_list() normalises various input shapes (missing, strings, dicts) for backward compatibility. Aggregates across judges via get_peer_review_status()["screenshots"].

Configuration support (cube/core/user_config.py): New optional verify.dev_server block with cmd, url, ready_marker, timeout_seconds semantics.

Type extensions (cube/models/types.py): JudgeInfo now carries dev_server_url, dev_server_unavailable_reason, and screenshots_dir.

Test Coverage

New tests in test_dev_server.py (8 tests): Config declaration, spawn failures, timeout/marker detection, process-group cleanup, and idempotent stopping.

New tests in test_decision_parser.py (8 tests): Screenshot coercion edge cases (None, empty, lists, strings, invalid types), missing field tolerance, and cross-judge aggregation.

Failure Modes

Gracefully degrades to static review: no mcp_servers → skip bring-up; no verify.dev_server → unavailable static block; spawn/marker/timeout failures → unavailable block with cleanup. No exceptions propagate; panel continues.

Built on slice A (#262) which plumbed per-judge mcp_servers through the bridge. This wires the actual feature. 1) cube.yaml: judge_4 declares Playwright via mcp_servers. Default shipped config so anyone on Claude SDK gets the tool. Other judges stay tool-free. 2) verify.dev_server: new optional cube.yaml block (cmd, url, ready_marker, timeout_seconds). Cube spins the dev-server up in the writer's worktree before the panel, polls for ready via the marker (preferred) or HEAD-poll of the URL (fallback), tears down after asyncio.gather. Missing block = no behaviour change. 3) cube.core.dev_server: lifecycle helper. Spawns under a fresh process group so SIGTERM cleans up child workers (next dev forks a worker that survives a plain SIGTERM otherwise). Idempotent stop(). Returns unavailable=True on any failure path so the panel keeps running with static-only review. 4) judge_panel: _maybe_start_dev_server runs ONLY when ≥1 judge declares mcp_servers — non-frontend panels pay zero cost. Each browser-judge gets dev_server_url + screenshots_dir populated; the persona addendum reads these to generate the Playwright workflow block. When bring-up fails, judges get [BROWSER UNAVAILABLE] instead, with the reason, and explicit "do not call Playwright tools" guidance — so they degrade cleanly to static review. 5) Decision schema gains optional screenshots field: [{path, caption}]. Tolerant to schema drift (single string, list of strings, missing field all coerce). aggregate / get_peer_review_status surfaces the union across the panel so downstream PR-review code can embed them. Scope NOT in this slice (deferred to slice C): - Embedding screenshots inline in the GitHub PR review body via data URLs. - File-trigger gating (only run dev_server when .tsx/.css/etc diff non-empty). - Multi-viewport / responsive matrix. - Auth state for protected routes. 679/679 tests pass. 16 new tests in test_dev_server.py + test_decision_parser.py covering happy/timeout/spawn-fail paths, process-group cleanup, schema-drift coercion, and panel-level aggregation.

coderabbitai · 2026-06-02T08:06:16Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a75ed0e-dd56-49da-bab6-ae534a180d45

📥 Commits

Reviewing files that changed from the base of the PR and between 6e5e425 and e63346f.

📒 Files selected for processing (9)

docs/tasks/23-judge4-playwright-mcp.md
python/cube.yaml
python/cube/automation/judge_panel.py
python/cube/core/decision_parser.py
python/cube/core/dev_server.py
python/cube/core/user_config.py
python/cube/models/types.py
tests/core/test_decision_parser.py
tests/core/test_dev_server.py

Walkthrough

This PR implements browser-augmented judge review by adding optional dev-server lifecycle management and Playwright MCP integration. A new DevServerConfig and DevServerHandle module manages process startup with readiness detection via log marker or URL polling. Configuration loading parses verify.dev_server from YAML. The judge panel orchestrator conditionally starts the dev-server before judge execution, populates browser context onto browser-capable judges, and applies conditional prompt addenda (Playwright instructions when available, or a static fallback note). Decision parsing aggregates an optional screenshots field across judges. Comprehensive tests validate dev-server contracts and decision parser screenshot handling.

Possibly related PRs

aetheronhq/agent-cube#172: Modifies python/cube/automation/judge_panel.py prompt construction flow; this PR adds Playwright/dev-server addenda and screenshot support, while the retrieved PR adds lens-based file-scope/worktree routing in the same code path.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…e C) (#264) Two small additions on top of slice B (#263): 1) PR review body lists screenshots produced by browser-judges. ``get_peer_review_status`` already aggregates the ``screenshots`` field across the panel (slice B); the panel-review composer now appends a collapsible ``<details>`` block listing each screenshot path + caption. Inline image rendering via GitHub-side hosting is a follow-up — would need to commit the files to a branch or upload via the asset API. Listing the paths still surfaces visual findings in the PR comment thread, and lets the operator open them locally with ``open .prompts/screenshots/<task>/<judge>/<x>.png``. 2) File-trigger gating on dev-server bring-up. ``_maybe_start_dev_server`` now also checks that the browser-judge actually has at least one frontend file routed to it. Backend-only PRs (judge_4 scope empty / all .py / all .md) skip the bring-up entirely — saves the ``pnpm dev`` startup + teardown on every backend PR. Frontend suffixes covered: tsx/jsx/ts/js/css/scss/ sass/html/svelte/vue/astro. Conservative when ``scoped_files`` is None (file routing didn't run): bring the server up. Cost-of-skip (visual regression slips) > cost-of-spin-up (wasted seconds). 7 new tests in test_judge_panel_browser_gating.py pin the gate's contract across the suffix set, the backend-only case, and the unknown-scope conservative default. 715/715 tests pass. Inline image rendering via GitHub asset upload is the remaining piece for full visual review in the PR thread.

jacsamell merged commit c45a452 into main Jun 2, 2026

jacsamell mentioned this pull request Jun 2, 2026

feat(judge_4): screenshot list in PR body + file-trigger gating (Slice C) #264

Merged

This was referenced Jun 2, 2026

fix(dev-server): reuse already-running dev server instead of port-colliding #266

Merged

fix(dev-server): install dependencies in the worktree before bringing up #268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(judge_4): browser-augmented review via Playwright MCP (Slice B)#263

feat(judge_4): browser-augmented review via Playwright MCP (Slice B)#263
jacsamell merged 1 commit into
mainfrom
feat/slice-b-judge4-playwright

jacsamell commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacsamell commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope NOT in this slice

Tests

Overview

Architecture Changes

Test Coverage

Failure Modes

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Possibly related PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jacsamell commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading