feat(judge_4): browser-augmented review via Playwright MCP (Slice B)#263
Conversation
Built on slice A (#262) which plumbed per-judge mcp_servers through the bridge. This wires the actual feature. 1) cube.yaml: judge_4 declares Playwright via mcp_servers. Default shipped config so anyone on Claude SDK gets the tool. Other judges stay tool-free. 2) verify.dev_server: new optional cube.yaml block (cmd, url, ready_marker, timeout_seconds). Cube spins the dev-server up in the writer's worktree before the panel, polls for ready via the marker (preferred) or HEAD-poll of the URL (fallback), tears down after asyncio.gather. Missing block = no behaviour change. 3) cube.core.dev_server: lifecycle helper. Spawns under a fresh process group so SIGTERM cleans up child workers (next dev forks a worker that survives a plain SIGTERM otherwise). Idempotent stop(). Returns unavailable=True on any failure path so the panel keeps running with static-only review. 4) judge_panel: _maybe_start_dev_server runs ONLY when ≥1 judge declares mcp_servers — non-frontend panels pay zero cost. Each browser-judge gets dev_server_url + screenshots_dir populated; the persona addendum reads these to generate the Playwright workflow block. When bring-up fails, judges get [BROWSER UNAVAILABLE] instead, with the reason, and explicit "do not call Playwright tools" guidance — so they degrade cleanly to static review. 5) Decision schema gains optional screenshots field: [{path, caption}]. Tolerant to schema drift (single string, list of strings, missing field all coerce). aggregate / get_peer_review_status surfaces the union across the panel so downstream PR-review code can embed them. Scope NOT in this slice (deferred to slice C): - Embedding screenshots inline in the GitHub PR review body via data URLs. - File-trigger gating (only run dev_server when .tsx/.css/etc diff non-empty). - Multi-viewport / responsive matrix. - Auth state for protected routes. 679/679 tests pass. 16 new tests in test_dev_server.py + test_decision_parser.py covering happy/timeout/spawn-fail paths, process-group cleanup, schema-drift coercion, and panel-level aggregation.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
WalkthroughThis PR implements browser-augmented judge review by adding optional dev-server lifecycle management and Playwright MCP integration. A new Possibly related PRs
Comment |
…e C) (#264) Two small additions on top of slice B (#263): 1) PR review body lists screenshots produced by browser-judges. ``get_peer_review_status`` already aggregates the ``screenshots`` field across the panel (slice B); the panel-review composer now appends a collapsible ``<details>`` block listing each screenshot path + caption. Inline image rendering via GitHub-side hosting is a follow-up — would need to commit the files to a branch or upload via the asset API. Listing the paths still surfaces visual findings in the PR comment thread, and lets the operator open them locally with ``open .prompts/screenshots/<task>/<judge>/<x>.png``. 2) File-trigger gating on dev-server bring-up. ``_maybe_start_dev_server`` now also checks that the browser-judge actually has at least one frontend file routed to it. Backend-only PRs (judge_4 scope empty / all .py / all .md) skip the bring-up entirely — saves the ``pnpm dev`` startup + teardown on every backend PR. Frontend suffixes covered: tsx/jsx/ts/js/css/scss/ sass/html/svelte/vue/astro. Conservative when ``scoped_files`` is None (file routing didn't run): bring the server up. Cost-of-skip (visual regression slips) > cost-of-spin-up (wasted seconds). 7 new tests in test_judge_panel_browser_gating.py pin the gate's contract across the suffix set, the backend-only case, and the unknown-scope conservative default. 715/715 tests pass. Inline image rendering via GitHub asset upload is the remaining piece for full visual review in the PR thread.
Summary
Builds on Slice A (#262). Wires judge_4 (Frontend & UX) to actually use Playwright MCP for visual review.
cube.yamlmcp_serversverify.dev_server:cube.core.dev_serverjudge_panelmcp_servers, populatesdev_server_url+screenshots_diron each browser-judge, tears down infinallyscreenshots: [{path, caption}]— optional, tolerant to schema driftget_peer_review_status()["screenshots"]collects across the panelFailure modes degrade cleanly:
mcp_serversdeclared → no dev-server bring-up, zero costmcp_serversdeclared but noverify.dev_server→[BROWSER UNAVAILABLE]block in judge prompt, static reviewcmdfails to spawn / marker never appears / timeout → same fallback, dev-server cleaned upScope NOT in this slice
Tests
test_dev_server.py(8 — config, declared-check, spawn-fail, timeout, marker-success, idempotent stop, process-group cleanup),test_decision_parser.py(8 — screenshot coercion + panel-level aggregation)🤖 Generated with Claude Code
Overview
Implements Slice B of browser-augmented review for judge_4 (Frontend & UX) via Playwright MCP, building on Slice A. Introduces dev-server lifecycle management with graceful degradation when the server is unavailable. All 679 tests passing, including 16 new tests.
Architecture Changes
Dev-server lifecycle management (
cube/core/dev_server.py): New module providingDevServerConfigandDevServerHandleclasses. Spawns configured dev-server in process group, detects readiness via marker or HTTP polling, logs output to.prompts/dev-server-<task>.log, and ensures idempotent cleanup on failure.Conditional bring-up (
cube/automation/judge_panel.py): Judge panel now conditionally starts dev-server only when ≥1 judge declaresmcp_servers. Populates per-judgedev_server_urlandscreenshots_dircontext, applies Playwright workflow addendum to prompts, and tears down infinallyblock. Falls back to[BROWSER UNAVAILABLE]block with static-review guidance on spawn/timeout/marker failures.Playwright configuration (
cube.yaml): Judge_4 declares Playwright MCP viamcp_servers.playwrightwithnpx@playwright/mcp@latest.Screenshot aggregation (
cube/core/decision_parser.py): Extended decision schema with optionalscreenshots: [{path, caption}]field. New_coerce_screenshot_list()normalises various input shapes (missing, strings, dicts) for backward compatibility. Aggregates across judges viaget_peer_review_status()["screenshots"].Configuration support (
cube/core/user_config.py): New optionalverify.dev_serverblock withcmd,url,ready_marker,timeout_secondssemantics.Type extensions (
cube/models/types.py):JudgeInfonow carriesdev_server_url,dev_server_unavailable_reason, andscreenshots_dir.Test Coverage
New tests in
test_dev_server.py(8 tests): Config declaration, spawn failures, timeout/marker detection, process-group cleanup, and idempotent stopping.New tests in
test_decision_parser.py(8 tests): Screenshot coercion edge cases (None, empty, lists, strings, invalid types), missing field tolerance, and cross-judge aggregation.Failure Modes
Gracefully degrades to static review: no
mcp_servers→ skip bring-up; noverify.dev_server→ unavailable static block; spawn/marker/timeout failures → unavailable block with cleanup. No exceptions propagate; panel continues.