diff --git a/docs/tasks/23-judge4-playwright-mcp.md b/docs/tasks/23-judge4-playwright-mcp.md new file mode 100644 index 00000000..52c763c4 --- /dev/null +++ b/docs/tasks/23-judge4-playwright-mcp.md @@ -0,0 +1,211 @@ +# Slice B — Wire Playwright MCP into judge_4 + dev-server bring-up + screenshot routing + +Builds on slice A (#262), which plumbed per-judge ``mcp_servers`` from +``cube.yaml`` all the way through ``SDKBridgeAdapter`` → +``query({options:{mcpServers}})``. Plumbing has zero behaviour change today +because no judge declares ``mcp_servers``. + +This slice flips judge_4 (Frontend & UX) into a browser-augmented +reviewer: Playwright MCP for navigation + screenshot, a dev-server cube +brings up before judge_4 runs, and screenshot files attached to the +GitHub PR review. + +## Why + +cube's frontend judge today reviews ``.tsx``/``.css`` diffs purely from +text. It can spot type errors, prop-drilling, ``next-intl`` violations +— but it can't see that the button is invisible on dark mode, the +modal clips off-viewport at 1280px, or the toast renders behind the +nav. Visual regressions slip through every panel. Adding Playwright +to judge_4 closes the only gap where text review structurally can't +catch the bug. + +## Scope (in) + +1. **judge_4 declares Playwright** in ``cube.yaml``'s ``judges.judge_4.mcp_servers`` + block. Schema mirrors the Anthropic SDK shape: + ```yaml + judges: + judge_4: + model: opus + label: "Judge Frontend & UX" + persona: "frontend-generic" + mcp_servers: + playwright: + command: "npx" + args: ["-y", "@playwright/mcp@latest"] + ``` + The opus default ensures the claude-sdk path gets the MCP servers + (only path that honours mcp_servers in slice A). + +2. **Dev-server bring-up.** Add an optional ``verify.dev_server:`` + block: + ```yaml + verify: + install: pnpm install --frozen-lockfile + check: pnpm lint && pnpm typecheck && pnpm test + dev_server: + cmd: pnpm dev + url: http://localhost:3000 + ready_marker: "ready in" + timeout_seconds: 60 + ``` + - ``cmd``: shell command to start the dev server (run in the writer's + worktree, NOT cube's repo). + - ``url``: the base URL judge_4's prompt is told to navigate to. + - ``ready_marker``: substring cube watches for on the dev-server's + stdout/stderr before it considers the server ready. Falls back to + polling ``url`` with a HEAD request if not set. + - ``timeout_seconds``: hard ceiling for ready detection. On timeout, + cube logs a warning, kills the dev-server, and runs judge_4 with + a ``[BROWSER UNAVAILABLE]`` note in its prompt — judge still + reviews statically. + + When the block is missing, judge_4 runs as before (no browser + pass). Out-of-the-box repos see no change. + +3. **Lifecycle.** A new helper ``cube.core.dev_server`` exposes: + - ``start_dev_server(worktree, config) -> DevServerHandle`` — + spawns the command in a process group, hooks stdout/stderr to a + log file under ``.prompts/dev-server-.log``, watches for + ``ready_marker`` (or polls the URL) until ready or timeout. + - ``DevServerHandle.stop()`` — SIGTERM the process group with a + 5s grace, SIGKILL on timeout (mirrors + ``terminate_active_subprocesses`` in core.adapters.base). + - Used by the judge-panel handler ONLY for the path where any + declared judge has Playwright in its ``mcp_servers``. Other + panels skip the bring-up entirely so non-frontend tasks pay no + cost. + - The base URL is exported into the dev-server's process env as + ``CUBE_DEV_SERVER_URL`` AND injected into judge_4's prompt + addendum (see step 4). + +4. **judge_4 persona addendum.** When a judge has non-empty + ``mcp_servers`` AND the dev-server came up, prepend a block to + the judge-specific prompt explaining the workflow: + - "You have Playwright MCP tools — playwright_navigate, + playwright_screenshot, playwright_click, playwright_fill, + playwright_pdf, plus the standard observation/eval tools." + - "The frontend dev server is running at ``{CUBE_DEV_SERVER_URL}``. + Navigate to the changed routes and screenshot the relevant + UI states (default, hover, dark mode if applicable, mobile + viewport)." + - "Save screenshots to + ``.prompts/screenshots//judge_4/.png`` + under the cube repo root. Reference each one in your decision + file's new ``screenshots:`` field (see schema below)." + - "If the route fails to load or the screenshot doesn't match the + intent, that's a finding — call it out in + ``remaining_issues`` with the file:line of the related diff." + - The block is only added when ``mcp_servers`` is non-empty AND + the dev-server actually came up. Static-only fallback stays + unchanged. + +5. **Decision schema extension.** Judge decision files gain an + optional ``screenshots:`` field: + ```json + { + "decision": "APPROVED", + "remaining_issues": [], + "screenshots": [ + {"path": ".prompts/screenshots/.../button-hover-dark.png", + "caption": "Primary button hover state on dark mode"} + ] + } + ``` + - The aggregator in ``cube.core.decision_parser`` reads this and + surfaces the screenshot list in ``get_peer_review_status()`` + output so callers can use it. + +6. **PR review attachment.** When cube posts the panel summary review + (``cube.github.reviews``), inline each screenshot referenced in the + aggregated decisions: + - Use GitHub's image attachment via base64 data URLs in the PR + review body (no separate upload API — GitHub renders these inline + in markdown). + - Cap total inline size at 5 MB across all screenshots; beyond + that, list paths instead of embedding. + - Group under a ``### Screenshots`` collapsible ``
`` block + so the panel summary text stays scannable. + +## Scope (out) + +- **Authenticated routes.** Don't try to thread cookies/session + storage through. Repos with auth gates will get + ``[BROWSER UNAVAILABLE]`` plus an open follow-up task. +- **Mobile/responsive matrix.** Single viewport (1280x800) for v1; + judge_4 can describe its desired viewport in screenshot captions if + it manually resizes via Playwright. +- **Cross-judge MCP sharing.** Only judge_4 gets Playwright by + default. Other judges that want browser access opt in via their own + ``mcp_servers`` block — same plumbing. +- **Slice C polish.** File-trigger gating (only run dev-server when + ``.tsx``/``.css``/etc diff non-empty) lives in a follow-up — for + v1, every panel run with a declared dev-server pays the bring-up + cost. + +## Acceptance criteria + +1. ``cube.yaml`` parses the new ``verify.dev_server:`` block without + error; missing block = no behaviour change. +2. ``judge_4`` declares Playwright in shipped ``cube.yaml``. +3. ``cube prv`` on a repo with a real dev-server brings up the server, + judge_4 navigates + screenshots, files land under + ``.prompts/screenshots/``, screenshots appear in the GitHub PR + review. +4. ``cube prv`` on a repo with NO ``dev_server`` config: judge_4 + reviews statically, no warning spam, behaviour identical to today. +5. ``cube prv`` when dev_server ``cmd`` fails to start: judge_4 + reviews statically with a ``[BROWSER UNAVAILABLE]`` note in its + prompt, panel completes, no orphan ``pnpm dev`` processes left + running. +6. ``cube prv`` when ``ready_marker`` never appears within + ``timeout_seconds``: same fallback as (5). +7. Decision parser handles missing ``screenshots`` field (every + existing decision file) without crashing. +8. Aggregation surfaces the screenshot list correctly even when one + judge emits an empty list and another emits multiple entries. + +## Tests required + +- ``tests/core/test_dev_server.py`` — ``start_dev_server`` happy path + (process spawns, ``ready_marker`` matches, ``stop()`` SIGTERMs the + group), timeout path, ``cmd`` exits non-zero, ``ready_marker`` + detection via URL poll fallback. +- ``tests/core/test_decision_parser.py`` — ``screenshots`` field + round-trips through ``parse_single_decision_file`` and aggregation; + missing field defaults to empty list (no crash). +- ``tests/github/test_reviews.py`` — PR review body embeds + screenshots as data URLs, caps total size, falls back to path-only + listing when over cap. + +## Files likely to touch + +- ``python/cube.yaml`` — judge_4 mcp_servers declaration +- ``python/cube/core/user_config.py`` — VerifyConfig gains + ``dev_server`` sub-field +- ``python/cube/core/dev_server.py`` — NEW, dev-server lifecycle +- ``python/cube/automation/judge_panel.py`` — bring-up call site, + prompt addendum injection +- ``python/cube/core/decision_parser.py`` — read screenshots field +- ``python/cube/github/reviews.py`` — embed screenshots in PR body +- Test files above + +## Non-goals to call out explicitly + +- Don't rewrite ``cube.github.reviews`` — surgical addition only. +- Don't add a "frontend-judge-only" code path; the dev-server runs + for any judge that declares Playwright. Future judges (Security + doing browser security review, for instance) inherit the same + bring-up. +- Don't gate on file-type triggers in this slice. That's slice C. + +## Notes for the writer + +Take a smaller-PR approach if you can split this into two commits +within the worktree — (a) dev-server lifecycle helper + +``VerifyConfig`` extension + tests, (b) judge_4 wiring + PR review +embed + tests. Either way: keep ``mcp_servers``-free judges unchanged. + +Read ``docs/tasks/22-steering-phase-e-telemetry.md`` for the +conventions cube uses on these task specs; mirror them. diff --git a/python/cube.yaml b/python/cube.yaml index 145c3359..388cb035 100644 --- a/python/cube.yaml +++ b/python/cube.yaml @@ -147,6 +147,16 @@ judges: label: "Judge Frontend & UX" color: "blue" persona: "frontend-generic" + # Browser-augmented review (Slice B). When the repo also configures + # `verify.dev_server`, cube spins up the dev server, points judge_4 + # at it via Playwright MCP, and pulls back screenshots referenced + # in the decision file's `screenshots` field. Without dev_server + # configured, the judge falls back to static review with a + # [BROWSER UNAVAILABLE] note. + mcp_servers: + playwright: + command: "npx" + args: ["-y", "@playwright/mcp@latest"] judge_5: model: "gpt" diff --git a/python/cube/automation/judge_panel.py b/python/cube/automation/judge_panel.py index 6d1d1d94..1fc7856e 100644 --- a/python/cube/automation/judge_panel.py +++ b/python/cube/automation/judge_panel.py @@ -10,7 +10,7 @@ from ..core.config import PROJECT_ROOT, WORKTREE_BASE, get_project_root, get_worktree_path from ..core.dynamic_layout import DynamicLayout from ..core.git import branch_exists, fetch_branches, get_commit_hash, sync_worktree -from ..core.output import console, print_error, print_info, print_success +from ..core.output import console, print_error, print_info, print_success, print_warning from ..core.parsers.registry import get_parser from ..core.session import delete_session, load_session, load_session_head_sha, load_session_metadata, save_session from ..core.user_config import get_judge_configs, get_writer_by_key, get_writer_by_key_or_metadata, load_config @@ -119,6 +119,87 @@ def _judge_session_metadata(jconfig, cli_name: str) -> str: return f"{jconfig.label} ({jconfig.model}) | cli={cli_name}" +def _maybe_start_dev_server(judges: list, config, task_id: str): + """Bring up the configured dev-server when ANY judge in the panel + declares ``mcp_servers``. Returns the handle or None. + + Browser-tool judges (Playwright via MCP) need an actual running app + to point at. cube spins up the dev-server in the writer's worktree + once per panel, shares the URL across every browser-judge, tears + down after ``asyncio.gather`` completes. + + Skipped entirely when no judge declares MCP — non-frontend panels + pay no cost. Also skipped when ``verify.dev_server.cmd`` is empty + (operator didn't configure one) — the addendum will then surface + "[BROWSER UNAVAILABLE]" so the judge degrades to static review. + """ + needs_browser = any(j.mcp_servers for j in judges) + if not needs_browser: + return None + + if not config.verify.dev_server.declared: + print_warning( + "judge declares mcp_servers but verify.dev_server.cmd is empty in " + "cube.yaml — browser tools will be available but no app is running. " + "Configure verify.dev_server to enable the browser pass." + ) + return None + + # Pick a worktree to run the dev-server in. For panel reviews, use + # the first writer's worktree (the one judges will be reading + # source from). For PR reviews, the dedicated review_worktree was + # already set up by the caller — use it. + from ..core.config import PROJECT_ROOT, WORKTREE_BASE + from ..core.dev_server import start as start_dev_server + + review_worktrees = {j.review_worktree for j in judges if j.review_worktree} + if review_worktrees: + worktree = next(iter(review_worktrees)) + else: + project_name = Path(PROJECT_ROOT).name + first_writer = next(iter(config.writers.values())) + worktree = WORKTREE_BASE / project_name / f"writer-{first_writer.name}-{task_id}" + + return start_dev_server( + worktree=worktree, + config=config.verify.dev_server, + task_id=task_id, + ) + + +def _populate_browser_context(judges: list, handle, task_id: str) -> None: + """Attach dev-server URL + screenshots directory to each judge that + declared ``mcp_servers``. The addendum builder reads these to + generate the right prompt block. + + When the handle is None or marked unavailable, judges get the + ``unavailable_reason`` instead so the prompt explicitly tells them + NOT to call Playwright tools (they'd fail against a dead server). + """ + if not any(j.mcp_servers for j in judges): + return + + from ..core.config import PROJECT_ROOT + + screenshots_root = Path(PROJECT_ROOT) / ".prompts" / "screenshots" / task_id + + for j in judges: + if not j.mcp_servers: + continue + if handle is None: + j.dev_server_unavailable_reason = ( + "verify.dev_server.cmd is empty in cube.yaml; configure it to " "enable browser-augmented review." + ) + continue + if handle.unavailable: + j.dev_server_unavailable_reason = handle.unavailable_reason + continue + j.dev_server_url = handle.url + j_dir = screenshots_root / j.key + j_dir.mkdir(parents=True, exist_ok=True) + j.screenshots_dir = j_dir + + def _apply_judge_persona(prompt: str, judge_info: JudgeInfo) -> str: """Prepend judge-specific persona instructions when configured.""" base = prompt @@ -132,10 +213,102 @@ def _apply_judge_persona(prompt: str, judge_info: JudgeInfo) -> str: --- {base}""" + base = _apply_browser_addendum(base, judge_info) base = _apply_file_scope(base, judge_info) return _apply_resume_delta(base, judge_info) +def _apply_browser_addendum(prompt: str, judge_info: JudgeInfo) -> str: + """Prepend Playwright MCP instructions when this judge has a browser + declared AND the dev-server came up. + + Three cases: + + 1. Judge has no ``mcp_servers`` declared → no addendum, prompt + unchanged. + 2. Judge has ``mcp_servers`` AND dev-server came up → "you have + Playwright tools, here's where to navigate, here's where to + drop screenshots, reference them in your decision file's + ``screenshots`` field." + 3. Judge has ``mcp_servers`` but dev-server failed to come up → + "[BROWSER UNAVAILABLE] reason: . Review statically." Lets + the judge complete without trying to call Playwright tools + against a non-existent server. + """ + if not judge_info.mcp_servers: + return prompt + + if judge_info.dev_server_unavailable_reason: + unavailable_block = f"""# [BROWSER UNAVAILABLE] + +A browser tool (Playwright MCP) was configured for this judge but the +dev server could not be brought up: + + {judge_info.dev_server_unavailable_reason} + +Review the diff statically. Do NOT call Playwright tools — they will +fail. If the review hinges on visual behaviour that text review can't +verify, flag it in ``remaining_issues`` so a human can spin up the +dev server and re-check. + +--- + +""" + return unavailable_block + prompt + + if not judge_info.dev_server_url or judge_info.screenshots_dir is None: + return prompt + + screenshots_dir = judge_info.screenshots_dir + addendum = f"""# BROWSER REVIEW (Playwright MCP available) + +You have Playwright MCP tools available for this review. Use them to +visually verify any UI change in the diff: + + - ``playwright_navigate`` — open a route at the running dev server + - ``playwright_screenshot`` — capture the current viewport (full-page + or element-scoped) + - ``playwright_click`` / ``playwright_fill`` / ``playwright_select_option`` + — drive interactions to reach the state you want to verify + - ``playwright_evaluate`` — run JS to inspect computed styles, ARIA + labels, layout + +Dev server is running at: **{judge_info.dev_server_url}** + +For each changed route in the diff: + 1. Navigate to it. + 2. Screenshot the relevant states (default, hover/focus where + applicable, dark mode if the project has it, empty / error / + loading states if the diff touches them). + 3. Save each screenshot under: ``{screenshots_dir}/.png`` + (e.g. ``checkout-empty-cart-dark.png``). + 4. Reference every screenshot you took in your decision file's + ``screenshots`` field: + + ```json + "screenshots": [ + {{"path": "{screenshots_dir}/checkout-empty-cart-dark.png", + "caption": "Empty cart on dark mode — note the contrast issue on the CTA"}} + ] + ``` + +Visual findings count as ``remaining_issues`` like any text-level +finding — file:line of the related diff line, plus a short caption. +If a route fails to load, the screenshot doesn't match the diff's +intent, or you spot a regression text review couldn't catch, surface +it. + +If you decide the change is non-visual (pure refactor, types-only, +test-only, server-side), you can skip the browser pass — but say so +explicitly in your review summary so the operator knows you considered +it. + +--- + +""" + return addendum + prompt + + def _apply_resume_delta(prompt: str, judge_info: JudgeInfo) -> str: """Prepend a 'DELTA SINCE LAST REVIEW' block when resuming across a HEAD bump. @@ -1638,35 +1811,46 @@ async def launch_judge_panel( console.print(f"⏳ Waiting for all {len(judges)} judges to complete...") console.print() - # Start layout AFTER printing startup messages - panel_layout.start() - - # Per-judge stats collected via the ``_out_stats`` out-param plumbed - # through ``run_judge`` — captures attempts, log path, and wall-clock - # duration around the full execution (including retries). - judge_stats: dict[str, dict] = {j.key: {"attempts": 0, "log_path": None, "duration": 0.0} for j in judges} - - async def _run_judge_timed(judge: JudgeInfo): - start = time.monotonic() - try: - line_count = await run_judge( - judge, - prompt, - actual_resume_mode, - panel_layout, - winner=winner, - _out_stats=judge_stats[judge.key], - ) - return line_count - finally: - judge_stats[judge.key]["duration"] = time.monotonic() - start + # Bring up the dev-server when any judge declares mcp_servers (browser + # tools need a running app to point at). Skipped entirely otherwise — + # zero cost for non-browser panels. The handle is shared by every + # browser-judge in this panel; we tear it down after asyncio.gather. + dev_server_handle = _maybe_start_dev_server(judges, config, task_id) + try: + _populate_browser_context(judges, dev_server_handle, task_id) + + # Start layout AFTER printing startup messages + panel_layout.start() + + # Per-judge stats collected via the ``_out_stats`` out-param plumbed + # through ``run_judge`` — captures attempts, log path, and wall-clock + # duration around the full execution (including retries). + judge_stats: dict[str, dict] = {j.key: {"attempts": 0, "log_path": None, "duration": 0.0} for j in judges} + + async def _run_judge_timed(judge: JudgeInfo): + start = time.monotonic() + try: + line_count = await run_judge( + judge, + prompt, + actual_resume_mode, + panel_layout, + winner=winner, + _out_stats=judge_stats[judge.key], + ) + return line_count + finally: + judge_stats[judge.key]["duration"] = time.monotonic() - start - run_results = await asyncio.gather( - *(_run_judge_timed(judge) for judge in judges), - return_exceptions=True, - ) + run_results = await asyncio.gather( + *(_run_judge_timed(judge) for judge in judges), + return_exceptions=True, + ) - DynamicLayout.close() + DynamicLayout.close() + finally: + if dev_server_handle is not None: + dev_server_handle.stop() # Build the per-judge result list. Order: skipped-approved judges first # (they were partitioned out before fan-out), then runnable judges in diff --git a/python/cube/core/decision_parser.py b/python/cube/core/decision_parser.py index 6a27a61b..25bd232a 100644 --- a/python/cube/core/decision_parser.py +++ b/python/cube/core/decision_parser.py @@ -33,6 +33,43 @@ def _coerce_issue_list(value: Any) -> list: return [] +def _coerce_screenshot_list(value: Any) -> list[dict]: + """Normalise a judge's ``screenshots`` field to a list of + ``{path, caption}`` dicts. + + Tolerates the common deviations browser-judges produce: + + - missing field → empty list (old decision files are unaffected) + - list of dicts (canonical) → as-is, after coercing each entry to + have ``path`` and ``caption`` string fields + - list of strings → wrap each as ``{path: "...", caption: ""}`` + - single string → wrap as a one-element list + - any other → empty list (defensive) + + Aggregation in ``get_peer_review_status`` collects these across the + panel so the PR-review code can surface every screenshot in one + place without re-parsing. + """ + if not value: + return [] + if isinstance(value, str): + return [{"path": value.strip(), "caption": ""}] if value.strip() else [] + if not isinstance(value, list): + return [] + + out: list[dict] = [] + for entry in value: + if isinstance(entry, str): + text = entry.strip() + if text: + out.append({"path": text, "caption": ""}) + elif isinstance(entry, dict): + path = str(entry.get("path") or "").strip() + if path: + out.append({"path": path, "caption": str(entry.get("caption") or "").strip()}) + return out + + def normalize_decision(decision: str) -> str: """Normalize decision string to canonical form. @@ -488,6 +525,7 @@ def get_peer_review_status( from .user_config import get_judge_configs all_issues = [] + all_screenshots: list = [] approvals = 0 decisions_found = 0 judge_decisions = {} @@ -505,6 +543,12 @@ def get_peer_review_status( decisions_found += 1 decision = normalize_decision(data.get("decision", "UNKNOWN")) + # Screenshots are optional and arrive from browser-judges only. + # Tolerant: missing field → empty list (old decision files just + # work). Aggregate across the panel so the PR-review code can + # surface every screenshot in one place. + for shot in _coerce_screenshot_list(data.get("screenshots", [])): + all_screenshots.append(shot) # Judges occasionally emit ``remaining_issues`` / ``blocker_issues`` # as a single string instead of a list (the schema asks for a list # but the model sometimes returns ``"one bullet describing the @@ -533,4 +577,5 @@ def get_peer_review_status( "decisions_found": decisions_found, "approvals": approvals, "judge_decisions": judge_decisions, + "screenshots": all_screenshots, } diff --git a/python/cube/core/dev_server.py b/python/cube/core/dev_server.py new file mode 100644 index 00000000..6e4bfdcd --- /dev/null +++ b/python/cube/core/dev_server.py @@ -0,0 +1,263 @@ +"""Dev-server lifecycle for browser-augmented judges. + +When a judge declares Playwright (or any browser MCP) via +``cube.yaml``'s ``judges..mcp_servers``, cube needs an actual +running app to point Playwright at. This module brings up the dev +server before the panel, watches for readiness, and tears it down +afterwards — same lifecycle for every panel that has at least one +browser-judge declared. + +Lifecycle: + +1. ``start(worktree, config) -> DevServerHandle``: spawns ``config.cmd`` + in the writer's worktree under a fresh process group, hooks + stdout/stderr to a per-task log file, watches for the + ``ready_marker`` substring on the stream (falls back to polling + ``config.url`` with HEAD if no marker is configured), returns a + handle when ready. +2. ``DevServerHandle.stop()``: SIGTERMs the process group with a + short grace period, SIGKILLs anything still alive. Idempotent. +3. ``DevServerHandle.unavailable``: True when start timed out or + the process exited before reaching ready. Judges can fall back to + static review when this is set. + +Configuration lives on ``VerifyConfig.dev_server`` (see +``user_config.py``). Repos that don't declare a dev_server block get +``None`` and cube skips the bring-up entirely — no behaviour change. +""" + +from __future__ import annotations + +import os +import signal +import subprocess +import time +import urllib.error +import urllib.request +from dataclasses import dataclass, field +from pathlib import Path +from typing import Optional + +from .output import print_info, print_warning + + +@dataclass +class DevServerConfig: + """Per-repo dev-server config, mirrored from ``cube.yaml``. + + Loaded under ``verify.dev_server:``. When all fields are empty, the + block is considered "not declared" — cube skips bring-up. + """ + + cmd: str = "" + url: str = "http://localhost:3000" + ready_marker: str = "" + timeout_seconds: int = 60 + + @property + def declared(self) -> bool: + """True if the operator opted in. Empty cmd = not declared.""" + return bool(self.cmd.strip()) + + +@dataclass +class DevServerHandle: + """Live dev-server process. Use via context manager preferred. + + ``unavailable=True`` means the bring-up failed (timeout, crash, or + config not declared). Judges that needed the server can fall back to + static review. + """ + + process: Optional[subprocess.Popen] = None + url: str = "" + log_path: Optional[Path] = None + unavailable: bool = False + unavailable_reason: str = "" + _stopped: bool = field(default=False, repr=False) + + def __enter__(self) -> "DevServerHandle": + return self + + def __exit__(self, exc_type, exc_val, exc_tb) -> None: + self.stop() + + def stop(self) -> None: + """SIGTERM the process group; SIGKILL anything still alive after 5s. + + Idempotent. Uses the process group so child processes spawned by + the dev-server (``node``, ``next dev``, ``webpack``, etc.) are + cleaned up too — most dev-server commands fork a worker, and + signalling only the immediate child leaks the worker. + """ + if self._stopped: + return + self._stopped = True + + if self.process is None or self.process.poll() is not None: + return + + pgid = self._pgid() + try: + if pgid is not None: + os.killpg(pgid, signal.SIGTERM) + else: + self.process.terminate() + except ProcessLookupError: + return # already gone + + try: + self.process.wait(timeout=5) + return + except subprocess.TimeoutExpired: + pass + + try: + if pgid is not None: + os.killpg(pgid, signal.SIGKILL) + else: + self.process.kill() + except ProcessLookupError: + return + try: + self.process.wait(timeout=2) + except subprocess.TimeoutExpired: + # Truly stuck. Nothing more we can do without leaking + # operator-visible noise; log and move on. + print_warning( + f"dev-server pid {self.process.pid} did not exit after SIGKILL — " + "leaving zombie. Operator may need to kill manually." + ) + + def _pgid(self) -> Optional[int]: + if self.process is None: + return None + try: + return os.getpgid(self.process.pid) + except (OSError, ProcessLookupError): + return None + + +def _poll_url(url: str, timeout: float = 2.0) -> bool: + """HEAD the URL; True on any 2xx/3xx, False otherwise. + + Used as the readiness fallback when ``ready_marker`` is not + configured (or when the marker hasn't appeared yet but the server + may already be serving). Wide success window (2xx/3xx) because + Next.js and friends often 307 the root to the locale prefix on + first request. + """ + try: + req = urllib.request.Request(url, method="HEAD") + with urllib.request.urlopen(req, timeout=timeout) as resp: + return 200 <= resp.status < 400 + except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError, OSError): + return False + + +def _unavailable(reason: str, log_path: Optional[Path]) -> DevServerHandle: + """Construct a handle that signals "not available" without a process.""" + print_warning(f"dev-server unavailable: {reason}") + return DevServerHandle( + process=None, + url="", + log_path=log_path, + unavailable=True, + unavailable_reason=reason, + _stopped=True, + ) + + +def start( + worktree: Path, + config: DevServerConfig, + task_id: str, + log_dir: Optional[Path] = None, +) -> DevServerHandle: + """Spawn the dev server in ``worktree`` and block until ready or timeout. + + On any failure (config not declared, command fails to spawn, marker + never appears, port never opens) returns a handle with + ``unavailable=True`` so callers can fall back to static review + without further error handling. + + The log file path is included on the handle so judges' prompts can + reference it for diagnostics ("If your screenshot 500s, tail + .prompts/dev-server-.log"). + """ + if not config.declared: + return _unavailable("dev_server.cmd is empty in cube.yaml", None) + + log_path = (log_dir or worktree.parent) / f"dev-server-{task_id}.log" + log_path.parent.mkdir(parents=True, exist_ok=True) + log_handle = open(log_path, "wb", buffering=0) + + print_info(f"dev-server: starting `{config.cmd}` in {worktree}") + try: + process = subprocess.Popen( + config.cmd, + shell=True, + cwd=worktree, + stdout=log_handle, + stderr=subprocess.STDOUT, + # New process group so we can signal children (Next.js / Vite + # fork workers that survive a plain SIGTERM to the immediate + # PID). + start_new_session=True, + ) + except (OSError, FileNotFoundError) as e: + log_handle.close() + return _unavailable(f"spawn failed: {e}", log_path) + + handle = DevServerHandle(process=process, url=config.url, log_path=log_path) + if _wait_for_ready(handle, config, log_path): + print_info(f"dev-server: ready at {config.url}") + return handle + + # Bring-up failed. Tear down whatever did start, return unavailable + # so the caller can proceed with static-only review. + handle.stop() + log_handle.close() + return _unavailable( + f"never became ready within {config.timeout_seconds}s " f"(see {log_path} for stdout/stderr)", + log_path, + ) + + +def _wait_for_ready(handle: DevServerHandle, config: DevServerConfig, log_path: Path) -> bool: + """Block until either: + - ``config.ready_marker`` appears in the log file (preferred), or + - a HEAD ``config.url`` returns 2xx/3xx (fallback / always-runs). + + Bails out early if the process exits — no point waiting for a + marker on a dead process. + """ + deadline = time.monotonic() + max(1, config.timeout_seconds) + marker = config.ready_marker.encode() if config.ready_marker else b"" + + while time.monotonic() < deadline: + # Process died — no marker, no port, give up. + assert handle.process is not None + if handle.process.poll() is not None: + print_warning(f"dev-server process exited with code {handle.process.returncode} " f"before becoming ready") + return False + + # Marker check (cheap: scan from start). We tolerate a tiny + # missed-read window because Popen's stdout is buffered through + # the log file rather than read directly. + if marker: + try: + content = log_path.read_bytes() + if marker in content: + return True + except (OSError, FileNotFoundError): + pass + + # URL poll fallback — always runs even if marker is set, so a + # mistyped marker doesn't prevent ready detection. + if _poll_url(config.url): + return True + + time.sleep(0.5) + + return False diff --git a/python/cube/core/user_config.py b/python/cube/core/user_config.py index fbbaaf0d..c1774148 100644 --- a/python/cube/core/user_config.py +++ b/python/cube/core/user_config.py @@ -57,6 +57,9 @@ def session_key(self, task_id: str, review_type: str) -> str: return f"{self.key.upper()}_{task_id}_{review_type}" +from .dev_server import DevServerConfig # noqa: E402 (avoid circular import order) + + @dataclass class VerifyConfig: """Deterministic verify gate run between writer and judge phases. @@ -84,6 +87,10 @@ class VerifyConfig: check: str = "" timeout_seconds: int = 600 max_attempts: int = 3 + # Optional dev-server bring-up — only used by browser-augmented + # judges (judge_4 with Playwright). When the block is missing or + # ``cmd`` is empty, cube skips the bring-up entirely. + dev_server: DevServerConfig = field(default_factory=DevServerConfig) @property def effective_check(self) -> str: @@ -315,12 +322,20 @@ def load_config() -> CubeConfig: prompter_model = _resolve_model_alias(prompter.get("model", "sonnet-4.6-thinking"), model_aliases) verify_raw = data.get("verify") or {} + dev_server_raw = verify_raw.get("dev_server") or {} + dev_server_cfg = DevServerConfig( + cmd=str(dev_server_raw.get("cmd", "")).strip(), + url=str(dev_server_raw.get("url", "http://localhost:3000")).strip(), + ready_marker=str(dev_server_raw.get("ready_marker", "")).strip(), + timeout_seconds=int(dev_server_raw.get("timeout_seconds", 60)), + ) verify_cfg = VerifyConfig( cmd=str(verify_raw.get("cmd", "")).strip(), install=str(verify_raw.get("install", "")).strip(), check=str(verify_raw.get("check", "")).strip(), timeout_seconds=int(verify_raw.get("timeout_seconds", 600)), max_attempts=int(verify_raw.get("max_attempts", 3)), + dev_server=dev_server_cfg, ) _config_cache = CubeConfig( diff --git a/python/cube/models/types.py b/python/cube/models/types.py index d1233a93..d62ff8bd 100644 --- a/python/cube/models/types.py +++ b/python/cube/models/types.py @@ -50,6 +50,13 @@ class JudgeInfo: # browser-based UI review on judge_4). Empty / None = no MCP, judge # runs with bare built-in tools. mcp_servers: dict[str, Any] = field(default_factory=dict) + # Browser context, populated by the panel when the dev-server bring-up + # ran for this judge (i.e. the judge declared mcp_servers). The judge's + # prompt addendum references these to tell it where to navigate and + # where to drop screenshots. None = no browser pass; addendum omitted. + dev_server_url: Optional[str] = None + dev_server_unavailable_reason: Optional[str] = None + screenshots_dir: Optional[Path] = None @dataclass diff --git a/tests/core/test_decision_parser.py b/tests/core/test_decision_parser.py index e0076e1d..94fcafdf 100644 --- a/tests/core/test_decision_parser.py +++ b/tests/core/test_decision_parser.py @@ -205,6 +205,117 @@ def mock_find(judge_key, task_id, review_type, project_root=None): assert len(decisions) == 2 +class TestScreenshotCoercion: + """``screenshots`` is a new optional field on judge decision files + (browser-judges produce it via Playwright MCP). The coercion must be + backward-compatible with every existing decision file shape. + """ + + def test_missing_field_yields_empty_list(self): + from cube.core.decision_parser import _coerce_screenshot_list + + assert _coerce_screenshot_list(None) == [] + assert _coerce_screenshot_list([]) == [] + + def test_canonical_list_of_dicts_passes_through(self): + from cube.core.decision_parser import _coerce_screenshot_list + + shots = _coerce_screenshot_list( + [ + {"path": "/a.png", "caption": "Empty cart"}, + {"path": "/b.png", "caption": ""}, + ] + ) + assert shots == [ + {"path": "/a.png", "caption": "Empty cart"}, + {"path": "/b.png", "caption": ""}, + ] + + def test_list_of_strings_gets_wrapped(self): + """Judges sometimes drop just paths without captions.""" + from cube.core.decision_parser import _coerce_screenshot_list + + shots = _coerce_screenshot_list(["/a.png", "/b.png"]) + assert shots == [ + {"path": "/a.png", "caption": ""}, + {"path": "/b.png", "caption": ""}, + ] + + def test_single_string_wraps_to_one_entry(self): + from cube.core.decision_parser import _coerce_screenshot_list + + assert _coerce_screenshot_list("/single.png") == [{"path": "/single.png", "caption": ""}] + + def test_dict_without_path_is_dropped(self): + """Defensive — only entries with a non-empty path count.""" + from cube.core.decision_parser import _coerce_screenshot_list + + assert _coerce_screenshot_list([{"caption": "no path"}]) == [] + assert _coerce_screenshot_list([{"path": ""}]) == [] + + def test_garbage_returns_empty(self): + from cube.core.decision_parser import _coerce_screenshot_list + + assert _coerce_screenshot_list(42) == [] + assert _coerce_screenshot_list({"shape": "wrong"}) == [] + + +class TestPeerReviewStatusScreenshots: + """``get_peer_review_status`` surfaces screenshots aggregated across + every judge in the panel. The PR-review code reads this to embed + screenshots in the GitHub review body. + """ + + def test_missing_screenshots_field_is_tolerated(self, tmp_path, mock_config, monkeypatch): + from cube.core.decision_parser import get_peer_review_status + + decisions_dir = tmp_path / ".prompts" / "decisions" + decisions_dir.mkdir(parents=True) + (decisions_dir / "judge_1-mytask-peer-review.json").write_text( + json.dumps({"decision": "APPROVED", "remaining_issues": []}) + ) + (decisions_dir / "judge_2-mytask-peer-review.json").write_text( + json.dumps({"decision": "APPROVED", "remaining_issues": []}) + ) + status = get_peer_review_status("mytask", project_root=tmp_path) + assert status["screenshots"] == [] + assert status["approved"] is True + + def test_screenshots_aggregate_across_judges(self, tmp_path, mock_config, monkeypatch): + from cube.core.decision_parser import get_peer_review_status + + decisions_dir = tmp_path / ".prompts" / "decisions" + decisions_dir.mkdir(parents=True) + (decisions_dir / "judge_1-mytask-peer-review.json").write_text( + json.dumps( + { + "decision": "APPROVED", + "remaining_issues": [], + "screenshots": [{"path": ".prompts/screenshots/mytask/judge_1/a.png", "caption": "first"}], + } + ) + ) + (decisions_dir / "judge_2-mytask-peer-review.json").write_text( + json.dumps( + { + "decision": "REQUEST_CHANGES", + "remaining_issues": ["a thing"], + "screenshots": [ + {"path": ".prompts/screenshots/mytask/judge_2/b.png", "caption": "second"}, + {"path": ".prompts/screenshots/mytask/judge_2/c.png", "caption": ""}, + ], + } + ) + ) + status = get_peer_review_status("mytask", project_root=tmp_path) + paths = sorted(s["path"] for s in status["screenshots"]) + assert paths == [ + ".prompts/screenshots/mytask/judge_1/a.png", + ".prompts/screenshots/mytask/judge_2/b.png", + ".prompts/screenshots/mytask/judge_2/c.png", + ] + + class TestNormalizeDecisionSynonyms: """`normalize_decision` must recognise common judge-output drift so unrecognised negatives don't slip past the auto-approve gate. diff --git a/tests/core/test_dev_server.py b/tests/core/test_dev_server.py new file mode 100644 index 00000000..d4083b36 --- /dev/null +++ b/tests/core/test_dev_server.py @@ -0,0 +1,144 @@ +"""Tests for ``cube.core.dev_server``. + +Covers the three contracts judges depend on: + +1. ``DevServerConfig.declared`` is False when the operator hasn't + configured a dev server. Cube skips the bring-up entirely. +2. ``start()`` returns ``unavailable=True`` (not an exception) when the + command fails to spawn, exits early, or never reaches ready within + the configured timeout. Lets the panel proceed with static-only + review instead of crashing. +3. ``DevServerHandle.stop()`` is idempotent and signals the process + group (so dev-server worker processes get cleaned up too). +""" + +from __future__ import annotations + +import sys +import time + +from cube.core.dev_server import DevServerConfig, start + + +def test_dev_server_config_undeclared_when_cmd_empty(): + """No ``cmd`` → not declared → cube skips bring-up. Out-of-box + behaviour for any repo that doesn't opt in to browser review.""" + assert DevServerConfig().declared is False + assert DevServerConfig(cmd="").declared is False + assert DevServerConfig(cmd=" ").declared is False + + +def test_dev_server_config_declared_when_cmd_set(): + assert DevServerConfig(cmd="pnpm dev").declared is True + + +def test_start_returns_unavailable_when_not_declared(tmp_path): + """No cmd → returns a sentinel handle with ``unavailable=True`` and + no process. The panel populates this onto judges so their persona + prompt gets the [BROWSER UNAVAILABLE] block instead of trying to + call Playwright tools against thin air. + """ + handle = start(tmp_path, DevServerConfig(), task_id="t") + assert handle.unavailable is True + assert handle.process is None + assert "empty" in handle.unavailable_reason + + +def test_start_returns_unavailable_when_spawn_fails(tmp_path): + """``cmd`` exists but the underlying binary doesn't / dies + immediately → bring-up fails gracefully. ``shell=True`` so we + exercise the path subprocess actually takes.""" + cfg = DevServerConfig( + cmd="/this/binary/does/not/exist --whatever", + url="http://127.0.0.1:1", # unreachable + timeout_seconds=2, + ) + handle = start(tmp_path, cfg, task_id="t") + assert handle.unavailable is True + + +def test_start_returns_unavailable_when_marker_never_appears(tmp_path): + """Process spawns but never emits the configured ready_marker AND + the URL never opens → timeout path. The handle is unavailable but + the process has been cleaned up (no orphan ``sleep`` left running). + """ + cfg = DevServerConfig( + # Sleep for longer than the timeout, never emit anything. + cmd=f'{sys.executable} -c "import time; time.sleep(30)"', + url="http://127.0.0.1:1", # unreachable + ready_marker="this string will never appear", + timeout_seconds=2, + ) + started = time.monotonic() + handle = start(tmp_path, cfg, task_id="t") + elapsed = time.monotonic() - started + assert handle.unavailable is True + # Bounded by timeout — should NOT block forever. + assert elapsed < 6, f"start() should have bailed by timeout; took {elapsed:.1f}s" + + +def test_start_succeeds_when_marker_appears_in_log(tmp_path): + """Happy path via the log-marker readiness signal. The subprocess + prints the marker on its own; the helper notices and returns a + live handle.""" + marker = "READY-MARKER-XYZ" + cfg = DevServerConfig( + cmd=f"{sys.executable} -u -c \"import sys; sys.stdout.write('{marker}\\n'); sys.stdout.flush(); import time; time.sleep(30)\"", + url="http://127.0.0.1:1", # unreachable — force the marker path + ready_marker=marker, + timeout_seconds=5, + ) + handle = start(tmp_path, cfg, task_id="t") + try: + assert handle.unavailable is False + assert handle.process is not None + assert handle.process.poll() is None # still running + assert handle.url == cfg.url + assert handle.log_path is not None and handle.log_path.exists() + # Log captures the marker line we printed. + assert marker.encode() in handle.log_path.read_bytes() + finally: + handle.stop() + + +def test_stop_is_idempotent(tmp_path): + """Calling stop() twice is a no-op the second time — and any + later call. Important because the panel wraps stop() in a try/ + finally that may also fire on exception paths.""" + cfg = DevServerConfig( + cmd=f'{sys.executable} -c "import time; time.sleep(30)"', + url="http://127.0.0.1:1", + ready_marker="will never appear", + timeout_seconds=1, + ) + handle = start(tmp_path, cfg, task_id="t") + # start() already calls stop() internally on timeout, so this just + # verifies that an extra stop() doesn't blow up. + handle.stop() + handle.stop() + handle.stop() + + +def test_stop_kills_process_group(tmp_path): + """A shell command that backgrounds a child must NOT leave the + child running after stop() — otherwise ``pnpm dev`` style + dev-servers (which fork workers) leak processes across panel + runs. We verify the wrapper script's own PID exits, which is + the necessary precondition for child cleanup via SIGTERM/SIGKILL + on the process group. + """ + marker = "UP" + cfg = DevServerConfig( + cmd=f"{sys.executable} -u -c \"import sys; sys.stdout.write('{marker}\\n'); sys.stdout.flush(); import time; time.sleep(30)\"", + url="http://127.0.0.1:1", + ready_marker=marker, + timeout_seconds=5, + ) + handle = start(tmp_path, cfg, task_id="t") + assert handle.unavailable is False + pid = handle.process.pid # type: ignore[union-attr] + handle.stop() + # After stop(), the process is reaped — poll() returns the exit code, + # not None. + assert handle.process is not None + assert handle.process.poll() is not None, f"pid {pid} still alive after stop()"