AppliedLabsAI · hwuiwon · Apr 7, 2026 · Apr 7, 2026
diff --git a/.claude/commands/create-playbook.md b/.claude/commands/create-playbook.md
@@ -36,7 +36,7 @@ Convert the recorded interactions into a playbook YAML file. Apply these optimiz
 
 **Auth handling (IMPORTANT):**
 - **NEVER include login/auth steps** (typing username, password, clicking login) in the playbook.
-- Authentication is handled externally by `DashboardAuth.ensure_authenticated()` before the playbook runs. It restores saved sessions or performs fresh login automatically.
+- Authentication is handled externally by the auth system before the playbook runs. It detects login forms and performs fresh login automatically.
 - If the recording includes login interactions, **strip them out**. The playbook should start from the first post-login action.
 - Set `auth_required: true` if login was part of the recording — this tells the runner to authenticate before executing.
 

diff --git a/.claude/commands/test-deployment.md b/.claude/commands/test-deployment.md
@@ -0,0 +1,113 @@
+Test the deployed CUA API on Modal to verify it's working correctly.
+
+## Prerequisites
+
+Before running tests, determine the API base URL and ensure `$CUA_API_KEY` is set in the environment (loaded via direnv). The base URL depends on the Modal workspace — check the deploy output or run:
+
+```bash
+.venv/bin/modal app list | grep cua
+```
+
+Set the base URL for the session:
+```bash
+BASE_URL="https://<workspace>--cua-serve.modal.run"
+```
+
+## Test cases
+
+Run all 4 test cases. For Tests 3 and 4, save the `run_id` from the create response and poll until `status` is `completed` (or a terminal state).
+
+### Test 1: Dry-run validation (config check, no sandbox)
+
+```bash
+curl -s -X POST "$BASE_URL/runs/dry-run" \
+  -H "Authorization: Bearer $CUA_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{"directive": "Test directive", "max_steps": 10}' | python3 -m json.tool
+```
+
+**Expected**: `"valid": true`, all checks passed.
+
+### Test 2: Input validation (reject invalid config)
+
+```bash
+curl -s -X POST "$BASE_URL/runs/dry-run" \
+  -H "Authorization: Bearer $CUA_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{"directive": "Test", "max_steps": 0}' | python3 -m json.tool
+```
+
+**Expected**: HTTP 422 with `"code": "INVALID_REQUEST"` and error about `max_steps`.
+
+### Test 3: Simple directive (example.com heading)
+
+Create a run:
+```bash
+curl -s -X POST "$BASE_URL/runs" \
+  -H "Authorization: Bearer $CUA_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "directive": "Go to https://example.com and tell me the heading text on the page",
+    "max_steps": 10,
+    "timeout_seconds": 120,
+    "start_url": "https://example.com"
+  }' | python3 -m json.tool
+```
+
+Poll until completed:
+```bash
+curl -s "$BASE_URL/runs/<run_id>" \
+  -H "Authorization: Bearer $CUA_API_KEY" | python3 -m json.tool
+```
+
+**Expected**: `"status": "completed"`, result mentions "Example Domain".
+
+### Test 4: Structured output extraction (HN top 3)
+
+Create a run with `output_schema`:
+```bash
+curl -s -X POST "$BASE_URL/runs" \
+  -H "Authorization: Bearer $CUA_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "directive": "Go to https://news.ycombinator.com and extract the titles of the top 3 stories",
+    "max_steps": 15,
+    "timeout_seconds": 180,
+    "start_url": "https://news.ycombinator.com",
+    "output_schema": {
+      "type": "object",
+      "properties": {
+        "stories": {
+          "type": "array",
+          "items": {"type": "object", "properties": {"rank": {"type": "integer"}, "title": {"type": "string"}}},
+          "maxItems": 3
+        }
+      },
+      "required": ["stories"]
+    }
+  }' | python3 -m json.tool
+```
+
+Poll until completed (may take 20-30s).
+
+**Expected**: `"status": "completed"`, `data.stories` array with 3 items each having `rank` and `title`, no extract timeout errors in actions.
+
+## Evaluating results
+
+For each test, check:
+1. **Status**: should be `completed` (not `failed` or `timeout`)
+2. **Errors**: `error` field should be `null`
+3. **Actions**: verify no repeated timeout errors (stuck detection should catch these)
+4. **Duration**: simple directives should complete in under 30s, structured output under 60s
+
+## Troubleshooting sandbox logs
+
+If a run fails or behaves unexpectedly, check the sandbox logs in the Modal dashboard.
+
+Look for:
+- `CancelledError` in Starlette lifespan — graceful shutdown issue
+- `AuthError: Token missing` — volume commit called from sandbox context
+- `AsyncUsageWarning` — sync Modal API call in async context
+- `browser_dom.extract failed: Timeout` — selector mismatch (check if href was truncated)
+
+Report a summary table of all test results with status, duration, and any errors found.
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,93 @@
+# CUA — Computer Use Agent
+
+Autonomous browser automation agent deployed on Modal. Accepts natural-language directives, executes them via a headless Chromium browser in a sandboxed VM, and returns structured results.
+
+## Quick reference
+
+```bash
+# Run tests
+.venv/bin/python -m pytest tests/ -x -q
+
+# Lint
+.venv/bin/ruff check .
+
+# Type check
+.venv/bin/ty check
+
+# Deploy to Modal
+.venv/bin/modal deploy api/server.py::modal_app
+
+# Run agent locally (requires DISPLAY / Xvfb)
+.venv/bin/python scripts/run_local.py --directive "..." --start-url "https://..."
+```
+
+## Architecture
+
+```
+api/             → Outer FastAPI service (Modal Function), handles /runs CRUD
+  server.py      → FastAPI app + Modal ASGI entrypoint (modal_app lives in modal_app.py)
+  modal_app.py   → Modal App definition, image builds, volume/dict setup
+  runs/          → RunService, RunRegistry (in-memory + Modal Dict), RunHandle
+  streaming.py   → In-sandbox status API (port 8090), SSE events, status persistence
+  models.py      → RunConfig, RunStatus, GuardrailSettings, ActionEvent
+
+agent/           → Agent loop (runs inside Modal Sandbox)
+  main.py        → Sandbox entrypoint — starts status API, runs session, handles shutdown
+  loop.py        → PydanticAI agent loop with tool definitions
+  session/       → SessionRunner (browser + agent lifecycle), RunFinalizer
+  tools.py       → browser_dom tool implementation
+  hooks.py       → PydanticAI hooks (preflight guardrails, thinking capture, error recovery)
+
+bridge/          → Browser abstraction layer
+  browser.py     → BrowserManager (Patchright wrapper, page lifecycle)
+  execution.py   → Action handlers (click, extract, goto, etc.), SequenceExecutor
+  page_actions.py → Primitive page actions with shared semantics
+  observation.py → DOM snapshots, mutations, screenshots
+  scripts/       → JS injected into pages (page_context.js, recorder.js)
+
+sandbox/         → Modal Sandbox definition
+  image.py       → Ubuntu 24.04 image with desktop env, Patchright, agent runtime
+  entrypoint.sh  → Starts Xvfb + openbox, runs agent/main.py
+
+guardrails/      → Runtime safety checks
+  stuck.py       → Stuck detection (repetition, cycles, failure clusters, URL revisits)
+  scope.py       → Domain allowlist/blocklist, action permissions
+
+blinders/        → Directive classification (goal type, login detection, action filtering)
+playbooks/       → YAML-defined deterministic workflows with LLM fallback
+evaluation/      → Benchmark suite runner and scoring engine
+recording/       → Playwright trace capture and artifact management
+telemetry/       → OpenTelemetry tracing, structured logging, metrics
+```
+
+## Key conventions
+
+- **Python 3.13+**, managed with `uv`. Virtual env at `.venv/`.
+- **Environment**: uses `direnv` — secrets loaded from `.envrc` (not committed).
+- **Settings**: all env vars centralized in `settings.py` via Pydantic Settings. Never scatter `os.environ.get()`.
+- **Models**: `PRIMARY_MODEL` and `UTILITY_MODEL` constants in `settings.py`. Change there to switch everywhere.
+- **Modal deploy target**: `api/server.py::modal_app` — the app variable is named `modal_app`, not `app`.
+- **Sandbox vs Function**: Code in `agent/` runs inside Modal Sandboxes (no Modal API token). Code in `api/` runs in Modal Functions (has Modal auth). Don't call `modal.Volume.commit()` from sandbox code.
+- **Tests**: `pytest` with `asyncio_mode = "auto"`. Integration tests marked `@pytest.mark.integration`. Run `pytest tests/ -x -q` for the full suite.
+- **Lint**: `ruff` with bugbear, isort, pyupgrade, and pep8-naming rules. Line length 88.
+
+## API endpoints
+
+| Method | Path | Description |
+|--------|------|-------------|
+| POST | /runs/dry-run | Validate config without executing |
+| POST | /runs | Create and start a new run |
+| GET | /runs/{run_id} | Poll run status |
+| POST | /runs/{run_id}/stop | Terminate a run |
+| GET | /runs/{run_id}/stream | SSE event stream |
+| GET | /runs/{run_id}/recording/manifest | List recording artifacts |
+| GET | /runs/{run_id}/recording/trace | Download Playwright trace ZIP |
+
+Auth: `Authorization: Bearer $CUA_API_KEY` (set in Modal secret `cua-secret`).
+
+## Common patterns
+
+- **DOM snapshot truncation**: `page_context.js` truncates hrefs to 60 chars. The extract action has a fallback that retries with `href^=` (starts-with) when exact match fails.
+- **Stuck detection**: sliding window over recent actions, checks for repetition, cycles, failure clusters, and URL revisits. Configurable via `GuardrailSettings`.
+- **Session memory**: injected into the system prompt before each LLM request so the agent retains awareness of prior work even after context pruning.
+- **Playbook execution**: YAML-defined step sequences with selector fallbacks, verification checks, and LLM handoff on failure.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1 @@
+@AGENTS.md
diff --git a/README.md b/README.md
@@ -195,11 +195,10 @@ cua/
 
 | Topic | Description |
 |---|---|
-| [Architecture](docs/architecture.md) | Full sequence diagram and component overview |
 | [API Reference](docs/api.md) | Endpoints, SSE streaming, replay, multi-container support |
-| [Browser Tools](docs/tools.md) | 9 browser actions, `execute_sequence` batching, design choices |
+| [Browser Tools](docs/tools.md) | 10 browser actions, `execute_sequence` batching, design choices |
 | [Playbooks](docs/playbooks.md) | Deterministic workflows, selector fallbacks, LLM handoff |
-| [Authentication](docs/authentication.md) | Session persistence, credential refs, `SecretValue`, and security caveats |
+| [Authentication](docs/authentication.md) | Credential refs, `SecretValue`, and security caveats |
 | [Guardrails](docs/guardrails.md) | Cognitive Blinders, runtime safety, domain/action controls |
 | [Recording](docs/recording.md) | Playwright tracing, session replay |
 | [Evaluation](docs/evaluation.md) | Benchmark suites, trial scoring, pass/fail expectations |

diff --git a/agent/main.py b/agent/main.py
@@ -18,21 +18,29 @@
 import os
 import signal
 import sys
+from typing import TYPE_CHECKING
 
 from telemetry.logging import setup_logging
 
+if TYPE_CHECKING:
+    import uvicorn
+
 setup_logging()
 logger = logging.getLogger("cua.agent")
 
 STATUS_API_PORT = 8090
 
 
-async def _start_status_api() -> asyncio.Task:
+async def _start_status_api() -> tuple[asyncio.Task, uvicorn.Server]:
     """Start the in-sandbox status API as a background asyncio task.
 
     Runs uvicorn in the same process so the status API shares module
     globals with the agent loop (push_action / complete_run update the
     same _status and _subscribers objects that GET /events reads from).
+
+    Returns ``(task, server)`` so the caller can trigger a graceful
+    shutdown via ``server.should_exit = True`` instead of cancelling
+    the task (which causes a noisy ``CancelledError`` in Starlette).
     """
     import uvicorn
 
@@ -48,7 +56,7 @@ async def _start_status_api() -> asyncio.Task:
     task = asyncio.create_task(server.serve())
     # Give uvicorn a moment to bind the port
     await asyncio.sleep(0.5)
-    return task
+    return task, server
 
 
 async def main() -> int:
@@ -91,7 +99,7 @@ async def main() -> int:
     logger.info("Directive: %s", config.directive[:200])
 
     # Start status API in-process (shares globals with agent loop)
-    status_task = await _start_status_api()
+    status_task, status_server = await _start_status_api()
     logger.info("Status API started on port %d (in-process)", STATUS_API_PORT)
 
     # Initialize status API state
@@ -143,9 +151,10 @@ def _request_shutdown(sig: int) -> None:
     except asyncio.CancelledError:
         result = 1
 
-    # Cancel the status API after a grace period for final SSE delivery
+    # Keep the status API alive so the outer API can do final polling
+    # during the entrypoint keep-alive window, then shut down gracefully.
     await asyncio.sleep(1)
-    status_task.cancel()
+    status_server.should_exit = True
     with contextlib.suppress(asyncio.CancelledError):
         await status_task
 

diff --git a/agent/session/finalizer.py b/agent/session/finalizer.py
@@ -24,7 +24,16 @@
 
 
 async def _commit_recording_volume() -> None:
-    """Commit the recordings volume so the outer API can read persisted data."""
+    """Commit the recordings volume so the outer API can read persisted data.
+
+    Inside a Modal sandbox the volume is auto-synced on exit and the Modal
+    API token is unavailable, so we skip the explicit commit.
+    """
+    from settings import get_settings
+
+    if get_settings().modal_sandbox_id != "local":
+        logger.debug("Skipping volume commit (sandbox auto-syncs on exit)")
+        return
     try:
         vol = modal.Volume.from_name(_RECORDING_VOLUME_NAME)
         await vol.commit.aio()

diff --git a/api/runs/registry.py b/api/runs/registry.py
@@ -50,7 +50,7 @@ def add(self, handle: RunHandle) -> None:
     def get(self, run_id: str) -> RunHandle | None:
         raise NotImplementedError
 
-    def remove(self, run_id: str) -> RunHandle | None:
+    async def remove(self, run_id: str) -> RunHandle | None:
         raise NotImplementedError
 
     def contains(self, run_id: str) -> bool:
@@ -69,7 +69,7 @@ def add(self, handle: RunHandle) -> None:
     def get(self, run_id: str) -> RunHandle | None:
         return self._runs.get(run_id)
 
-    def remove(self, run_id: str) -> RunHandle | None:
+    async def remove(self, run_id: str) -> RunHandle | None:
         return self._runs.pop(run_id, None)
 
     def contains(self, run_id: str) -> bool:
@@ -106,10 +106,10 @@ def get(self, run_id: str) -> RunHandle | None:
         # via modal.Sandbox.from_id() and re-adds to the registry.
         return self._local.get(run_id)
 
-    def remove(self, run_id: str) -> RunHandle | None:
+    async def remove(self, run_id: str) -> RunHandle | None:
         handle = self._local.pop(run_id, None)
         with contextlib.suppress(Exception):
-            self._dict.pop(run_id)
+            await self._dict.pop.aio(run_id)
         return handle
 
     def contains(self, run_id: str) -> bool: