Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/commands/create-playbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Convert the recorded interactions into a playbook YAML file. Apply these optimiz

**Auth handling (IMPORTANT):**
- **NEVER include login/auth steps** (typing username, password, clicking login) in the playbook.
- Authentication is handled externally by `DashboardAuth.ensure_authenticated()` before the playbook runs. It restores saved sessions or performs fresh login automatically.
- Authentication is handled externally by the auth system before the playbook runs. It detects login forms and performs fresh login automatically.
- If the recording includes login interactions, **strip them out**. The playbook should start from the first post-login action.
- Set `auth_required: true` if login was part of the recording — this tells the runner to authenticate before executing.

Expand Down
113 changes: 113 additions & 0 deletions .claude/commands/test-deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
Test the deployed CUA API on Modal to verify it's working correctly.

## Prerequisites

Before running tests, determine the API base URL and ensure `$CUA_API_KEY` is set in the environment (loaded via direnv). The base URL depends on the Modal workspace — check the deploy output or run:

```bash
.venv/bin/modal app list | grep cua
```

Set the base URL for the session:
```bash
BASE_URL="https://<workspace>--cua-serve.modal.run"
```

## Test cases

Run all 4 test cases. For Tests 3 and 4, save the `run_id` from the create response and poll until `status` is `completed` (or a terminal state).

### Test 1: Dry-run validation (config check, no sandbox)

```bash
curl -s -X POST "$BASE_URL/runs/dry-run" \
-H "Authorization: Bearer $CUA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"directive": "Test directive", "max_steps": 10}' | python3 -m json.tool
```

**Expected**: `"valid": true`, all checks passed.

### Test 2: Input validation (reject invalid config)

```bash
curl -s -X POST "$BASE_URL/runs/dry-run" \
-H "Authorization: Bearer $CUA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"directive": "Test", "max_steps": 0}' | python3 -m json.tool
```

**Expected**: HTTP 422 with `"code": "INVALID_REQUEST"` and error about `max_steps`.

### Test 3: Simple directive (example.com heading)

Create a run:
```bash
curl -s -X POST "$BASE_URL/runs" \
-H "Authorization: Bearer $CUA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"directive": "Go to https://example.com and tell me the heading text on the page",
"max_steps": 10,
"timeout_seconds": 120,
"start_url": "https://example.com"
}' | python3 -m json.tool
```

Poll until completed:
```bash
curl -s "$BASE_URL/runs/<run_id>" \
-H "Authorization: Bearer $CUA_API_KEY" | python3 -m json.tool
```

**Expected**: `"status": "completed"`, result mentions "Example Domain".

### Test 4: Structured output extraction (HN top 3)

Create a run with `output_schema`:
```bash
curl -s -X POST "$BASE_URL/runs" \
-H "Authorization: Bearer $CUA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"directive": "Go to https://news.ycombinator.com and extract the titles of the top 3 stories",
"max_steps": 15,
"timeout_seconds": 180,
"start_url": "https://news.ycombinator.com",
"output_schema": {
"type": "object",
"properties": {
"stories": {
"type": "array",
"items": {"type": "object", "properties": {"rank": {"type": "integer"}, "title": {"type": "string"}}},
"maxItems": 3
}
},
"required": ["stories"]
}
}' | python3 -m json.tool
```

Poll until completed (may take 20-30s).

**Expected**: `"status": "completed"`, `data.stories` array with 3 items each having `rank` and `title`, no extract timeout errors in actions.

## Evaluating results

For each test, check:
1. **Status**: should be `completed` (not `failed` or `timeout`)
2. **Errors**: `error` field should be `null`
3. **Actions**: verify no repeated timeout errors (stuck detection should catch these)
4. **Duration**: simple directives should complete in under 30s, structured output under 60s

## Troubleshooting sandbox logs

If a run fails or behaves unexpectedly, check the sandbox logs in the Modal dashboard.

Look for:
- `CancelledError` in Starlette lifespan — graceful shutdown issue
- `AuthError: Token missing` — volume commit called from sandbox context
- `AsyncUsageWarning` — sync Modal API call in async context
- `browser_dom.extract failed: Timeout` — selector mismatch (check if href was truncated)

Report a summary table of all test results with status, duration, and any errors found.
93 changes: 93 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# CUA — Computer Use Agent

Autonomous browser automation agent deployed on Modal. Accepts natural-language directives, executes them via a headless Chromium browser in a sandboxed VM, and returns structured results.

## Quick reference

```bash
# Run tests
.venv/bin/python -m pytest tests/ -x -q

# Lint
.venv/bin/ruff check .

# Type check
.venv/bin/ty check

# Deploy to Modal
.venv/bin/modal deploy api/server.py::modal_app

# Run agent locally (requires DISPLAY / Xvfb)
.venv/bin/python scripts/run_local.py --directive "..." --start-url "https://..."
```

## Architecture

```
api/ → Outer FastAPI service (Modal Function), handles /runs CRUD
server.py → FastAPI app + Modal ASGI entrypoint (modal_app lives in modal_app.py)
modal_app.py → Modal App definition, image builds, volume/dict setup
runs/ → RunService, RunRegistry (in-memory + Modal Dict), RunHandle
streaming.py → In-sandbox status API (port 8090), SSE events, status persistence
models.py → RunConfig, RunStatus, GuardrailSettings, ActionEvent

agent/ → Agent loop (runs inside Modal Sandbox)
main.py → Sandbox entrypoint — starts status API, runs session, handles shutdown
loop.py → PydanticAI agent loop with tool definitions
session/ → SessionRunner (browser + agent lifecycle), RunFinalizer
tools.py → browser_dom tool implementation
hooks.py → PydanticAI hooks (preflight guardrails, thinking capture, error recovery)

bridge/ → Browser abstraction layer
browser.py → BrowserManager (Patchright wrapper, page lifecycle)
execution.py → Action handlers (click, extract, goto, etc.), SequenceExecutor
page_actions.py → Primitive page actions with shared semantics
observation.py → DOM snapshots, mutations, screenshots
scripts/ → JS injected into pages (page_context.js, recorder.js)

sandbox/ → Modal Sandbox definition
image.py → Ubuntu 24.04 image with desktop env, Patchright, agent runtime
entrypoint.sh → Starts Xvfb + openbox, runs agent/main.py

guardrails/ → Runtime safety checks
stuck.py → Stuck detection (repetition, cycles, failure clusters, URL revisits)
scope.py → Domain allowlist/blocklist, action permissions

blinders/ → Directive classification (goal type, login detection, action filtering)
playbooks/ → YAML-defined deterministic workflows with LLM fallback
evaluation/ → Benchmark suite runner and scoring engine
recording/ → Playwright trace capture and artifact management
telemetry/ → OpenTelemetry tracing, structured logging, metrics
```

## Key conventions

- **Python 3.13+**, managed with `uv`. Virtual env at `.venv/`.
- **Environment**: uses `direnv` — secrets loaded from `.envrc` (not committed).
- **Settings**: all env vars centralized in `settings.py` via Pydantic Settings. Never scatter `os.environ.get()`.
- **Models**: `PRIMARY_MODEL` and `UTILITY_MODEL` constants in `settings.py`. Change there to switch everywhere.
- **Modal deploy target**: `api/server.py::modal_app` — the app variable is named `modal_app`, not `app`.
- **Sandbox vs Function**: Code in `agent/` runs inside Modal Sandboxes (no Modal API token). Code in `api/` runs in Modal Functions (has Modal auth). Don't call `modal.Volume.commit()` from sandbox code.
- **Tests**: `pytest` with `asyncio_mode = "auto"`. Integration tests marked `@pytest.mark.integration`. Run `pytest tests/ -x -q` for the full suite.
- **Lint**: `ruff` with bugbear, isort, pyupgrade, and pep8-naming rules. Line length 88.

## API endpoints

| Method | Path | Description |
|--------|------|-------------|
| POST | /runs/dry-run | Validate config without executing |
| POST | /runs | Create and start a new run |
| GET | /runs/{run_id} | Poll run status |
| POST | /runs/{run_id}/stop | Terminate a run |
| GET | /runs/{run_id}/stream | SSE event stream |
| GET | /runs/{run_id}/recording/manifest | List recording artifacts |
| GET | /runs/{run_id}/recording/trace | Download Playwright trace ZIP |

Auth: `Authorization: Bearer $CUA_API_KEY` (set in Modal secret `cua-secret`).

## Common patterns

- **DOM snapshot truncation**: `page_context.js` truncates hrefs to 60 chars. The extract action has a fallback that retries with `href^=` (starts-with) when exact match fails.
- **Stuck detection**: sliding window over recent actions, checks for repetition, cycles, failure clusters, and URL revisits. Configurable via `GuardrailSettings`.
- **Session memory**: injected into the system prompt before each LLM request so the agent retains awareness of prior work even after context pruning.
- **Playbook execution**: YAML-defined step sequences with selector fallbacks, verification checks, and LLM handoff on failure.
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@AGENTS.md
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,11 +195,10 @@ cua/

| Topic | Description |
|---|---|
| [Architecture](docs/architecture.md) | Full sequence diagram and component overview |
| [API Reference](docs/api.md) | Endpoints, SSE streaming, replay, multi-container support |
| [Browser Tools](docs/tools.md) | 9 browser actions, `execute_sequence` batching, design choices |
| [Browser Tools](docs/tools.md) | 10 browser actions, `execute_sequence` batching, design choices |
| [Playbooks](docs/playbooks.md) | Deterministic workflows, selector fallbacks, LLM handoff |
| [Authentication](docs/authentication.md) | Session persistence, credential refs, `SecretValue`, and security caveats |
| [Authentication](docs/authentication.md) | Credential refs, `SecretValue`, and security caveats |
| [Guardrails](docs/guardrails.md) | Cognitive Blinders, runtime safety, domain/action controls |
| [Recording](docs/recording.md) | Playwright tracing, session replay |
| [Evaluation](docs/evaluation.md) | Benchmark suites, trial scoring, pass/fail expectations |
Expand Down
19 changes: 14 additions & 5 deletions agent/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,29 @@
import os
import signal
import sys
from typing import TYPE_CHECKING

from telemetry.logging import setup_logging

if TYPE_CHECKING:
import uvicorn

setup_logging()
logger = logging.getLogger("cua.agent")

STATUS_API_PORT = 8090


async def _start_status_api() -> asyncio.Task:
async def _start_status_api() -> tuple[asyncio.Task, uvicorn.Server]:
"""Start the in-sandbox status API as a background asyncio task.

Runs uvicorn in the same process so the status API shares module
globals with the agent loop (push_action / complete_run update the
same _status and _subscribers objects that GET /events reads from).

Returns ``(task, server)`` so the caller can trigger a graceful
shutdown via ``server.should_exit = True`` instead of cancelling
the task (which causes a noisy ``CancelledError`` in Starlette).
"""
import uvicorn

Expand All @@ -48,7 +56,7 @@ async def _start_status_api() -> asyncio.Task:
task = asyncio.create_task(server.serve())
# Give uvicorn a moment to bind the port
await asyncio.sleep(0.5)
return task
return task, server


async def main() -> int:
Expand Down Expand Up @@ -91,7 +99,7 @@ async def main() -> int:
logger.info("Directive: %s", config.directive[:200])

# Start status API in-process (shares globals with agent loop)
status_task = await _start_status_api()
status_task, status_server = await _start_status_api()
logger.info("Status API started on port %d (in-process)", STATUS_API_PORT)

# Initialize status API state
Expand Down Expand Up @@ -143,9 +151,10 @@ def _request_shutdown(sig: int) -> None:
except asyncio.CancelledError:
result = 1

# Cancel the status API after a grace period for final SSE delivery
# Keep the status API alive so the outer API can do final polling
# during the entrypoint keep-alive window, then shut down gracefully.
await asyncio.sleep(1)
status_task.cancel()
status_server.should_exit = True
with contextlib.suppress(asyncio.CancelledError):
await status_task

Expand Down
11 changes: 10 additions & 1 deletion agent/session/finalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,16 @@


async def _commit_recording_volume() -> None:
"""Commit the recordings volume so the outer API can read persisted data."""
"""Commit the recordings volume so the outer API can read persisted data.

Inside a Modal sandbox the volume is auto-synced on exit and the Modal
API token is unavailable, so we skip the explicit commit.
"""
from settings import get_settings

if get_settings().modal_sandbox_id != "local":
logger.debug("Skipping volume commit (sandbox auto-syncs on exit)")
return
try:
vol = modal.Volume.from_name(_RECORDING_VOLUME_NAME)
await vol.commit.aio()
Expand Down
8 changes: 4 additions & 4 deletions api/runs/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def add(self, handle: RunHandle) -> None:
def get(self, run_id: str) -> RunHandle | None:
raise NotImplementedError

def remove(self, run_id: str) -> RunHandle | None:
async def remove(self, run_id: str) -> RunHandle | None:
raise NotImplementedError

def contains(self, run_id: str) -> bool:
Expand All @@ -69,7 +69,7 @@ def add(self, handle: RunHandle) -> None:
def get(self, run_id: str) -> RunHandle | None:
return self._runs.get(run_id)

def remove(self, run_id: str) -> RunHandle | None:
async def remove(self, run_id: str) -> RunHandle | None:
return self._runs.pop(run_id, None)

def contains(self, run_id: str) -> bool:
Expand Down Expand Up @@ -106,10 +106,10 @@ def get(self, run_id: str) -> RunHandle | None:
# via modal.Sandbox.from_id() and re-adds to the registry.
return self._local.get(run_id)

def remove(self, run_id: str) -> RunHandle | None:
async def remove(self, run_id: str) -> RunHandle | None:
handle = self._local.pop(run_id, None)
with contextlib.suppress(Exception):
self._dict.pop(run_id)
await self._dict.pop.aio(run_id)
return handle

def contains(self, run_id: str) -> bool:
Expand Down
Loading
Loading