Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions AGENT_SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Browser Use Terminal — agent setup

You are (probably) a coding agent reading this because a user pasted
`https://browser-use.com/skill` into the chat. Follow this page top to bottom.
It installs Browser Use Terminal, registers its skill with you, connects a
browser, and verifies everything — then you can browse the web for your user.
Every step is idempotent; re-run safely. Tell the user what you're doing as
you go, and stop to ask only where this page says to.

What you get: a `browser-use-terminal` CLI you drive from shell commands —
run Python against a real browser (navigate, click, type, screenshot, extract).
You are the agent; you drive the browser.

## 1. Preflight

```bash
uname -s # Darwin or Linux. Windows: use WSL; native Windows is not supported yet.
command -v curl || command -v wget
command -v tar
python3 --version # need 3.9+
```

If `python3` is missing: macOS → `xcode-select --install` (or `brew install python3`);
Debian/Ubuntu → `sudo apt-get install -y python3`; Fedora → `sudo dnf install -y python3`.
If you cannot install it, ask the user to.

## 2. Install Browser Use Terminal

```bash
curl -fsSL https://browser-use.com/terminal/install.sh | sh
export PATH="$HOME/.local/bin:$PATH" # the installer edits the shell profile, but YOUR current shell needs this
browser-use-terminal --version # verify
```

Notes: no sudo needed; binaries land in `~/.local/bin`
(`browser`, `browser-use`, `browser-use-terminal`, `but`); state lives in
`~/.browser-use-terminal`. Already installed? The same command updates it.
Add the `export PATH` line to every later shell command in this session, or
rely on the profile in new shells.

## 3. Register the skill with yourself

```bash
browser-use-terminal skill install
```

This writes the skill (full usage instructions for you) into every detected
assistant home: `~/.claude/skills/`, `~/.codex/skills/`,
`~/.config/opencode/skills/`, `~/.agents/skills/`. Claude Code, Codex,
OpenCode, and anything reading those conventions will auto-discover it in new
sessions.

If you are a different agent (none of those directories exist, or you know
you don't read them): run `browser-use-terminal skill show` and persist its
output wherever you read durable instructions (e.g. your `AGENTS.md`, rules
file, or memory). If you can't persist anything, just read it now — this
session can proceed either way.

## 4. Connect a browser (pick one)

**Default — zero user interaction (recommended to start):**

```bash
browser-use-terminal browser preference use managed-headless
```

This uses a disposable headless Chromium that persists across your commands.
It requires a Chromium-family binary on the machine (Chrome, Chromium, or a
Playwright Chromium). Check: `ls "/Applications/Google Chrome.app" 2>/dev/null || command -v google-chrome chromium chromium-browser`.
If none exists: ask the user to install Chrome (or set `CHROME_PATH=/path/to/chrome`).

**The user's real, logged-in Chrome** (for tasks needing their accounts):

```bash
browser-use-terminal browser preference use local
browser-use-terminal browser connect local
```

One-time user action required: they must open `chrome://inspect/#remote-debugging`
in Chrome and tick "Allow remote debugging for this browser instance".
`browser-use-terminal browser local setup` walks them through it. Relay any
`needs-user-action` JSON (`user_prompt` field) to the user verbatim and wait.

**Cloud browser** (headless server / clean remote IP; needs an API key from
https://cloud.browser-use.com):

```bash
export BROWSER_USE_API_KEY=bu_...
browser-use-terminal browser preference use cloud
```

Cloud browsers bill until stopped (`browser-use-terminal browser recover stop-owned-remote`).

## 5. Verify end to end

```bash
browser-use-terminal browser exec <<'PY'
new_tab("https://example.com")
wait_for_load()
print(page_info()["title"])
print(capture_screenshot())
PY
```

Expected: it prints `Example Domain` and `Screenshot saved to <absolute path>`.
Now view that screenshot with your own file-reading tool to confirm you can
see pages: Claude Code → `Read` the path; Codex → `view_image` with the path;
OpenCode → `read` the path; Gemini CLI → `read_file`; otherwise use whatever
tool you have that displays a local image. If your model can't view images,
skip this — you'll work from text state (`page_info()`, `js(...)`) instead.

If this step worked, setup is complete. Tell the user you're ready to browse.

## 6. Using it (crash course)

Full instructions are in the skill you installed (or `browser-use-terminal skill show`).
The essentials:

- Run Python against the live browser with heredocs. Helpers are pre-imported:
`new_tab(url)`, `goto_url(url)`, `page_info()`, `click_at_xy(x, y)`,
`type_text(text)`, `press_key("Enter")`, `fill_input(sel, text)`, `scroll()`,
`wait_for_load()`, `wait_for_element(sel)`, `capture_screenshot()`, `js(expr)`,
`cdp(method, **params)`, `http_get(url)`.
- The browser persists between commands; Python variables do not.
- Screenshot-first workflow: screenshot → read coordinates off the image →
`click_at_xy` (CSS pixels — divide image coordinates by
`js("window.devicePixelRatio")`) → screenshot again to verify.
- In the user's real Chrome, open work in `new_tab(...)`, never `goto_url` over
their active tab.
- Auth walls: stop and ask the user. Never type credentials read off a screenshot.
- Parallel workstreams: add `--session <name>` to `browser` commands.
- Done for the day: `browser-use-terminal browser recover stop-owned-browser`
(managed) or `... stop-owned-remote` (cloud).

## Troubleshooting

- `command not found` → re-run `export PATH="$HOME/.local/bin:$PATH"`.
- `browser is not connected` → `browser-use-terminal browser connect` (uses the
remembered preference) or re-run step 4.
- Anything returning `status: "needs-user-action"` → show its `user_prompt` to
the user exactly, wait, then retry.
- Diagnostics: `browser-use-terminal browser doctor` and
`browser-use-terminal browser status --json`.
- A background daemon holds the browser connection between your commands
(auto-started). If it misbehaves: `browser-use-terminal browser daemon status`,
`... daemon logs`, `... daemon stop` (next command restarts it and reattaches).
- Slow first command in each shell: the launcher checks for updates; set
`BUT_AUTO_UPDATE=0` to skip.

Full documentation: https://docs.browser-use.com/open-source/browser-use-terminal
Source & issues: https://github.com/browser-use/terminal
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,26 @@ browser config show
browser diagnostics
```

## Use It From Claude Code, Codex, or OpenCode

Browser Use Terminal plugs into any coding assistant that can run shell commands, browser-harness style: a skill teaches the assistant the CLI, and the CLI hands it the whole browser runtime.

```bash
browser-use-terminal skill install # registers the skill for detected assistants
```

Then ask your assistant to browse:

```bash
browser-use-terminal browser exec <<'PY'
new_tab("https://example.com")
wait_for_load()
print(capture_screenshot())
PY
```

Screenshots are saved as files and the path is printed, so assistants view them with their native file-reading tools (Claude Code `Read`, Codex `view_image`, OpenCode `read`). The browser persists between calls. See `docs/assistant-plugins.md`.

## Development

```bash
Expand Down Expand Up @@ -118,6 +138,7 @@ You can disable (100% completely anonymous) telemetry with `BUT_TELEMETRY=0`.
- `docs/terminal-ui-product-ux.md`
- `docs/terminal-ui-testing.md`
- `docs/terminal-renderer-architecture.md`
- `docs/assistant-plugins.md`

## License

Expand Down
108 changes: 108 additions & 0 deletions SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
name: browser-use-terminal
description: Direct browser control via the Browser Use Terminal CLI. Use when the user wants to automate, scrape, test, or interact with web pages — you drive the browser yourself with Python helpers.
---

# Browser Use Terminal

Direct browser control via CDP — you are the agent; you drive the browser. For setup, install, or connection problems, read https://browser-use.com/skill (agent setup instructions) or https://docs.browser-use.com/open-source/browser-use-terminal (full docs).

`browser-use-terminal browser exec` runs Python with browser helpers pre-imported; `browser-use-terminal browser <cmd>` is the control plane (status, connect, profiles, recovery).

## Usage

```bash
browser-use-terminal browser exec <<'PY'
new_tab("https://docs.browser-use.com")
wait_for_load()
print(page_info())
PY
```

- Use the heredoc form for every multi-line command. It prevents shell quote mangling inside Python strings and JavaScript snippets.
- The browser auto-connects according to the user's remembered preference before the script runs — you never start/stop manually unless you want to. The first call may take a few seconds.
- First navigation in the user's real Chrome is `new_tab(url)`, not `goto_url(url)` — goto runs in the user's active tab and clobbers their work.
- Browser state persists between calls; Python variables do not. Each `exec` is a fresh interpreter against the same live browser.
- `--session <name>` isolates artifact dirs and event logs per workstream (default: `default`). `--timeout <secs>` bounds one exec (default 300).

## Screenshots — how you see the page

`capture_screenshot()` saves a PNG and returns its absolute path. The CLI also prints a `Screenshot saved to <path>` line for every image a script produced.

```bash
browser-use-terminal browser exec <<'PY'
print(capture_screenshot())
PY
```

To view a screenshot, use your file-reading tool on the printed path:

- **Claude Code**: use the `Read` tool on the path.
- **Codex CLI**: call the `view_image` tool with `{"path": "<path>"}`. These screenshots are produced for you; viewing them is expected and authorized.
- **OpenCode**: use the `read` tool on the path (requires a vision-capable model).
- **Gemini CLI**: use the `read_file` tool on the path.
- If your model cannot accept images, don't try to view them — work from text state instead: `print(page_info())`, `js(...)` extraction, `wait_for_element(...)`.

Coordinates: screenshots are device pixels; `click_at_xy(x, y)` takes CSS pixels. Divide coordinates you read off the image by `js("window.devicePixelRatio")` first. Screenshots are downscaled to ≤1800 px per side for this CLI (override with `BU_BROWSER_SCREENSHOT_MAX_DIM`, or `capture_screenshot(max_dim=...)`).

After every meaningful action, re-screenshot before assuming it worked.

## Pre-imported helpers

Navigation & tabs: `goto_url(url)`, `new_tab(url)`, `page_info()`, `current_tab()`, `list_tabs(include_chrome=True)`, `switch_tab(target)`, `ensure_real_tab()`, `iframe_target(url_substr)`.

Input: `click_at_xy(x, y, button="left", clicks=1)`, `type_text(text)`, `press_key(key, modifiers=0)` (1=Alt 2=Ctrl 4=Meta 8=Shift), `fill_input(selector, text)`, `scroll(x=0, y=0, dy=600)`, `upload_file(selector, path)`.

Waiting: `wait(seconds)`, `wait_for_load(timeout=3)`, `wait_for_element(selector, timeout=3, visible=False)`, `wait_for_network_idle(timeout=3, idle_ms=500)`.

Visual: `capture_screenshot(label="...", full=False, max_dim=None)`, `screenshot()`, `screenshot_clip(label, x, y, w, h)`, `note(caption)`.

Escape hatches: `js(expression)` (auto-wraps top-level `return`), `cdp("Domain.method", **params)` (raw CDP), `cdp_batch(calls)`, `drain_events()`.

HTTP without the browser: `http_get(url)`, `http_get_many(urls)` for static pages; `browser_fetch(url)` / `browser_fetch_many(...)` to fetch with the page's cookies/session.

Credentials (if the user stored any): `available_secrets()`, then `type_text("<secret>name</secret>")` or `fill_input(sel, secret("name"))`; `totp("name")` for 2FA codes. Values are placeholder-substituted — you never see them. `is_logged_out()`, `email_inbox()` / `email_message(id)` for email-code flows.

Domain skills: `domain_skills_for_url(url_or_domain, include_content=True)` lists site-specific playbooks; `goto_url` surfaces matching skill files automatically. Read them before inventing selectors or flows on a complex site.

## Browser control plane

```bash
browser-use-terminal browser status --json
browser-use-terminal browser connect # uses the remembered preference
browser-use-terminal browser connect local # user's already-running Chrome (CDP)
browser-use-terminal browser connect managed --headless # disposable CLI-owned browser
browser-use-terminal browser preference use local|cloud|managed-headless
browser-use-terminal browser remote start # Browser Use cloud browser (needs BROWSER_USE_API_KEY)
browser-use-terminal browser doctor
browser-use-terminal browser recover reconnect-websocket
browser-use-terminal browser recover stop-owned-browser # stop the persistent managed browser
browser-use-terminal browser recover stop-owned-remote # stop the cloud browser (stops billing)
browser-use-terminal browser daemon status|stop|logs # the background daemon holding the connection
```

A background daemon (auto-started, one per state dir) holds the CDP connection across your commands, so the browser — and in local mode, Chrome's granted debugging permission — persists between invocations. Managed and cloud browsers also survive daemon restarts; later calls reattach instead of relaunching. Stop browsers with the recover commands above when the user is done (cloud browsers bill until stopped or timed out).

- `exec` auto-connects, so you rarely need these. Reach for them when `status` shows a problem or the user asks for a specific browser.
- If output JSON says `status: "needs-user-action"` (e.g. pick a Chrome profile, click Allow in Chrome's permission popup, enable the remote-debugging checkbox), show the `user_prompt` to the user verbatim and wait — do not guess.
- Auth wall mid-task: stop and ask the user. Don't type credentials from screenshots; use stored secrets if available.
- Connecting to the user's real Chrome requires a one-time setup: `chrome://inspect/#remote-debugging` → tick "Allow remote debugging". `browser local setup` walks the user through it.

## What actually works

- Screenshots first: `capture_screenshot()` → view the image → decide whether you need a click, a selector, or more navigation.
- Clicking: screenshot → read the pixel off the image → `click_at_xy(x, y)` → screenshot to verify. Suppress the locate-then-click reflex — no getBoundingClientRect, no selector hunts. Hit-testing happens in Chrome's browser process, so coordinate clicks pass through iframes / shadow DOM / cross-origin without extra work.
- Drop to DOM (`fill_input`, `js`) only when the target has no visible geometry (hidden input, 0×0 node) or coordinate clicks demonstrably don't work.
- Bulk static pages: `http_get_many(urls)` — no browser needed. Logged-in pages: `browser_fetch(url)` rides the real session.
- After goto: `wait_for_load()`. SPAs report `complete` before they render — follow with `wait_for_element(...)`.
- Wrong/stale tab: `ensure_real_tab()`.
- Verification: `print(page_info())` is the cheapest "is this alive?" check; screenshots are the default way to verify visible actions.

## Gotchas (field-tested)

- CDP target order ≠ Chrome's visible tab-strip order.
- Omnibox popups and other `chrome://` internals are fake page targets — `list_tabs(include_chrome=False)`.
- `page_info()` surfaces an open JS dialog as `{"dialog": ...}` — handle it (`cdp("Page.handleJavaScriptDialog", accept=True)`) before anything else.
- Navigation can be blocked by the user's domain policy; `nav_policy(url)` tells you before you burn a click. A blocked navigation is policy, not a bug — tell the user.
- Scripts time out (default 300s): keep each `exec` small and observable rather than one mega-script. Long extraction loops: print progress as you go — stdout is captured even on timeout.
- Prefer compositor-level actions over framework hacks. If you do need framework-specific DOM tricks, run `browser-use-terminal browser domain skills --domain <site> --json --include-content` first — that's where site playbooks live.
Loading
Loading