Open-source AI agent desktop sandbox.
Give any AI agent a clean Linux + Chromium environment over HTTP. Take screenshots, send clicks/keys, install apps, watch the screen. Self-hostable, MCP-native, ~$0.005 per agent action with a vision LLM.
Use it for RPA, agent training, demos, scraping, computer-use evaluation, exploratory smoke runs — anywhere an AI needs a real computer to drive and you can tolerate occasional flake (vision-based UI agents are not deterministic). For green-on-green CI, prefer Playwright/Maestro on apps you can instrument; reach for qdesk when DOM/AX hooks aren't available.
👉 AI assistants (Claude Code, Cursor, Aider, …) — qdesk ships an MCP server with 4 tools you can call directly. See
SKILL.md. Editing this codebase? ReadAGENTS.md.
┌──────────────────────────────────────────────────────────┐
│ AI agent (Claude / Gemini / GPT / your own loop) │
└──────────────────────┬───────────────────────────────────┘
│ HTTPS — JSON actions
▼
┌──────────────────────────────────────────────────────────┐
│ qdesk-control (multi-session, SQLite, bearer auth) │
│ └─ POST /v1/sessions → spin up a sandbox │
│ └─ GET .../screenshot │
│ └─ POST .../actions {click,type,key,scroll,drag} │
└──────────────────────┬───────────────────────────────────┘
│ Docker
▼
┌──────────────────────────────────────────────────────────┐
│ qdesk/ubuntu-chrome:dev │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Xvfb (virtual display) │ │
│ │ xfwm4 (window manager) │ │
│ │ Chromium │ │
│ │ qdesk-agentd (HTTP daemon, /screenshot /actions) │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
A single sandbox boots in ~1 second (Docker), responds to HTTP, returns real PNG screenshots and accepts real input events. No app instrumentation needed — the sandbox doesn't know or care what's running inside.
In addition to the Linux Docker sandbox, qdesk now ships a Mac host mode for AI assistants to drive native macOS apps. v1 targets WeChat.
./scripts/install-mac.sh
qdesk-mac doctor # grants Screen Recording + Accessibility
claude mcp add --transport stdio qdesk-mac -- /usr/local/bin/qdesk-macThe MCP tools live under wechat.*: screenshot, click, type, key,
scroll, ensure_foreground, open_chat. wechat.type automatically
falls back to clipboard paste for non-ASCII text. See
examples/wechat-reply.md.
v1 limitations: macOS 14+, single WeChat instance, action calls require WeChat to be the foreground app, screenshots are full-screen (includes other apps' windows). No code signing — TCC may re-prompt after rebuild.
Run qdesk-mac as an HTTP server so a client on another machine can
drive your Mac's WeChat. Same MCP JSON-RPC dispatch, just over HTTP.
export QDESK_MAC_API_KEY=$(openssl rand -hex 32)
qdesk-mac --listen 127.0.0.1:8765 --api-key "$QDESK_MAC_API_KEY"
# In another shell or another machine (with --listen 0.0.0.0:8765 + reverse proxy):
curl -X POST http://127.0.0.1:8765/mcp \
-H "Authorization: Bearer $QDESK_MAC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'While running in HTTP mode, qdesk-mac forks caffeinate -di to keep the
Mac awake and the display unlocked (suppress with --no-caffeinate).
The Mac must remain logged in and unlocked — macOS Secure Event
Input blocks all synthetic keyboard events from the lock screen, so
nothing in qdesk-mac can unlock the screen for you.
Endpoints:
GET /health— no auth, returns{"ok": true}. Liveness probe.POST /mcp— bearer auth required. Body is one JSON-RPC request, response is one JSON-RPC response.
Clients that prefer SSE framing (Claude Desktop's legacy SSE
transport, some Cursor builds) get a single-event stream instead of
plain JSON when they send Accept: text/event-stream. No new
endpoint, no streaming behavior change.
Hardening before exposing to the public internet:
- Front with TLS (caddy / nginx /
tailscale serve). qdesk-mac speaks plain HTTP. - Use a strong
--api-key(32+ random bytes). It's the only auth. - Bind to
127.0.0.1and tunnel via SSH or Tailscale, OR bind to0.0.0.0only behind a reverse proxy with ACLs. Don't expose the raw port.
# 1. Bind only to your Tailscale IP, restrict to the Tailscale CGNAT range:
qdesk-mac --listen $(tailscale ip -4):8765 \
--api-key "$QDESK_MAC_API_KEY" \
--trusted-cidr 100.64.0.0/10
# 2. (optional) Front with `tailscale serve` for HTTPS + identity:
tailscale serve --bg --https=443 http://localhost:8765
qdesk-mac --listen 127.0.0.1:8765 \
--api-key "$QDESK_MAC_API_KEY" \
--trusted-cidr 127.0.0.0/8 \
--trust-tailscale-headers--trusted-cidr rejects any connection whose source IP is outside the
listed ranges (multiple comma-separated, e.g. 100.64.0.0/10,10.0.0.0/8).
X-Forwarded-For is honored only when the immediate peer is loopback,
so a remote attacker can't spoof their source. --trust-tailscale-headers
logs Tailscale-User-Login for every request — only enable when you
front qdesk-mac with tailscale serve, otherwise an attacker can fake
the headers.
A single Go binary, qdesk-win.exe, exposes the same shape as
qdesk-mac --listen for a Windows host. No sidecar — Win32 syscalls
happen directly in Go, so deploying is just one file.
# Cross-compile from your dev box
make win-build
# Deploy to a Windows host over SSH (OpenSSH server enabled)
QDESK_WIN_HOST=Administrator@your-windows-host ./scripts/install-win.sh
# Open inbound port 8765 in Windows Defender Firewall (one-time)
ssh "$QDESK_WIN_HOST" 'powershell New-NetFirewallRule -DisplayName qdesk-win -Direction Inbound -Action Allow -Protocol TCP -LocalPort 8765'
# Launch qdesk-win in the user's INTERACTIVE session — see "session
# isolation" below for why this isn't just `Start-Process`.
KEY=$(openssl rand -hex 32)
ssh "$QDESK_WIN_HOST" "schtasks /create /TN qdesk-win-runner /TR \"C:\\Users\\Administrator\\qdesk-win.exe --listen 0.0.0.0:8765 --api-key $KEY\" /SC ONCE /ST 23:59 /RL HIGHEST /F /RU Administrator /IT"
ssh "$QDESK_WIN_HOST" "schtasks /run /TN qdesk-win-runner"Tools live under windows.*: front_app, activate, screenshot,
click, type, key, scroll, clipboard_paste. Each action accepts
an optional expected_exe guard (basename, case-insensitive) that
refuses the call if a different exe is in front. windows.type
auto-routes non-ASCII text through the clipboard fallback (mirrors the
macOS WeChat 4.x finding that some apps' input controls drop synthetic
unicode events).
curl -X POST http://your-windows-host:8765/mcp \
-H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'Windows Server (and locked Windows desktops) isolate processes by
logon session. SSH into a Windows Server lands in Session 0
(services), which has no desktop — GetForegroundWindow returns
NULL, BitBlt fails, GUI apps started there are invisible. The qdesk
GUI tools require an active interactive logon: a connected RDP
session or a physical-console login. Disconnected RDP sessions count
as logged in but have no display device, so screenshots and
SetForegroundWindow still fail there.
That's why the launch step uses schtasks /IT — it routes the
process into the user's interactive session. Keep an RDP session
connected while qdesk-win serves traffic. If you need full
unattended operation, use a physical-console autologon with the
screensaver/lock disabled, or look at v1.x service-mode on the
roadmap.
- Primary monitor only. Multi-monitor support is v1.x.
- No UIA / accessibility tree. Per the design doc, the v1 approach is screenshot + coordinate input; UIA is deferred and likely won't help apps that paint via DirectX/Skia (Electron, Office, Slack).
actually_foregroundmay befalse. Windows refusesSetForegroundWindowfrom non-foreground processes; the tool reports honestly and the caller decides whether to retry.- No service install / autostart in v1 (the schtasks recipe above is a manual trigger, not a service).
- No code signing. SmartScreen may warn the first time the .exe runs; click "More info → Run anyway".
See docs/superpowers/specs/2026-05-07-windows-host-mode-design.md
for the full design and docs/superpowers/plans/2026-05-07-windows-host-mode.md
for the implementation plan.
Both Mac and Windows host-mode MCP servers expose 8 mirror-image tools
({platform}.front_app, activate, screenshot, click, type,
key, scroll, clipboard_paste). If you're an AI agent picking
which tool to call, read docs/agents/host-mode-tools.md
once — it covers coordinate systems (Mac=logical points, Windows=physical
pixels), the foreground-guard pattern, ASCII-vs-clipboard auto-routing,
and the canonical screenshot-→-vision-→-action loop with cost figures.
qdesk gives you a primitive: an AI-controllable Linux desktop. What you build on top is up to you.
Give Claude / Gemini / your own agent a real computer to drive. Open URLs, click, type, observe, decide. Same shape as Anthropic Computer Use, but you control the computer and it's open source.
Describe a flow in English; an AI agent attempts it and reports back.
Best for: production canvas-rendered apps (Figma, Excalidraw, custom
Flutter painters), legacy desktop apps with no UI tree, "did this even
load?" smoke checks. Not for: tight CI gates — vision-based agents
mis-click ~5–20 % of the time depending on UI density and model
(gemini-2.5-pro is more reliable than flash but ~10× the cost). For
deterministic testing on apps you can instrument, use the right tool:
Playwright for web DOM,
Maestro for mobile,
flutter-skill for Flutter
in dev, and reach for qdesk only when those don't apply.
# tests/login.qdesk.yaml — smoke check, NOT a CI gate
name: "Landing → sign in (smoke)"
url: http://host.docker.internal:8888
goal: Click "Get started" on the welcome page.
expect:
- The screen shows "Sign in" near the top-left.$ qdesk run tests/login.qdesk.yaml
✅ PASS (3 step(s), 41s, ~$0.005)
📄 report: file:///.../report.htmlFor high-friction targets (apps with dense localized UI, ambiguous
search results, custom widgets) plain "screenshot + vision + click"
is unreliable regardless of model. The path that's empirically working
in this repo is app-specific composite tools that bypass vision
on the hard step — see wechat.open_chat for the pattern (cmd+f →
paste → return packaged as one tool, no LLM in the middle).
Site has anti-bot measures? Or just complex JS? Drive a real Chromium with an LLM choosing each action. Slower than Playwright but handles cases DOM-based scrapers can't (canvas, dynamic flows, captchas you have to visually decide on).
Internal Linux desktop tools, legacy ERP web frontends, intranet
dashboards, anything that's "only a UI." This is where vision-based
driving genuinely earns its complexity: the alternative is a human
clicking through it, so a 90 % success rate with retry-on-fail is a
huge win, not a weakness. apt install the app inside the sandbox
image (or use mac.* / windows.* host modes for native apps), then
have the AI drive it.
"Show me how to use Photoshop / Figma / our internal admin panel." AI performs the steps in the sandbox while recording — generate animated guides without manual screen-recording.
Record trajectories (screenshots + actions + outcomes) of agents performing tasks. Use as supervised fine-tuning data or as eval harness.
Test how a frontier model handles unusual UIs, multi-step flows, or adversarial pages. Sandbox is disposable — agent can't escape into your host.
There's a healthy 2026 ecosystem. qdesk doesn't try to win every cell.
| qdesk | Browserbase | E2B | Maestro | flutter-skill | Playwright | |
|---|---|---|---|---|---|---|
| Scope | Full Linux desktop | Browser only | Code execution + light desktop | Mobile + web testing | Flutter dev (Dart VM) | Web DOM automation |
| Open source | ✅ Apache 2.0 | ❌ SaaS | Open-core | ✅ | ✅ | ✅ |
| Self-hostable | ✅ | ❌ | Partial | ✅ | ✅ | ✅ |
| MCP server | ✅ built-in | Via partners | ✅ | ✅ | ✅ | Via plugins |
| Funded | ❌ open-source side project | $66M | $20M+ | $$$ | community | F500-funded |
| Best at | Self-hosted desktop sandbox | Managed browser sandbox | Cloud code+browser | Mobile testing | Flutter dev iteration | Web DOM CI |
| Worst at | Speed (LLM screenshot loop) | Anything outside browser | Pure desktop apps | Canvas-only assertions | Production builds | Canvas content |
TL;DR: qdesk is the "open-source self-hosted desktop sandbox" cell —
plus host-mode adapters for the user's Mac (qdesk-mac) and Windows
(qdesk-win) machines for cases where the target is the real desktop,
not a fresh container. If you want managed cloud and only browser →
Browserbase. If you want mobile-specific testing → Maestro. If you want
fast Flutter dev loops → flutter-skill. If you want a real computer your
AI can drive — open-source, self-hostable, MCP-native — and you accept
that vision-based driving is best-effort rather than deterministic, qdesk.
1. Build the sandbox image (~1 min on warm cache):
docker build -t qdesk/ubuntu-chrome:dev -f images/ubuntu-chrome/Dockerfile .2. Build / install binaries:
make build && sudo make install
# Or one-line via the GitHub release:
curl -fsSL https://raw.githubusercontent.com/jackwangfeng/qdesk/main/scripts/install.sh | bashBinaries:
qdesk-agentd— runs inside each sandbox (HTTP daemon)qdesk-control— multi-session control planeqdesk— CLI runner (testing use case)qdesk-mcp— MCP server for AI assistants
3. Run the control plane (one terminal):
export QDESK_DEV_KEY=$(openssl rand -hex 16)
qdesk-control --listen 127.0.0.1:8090 --dev-key "$QDESK_DEV_KEY" \
--image qdesk/ubuntu-chrome:dev4. Drive a sandbox via plain HTTP (no LLM):
# Spin up a session
SESSION=$(curl -s -X POST http://127.0.0.1:8090/v1/sessions \
-H "Authorization: Bearer $QDESK_DEV_KEY" \
-H "Content-Type: application/json" \
-d '{"open_url":"https://example.com"}' | jq -r .session_id)
# Take a screenshot
curl http://127.0.0.1:8090/v1/sessions/$SESSION/screenshot \
-H "Authorization: Bearer $QDESK_DEV_KEY" \
--output /tmp/screen.png
# Click somewhere
curl -X POST http://127.0.0.1:8090/v1/sessions/$SESSION/actions \
-H "Authorization: Bearer $QDESK_DEV_KEY" \
-H "Content-Type: application/json" \
-d '{"type":"click","x":500,"y":300}'
# Tear down
curl -X DELETE http://127.0.0.1:8090/v1/sessions/$SESSION \
-H "Authorization: Bearer $QDESK_DEV_KEY"5. Or: use it as an AI testing tool with the bundled runner:
export GEMINI_API_KEY=AIza...
qdesk run --control http://127.0.0.1:8090 examples/recompdaily-landing.qdesk.yaml6. Or: register with Claude Code via MCP:
claude mcp add --transport stdio qdesk -- qdesk-mcp \
--control http://127.0.0.1:8090 \
--api-key "$QDESK_DEV_KEY" \
--gemini-key "$GEMINI_API_KEY"After this, Claude Code can call qdesk_screenshot, qdesk_quick_test,
etc. naturally inside a project.
See docs/TEAM_QUICKSTART.md for the full
team-onboarding flow (5 minutes).
pkg/protocol/ wire types — Action, Session, ActionResult, ...
pkg/client/ Go SDK for qdesk-control HTTP API
internal/agentd/ in-sandbox HTTP daemon (Xvfb-driven)
internal/control/ control plane: sessions, runtime, auth, proxy
internal/llm/ VisionAgent backends (Gemini default; Claude/GPT pluggable)
internal/runner/ .qdesk parser, agent loop, HTML report (testing use case)
cmd/qdesk-agentd/ binary that runs INSIDE each sandbox
cmd/qdesk-control/ control plane binary (one per host / cluster)
cmd/qdesk/ CLI runner for testing
cmd/qdesk-mcp/ MCP server for AI assistants
images/ubuntu-chrome/ Dockerfile + entrypoint for default sandbox
docs/superpowers/ design specs and implementation plans
docs/TEAM_QUICKSTART.md team onboarding (5 min)
.claude/skills/ Claude Code skill bundle
SKILL.md integration guide for AI assistants
AGENTS.md conventions for AI assistants editing qdesk itself
make help # all targets
make build test smoke # build + unit tests + e2e smoke
go test ./... # unit tests onlyPure-Go runtime; only third-party deps are modernc.org/sqlite (pure Go)
and gopkg.in/yaml.v3.
- ✅ v0.1 — sandbox + control plane + Gemini agent loop + HTML report + MCP server. Verified end-to-end on a real Flutter Web app.
- 🔄 v0.2 (planned) — replay mode + self-heal traces, web UI for the control plane, browser cookies/auth persistence, GPU sandbox template.
- 🔮 v0.3+ (planned) — Android emulator template, macOS/iOS simulator template (positioned as agent sandboxes, not testing-only), Firecracker microVM, agent trajectory recording for training data.
Adjacent tools, all worth knowing:
- Managed AI sandbox SaaS — Browserbase (browser only), E2B (code+light desktop), Anchor Browser, Hyperbrowser
- Testing — mobile — Maestro (mobile + web, MCP-native, gold standard)
- Testing — Flutter dev iteration — flutter-skill (Dart VM in-process)
- Testing — web DOM — Playwright (DOM automation), Cypress
- AI agent computer use — Anthropic Computer Use (agent side; qdesk provides the computer)
- Browser-as-agent libraries — Browser Use, Stagehand, Skyvern
If you want managed, polished, well-funded — pick one of those for your specific need. If you want open source, self-hosted, full-Linux, that you can fork and shape — that's qdesk.
Apache 2.0 — see LICENSE.