Skip to content

jackwangfeng/qdesk

Repository files navigation

qdesk

Open-source AI agent desktop sandbox.

Give any AI agent a clean Linux + Chromium environment over HTTP. Take screenshots, send clicks/keys, install apps, watch the screen. Self-hostable, MCP-native, ~$0.005 per agent action with a vision LLM.

Use it for RPA, agent training, demos, scraping, computer-use evaluation, exploratory smoke runs — anywhere an AI needs a real computer to drive and you can tolerate occasional flake (vision-based UI agents are not deterministic). For green-on-green CI, prefer Playwright/Maestro on apps you can instrument; reach for qdesk when DOM/AX hooks aren't available.

👉 AI assistants (Claude Code, Cursor, Aider, …) — qdesk ships an MCP server with 4 tools you can call directly. See SKILL.md. Editing this codebase? Read AGENTS.md.


What it gives you

┌──────────────────────────────────────────────────────────┐
│  AI agent (Claude / Gemini / GPT / your own loop)         │
└──────────────────────┬───────────────────────────────────┘
                       │ HTTPS — JSON actions
                       ▼
┌──────────────────────────────────────────────────────────┐
│  qdesk-control  (multi-session, SQLite, bearer auth)      │
│      └─ POST /v1/sessions  → spin up a sandbox            │
│      └─ GET  .../screenshot                                │
│      └─ POST .../actions  {click,type,key,scroll,drag}    │
└──────────────────────┬───────────────────────────────────┘
                       │ Docker
                       ▼
┌──────────────────────────────────────────────────────────┐
│  qdesk/ubuntu-chrome:dev                                   │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Xvfb (virtual display)                              │  │
│  │ xfwm4 (window manager)                              │  │
│  │ Chromium                                             │  │
│  │ qdesk-agentd (HTTP daemon, /screenshot /actions)    │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

A single sandbox boots in ~1 second (Docker), responds to HTTP, returns real PNG screenshots and accepts real input events. No app instrumentation needed — the sandbox doesn't know or care what's running inside.


Mac host mode (alpha) — control your local WeChat

In addition to the Linux Docker sandbox, qdesk now ships a Mac host mode for AI assistants to drive native macOS apps. v1 targets WeChat.

./scripts/install-mac.sh
qdesk-mac doctor   # grants Screen Recording + Accessibility
claude mcp add --transport stdio qdesk-mac -- /usr/local/bin/qdesk-mac

The MCP tools live under wechat.*: screenshot, click, type, key, scroll, ensure_foreground, open_chat. wechat.type automatically falls back to clipboard paste for non-ASCII text. See examples/wechat-reply.md.

v1 limitations: macOS 14+, single WeChat instance, action calls require WeChat to be the foreground app, screenshots are full-screen (includes other apps' windows). No code signing — TCC may re-prompt after rebuild.

Remote mode (HTTP transport)

Run qdesk-mac as an HTTP server so a client on another machine can drive your Mac's WeChat. Same MCP JSON-RPC dispatch, just over HTTP.

export QDESK_MAC_API_KEY=$(openssl rand -hex 32)
qdesk-mac --listen 127.0.0.1:8765 --api-key "$QDESK_MAC_API_KEY"
# In another shell or another machine (with --listen 0.0.0.0:8765 + reverse proxy):
curl -X POST http://127.0.0.1:8765/mcp \
  -H "Authorization: Bearer $QDESK_MAC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

While running in HTTP mode, qdesk-mac forks caffeinate -di to keep the Mac awake and the display unlocked (suppress with --no-caffeinate). The Mac must remain logged in and unlocked — macOS Secure Event Input blocks all synthetic keyboard events from the lock screen, so nothing in qdesk-mac can unlock the screen for you.

Endpoints:

  • GET /health — no auth, returns {"ok": true}. Liveness probe.
  • POST /mcp — bearer auth required. Body is one JSON-RPC request, response is one JSON-RPC response.

Clients that prefer SSE framing (Claude Desktop's legacy SSE transport, some Cursor builds) get a single-event stream instead of plain JSON when they send Accept: text/event-stream. No new endpoint, no streaming behavior change.

Hardening before exposing to the public internet:

  • Front with TLS (caddy / nginx / tailscale serve). qdesk-mac speaks plain HTTP.
  • Use a strong --api-key (32+ random bytes). It's the only auth.
  • Bind to 127.0.0.1 and tunnel via SSH or Tailscale, OR bind to 0.0.0.0 only behind a reverse proxy with ACLs. Don't expose the raw port.

Tailscale setup (recommended)

# 1. Bind only to your Tailscale IP, restrict to the Tailscale CGNAT range:
qdesk-mac --listen $(tailscale ip -4):8765 \
          --api-key "$QDESK_MAC_API_KEY" \
          --trusted-cidr 100.64.0.0/10

# 2. (optional) Front with `tailscale serve` for HTTPS + identity:
tailscale serve --bg --https=443 http://localhost:8765
qdesk-mac --listen 127.0.0.1:8765 \
          --api-key "$QDESK_MAC_API_KEY" \
          --trusted-cidr 127.0.0.0/8 \
          --trust-tailscale-headers

--trusted-cidr rejects any connection whose source IP is outside the listed ranges (multiple comma-separated, e.g. 100.64.0.0/10,10.0.0.0/8). X-Forwarded-For is honored only when the immediate peer is loopback, so a remote attacker can't spoof their source. --trust-tailscale-headers logs Tailscale-User-Login for every request — only enable when you front qdesk-mac with tailscale serve, otherwise an attacker can fake the headers.


Windows host mode (alpha) — drive a Windows machine over HTTP

A single Go binary, qdesk-win.exe, exposes the same shape as qdesk-mac --listen for a Windows host. No sidecar — Win32 syscalls happen directly in Go, so deploying is just one file.

# Cross-compile from your dev box
make win-build

# Deploy to a Windows host over SSH (OpenSSH server enabled)
QDESK_WIN_HOST=Administrator@your-windows-host ./scripts/install-win.sh

# Open inbound port 8765 in Windows Defender Firewall (one-time)
ssh "$QDESK_WIN_HOST" 'powershell New-NetFirewallRule -DisplayName qdesk-win -Direction Inbound -Action Allow -Protocol TCP -LocalPort 8765'

# Launch qdesk-win in the user's INTERACTIVE session — see "session
# isolation" below for why this isn't just `Start-Process`.
KEY=$(openssl rand -hex 32)
ssh "$QDESK_WIN_HOST" "schtasks /create /TN qdesk-win-runner /TR \"C:\\Users\\Administrator\\qdesk-win.exe --listen 0.0.0.0:8765 --api-key $KEY\" /SC ONCE /ST 23:59 /RL HIGHEST /F /RU Administrator /IT"
ssh "$QDESK_WIN_HOST" "schtasks /run /TN qdesk-win-runner"

Tools live under windows.*: front_app, activate, screenshot, click, type, key, scroll, clipboard_paste. Each action accepts an optional expected_exe guard (basename, case-insensitive) that refuses the call if a different exe is in front. windows.type auto-routes non-ASCII text through the clipboard fallback (mirrors the macOS WeChat 4.x finding that some apps' input controls drop synthetic unicode events).

curl -X POST http://your-windows-host:8765/mcp \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

Windows session isolation (must read)

Windows Server (and locked Windows desktops) isolate processes by logon session. SSH into a Windows Server lands in Session 0 (services), which has no desktopGetForegroundWindow returns NULL, BitBlt fails, GUI apps started there are invisible. The qdesk GUI tools require an active interactive logon: a connected RDP session or a physical-console login. Disconnected RDP sessions count as logged in but have no display device, so screenshots and SetForegroundWindow still fail there.

That's why the launch step uses schtasks /IT — it routes the process into the user's interactive session. Keep an RDP session connected while qdesk-win serves traffic. If you need full unattended operation, use a physical-console autologon with the screensaver/lock disabled, or look at v1.x service-mode on the roadmap.

v1 limitations

  • Primary monitor only. Multi-monitor support is v1.x.
  • No UIA / accessibility tree. Per the design doc, the v1 approach is screenshot + coordinate input; UIA is deferred and likely won't help apps that paint via DirectX/Skia (Electron, Office, Slack).
  • actually_foreground may be false. Windows refuses SetForegroundWindow from non-foreground processes; the tool reports honestly and the caller decides whether to retry.
  • No service install / autostart in v1 (the schtasks recipe above is a manual trigger, not a service).
  • No code signing. SmartScreen may warn the first time the .exe runs; click "More info → Run anyway".

See docs/superpowers/specs/2026-05-07-windows-host-mode-design.md for the full design and docs/superpowers/plans/2026-05-07-windows-host-mode.md for the implementation plan.


For the LLM driving these tools

Both Mac and Windows host-mode MCP servers expose 8 mirror-image tools ({platform}.front_app, activate, screenshot, click, type, key, scroll, clipboard_paste). If you're an AI agent picking which tool to call, read docs/agents/host-mode-tools.md once — it covers coordinate systems (Mac=logical points, Windows=physical pixels), the foreground-guard pattern, ASCII-vs-clipboard auto-routing, and the canonical screenshot-→-vision-→-action loop with cost figures.


Use cases

qdesk gives you a primitive: an AI-controllable Linux desktop. What you build on top is up to you.

🤖 1. AI agent computer use

Give Claude / Gemini / your own agent a real computer to drive. Open URLs, click, type, observe, decide. Same shape as Anthropic Computer Use, but you control the computer and it's open source.

🧪 2. Exploratory smoke runs on un-instrumentable UIs

Describe a flow in English; an AI agent attempts it and reports back. Best for: production canvas-rendered apps (Figma, Excalidraw, custom Flutter painters), legacy desktop apps with no UI tree, "did this even load?" smoke checks. Not for: tight CI gates — vision-based agents mis-click ~5–20 % of the time depending on UI density and model (gemini-2.5-pro is more reliable than flash but ~10× the cost). For deterministic testing on apps you can instrument, use the right tool: Playwright for web DOM, Maestro for mobile, flutter-skill for Flutter in dev, and reach for qdesk only when those don't apply.

# tests/login.qdesk.yaml — smoke check, NOT a CI gate
name: "Landing → sign in (smoke)"
url: http://host.docker.internal:8888
goal: Click "Get started" on the welcome page.
expect:
  - The screen shows "Sign in" near the top-left.
$ qdesk run tests/login.qdesk.yaml
✅ PASS  (3 step(s), 41s, ~$0.005)
📄 report: file:///.../report.html

For high-friction targets (apps with dense localized UI, ambiguous search results, custom widgets) plain "screenshot + vision + click" is unreliable regardless of model. The path that's empirically working in this repo is app-specific composite tools that bypass vision on the hard step — see wechat.open_chat for the pattern (cmd+f → paste → return packaged as one tool, no LLM in the middle).

🌐 3. AI-driven web automation / scraping

Site has anti-bot measures? Or just complex JS? Drive a real Chromium with an LLM choosing each action. Slower than Playwright but handles cases DOM-based scrapers can't (canvas, dynamic flows, captchas you have to visually decide on).

🦾 4. RPA / driving apps that have no API (the sweet spot)

Internal Linux desktop tools, legacy ERP web frontends, intranet dashboards, anything that's "only a UI." This is where vision-based driving genuinely earns its complexity: the alternative is a human clicking through it, so a 90 % success rate with retry-on-fail is a huge win, not a weakness. apt install the app inside the sandbox image (or use mac.* / windows.* host modes for native apps), then have the AI drive it.

🎬 5. Demos / tutorials

"Show me how to use Photoshop / Figma / our internal admin panel." AI performs the steps in the sandbox while recording — generate animated guides without manual screen-recording.

📊 6. Agent training data / eval

Record trajectories (screenshots + actions + outcomes) of agents performing tasks. Use as supervised fine-tuning data or as eval harness.

🔍 7. Computer-use eval / red-teaming

Test how a frontier model handles unusual UIs, multi-step flows, or adversarial pages. Sandbox is disposable — agent can't escape into your host.


How it compares (honest table)

There's a healthy 2026 ecosystem. qdesk doesn't try to win every cell.

qdesk Browserbase E2B Maestro flutter-skill Playwright
Scope Full Linux desktop Browser only Code execution + light desktop Mobile + web testing Flutter dev (Dart VM) Web DOM automation
Open source ✅ Apache 2.0 ❌ SaaS Open-core
Self-hostable Partial
MCP server ✅ built-in Via partners Via plugins
Funded ❌ open-source side project $66M $20M+ $$$ community F500-funded
Best at Self-hosted desktop sandbox Managed browser sandbox Cloud code+browser Mobile testing Flutter dev iteration Web DOM CI
Worst at Speed (LLM screenshot loop) Anything outside browser Pure desktop apps Canvas-only assertions Production builds Canvas content

TL;DR: qdesk is the "open-source self-hosted desktop sandbox" cell — plus host-mode adapters for the user's Mac (qdesk-mac) and Windows (qdesk-win) machines for cases where the target is the real desktop, not a fresh container. If you want managed cloud and only browser → Browserbase. If you want mobile-specific testing → Maestro. If you want fast Flutter dev loops → flutter-skill. If you want a real computer your AI can drive — open-source, self-hostable, MCP-native — and you accept that vision-based driving is best-effort rather than deterministic, qdesk.


Quickstart

1. Build the sandbox image (~1 min on warm cache):

docker build -t qdesk/ubuntu-chrome:dev -f images/ubuntu-chrome/Dockerfile .

2. Build / install binaries:

make build && sudo make install
# Or one-line via the GitHub release:
curl -fsSL https://raw.githubusercontent.com/jackwangfeng/qdesk/main/scripts/install.sh | bash

Binaries:

  • qdesk-agentd — runs inside each sandbox (HTTP daemon)
  • qdesk-control — multi-session control plane
  • qdesk — CLI runner (testing use case)
  • qdesk-mcp — MCP server for AI assistants

3. Run the control plane (one terminal):

export QDESK_DEV_KEY=$(openssl rand -hex 16)
qdesk-control --listen 127.0.0.1:8090 --dev-key "$QDESK_DEV_KEY" \
              --image qdesk/ubuntu-chrome:dev

4. Drive a sandbox via plain HTTP (no LLM):

# Spin up a session
SESSION=$(curl -s -X POST http://127.0.0.1:8090/v1/sessions \
    -H "Authorization: Bearer $QDESK_DEV_KEY" \
    -H "Content-Type: application/json" \
    -d '{"open_url":"https://example.com"}' | jq -r .session_id)

# Take a screenshot
curl http://127.0.0.1:8090/v1/sessions/$SESSION/screenshot \
    -H "Authorization: Bearer $QDESK_DEV_KEY" \
    --output /tmp/screen.png

# Click somewhere
curl -X POST http://127.0.0.1:8090/v1/sessions/$SESSION/actions \
    -H "Authorization: Bearer $QDESK_DEV_KEY" \
    -H "Content-Type: application/json" \
    -d '{"type":"click","x":500,"y":300}'

# Tear down
curl -X DELETE http://127.0.0.1:8090/v1/sessions/$SESSION \
    -H "Authorization: Bearer $QDESK_DEV_KEY"

5. Or: use it as an AI testing tool with the bundled runner:

export GEMINI_API_KEY=AIza...
qdesk run --control http://127.0.0.1:8090 examples/recompdaily-landing.qdesk.yaml

6. Or: register with Claude Code via MCP:

claude mcp add --transport stdio qdesk -- qdesk-mcp \
    --control http://127.0.0.1:8090 \
    --api-key "$QDESK_DEV_KEY" \
    --gemini-key "$GEMINI_API_KEY"

After this, Claude Code can call qdesk_screenshot, qdesk_quick_test, etc. naturally inside a project.

See docs/TEAM_QUICKSTART.md for the full team-onboarding flow (5 minutes).


Layout

pkg/protocol/         wire types — Action, Session, ActionResult, ...
pkg/client/           Go SDK for qdesk-control HTTP API
internal/agentd/      in-sandbox HTTP daemon (Xvfb-driven)
internal/control/     control plane: sessions, runtime, auth, proxy
internal/llm/         VisionAgent backends (Gemini default; Claude/GPT pluggable)
internal/runner/      .qdesk parser, agent loop, HTML report (testing use case)
cmd/qdesk-agentd/     binary that runs INSIDE each sandbox
cmd/qdesk-control/    control plane binary (one per host / cluster)
cmd/qdesk/            CLI runner for testing
cmd/qdesk-mcp/        MCP server for AI assistants
images/ubuntu-chrome/ Dockerfile + entrypoint for default sandbox
docs/superpowers/     design specs and implementation plans
docs/TEAM_QUICKSTART.md  team onboarding (5 min)
.claude/skills/       Claude Code skill bundle
SKILL.md              integration guide for AI assistants
AGENTS.md             conventions for AI assistants editing qdesk itself

Local development

make help                    # all targets
make build test smoke        # build + unit tests + e2e smoke
go test ./...                # unit tests only

Pure-Go runtime; only third-party deps are modernc.org/sqlite (pure Go) and gopkg.in/yaml.v3.

Status

  • v0.1 — sandbox + control plane + Gemini agent loop + HTML report + MCP server. Verified end-to-end on a real Flutter Web app.
  • 🔄 v0.2 (planned) — replay mode + self-heal traces, web UI for the control plane, browser cookies/auth persistence, GPU sandbox template.
  • 🔮 v0.3+ (planned) — Android emulator template, macOS/iOS simulator template (positioned as agent sandboxes, not testing-only), Firecracker microVM, agent trajectory recording for training data.

Related projects

Adjacent tools, all worth knowing:

If you want managed, polished, well-funded — pick one of those for your specific need. If you want open source, self-hosted, full-Linux, that you can fork and shape — that's qdesk.

License

Apache 2.0 — see LICENSE.

About

Open-source AI agent desktop sandbox. Give any AI a clean Linux+Chromium environment over HTTP — for testing, RPA, agents, scraping, demos.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors