GUI automation MCP server powered by local Vision LLM (Ollama)
helix-pilot lets AI agents see and control your Windows desktop through the Model Context Protocol (MCP). It captures screenshots, analyzes them with a local Ollama Vision model, and executes mouse/keyboard actions — all running on your machine with zero cloud API cost.
Most GUI automation tools either require expensive cloud APIs, only support macOS, or run inside VMs. helix-pilot is different:
- 100% local — Runs entirely on your machine via Ollama. No cloud API keys, no per-request charges, no data leaving your PC.
- Windows-native — Direct host OS control via Win32 API. Not a VM, not a container — real desktop automation.
- MCP-native — Built as a first-class MCP server. Works instantly with Claude Code, Codex CLI, Cursor, and any MCP client.
- Vision LLM powered — Uses local vision models (Gemma 3, Mistral Small 3.2, etc.) to understand what's on screen, not brittle selectors.
- Safe by design — Built-in action policies, secret detection, emergency stop, and user activity monitoring.
| Feature | helix-pilot | terminator | UI-TARS Desktop | Peekaboo | Cua |
|---|---|---|---|---|---|
| MCP server (CLI-native) | Yes | No | Partial | Yes | No |
| Windows host direct control | Yes | Yes | Yes | No (macOS) | No (VM) |
| Local Vision LLM (Ollama) | Yes | No | No | Yes | No |
| Zero cloud API cost | Yes | No | No | Yes | No |
| Open WebUI integration | Yes | No | No | No | No |
| Built-in safety system | Yes | Partial | No | No | Partial |
| Open source (MIT) | Yes | Yes | Yes | Yes | Yes |
AI agent calls helix-pilot tools via MCP: status() → screenshot() → describe() → auto(). The Vision LLM analyzes the screen and executes GUI actions autonomously.
helix-pilot captures the screen and sends it to a local Ollama Vision model for analysis. The model identifies windows, UI elements, and layout — all running locally with zero API cost.
Example status() output
{
"ok": true,
"helix_pilot_version": "2.0.0",
"ollama": { "available": true, "endpoint": "http://localhost:11434" },
"screen_size": [3840, 2160],
"agent_runtime": { "tracked_agents": 1, "running_agents": 0 },
"safe_mode": true,
"visible_windows": ["Claude Code", "Google Chrome", "Windows PowerShell", "..."]
}ollama pull mistral-small3.2Other supported models:
gemma3:27b,llava,moondream, or any Ollama vision model.
git clone https://github.com/tsunamayo7/helix-pilot.git
cd helix-pilot
uv syncEdit config/helix_pilot.json:
{
"ollama_endpoint": "http://localhost:11434",
"vision_model": "mistral-small3.2:latest"
}See Compatible MCP Clients below for setup instructions.
helix-pilot works with any MCP-compatible client. Here are tested configurations:
Claude Code
Add to your Claude Code MCP settings (.claude.json or project settings):
{
"mcpServers": {
"helix-pilot": {
"command": "uv",
"args": ["--directory", "/path/to/helix-pilot", "run", "server.py"]
}
}
}Codex CLI
Add to your Codex CLI MCP configuration:
{
"mcpServers": {
"helix-pilot": {
"command": "uv",
"args": ["--directory", "/path/to/helix-pilot", "run", "server.py"]
}
}
}Cursor / Windsurf / VS Code (Copilot)
Add to your editor's MCP settings:
{
"mcpServers": {
"helix-pilot": {
"command": "uv",
"args": ["--directory", "/path/to/helix-pilot", "run", "server.py"]
}
}
}Open WebUI + Ollama (via MCPO)
helix-pilot works with Open WebUI and local Ollama models through MCPO (MCP-to-OpenAPI proxy).
- Install MCPO:
pip install mcpo- Create
mcpo_config.json:
{
"mcpServers": {
"helix-pilot": {
"command": "uv",
"args": ["--directory", "/path/to/helix-pilot", "run", "server.py"]
}
}
}- Start the proxy:
mcpo --host 127.0.0.1 --port 8300 --config mcpo_config.json- In Open WebUI: Admin Settings > External Tools > Add Server
- Type:
OpenAPI - URL:
http://127.0.0.1:8300/helix-pilot
- Type:
All 20 tools are now available to any Ollama model with function calling support (e.g. gemma3:27b, qwen3.5:122b).
helix-pilot provides 20 MCP tools for comprehensive GUI automation:
| Tool | Description |
|---|---|
screenshot |
Capture screen or window screenshot |
click |
Click at screen coordinates |
type_text |
Type text (Unicode supported) |
hotkey |
Send keyboard shortcut (e.g. ctrl+c) |
scroll |
Scroll mouse wheel |
describe |
Describe screen content via Vision LLM |
find |
Find UI element by description, returns coordinates |
verify |
Verify screen matches expected state |
status |
Check system status (Ollama, models, screen) |
list_windows |
List all visible windows |
wait_stable |
Wait until screen stops changing |
auto |
Autonomous multi-step GUI task execution |
browse |
Browser-specialized automation |
click_screenshot |
Click then immediately screenshot |
resize_image |
Resize image for AI model size limits |
spawn_pilot_agent |
Launch a background GUI worker with default / explorer / worker roles |
send_pilot_agent_input |
Continue the same GUI worker with a follow-up instruction |
wait_pilot_agent |
Wait for the current agent turn and fetch the last result |
list_pilot_agents |
Inspect tracked background GUI agents |
close_pilot_agent |
Close an idle GUI agent |
The new lifecycle tools let Claude Code treat helix-pilot as a persistent GUI worker instead of only as one-shot tool calls.
- Use
spawn_pilot_agentto start a background agent inautoorbrowsemode. - Role presets map naturally to Claude Code delegation:
defaultfor general execution,explorerfor observation-firstdry_runplanning,workerfor direct execution. - Use
send_pilot_agent_inputto continue the same worker with accumulated GUI context. - Use
wait_pilot_agent,list_pilot_agents, andclose_pilot_agentto coordinate long-running desktop tasks.
helix-pilot includes multiple safety layers to protect your system:
- Action policies — configurable per-site allow/deny lists
- Immutable policy — blocks secrets (API keys, tokens) from being typed
- Emergency stop — move mouse to screen corner to abort
- User activity detection — pauses when user is actively using the computer
- Window deny list — prevents interaction with sensitive windows (Task Manager, Security, etc.)
- Execution modes —
observe_only,draft_only,apply_with_approval
Claude Code / Codex CLI / Cursor Open WebUI + Ollama
| |
| MCP (stdio) | HTTP (via MCPO)
v v
server.py (FastMCP) <-------------> MCPO proxy (optional)
|
v
HelixPilot (src/pilot.py)
|
+-- CoreOperations (PyAutoGUI + Win32 API)
+-- VisionLLM (Ollama API via httpx)
+-- SafetyGuard (policies + user monitoring)
+-- ActionContract (policy evaluation)
# Run tests
uv run python -m pytest tests/ -v
# Lint
uv run ruff check .
# Syntax check
uv run python -m py_compile server.py
# Run server directly
uv run python server.pyContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- helix-ai-studio — All-in-one AI chat studio with 7 providers, RAG, MCP tools, and pipeline
- helix-agent — Extend Claude Code with local Ollama models — cut token costs by 60-80%
- claude-code-codex-agents — MCP bridge to Codex CLI (GPT-5.4) with structured JSONL traces
- helix-sandbox — Secure sandbox MCP server — Docker + Windows Sandbox
MIT - feel free to use this in your own projects.
If you find helix-pilot useful, please consider giving it a star!
Japanese / 日本語
helix-pilot は、ローカルの Vision LLM (Ollama) を使って Windows デスクトップを AI エージェントが操作できる MCP サーバーです。
特徴:
- クラウド API 不要 — Ollama でローカル完結、API 費用ゼロ
- Windows ネイティブ — ホスト OS を直接操作(VM ではない)
- MCP 対応 — Claude Code、Codex CLI、Cursor、VS Code 等ですぐ使える
- Vision LLM 駆動 — 画面をスクリーンショットし、ローカル Vision モデルで解析・操作
- 安全設計 — アクション制御、シークレット検出、緊急停止、ユーザー操作検知
クイックスタート:
ollama pull mistral-small3.2
git clone https://github.com/tsunamayo7/helix-pilot.git
cd helix-pilot && uv syncMCP クライアント(Claude Code 等)に接続するだけで、20 個の GUI 自動化ツールが利用可能になります。 詳細なセットアップ方法は上記の英語ドキュメントをご覧ください。

