@atomicbotai/computer-use-mcp

Give your AI agent real eyes — not just blurry screenshots.

An MCP server that turns any MCP-compatible client into a desktop operator — with native OCR that actually reads what's on screen.

Why not just screenshots?

Most computer-use tools send a downscaled screenshot to the model and hope it figures out where to click. That works for large buttons. It fails everywhere else.

The problem: when a 2560×1600 screen is downscaled to fit a model's context window, small text becomes unreadable. Button labels, menu items, form placeholders, status bar text — all of it blurs into noise. The agent guesses coordinates and misses. It clicks the wrong button, types in the wrong field, or loops retrying the same failed action.

The solution: this server pairs every screenshot with native OCR that extracts text with pixel-precise coordinates. Instead of guessing, the agent gets a structured map of the UI:

Full-resolution screenshot captured (2560×1600) with grid overlay.
OCR anchors: "Send" at (1450, 890); "Cancel" at (1300, 890); "Subject" at (200, 145)
OCR layout: [e1 "Send" center=(1450, 890) box=(1400, 875, 100x30)] ...

The agent knows exactly where "Send" is. No guessing. No retries. No wasted tokens on failed clicks.

Screenshot-only vs Screenshot + OCR

	Screenshot only	Screenshot + OCR
Small text (12px labels)	Unreadable after downscale	Extracted with exact coordinates
Click accuracy	~60-70% on complex UIs	~95%+ with anchor-guided clicks
Retry loops	Common (3-5 attempts per action)	Rare
Token cost	High (retries burn tokens)	Low (first-attempt success)
Complex forms	Struggles with similar-looking fields	Identifies each field by label text
Status/error text	Misses or hallucinates content	Reads actual text with confidence scores

How OCR works under the hood

The screenshot_full action:

Takes a full-resolution screenshot (no downscaling)
Runs OCR via native platform engine — zero dependencies, no API keys, fully offline
Deduplicates and sorts text elements by reading order (top→bottom, left→right)
Returns anchor points with coordinates in screenshot-image space
Generates a structured prompt the model can act on immediately

Platform	OCR Engine	Setup required
macOS	Apple Vision framework	None — uses `xcrun swift`
Windows	Windows.Media.Ocr	None — built-in UWP API
Linux	—	Not available yet (graceful no-op)

All processing is local and offline. No data leaves the machine.

Quick start

Run with npx (zero install)

npx @atomicbotai/computer-use-mcp

Or install globally

npm install -g @atomicbotai/computer-use-mcp
computer-use-mcp

Connect to your MCP client

Add the following to your MCP client's config file (Claude Desktop, Cursor, Windsurf, Cline, or any other MCP-compatible client):

{
  "mcpServers": {
    "computer-use": {
      "command": "npx",
      "args": ["@atomicbotai/computer-use-mcp"],
      "env": {
        "COMPUTER_USE_OVERLAY_ENABLED": "1",
        "COMPUTER_USE_OVERLAY_LABEL": "AI Agent"
      }
    }
  }
}

All 19 actions

The server exposes a single MCP tool called computer:

Action	What it does
`screenshot`	Capture screen (auto-downscaled with grid overlay)
`screenshot_full`	Full-resolution capture + OCR text anchors
`click` / `double_click` / `triple_click`	Click at coordinates
`type`	Insert literal text
`press`	Keyboard shortcuts (`cmd+s`, `enter`, etc.)
`submit_input`	Press Enter to submit
`scroll`	Scroll in any direction
`cursor_position`	Get cursor location
`mouse_move`	Move cursor
`drag`	Drag and drop
`wait`	Pause (max 30s)
`hold_key`	Hold key combo (max 10s)
`display_list`	List connected displays
`read_clipboard` / `write_clipboard`	Clipboard access
`open_app` / `switch_app`	Launch or focus apps by name

Coordinates are automatically mapped from screenshot image space to real screen points — the agent works in screenshot coordinates, the library handles the translation.

Configuration

All configuration via environment variables:

Variable	Default	Description
`COMPUTER_USE_OVERLAY_ENABLED`	`1`	Set to `0` to disable the "agent active" overlay
`COMPUTER_USE_OVERLAY_COLOR`	`00BFFF`	Overlay color (hex without `#`)
`COMPUTER_USE_OVERLAY_LABEL`	`Atomic bot`	Text shown on the overlay
`COMPUTER_USE_DEBUG_ARTIFACTS`	`0`	Set to `1` to save screenshots, OCR, and results per action
`COMPUTER_USE_DEBUG_DIR`	`./computer-use-debug`	Directory for debug output
`COMPUTER_USE_LOCK_DIR`	`~/.atomic-computeruse`	Directory for session lock file

Safety features

Session lock — prevents two agents from controlling the desktop simultaneously
Visual overlay — native "agent active" indicator so you always know when automation is running
Guardrails — blocks misclicks in system dock/launcher and dangerous submit zones
Debug artifacts — save every screenshot, OCR result, and action output for post-mortem analysis

Programmatic usage

import { createServer } from "@atomicbotai/computer-use-mcp";

const server = createServer();
// Connect your own MCP transport

Platform support

Feature	macOS	Windows	Linux
Screenshot + actions	✅	✅	✅
OCR	✅ Vision	✅ Media OCR	—
Overlay	✅ Swift	✅ PowerShell	—
Drag (native)	✅	✅	✅ fallback

Built on

@atomicbotai/computer-use — the core desktop automation library with standalone OCR, actions, overlay, and more
@modelcontextprotocol/sdk — official MCP SDK

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
publish.sh		publish.sh
release.sh		release.sh
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@atomicbotai/computer-use-mcp

Why not just screenshots?

Screenshot-only vs Screenshot + OCR

How OCR works under the hood

Quick start

Run with npx (zero install)

Or install globally

Connect to your MCP client

All 19 actions

Configuration

Safety features

Programmatic usage

Platform support

Built on

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

@atomicbotai/computer-use-mcp

Why not just screenshots?

Screenshot-only vs Screenshot + OCR

How OCR works under the hood

Quick start

Run with npx (zero install)

Or install globally

Connect to your MCP client

All 19 actions

Configuration

Safety features

Programmatic usage

Platform support

Built on

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages