Autonomous agent that operates a virtualized Ubuntu desktop to automate workflows without API access. Reads screenshots, decides actions, executes clicks/keyboard. Anthropic's Computer Use frontier applied to back-office automation.
Industrial use case: RPA for legacy systems without APIs, back-office automation in banking/insurance.
Receives a task in natural language ("extract Q3 sales from the legacy system and save to CSV"). Operates a Ubuntu VM by: capturing screenshot → Claude with computer_use tool decides next action → xdotool executes it → loop until task complete or max steps.
Task: "Extract Q3 sales from legacy system"
│
▼
[Planner] Claude → initial plan
│
▼
LOOP DE ACCIÓN:
├──► [Screenshot capture] scrot on Xvfb
├──► [Vision + Reasoning] Claude with computer_use tool
│ → click(x, y) | type(text) | key(combo) | scroll
├──► [Action Executor] xdotool
├──► [Verifier] Claude — "did the step complete? unexpected dialog?"
├──► [State Updater] LangGraph state with screenshot history
└──► continue until task_complete or max_steps
│
▼
[Final Validator] Claude → does output satisfy original task?
│
▼
[Audit Log] PostgreSQL
- Dockerized Ubuntu 22.04 + Xvfb + VNC environment
- xdotool wrappers for click/type/key/scroll
- Screenshot capture with scrot
- LangGraph loop with screenshot history in state
- Action executor with safety filters (no rm -rf, no sudo)
- 20 custom evaluation tasks with ground truth success criteria
- OSWorld benchmark comparison (subset)
- Next.js demo with live VNC viewer
- 5 recorded MP4 videos of task executions
- Safety layer documented (blocked action list)
| Layer | Technology |
|---|---|
| Computer Use | Anthropic Claude Sonnet 4.5 with computer_20250124 tool |
| VM | Docker container — Ubuntu 22.04 + Xvfb + VNC |
| Screenshots | scrot or gnome-screenshot |
| Input automation | xdotool |
| Orchestration | LangGraph with screenshot history in state |
| Storage | PostgreSQL for action logs, filesystem for screenshots |
| Frontend | Next.js with live VNC viewer |
| Observability | LangSmith with action-by-action traces |
- Reproducible VM via
docker compose up - 20 custom tasks defined with ground truth
- Success rate reported by category (form filling, data extraction, web nav, multi-step)
- Demo lets user pick a task and watch the agent execute in streaming
- 5 recorded videos (MP4 in repo or YouTube unlisted) of real executions
- Safety layer documented (blocked actions list)
Plus the 12 universal DoD blocks.
MIT.