Current implementation is a sample with API key from deployment via lightning AI for inference
A Sophisticated MVP Platform That Transforms Any API Key into an Autonomous OS-Level Intelligent Agent
DEMO VIDEO (click to play): In the video the Agent was asked to broswe the github profile YADAV1825 and then write about the person in a blank word document here the mdoel even read the Resume PDF on the website with pypdf2 package of python by downloading and running shell commands on prompting and asking permission of user before running!!
---The Opus Desktop AI Agent is a production-ready MVP platform that enables complete desktop automation through keyboard, mouse, and shell command execution. It features self-healing and self-correcting behavior through visual feedback loops, multi-modal task execution, and human-in-the-loop safety mechanisms.
- ✅ Complete desktop automation through keyboard, mouse, and shell command execution
- ✅ Self-healing and self-correcting behavior through visual feedback loops
- ✅ Multi-modal task execution (visual analysis, code execution, file operations)
- ✅ Human-in-the-loop safety mechanisms with user confirmation gates
- ✅ Voice-based task initiation (optional, gracefully falls back to text)
- ✅ Real-time visual verification through automated screenshot analysis
- ✅ Persistent execution logging and history tracking
- ✅ Reflex command optimization for immediate task execution
ROOT ENTRY POINT
└─ main.py
Initializes AgentApp GUI and starts the event loop
CORE LAYER (core/)
├─ agent.py → DesktopAgent orchestrator
├─ llm.py → LightningClient & OllamaClient (dual API support)
├─ prompts.py → Advanced system prompts with 5 command modes
├─ state.py → AgentState & AgentStep data models
TOOLS LAYER (tools/)
├─ toolkit.py → ToolKit dispatcher & router
├─ os_tools.py → System operations (keystrokes, screenshots, apps)
├─ file_tools.py → File operations (read, write, append, list)
├─ code_tools.py → Python code execution with user confirmation
├─ browser_tools.py → Playwright-based browser automation (optional)
├─ real_browser_tools.py → Alternative browser implementation
UI LAYER (ui/)
└─ gui.py → Tkinter-based Picture-in-Picture floating widget
UTILITIES (utils/)
└─ parsers.py → JSON extraction with error tolerance
LOGS (logs/)
└─ Persistent execution history for audit trail
User Input (Voice/Text)
↓
AgentApp GUI (Tkinter Widget)
↓
DesktopAgent.run_task()
↓
LightningClient.chat() / OllamaClient.chat()
↓
LLM Response → JSON Parsing
↓
ToolKit.execute()
↓
├─→ OSTools (keyboard, screenshot)
├─→ FileTools (file operations)
├─→ CodeTools (Python execution)
└─→ BrowserTools (web automation)
↓
Result/Screenshot Feedback
↓
History & State Update
↓
LOOP or DONE
The agent can execute complex multi-step tasks independently:
- Open applications by name or keyboard shortcuts
- Navigate web browsers with keyboard commands
- Compose and send emails via Gmail
- Execute shell commands (with user confirmation gate)
- Run Python scripts inline
- Read and write files
- Search and process web content
- Take screenshots for visual feedback
- Press keyboard combinations for system control
- Type text into any focused window
The agent implements a sophisticated "See-Think-Act-Verify" loop:
Step 1: Screenshot Analysis
- Captures current screen state before actions
- Uses vision_analysis field to describe screen context
- Identifies UI elements, buttons, text, and content
Step 2: Intelligent Action Selection
- Chooses appropriate keyboard shortcuts or mouse actions
- Uses tab navigation to focus target elements
- Adapts based on visual feedback from previous steps
Step 3: Verification Step
- Takes follow-up screenshots to confirm action success
- Compares expected vs. actual state
- Automatically retries with alternative approaches if action fails
Step 4: Adaptive Recovery
- If action fails, agent tries different methods
- Can switch between clicking, typing, keyboard shortcuts
- Learns from failure patterns within same execution
The agent supports 5 distinct command interpretation modes:
MODE 1: REFLEX COMMANDS (Immediate Direct Actions)
- Direct user commands bypass planning
- Examples: "press alt tab", "type hello", "hit enter"
- Skips plan.txt creation
- Response time: <1 second
- Used for: Quick, immediate actions
MODE 2: SCREEN-CONTEXT COMMANDS (Visual Awareness)
- Commands referencing visible screen content
- Examples: "play the song you see", "click that button"
- Mandatory screenshot as first action
- Full vision_analysis field required
- Self-healing via visual feedback
MODE 3: MULTI-STEP PLANNING (Complex Tasks)
- Traditional task execution with numbered steps
- Creates plan.txt with execution strategy
- Updates plan with [DONE] markers
- Used for: Complex multi-step workflows
- Example: "Open YouTube and search for Faded and play it"
MODE 4: EMAIL COMPOSITION (Gmail Specific)
- Dedicated mode for email operations
- Detects account index (/u/0/, /u/1/)
- Fills To, Subject, Body using keyboard
- Uses Ctrl+Enter for sending
- Verifies with screenshot
MODE 5: WINDOW SWITCHING (Alt-Tab Navigation)
- Advanced Task Switcher menu navigation
- Multi-step process with visual verification
- Holds Alt key across multiple steps
- Captures target window selection
- Confirms final window focus
Built-In Safety Gates:
-
User Confirmation for Dangerous Operations
- Shell command execution: Shows command preview, requires user approval
- Python code execution: Shows code snippet, requires confirmation
- Prevents: Accidental system damage, malware-like behavior
-
40-Step Safety Threshold
- After 40 steps, agent pauses and asks user
- Dialog: "I have reached 40 steps. Should I keep going?"
- User can stop at any milestone
- Prevents: Infinite loops, runaway automation
- Timeout: 120 seconds for user response
-
Stop Request Handling
- User can stop agent at any time (⏹ button)
- Agent gracefully handles InterruptedError
- Browser kept open for manual intervention
- Process and file states preserved
-
Command Validation
- JSON output strictly validated
- Malformed responses detected and logged
- Timeouts on subprocess operations (90 seconds for shell, 60 for Python)
- Keyboard input sanitization
-
Vision-Based Verification
- Actions verified through screenshot analysis
- Prevents blind command execution
- Automatic fallback if action doesn't produce expected result
Dual API Support:
1. LIGHTNING AI (Primary)
- Cloud-based API endpoint
- Authentication: Bearer {api_key}/{teamspace}
- Model: User-specified (default: lightning-ai/gpt-oss-120b)
- Supports streaming responses
- Handles both 'choices' and 'candidates' response formats
2. OLLAMA (Local/Alternative)
- Local model server support
- Base URL configurable
- Useful for: Privacy, offline operation, local models
- Fallback when API unavailable
ENVIRONMENT VARIABLES:
- LIGHTNING_API_KEY: Your Lightning AI authentication token
- LIGHTNING_TEAMSPACE: Your project path (e.g., "username/project")
- MODEL_NAME: Specific model to use
- Gracefully degrades if env vars missing
DesktopAgent Class - Primary orchestrator for the autonomous execution loop
Key Methods:
_is_direct_command()- Detects reflex and screen-context commands_make_messages()- Constructs conversation messages for LLM_call_planner()- Calls LLM to get next action (JSON)_ask_continue()- Thread-safe dialog box after 40 stepsrun_task()- Main execution loop
OllamaClient: Local model server support with streaming LightningClient: Cloud-based API with structured output support
Advanced system prompts containing:
- 13 total available actions
- Output schema specifications
- Reflex command handling
- Screen-context command guidelines
- Human emulation techniques
- Advanced automation techniques
Central command router that dispatches actions to appropriate tools (OS, File, Code, Browser)
- press_key() - Single keys, combos, modifier holds, repeated keys
- type_text() - Safe character input to focused window
- take_screenshot() - PIL ImageGrab for cross-platform compatibility
- open_app() - Application launching with special handling
- open_url() - URL navigation in default browser
- search_web() - Google search queries
- read_url() - HTTP GET with timeout and HTML cleaning
- shell() - Shell command execution with user confirmation
- read_file() - Read file contents with truncation
- write_file() - Create/overwrite files with directory creation
- append_file() - Append to files
- list_dir() - Directory listing with sorting
- run_python() - Sandboxed Python execution with user confirmation
AgentApp Class - Tkinter Picture-in-Picture Widget
Features:
- Floating window (always on top)
- Compact size: 400x190 pixels
- Dark theme (Material Design inspired)
- Voice input support (optional)
- Real-time logging
- Focus management
- History browser
Task: Search for GitHub user and write comprehensive document
[Step 1] Creating plan...
ACTION: write_file
REASON: Creating execution plan
[Step 2] Planning...
ACTION: browser_open
REASON: Gathering information from GitHub
[Step 3-5] Information gathering...
ACTION: browser_read
REASON: Extract profile and repository information
[Step 7-9] Opening Word...
ACTION: press_key (win) → type_text (Word) → press_key (enter)
REASON: Launch Word application
[Step 10-14] Document creation...
ACTION: take_screenshot → press_key (enter) → type_text (content)
REASON: Create blank document and populate with information
[Step 15] Completion...
ACTION: done
REASON: Task complete - all information gathered and documented
TOTAL EXECUTION TIME: ~2 minutes
STEPS COMPLETED: 15
MODE: Multi-step planning with screen verification
Task: "press alt tab"
[Step 1] Executing...
ACTION: press_key
REASON: Reflex: user said press alt tab
RESULT: Pressed key combo: alt+tab
[Step 2]
ACTION: done
REASON: Reflex command completed
MODE: Reflex (direct command bypass)
EXECUTION TIME: <1 second
Task: "play the song you see on the screen"
[Step 1]
ACTION: take_screenshot
REASON: Screen-context command: need to see what's on screen first
[Step 2]
ACTION: press_key
INPUT: space
VISION_ANALYSIS: YouTube with "Faded - Alan Walker" video page
REASON: Playing the visible video using spacebar
[Step 3]
ACTION: take_screenshot
REASON: Verify the song started playing
[Step 4]
ACTION: done
REASON: Verified: Video is now playing
MODE: Screen-context (visual awareness)
EXECUTION TIME: ~5 seconds
┌─────────────────────────────────────────┐
│ AgentApp GUI (Tkinter) │
│ ┌────────────────────────────────────┐ │
│ │ 🎤 [task entry field] ▶ 📋 ⏹ │ │
│ └────────────────────────────────────┘ │
└────────┬────────────────────────────────┘
│
│ Voice/Text Input
↓
┌─────────────────────────────────────────┐
│ start_agent() │
│ └─ Creates agent_thread │
│ └─ run_task(state) │
└─────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│ 1. CHECK STOP REQUEST │
│ If stop_requested → break │
└────────┬─────────────────────────────────────────┘
│
┌────────↓─────────────────────────────────────────┐
│ 2. 40-STEP SAFETY CHECK │
│ If turn % 40 == 0 → ask user to continue │
└────────┬─────────────────────────────────────────┘
│
┌────────↓─────────────────────────────────────────┐
│ 3. DETECT COMMAND MODE │
│ ├─ Reflex? → Skip planning │
│ ├─ Screen-context?→ Screenshot first │
│ └─ Complex? → enforce_plan = True │
└────────┬─────────────────────────────────────────┘
│
┌────────↓─────────────────────────────────────────┐
│ 4. CALL PLANNER (LLM) │
│ ├─ Prepare messages with history │
│ ├─ Include screenshots if available │
│ ├─ Post to Lightning/Ollama API │
│ └─ Parse JSON response │
└────────┬─────────────────────────────────────────┘
│
┌────────↓─────────────────────────────────────────┐
│ 5. EXTRACT ACTION │
│ ├─ If action == "done" → break │
│ ├─ Else → prepare for execution │
│ └─ Log action, reason, vision_analysis │
└────────┬─────────────────────────────────────────┘
│
┌────────↓─────────────────────────────────────────┐
│ 6. EXECUTE ACTION │
│ ├─ toolkit.execute(action) │
│ ├─ Route to OS/File/Code/Browser tools │
│ ├─ Apply user confirmations if needed │
│ └─ Capture result/screenshot │
└────────┬─────────────────────────────────────────┘
│
┌────────↓─────────────────────────────────────────┐
│ 7. UPDATE STATE │
│ ├─ Add step to history │
│ ├─ Update task with result context │
│ ├─ Log result │
│ └─ Increment turn counter │
└────────┬─────────────────────────────────────────┘
│
└─────────────────→ LOOP TO STEP 1
┌──────────────────────────────────────┐
│ toolkit.execute(action_dict) │
└────────┬─────────────────────────────┘
│
├─→ action == "open_app"
│ └─→ OSTools.open_app(name)
│
├─→ action == "press_key"
│ └─→ OSTools.press_key(key)
│
├─→ action == "take_screenshot"
│ └─→ OSTools.take_screenshot()
│ Returns: [SCREENSHOT_TAKEN: path]
│
├─→ action == "shell"
│ ├─→ Show confirmation dialog
│ └─→ OSTools.shell(cmd)
│
├─→ action == "run_python"
│ ├─→ Show confirmation dialog
│ └─→ CodeTools.run_python(code)
│
├─→ action == "read_file"
│ └─→ FileTools.read_file(path)
│
├─→ action == "write_file"
│ └─→ FileTools.write_file(path, content)
│
└─→ action == "done"
└─→ Return "done"
Task Initiation:
Voice Recognition: 2-5 seconds (depends on speech length)
Text Entry: <1 second
Agent Thread Start: <100ms
Planning Phase (First Step):
LLM Inference: 3-8 seconds (depends on model, system load)
JSON Parsing: <100ms
Action Execution:
Keyboard Operations: 50-200ms
Screenshot: 1500-2000ms (includes 1.5s delay for stability)
File Operations: <100ms
Shell Commands: 1-60 seconds (variable)
Verification Phase:
Screenshot Analysis: 2-5 seconds (LLM time)
Decision Making: 3-5 seconds (LLM time)
Total Loop Time (average):
Simple Action: 8-15 seconds
Complex Action: 15-30 seconds
Multi-step Task: 5-10 minutes (depending on task complexity)
Throughput:
Maximum steps per minute: 4-8 steps/min
Effective task completion rate: 1-2 complex tasks per hour
Memory:
Baseline: ~80MB (GUI + core)
Per 100 steps history: ~10MB
Screenshot buffer: ~5MB
Total typical: 100-150MB
CPU:
Idle: <1% (event-driven)
During LLM inference: 20-40% (single core)
During image capture: 10-20%
Network:
Per API call: ~50KB (varies with screenshot size)
Screenshots: 200KB-2MB per image
Typical session: 2-10MB data transfer
Disk:
Logs per run: 50KB-500KB
Screenshots: 1-3MB per execution
Retention: Depends on log cleanup frequency
THREAT 1: Malicious shell commands
├─ MITIGATION: User confirmation gate on all shell() calls
├─ PREVIEW: Command preview (150 chars) shown before execution
├─ TIMEOUT: 90-second hard limit prevents long-running exploits
└─ AUDIT: Full command logged in execution history
THREAT 2: Runaway infinite loops
├─ MITIGATION: 40-step safety threshold
├─ USER CHECK: Dialog every 40 steps asking to continue
├─ TIMEOUT: 120-second timeout for user response
├─ EARLY STOP: "done" action checked immediately
└─ LOG: Step count tracked and reported
THREAT 3: Accidental file deletion
├─ MITIGATION: No automatic deletion (write, append, read only)
├─ SCOPE: File operations limited to user directories
├─ VISIBILITY: All file operations logged
└─ VERIFICATION: No wildcard deletion patterns
THREAT 4: Unauthorized code execution
├─ MITIGATION: run_python requires user confirmation
├─ PREVIEW: Code snippet preview (200 chars) shown
├─ TIMEOUT: 60-second limit on Python execution
├─ SANDBOX: Subprocess isolation
└─ OUTPUT: Stderr/stdout captured separately
THREAT 5: Browser-based CSRF/XSS
├─ MITIGATION: Browser tools optional (graceful fallback)
├─ SCOPE: Limited to browser automation, no DOM access
├─ ISOLATION: Separate browser instance
└─ LOG: All browser actions logged
THREAT 6: Stop request not working
├─ MITIGATION: Stop check on every iteration
├─ SUBPROCESS: Process.kill() on stop
├─ LLM: Response stream interrupted
└─ FEEDBACK: InterruptedError caught and logged
- Data Collection: Screenshots sent to LLM for vision analysis
- History Logging: All steps logged to disk in ./logs/ directory
- Network Transmission: HTTPS encryption used for API calls
- Recommendations: Review sensitive data before screenshots, use local Ollama for sensitive tasks
- Python 3.8+
- API key (Lightning AI or Ollama)
-
Clone or download the project
-
Create .env file in project root:
LIGHTNING_API_KEY=your_api_key_here LIGHTNING_TEAMSPACE=username/project MODEL_NAME=lightning-ai/gpt-oss-120b -
Install dependencies:
pip install requests pillow pyautogui python-dotenv pip install playwright speech_recognition # optional pip install beautifulsoup4 # optional but recommended
-
Run application:
python main.py
✅ Core Functionality
✓ Autonomous task execution
✓ Multi-step planning
✓ Screenshot-based visual feedback
✓ Self-healing through verification
✓ Error handling and recovery
✅ Security Features
✓ User confirmation gates (shell, Python)
✓ 40-step safety threshold
✓ Stop request handling
✓ Timeout management
✓ Vision-based verification
✅ API Integration
✓ Lightning AI client
✓ Ollama client (fallback)
✓ Environment variable configuration
✓ Streaming response handling
✅ Desktop Automation
✓ Keyboard control
✓ Screenshot capture
✓ Application launching
✓ URL navigation
✓ File operations
✓ Shell commands
✓ Python execution
✅ User Interface
✓ Tkinter GUI widget
✓ Real-time logging
✓ Voice input (optional)
✓ Start/Stop controls
✓ Persistent history
✅ Advanced Features
✓ Reflex command mode
✓ Screen-context mode
✓ Multi-step planning mode
✓ Email composition mode
✓ Window switching mode
✓ Base64 screenshot encoding
✓ Focus management
✅ Production Ready
✓ Error logging
✓ Execution history
✓ Timeout handling
✓ Cross-platform support
✓ Graceful degradation for optional features
✓ Thread-safe GUI operations
- Run main.py directly
- GUI window visible
- Full logging enabled
- Interactive debugging
- Run with headless mode
- GUI root = None
- Batch task mode
- Auto-continues past 40-step check
- Create task queue
- Pass multiple tasks
- Logs to persistent storage
- Email/webhook on completion
- Include all dependencies
- Volume mount for logs
- API key via environment
- xvfb or virtual display for screenshots
- Multi-Agent Coordination
- Advanced Vision (OCR, object detection)
- Performance Optimization
- Enhanced Logging
- Model Switching
- Tool Extension (database, API calls)
- Learning & Adaptation
- Multi-Modal Input
- Mobile Support
- Natural Language Enhancement
- Advanced Safety (Explainability, rollback)
- Enterprise Features (RBAC, audit compliance)
Project: Opus Desktop AI Agent
Status: MVP - Production Ready
Language: Python 3.8+
Main Dependencies: requests, pillow, pyautogui, python-dotenv
Optional: playwright, speech_recognition, beautifulsoup4
Core Architecture: Modular, Event-Driven, Thread-Based
UI Framework: Tkinter (cross-platform)
API Support: Lightning AI + Ollama (fallback)
Platform: Windows (primary), macOS/Linux (supported)
Total Lines of Code: ~2500
Modules: 8 core + tools
Config Files: .env
Logging: Persistent (./logs/)
History: Full execution traces with timestamps
This project demonstrates a complete autonomous desktop agent implementation suitable for:
- Automated testing scenarios
- Data entry and processing
- Research and information gathering
- Browser automation tasks
- System administration tasks
- Integration testing
- Workflow automation
Rohit Yadav
NIT Jalandhar
(As specified by author)
This MVP represents a complete, modular, and extensible architecture ready for production deployment or further enhancement based on specific use case requirements.
Key Achievements:
- 13 distinct action types
- 5 command interpretation modes
- 40-step safety threshold
- Vision-based self-healing
- Multi-threaded GUI with real-time logging
- Cross-platform desktop control
- Dual LLM API support
- Comprehensive error recovery
Status: ✅ Production-Ready MVP
Last Updated: 2026