Skip to content

HKUDS/OpenPhone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨OpenPhone✨: Mobile Agentic Foundation Models for AI Phone

Typing Animation
演示动画

Hugging Face Hugging Face

🦾 New Release: PhoneClaw — Your Autonomous AI Butler for iPhone

PhoneClaw is a tireless AI phone butler that handles any iOS task for you — and gets smarter with every session. Powered by the Ralph Loop (EXECUTE → EVALUATE → FIX → REPEAT), it breaks your request into subtasks, acts on your phone, checks whether each step succeeded, and automatically retries with the failure context — until the job is done.

  • 🧠 UserMemory — builds a persistent profile of who you are (name, city, habits, history) and injects it into every plan, so the butler truly knows its owner
  • 📚 ExperienceLog — records app-specific navigation know-how (tap coords, failure patterns, timing) across sessions, auto-compacted into a lean, high-confidence knowledge base
  • Memory-first answers — repeated questions are answered instantly from the profile with zero device interactions
  • 🤖 Interactive daemon mode — connect once, accept unlimited tasks back-to-back; the screen stays on automatically
  • 🎓 Learning mode — just operate your phone as usual while PhoneClaw watches; it captures screenshots at ~8 fps, detects your taps via computer vision, and distils your actions into reusable navigation lessons that are added to the ExperienceLog immediately

📖 PhoneClaw Full Documentation →    iOS Agent README →

🎯 What is OpenPhone?

The Problem: Most AI agents rely on expensive cloud APIs and large models that are impractical for real-world on-device deployment. Users face Privacy Concerns, Latency Issues, and High Costs when their phone needs to call external services for every interaction.

Our Solution: OpenPhone introduces the first Open-Source, 3B-parameter Agentic Foundation Model designed specifically for on-device smartphone interaction. This compact vision-language model runs entirely locally — meaning No Privacy Concerns, No Cloud Dependence, and Zero API Costs.

🤔 Why 3B Parameters?

We believe the future of mobile AI lies not only in making models larger, but in making them smarter and more efficient for real-world constraints. Our 3B model is:

  • Edge-Optimized: Efficient enough for commodity GPUs and next-generation mobile NPUs.
  • 🔒 Privacy-First: All computation stays on your device.
  • 💰 Cost-Free: No cloud inference and no ongoing API fees.
  • 🎯 High-Performance: Achieves performance comparable to 7B–9B models through advanced training.

💡 Research Highlights

🔍 OpenPhone‑3B: Lightweight Agentic Model

Considering the compute limitations of today’s edge devices, models with ≤3B parameters strike a practical balance between capability and deployability. Based on this insight, we introduce OpenPhone‑3B, a lightweight yet powerful on‑device agent model.

  • Model Size & Architecture: Vision-language model engineered for efficient on-device reasoning under tight mobile compute constraints.
  • Edge-Native Design: Primary local agent compatible with consumer GPUs and mobile NPUs, eliminating continuous cloud dependency.
  • GUI‑Aware Action Capabilities: Trained for visual interpretation, instruction following, and structured action generation across real mobile tasks.
  • Open‑Source Release: Full model weights, configurations, and inference stack enabling community deployment and development.
  • Practical Sweet Spot: 3B scale delivers optimal balance—significantly stronger than tiny models while remaining deployable where larger models fail.

Why 3B is the Sweet Spot for Phone Agents

  • Hardware Fit: 3B parameters align perfectly with consumer GPU memory (8-12GB) and emerging mobile NPU computational budgets.
  • Speed Advantage: 3B models deliver 3-5x faster inference than 7B alternatives while maintaining competitive accuracy for sub-second GUI responses.
  • Power Efficiency: Smaller footprint extends battery life - essential for mobile deployment where power consumption affects user experience.
  • Privacy-First: Enables phone tasks to run entirely on-device, preserving user privacy while eliminating network dependencies.
  • Cost Savings: Local processing eliminates expensive cloud APIs and per-request charges for sustainable operation.

🦾 PhoneClaw: Your Autonomous AI Butler for iPhone

An autonomous iOS phone butler built on the Ralph Loop — a closed-loop execution methodology that runs until every subtask passes its success criteria. The key differentiator is a two-layer self-learning memory that makes the butler measurably smarter after each session:

  • UserMemory — Maintains a persistent user profile (inferred name, city, app habits, task history) injected into every planning prompt, so the agent makes contextually intelligent decisions from the very first step. Repeated questions are answered directly from memory with zero device interactions.
  • ExperienceLog — Records app-specific navigation knowledge per session: successful tap coordinates, failure patterns, UI timing quirks. Lessons are semantically deduplicated, reinforced on confirmation, and automatically compacted when an app accumulates ≥ 20 entries — keeping the knowledge base lean and high-quality.
  • Intelligent Planning: VLM decomposes each task into subtasks with explicit success criteria, enabling precise per-step evaluation and targeted retries rather than blind repetition.
  • Interactive Daemon Mode: Connect once, accept unlimited tasks indefinitely — the device screen stays on automatically throughout the session.
  • Learning Mode: Just use your phone normally while PhoneClaw watches. It captures screenshots at ~8 fps, detects tap positions via computer vision (HoughCircles + pixel-diff fallback), annotates each frame, and distils your actions into reusable navigation lessons added directly to the ExperienceLog — no manual annotation required.

Full PhoneClaw documentation


🚀 Model Release & Resources

📦 Ready-to-Deploy Model

  • Model Weights: OpenPhone-3B is available on Hugging Face with full licensing for research and commercial use.
  • Production-Ready Serving: Pre-configured vLLM inference scripts enable efficient deployment with optimized throughput and memory usage.

🛠️ Complete Training Pipeline

  • Reproducible Recipe: Full training implementation including our novel two-stage approach (SFT + GRPO-style RL with synthetic GUI data).
  • Customization Support: Detailed documentation in model_training/allows researchers to adapt the model for domain-specific phone tasks or extend to new mobile platforms.
  • Data Generation Paradigm: Scripts and methodologies for creating high-quality training data at scale.

📖 Table of Contents


🚀 Quick Start

This project comprises three core components designed for comprehensive mobile agent development and evaluation:

  • ⚡ For model training, please refer to the training guide README for comprehensive setup and execution instructions.
  • 🔧 For the data generation pipeline, please refer to the data preparation guide README for detailed implementation steps.

Below, we focus on evaluation using the AndroidLab benchmark framework.

📱 AndroidLab Benchmark Setup

Installation: Follow the official AndroidLab documentation AndroidLab for complete setup instructions.

Environment Configuration:

  • Recommended Mode: AVD on Mac (arm64) - validated in our experiments.
  • App Setup: Manual installation and task-specific configuration required.
  • Compatibility Note: Original Docker images are not compatible with AVD environments.

🚀 Model Deployment & Inference

vLLM Integration:

  • Inference scripts available in ./vllm_script/ directory
  • Optimized for efficient small model serving

Model Access:

  • OpenPhone Weights: 3B parameter model hosted on HuggingFace
  • Deployment Process: Download weights → Deploy via vLLM → Configure inference service
  • Service Ready: Seamless integration with evaluation pipeline

⚙️ Pre-Testing Configuration

  • API Setup Required: Configure cloud model credentials in ./evaluation/evaluation.py: Line 63, Line 75, Line 81
  • Coming Soon: Streamlined configuration interface in development

🌟 Key Features of OpenPhone

🤖 Lightweight Agentic Foundation Models

Compact Architecture: Specialized 3B-scale Vision-Language Models optimized for mobile GUI tasks with minimal computational footprint.
On-Device Deployment: True smartphone-compatible models that maintain competitive performance while running locally without cloud dependency.

☁️ Device-Cloud Collaboration Framework

Dynamic Orchestration: Real-time task complexity assessment that intelligently switches between device and cloud models based on execution requirements.
Cost-Performance Optimization: Strategic resource allocation that leverages cost-efficient on-device models while compensating limitations through selective cloud model usage.

🎯 Comprehensive Mobile Agent Evaluation Playground

Extended Benchmark Suite: Beyond AndroidLab, incorporating 25+ additional tasks across popular mobile applications for real-world validation.
Multi-Dimensional Assessment: Comprehensive evaluation covering performance metrics, computational efficiency, and practical deployment scenarios.


🌟 Technical Innovation & Implementation

🧠 Model Training: SFT+RL

Synthetic Data Generation: Leverages advanced MLLMs to create high-quality reasoning chain training data, addressing the scarcity of manual annotations.
Two-Stage Training: SFT injects GUI foundational knowledge, while GRPO reinforcement learning optimizes task completion accuracy.
Small Model Enhancement: Enables 3B models to achieve performance comparable to 7B-9B models on GUI tasks through structured training.

☁️ Device-Cloud Collaboration Framework

Dynamic Task Assessment: Real-time complexity evaluation determines when and how frequently to monitor device model performance.
Intelligent Orchestration: Seamlessly switches between device and cloud models based on execution progress and failure patterns.
Cost-Performance Optimization: Reduces cloud invocations by ~10% while maintaining high task success rates through strategic resource allocation.

💾 Efficient Memory Mechanism for Mobile Agents

Long-Horizon Reasoning: Multi-step chain-of-thought reasoning with reflective error correction to enhance decision-making capabilities.
Text-Based Summarization: Compresses high-resolution screenshots into compact textual representations for efficient memory management.
Structured Context Retention: Maintains 10-20 steps of historical context in resource-constrained environments through optimized token usage.



🧪 Testing & Evaluation

Single Task Testing

Test individual tasks using the following command structure:

python eval.py -n test_name -c your path to config.yaml --task_id task_id

Example Usage:

python eval.py -n all_cloud_v1_hyper -c ./configs/example_xml_cloud_hyper.yaml --task_id zoom_1

Batch Evaluation Scripts

Convenient batch testing scripts are available in ./test_script:

all_test_cloud_v1_hyper.sh: Evaluates all 138 AndroidLab benchmark tasks
all_test_cloud_v1_hyper_add.sh: Evaluates tasks for four additional mobile apps

Additional App Documentation

For comprehensive details about the four additional app tasks, refer to the documentation: Additional Apps Documentation


📊 Result Generation

LLM Evaluator Setup

Required Configuration: Set up LLM service credentials in ./evaluation/tasks/llm_evaluator.py:

• Line 10: API configuration
• Line 12: Service URL

💡 Enhancement: Our implementation replaces AndroidLab's rule-based evaluation with LLM-powered assessment, providing more nuanced and accurate task completion evaluation.

Generate Evaluation Results

Execute result generation with the following command:

python generate_result.py --input_folder ./logs/evaluation/ --output_folder ./logs/evaluation/ --output_excel ./logs/evaluation/test_name.xlsx

Batch Testing File Management

⚠️ Important: When using batch scripts from ./test_script/:
• Manual Transfer Required: Move generated evaluation files from script directory to ./logs/
• Then Execute: Run the result generation command above
• Error Prevention: This step prevents file path conflicts and ensures proper result compilation


🎯 📊 Key Evaluation Findings for OpenPhone

🏆 Small Model, Big Performance

  • Size vs Performance: OpenPhone-3B achieves performance comparable to 9B models while maintaining the deployment advantages of a compact architecture.
  • Efficiency Champion: Establishes itself as a genuine "small powerhouse" that challenges the bigger-is-better assumption in mobile AI.

🥊 Competitive Performance

  • Against Proprietary Models: OpenPhone-3B shows respectable performance compared to lightweight versions of proprietary models when evaluated on standard benchmarks.
  • Potential of Small Models: Demonstrates promising results that validate the viability of compact open-source approaches in mobile agent developmen.

🔄 Device-Cloud Framework Works

  • Performance with Efficiency: OpenPhone's hybrid architecture delivers near-optimal performance while dramatically reducing cloud model usage.
  • Intelligent Routing: Proves that smart task routing creates practical efficiency gains without sacrificing capability.

🧠 Longer Prompts Don't Always Help

  • Context Matters: Extended prompting strategies only improve performance when paired with sufficiently capable cloud models.
  • Smart Matching: Highlights the importance of matching reasoning complexity to model capability rather than assuming longer prompts always help.

📈 Device-Cloud Distribution Analysis for Phone Agents

To evaluate the practical efficiency of our hybrid approach, we measured key metrics across different MLLMs: average total steps per task, the proportion of steps handled by on-device versus cloud models, and cloud call reduction compared to cloud-only baselines.

📊 Workload Distribution

Cloud models still handle approximately 65% of execution steps, reflecting the computational limitations of smaller on-device models for complex reasoning tasks.

💰 Efficiency Gains

Introducing on-device processing achieves roughly 10% reduction in cloud API calls, translating to direct cost savings and reduced latency.

🎯 Model Capability Impact

Advanced cloud models like GLM-4.5V show smaller reductions in cloud dependency, as their superior capabilities enable more independent task completion without requiring on-device assistance.

⚡ Inference Speed Comparison

We evaluated average inference time per step using vLLM across different GPU configurations to assess real-world deployment feasibility. Note that GLM-4.1V-9B-Thinking could not operate on a single 3090 GPU due to context length constraints.

Model GPUs Size SR Time Cost / Step
Qwen2.5-VL-7B-Instruct Single 3090 7B 10.1 6289.15 ms
OpenPhone Single 3090 3B 15.2 4170.63 ms
GLM-4.1V-9B-Thinking Two 3090s 9B 24.6 14584.89 ms
Qwen2.5-VL-7B-Instruct Two 3090s 7B 10.1 4587.79 ms
OpenPhone Two 3090s 3B 15.2 3524.25 ms

🎯 Speed Advantage

  • Clear Winner: OpenPhone demonstrates significant inference speed advantages thanks to its lightweight 3B architecture
  • Real-World Ready: Speed benefits become increasingly pronounced under constrained computational resources, matching typical edge deployment scenarios

📊 Quantified Comparison

  • 3.5x Faster: OpenPhone on single 3090 vs GLM-4.1V-9B-Thinking on dual 3090s.
  • 4x Faster: OpenPhone on dual 3090s vs GLM-4.1V-9B-Thinking on dual 3090s.
  • OpenPhone's Lightweight: GLM-4.1V-9B-Thinking's inability to run on single 3090 severely limits edge deployment options.

💡 Practical Implications

The trade-off is clear: while larger models like GLM-4.1V-9B-Thinking achieve higher task performance, OpenPhone's speed advantages make it far more suitable for real-world on-device scenarios where response time and hardware constraints matter.


🌟 Citation

If you find this work helpful to your research, please kindly consider citing our paper.

@article{jiang2025lightagent,
  title={LightAgent: Mobile Agentic Foundation Models},
  author={Jiang, Yangqin and Huang, Chao},
  journal={arXiv preprint arXiv:2510.22009},
  year={2025}
}

🔗 Related Projects

OpenPhone builds upon excellent open-source projects. We sincerely thank their authors and contributors:

  • AndroidLab - The benchmark framework.
  • R1-V - Implementation details for the GRPO training methodology.
  • LLaMA Factory - The unified training framework enabling efficient model fine-tuning.

📜 License

This project is released under the MIT License.

If this project helps you, please give us a Star🌟

🤖 Empower AI Phone with Agents!


❤️ Thanks for visiting ✨ OpenPhone!

Views

About

"OpenPhone: Mobile Agentic Foundation Models for AI Phone"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages