askui
diff --git a/‎README.md‎
Lines changed: 111 additions & 775 deletions b/‎README.md‎
Lines changed: 111 additions & 775 deletions
diff --git a/‎docs/assets/Architecture.svg‎
Lines changed: 0 additions & 66 deletions b/‎docs/assets/Architecture.svg‎
Lines changed: 0 additions & 66 deletions
diff --git a/‎docs/chat.md‎
Lines changed: 401 additions & 0 deletions b/‎docs/chat.md‎
Lines changed: 401 additions & 0 deletions
diff --git a/‎docs/direct-tool-use.md‎
Lines changed: 41 additions & 0 deletions b/‎docs/direct-tool-use.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎docs/extracting-data.md‎
Lines changed: 198 additions & 0 deletions b/‎docs/extracting-data.md‎
Lines changed: 198 additions & 0 deletions
diff --git a/‎docs/mcp.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/mcp.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/observability.md‎
Lines changed: 59 additions & 0 deletions b/‎docs/observability.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎docs/telemetry.md‎
Lines changed: 11 additions & 0 deletions b/‎docs/telemetry.md‎
Lines changed: 11 additions & 0 deletions
@@ -0,0 +1,41 @@
+# 🛠️ Direct Tool Use
+
+Under the hood, agents are using a set of tools. You can directly access these tools.
+
+## Agent OS
+
+The controller for the operating system.
+
+```python
+agent.tools.os.click("left", 2) # clicking
+agent.tools.os.mouse_move(100, 100) # mouse movement
+agent.tools.os.keyboard_tap("v", modifier_keys=["control"]) # Paste
+# and many more
+```
+
+## Web browser
+
+The webbrowser tool powered by [webbrowser](https://docs.python.org/3/library/webbrowser.html) allows you to directly access webbrowsers in your environment.
+
+```python
+agent.tools.webbrowser.open_new("http://www.google.com")
+# also check out open and open_new_tab
+```
+
+## Clipboard
+
+The clipboard tool powered by [pyperclip](https://github.com/asweigart/pyperclip) allows you to interact with the clipboard.
+
+```python
+agent.tools.clipboard.copy("...")
+result = agent.tools.clipboard.paste()
+```
+
+## 🖥️ Multi-Monitor Support
+
+You have multiple monitors? Choose which one to automate by setting `display` to `1`, `2` etc. To find the correct display or monitor, you have to play play around a bit setting it to different values. We are going to improve this soon. By default, the agent will use display 1.
+
+```python
+with VisionAgent(display=1) as agent:
+    agent...
+```
@@ -0,0 +1,198 @@
+# Extracting Data
+
+This guide covers how to extract information from screens using AskUI Vision Agent's `get()` method, including structured data extraction, response schemas, and working with different data sources.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Basic Usage](#basic-usage)
+- [Working with Different Data Sources](#working-with-different-data-sources)
+- [Structured Data Extraction](#structured-data-extraction)
+  - [Basic Data Types](#basic-data-types)
+  - [Complex Data Structures (nested and recursive)](#complex-data-structures-nested-and-recursive)
+- [Under the hood: How we extract data from documents](#under-the-hood-how-we-extract-data-from-documents)
+- [Limitations](#limitations)
+
+
+## Overview
+
+The `get()` method allows you to extract information from the screen. You can use it to:
+
+- Get text or data from the screen
+- Check the state of UI elements
+- Make decisions based on screen content
+- Analyze static images and documents
+
+We currently support the following data sources:
+- Images (max. 20MB, .jpg, .png)
+- PDFs (max. 20MB, .pdf)
+- Excel files (.xlsx, .xls)
+- Word documents (.docx, .doc)
+
+## Basic Usage
+
+By default, the `get()` method will take a screenshot of the currently selected display and use the `askui` model to extract the textual information as a `str`.
+
+```python
+# Get text from screen
+url = agent.get("What is the current url shown in the url bar?")
+print(url)  # e.g., "github.com/login"
+
+# Check UI state
+is_logged_in = agent.get("Is the user logged in? Answer with 'yes' or 'no'.") == "yes"
+if is_logged_in:
+    agent.click("Logout")
+else:
+    agent.click("Login")
+
+# Get specific information
+page_title = agent.get("What is the page title?")
+button_count = agent.get("How many buttons are visible on this page?")
+```
+
+## Working with Different Data Sources
+
+Instead of taking a screenshot, you can analyze specific images or documents:
+
+```python
+from PIL import Image
+from askui import VisionAgent
+from pathlib import Path
+
+with VisionAgent() as agent:
+    # From PIL Image
+    image = Image.open("screenshot.png")
+    result = agent.get("What's in this image?", source=image)
+
+    # From file path
+
+    ## as a string
+    result = agent.get("What's in this image?", source="screenshot.png")
+    result = agent.get("What is this PDF about?", source="document.pdf")
+
+    ## as a Path
+    result = agent.get("What is this PDF about?", source="document.pdf")
+    result = agent.get("What is this PDF about?", source=Path("table.xlsx"))
+
+    # From a data url
+    result = agent.get("What's in this image?", source="data:image/png;base64,...")
+    result = agent.get("What is this PDF about?", source="data:application/pdf;base64,...")
+```
+
+## Extracting data other than strings
+
+### Structured data extraction
+
+For structured data extraction, use Pydantic models extending `ResponseSchemaBase`:
+
+```python
+from askui import ResponseSchemaBase, VisionAgent
+from PIL import Image
+import json
+
+class UserInfo(ResponseSchemaBase):
+    username: str
+    is_online: bool
+
+class UrlResponse(ResponseSchemaBase):
+    url: str
+
+with VisionAgent() as agent:
+    # Get structured data
+    user_info = agent.get(
+        "What is the username and online status?",
+        response_schema=UserInfo
+    )
+    print(f"User {user_info.username} is {'online' if user_info.is_online else 'offline'}")
+
+    # Get URL as string
+    url = agent.get("What is the current url shown in the url bar?")
+    print(url)  # e.g., "github.com/login"
+
+    # Get URL as Pydantic model from image at (relative) path
+    response = agent.get(
+        "What is the current url shown in the url bar?",
+        response_schema=UrlResponse,
+        source="screenshot.png",
+    )
+
+    # Dump whole model
+    print(response.model_dump_json(indent=2))
+    # or
+    response_json_dict = response.model_dump(mode="json")
+    print(json.dumps(response_json_dict, indent=2))
+    # or for regular dict
+    response_dict = response.model_dump()
+    print(response_dict["url"])
+```
+
+### Basic Data Types
+
+```python
+# Get boolean response
+is_login_page = agent.get(
+    "Is this a login page?",
+    response_schema=bool,
+)
+print(is_login_page)
+
+# Get integer response
+input_count = agent.get(
+    "How many input fields are visible on this page?",
+    response_schema=int,
+)
+print(input_count)
+
+# Get float response
+design_rating = agent.get(
+    "Rate the page design quality from 0 to 1",
+    response_schema=float,
+)
+print(design_rating)
+```
+
+### Complex Data Structures (nested and recursive)
+
+```python
+class NestedResponse(ResponseSchemaBase):
+    nested: UrlResponse
+
+class LinkedListNode(ResponseSchemaBase):
+    value: str
+    next: "LinkedListNode | None"
+
+# Get nested response
+nested = agent.get(
+    "Extract the URL and its metadata from the page",
+    response_schema=NestedResponse,
+)
+print(nested.nested.url)
+
+# Get recursive response
+linked_list = agent.get(
+    "Extract the breadcrumb navigation as a linked list",
+    response_schema=LinkedListNode,
+)
+current = linked_list
+while current:
+    print(current.value)
+    current = current.next
+```
+
+## Under the hood: How we extract data from documents
+
+When extracting data from documents like Docs or Excel files, we use the `markitdown` library to convert them into markdown format. We chose `markitdown` over other tools for several reasons:
+
+- **LLM-Friendly Output:** The markdown output is optimized for token usage, which is efficient for subsequent processing with large language models.
+- **Includes Sheet Names:** When converting Excel files, the name of the sheet is included in the generated markdown, providing better context.
+- **Enhanced Image Descriptions:** It can use an OpenAI client (`llm_client` and `llm_model`) to generate more descriptive captions for images within documents.
+- **No Local Inference:** No model inference is performed on the client machine, which means no need to install and maintain heavy packages like `torch`.
+- **Optional Dependencies:** It allows for optional imports, meaning you only need to install the dependencies for the file types you are working with. This reduces the number of packages to manage.
+- **Microsoft Maintained:** Being maintained by Microsoft, it offers robust support for converting Office documents.
+
+## Limitations
+
+- The support for response schemas varies among models. Currently, the `askui` model provides best support for response schemas as we try different models under the hood with your schema to see which one works best.
+- PDF processing is only supported for Gemini models hosted on AskUI and for PDFs up to 20MB.
+- Complex nested schemas may not work with all models.
+- Some models may have token limits that affect extraction capabilities.
@@ -1,5 +1,13 @@
 # MCP
 
+## Table of Contents
+
+- [What is MCP?](#what-is-mcp)
+- [Our MCP Support](#our-mcp-support)
+- [How to Use MCP with AskUI](#how-to-use-mcp-with-askui)
+  - [With the Library](#with-the-library)
+  - [With Chat](#with-chat)
+
 ## What is MCP?
 
 The Model Context Protocol (MCP) is a standardized way to provide context and tools to Large Language Models (LLMs). It acts as a universal interface - often described as "the USB-C port for AI" - that allows LLMs to connect to external resources and functionality in a secure, standardized manner.
 
@@ -0,0 +1,59 @@
+# Observability
+
+## 📜 Logging
+
+You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG.
+
+```python
+import logging
+
+with VisionAgent(log_level=logging.DEBUG) as agent:
+    agent...
+```
+
+## 📜 Reporting
+
+You want to see a report of the actions your agent took? Register a reporter using the `reporters` parameter.
+
+```python
+from typing import Optional, Union
+from typing_extensions import override
+from askui.reporting import SimpleHtmlReporter
+from PIL import Image
+
+with VisionAgent(reporters=[SimpleHtmlReporter()]) as agent:
+    agent...
+```
+
+You can also create your own reporter by implementing the `Reporter` interface.
+
+```python
+from askui.reporting import Reporter
+
+class CustomReporter(Reporter):
+    @override
+    def add_message(
+        self,
+        role: str,
+        content: Union[str, dict, list],
+        image: Optional[Image.Image | list[Image.Image]] = None,
+    ) -> None:
+        # adding message to the report (see implementation of `SimpleHtmlReporter` as an example)
+        pass
+
+    @override
+    def generate(self) -> None:
+        # generate the report if not generated live (see implementation of `SimpleHtmlReporter` as an example)
+        pass
+
+
+with VisionAgent(reporters=[CustomReporter()]) as agent:
+    agent...
+```
+
+You can also use multiple reporters at once. Their `generate()` and `add_message()` methods will be called in the order of the reporters in the list.
+
+```python
+with VisionAgent(reporters=[SimpleHtmlReporter(), CustomReporter()]) as agent:
+    agent...
+```
@@ -0,0 +1,11 @@
+# Telemetry
+
+By default, we record usage data to detect and fix bugs inside the package and improve the UX of the package including
+- version of the `askui` package used
+- information about the environment, e.g., operating system, architecture, device id (hashed to protect privacy), python version
+- session id
+- some of the methods called including (non-sensitive) method parameters and responses, e.g., the click coordinates in `click(x=100, y=100)`
+- exceptions (types and messages)
+- AskUI workspace and user id if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` are set
+
+If you would like to disable the recording of usage data, set the `ASKUI__VA__TELEMETRY__ENABLED` environment variable to `False`.