|
| 1 | +# Extracting Data |
| 2 | + |
| 3 | +This guide covers how to extract information from screens using AskUI Vision Agent's `get()` method, including structured data extraction, response schemas, and working with different data sources. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +- [Overview](#overview) |
| 8 | +- [Basic Usage](#basic-usage) |
| 9 | +- [Working with Different Data Sources](#working-with-different-data-sources) |
| 10 | +- [Structured Data Extraction](#structured-data-extraction) |
| 11 | + - [Basic Data Types](#basic-data-types) |
| 12 | + - [Complex Data Structures (nested and recursive)](#complex-data-structures-nested-and-recursive) |
| 13 | +- [Under the hood: How we extract data from documents](#under-the-hood-how-we-extract-data-from-documents) |
| 14 | +- [Limitations](#limitations) |
| 15 | + |
| 16 | + |
| 17 | +## Overview |
| 18 | + |
| 19 | +The `get()` method allows you to extract information from the screen. You can use it to: |
| 20 | + |
| 21 | +- Get text or data from the screen |
| 22 | +- Check the state of UI elements |
| 23 | +- Make decisions based on screen content |
| 24 | +- Analyze static images and documents |
| 25 | + |
| 26 | +We currently support the following data sources: |
| 27 | +- Images (max. 20MB, .jpg, .png) |
| 28 | +- PDFs (max. 20MB, .pdf) |
| 29 | +- Excel files (.xlsx, .xls) |
| 30 | +- Word documents (.docx, .doc) |
| 31 | + |
| 32 | +## Basic Usage |
| 33 | + |
| 34 | +By default, the `get()` method will take a screenshot of the currently selected display and use the `askui` model to extract the textual information as a `str`. |
| 35 | + |
| 36 | +```python |
| 37 | +# Get text from screen |
| 38 | +url = agent.get("What is the current url shown in the url bar?") |
| 39 | +print(url) # e.g., "github.com/login" |
| 40 | + |
| 41 | +# Check UI state |
| 42 | +is_logged_in = agent.get("Is the user logged in? Answer with 'yes' or 'no'.") == "yes" |
| 43 | +if is_logged_in: |
| 44 | + agent.click("Logout") |
| 45 | +else: |
| 46 | + agent.click("Login") |
| 47 | + |
| 48 | +# Get specific information |
| 49 | +page_title = agent.get("What is the page title?") |
| 50 | +button_count = agent.get("How many buttons are visible on this page?") |
| 51 | +``` |
| 52 | + |
| 53 | +## Working with Different Data Sources |
| 54 | + |
| 55 | +Instead of taking a screenshot, you can analyze specific images or documents: |
| 56 | + |
| 57 | +```python |
| 58 | +from PIL import Image |
| 59 | +from askui import VisionAgent |
| 60 | +from pathlib import Path |
| 61 | + |
| 62 | +with VisionAgent() as agent: |
| 63 | + # From PIL Image |
| 64 | + image = Image.open("screenshot.png") |
| 65 | + result = agent.get("What's in this image?", source=image) |
| 66 | + |
| 67 | + # From file path |
| 68 | + |
| 69 | + ## as a string |
| 70 | + result = agent.get("What's in this image?", source="screenshot.png") |
| 71 | + result = agent.get("What is this PDF about?", source="document.pdf") |
| 72 | + |
| 73 | + ## as a Path |
| 74 | + result = agent.get("What is this PDF about?", source="document.pdf") |
| 75 | + result = agent.get("What is this PDF about?", source=Path("table.xlsx")) |
| 76 | + |
| 77 | + # From a data url |
| 78 | + result = agent.get("What's in this image?", source="data:image/png;base64,...") |
| 79 | + result = agent.get("What is this PDF about?", source="data:application/pdf;base64,...") |
| 80 | +``` |
| 81 | + |
| 82 | +## Extracting data other than strings |
| 83 | + |
| 84 | +### Structured data extraction |
| 85 | + |
| 86 | +For structured data extraction, use Pydantic models extending `ResponseSchemaBase`: |
| 87 | + |
| 88 | +```python |
| 89 | +from askui import ResponseSchemaBase, VisionAgent |
| 90 | +from PIL import Image |
| 91 | +import json |
| 92 | + |
| 93 | +class UserInfo(ResponseSchemaBase): |
| 94 | + username: str |
| 95 | + is_online: bool |
| 96 | + |
| 97 | +class UrlResponse(ResponseSchemaBase): |
| 98 | + url: str |
| 99 | + |
| 100 | +with VisionAgent() as agent: |
| 101 | + # Get structured data |
| 102 | + user_info = agent.get( |
| 103 | + "What is the username and online status?", |
| 104 | + response_schema=UserInfo |
| 105 | + ) |
| 106 | + print(f"User {user_info.username} is {'online' if user_info.is_online else 'offline'}") |
| 107 | + |
| 108 | + # Get URL as string |
| 109 | + url = agent.get("What is the current url shown in the url bar?") |
| 110 | + print(url) # e.g., "github.com/login" |
| 111 | + |
| 112 | + # Get URL as Pydantic model from image at (relative) path |
| 113 | + response = agent.get( |
| 114 | + "What is the current url shown in the url bar?", |
| 115 | + response_schema=UrlResponse, |
| 116 | + source="screenshot.png", |
| 117 | + ) |
| 118 | + |
| 119 | + # Dump whole model |
| 120 | + print(response.model_dump_json(indent=2)) |
| 121 | + # or |
| 122 | + response_json_dict = response.model_dump(mode="json") |
| 123 | + print(json.dumps(response_json_dict, indent=2)) |
| 124 | + # or for regular dict |
| 125 | + response_dict = response.model_dump() |
| 126 | + print(response_dict["url"]) |
| 127 | +``` |
| 128 | + |
| 129 | +### Basic Data Types |
| 130 | + |
| 131 | +```python |
| 132 | +# Get boolean response |
| 133 | +is_login_page = agent.get( |
| 134 | + "Is this a login page?", |
| 135 | + response_schema=bool, |
| 136 | +) |
| 137 | +print(is_login_page) |
| 138 | + |
| 139 | +# Get integer response |
| 140 | +input_count = agent.get( |
| 141 | + "How many input fields are visible on this page?", |
| 142 | + response_schema=int, |
| 143 | +) |
| 144 | +print(input_count) |
| 145 | + |
| 146 | +# Get float response |
| 147 | +design_rating = agent.get( |
| 148 | + "Rate the page design quality from 0 to 1", |
| 149 | + response_schema=float, |
| 150 | +) |
| 151 | +print(design_rating) |
| 152 | +``` |
| 153 | + |
| 154 | +### Complex Data Structures (nested and recursive) |
| 155 | + |
| 156 | +```python |
| 157 | +class NestedResponse(ResponseSchemaBase): |
| 158 | + nested: UrlResponse |
| 159 | + |
| 160 | +class LinkedListNode(ResponseSchemaBase): |
| 161 | + value: str |
| 162 | + next: "LinkedListNode | None" |
| 163 | + |
| 164 | +# Get nested response |
| 165 | +nested = agent.get( |
| 166 | + "Extract the URL and its metadata from the page", |
| 167 | + response_schema=NestedResponse, |
| 168 | +) |
| 169 | +print(nested.nested.url) |
| 170 | + |
| 171 | +# Get recursive response |
| 172 | +linked_list = agent.get( |
| 173 | + "Extract the breadcrumb navigation as a linked list", |
| 174 | + response_schema=LinkedListNode, |
| 175 | +) |
| 176 | +current = linked_list |
| 177 | +while current: |
| 178 | + print(current.value) |
| 179 | + current = current.next |
| 180 | +``` |
| 181 | + |
| 182 | +## Under the hood: How we extract data from documents |
| 183 | + |
| 184 | +When extracting data from documents like Docs or Excel files, we use the `markitdown` library to convert them into markdown format. We chose `markitdown` over other tools for several reasons: |
| 185 | + |
| 186 | +- **LLM-Friendly Output:** The markdown output is optimized for token usage, which is efficient for subsequent processing with large language models. |
| 187 | +- **Includes Sheet Names:** When converting Excel files, the name of the sheet is included in the generated markdown, providing better context. |
| 188 | +- **Enhanced Image Descriptions:** It can use an OpenAI client (`llm_client` and `llm_model`) to generate more descriptive captions for images within documents. |
| 189 | +- **No Local Inference:** No model inference is performed on the client machine, which means no need to install and maintain heavy packages like `torch`. |
| 190 | +- **Optional Dependencies:** It allows for optional imports, meaning you only need to install the dependencies for the file types you are working with. This reduces the number of packages to manage. |
| 191 | +- **Microsoft Maintained:** Being maintained by Microsoft, it offers robust support for converting Office documents. |
| 192 | + |
| 193 | +## Limitations |
| 194 | + |
| 195 | +- The support for response schemas varies among models. Currently, the `askui` model provides best support for response schemas as we try different models under the hood with your schema to see which one works best. |
| 196 | +- PDF processing is only supported for Gemini models hosted on AskUI and for PDFs up to 20MB. |
| 197 | +- Complex nested schemas may not work with all models. |
| 198 | +- Some models may have token limits that affect extraction capabilities. |
0 commit comments