Skip to content

Commit fecd130

Browse files
committed
docs: extract docs from README to docs folder
1 parent dbddb3a commit fecd130

File tree

9 files changed

+1328
-841
lines changed

9 files changed

+1328
-841
lines changed

README.md

Lines changed: 111 additions & 775 deletions
Large diffs are not rendered by default.

docs/assets/Architecture.svg

Lines changed: 0 additions & 66 deletions
This file was deleted.

docs/chat.md

Lines changed: 401 additions & 0 deletions
Large diffs are not rendered by default.

docs/direct-tool-use.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# 🛠️ Direct Tool Use
2+
3+
Under the hood, agents are using a set of tools. You can directly access these tools.
4+
5+
## Agent OS
6+
7+
The controller for the operating system.
8+
9+
```python
10+
agent.tools.os.click("left", 2) # clicking
11+
agent.tools.os.mouse_move(100, 100) # mouse movement
12+
agent.tools.os.keyboard_tap("v", modifier_keys=["control"]) # Paste
13+
# and many more
14+
```
15+
16+
## Web browser
17+
18+
The webbrowser tool powered by [webbrowser](https://docs.python.org/3/library/webbrowser.html) allows you to directly access webbrowsers in your environment.
19+
20+
```python
21+
agent.tools.webbrowser.open_new("http://www.google.com")
22+
# also check out open and open_new_tab
23+
```
24+
25+
## Clipboard
26+
27+
The clipboard tool powered by [pyperclip](https://github.com/asweigart/pyperclip) allows you to interact with the clipboard.
28+
29+
```python
30+
agent.tools.clipboard.copy("...")
31+
result = agent.tools.clipboard.paste()
32+
```
33+
34+
## 🖥️ Multi-Monitor Support
35+
36+
You have multiple monitors? Choose which one to automate by setting `display` to `1`, `2` etc. To find the correct display or monitor, you have to play play around a bit setting it to different values. We are going to improve this soon. By default, the agent will use display 1.
37+
38+
```python
39+
with VisionAgent(display=1) as agent:
40+
agent...
41+
```

docs/extracting-data.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# Extracting Data
2+
3+
This guide covers how to extract information from screens using AskUI Vision Agent's `get()` method, including structured data extraction, response schemas, and working with different data sources.
4+
5+
## Table of Contents
6+
7+
- [Overview](#overview)
8+
- [Basic Usage](#basic-usage)
9+
- [Working with Different Data Sources](#working-with-different-data-sources)
10+
- [Structured Data Extraction](#structured-data-extraction)
11+
- [Basic Data Types](#basic-data-types)
12+
- [Complex Data Structures (nested and recursive)](#complex-data-structures-nested-and-recursive)
13+
- [Under the hood: How we extract data from documents](#under-the-hood-how-we-extract-data-from-documents)
14+
- [Limitations](#limitations)
15+
16+
17+
## Overview
18+
19+
The `get()` method allows you to extract information from the screen. You can use it to:
20+
21+
- Get text or data from the screen
22+
- Check the state of UI elements
23+
- Make decisions based on screen content
24+
- Analyze static images and documents
25+
26+
We currently support the following data sources:
27+
- Images (max. 20MB, .jpg, .png)
28+
- PDFs (max. 20MB, .pdf)
29+
- Excel files (.xlsx, .xls)
30+
- Word documents (.docx, .doc)
31+
32+
## Basic Usage
33+
34+
By default, the `get()` method will take a screenshot of the currently selected display and use the `askui` model to extract the textual information as a `str`.
35+
36+
```python
37+
# Get text from screen
38+
url = agent.get("What is the current url shown in the url bar?")
39+
print(url) # e.g., "github.com/login"
40+
41+
# Check UI state
42+
is_logged_in = agent.get("Is the user logged in? Answer with 'yes' or 'no'.") == "yes"
43+
if is_logged_in:
44+
agent.click("Logout")
45+
else:
46+
agent.click("Login")
47+
48+
# Get specific information
49+
page_title = agent.get("What is the page title?")
50+
button_count = agent.get("How many buttons are visible on this page?")
51+
```
52+
53+
## Working with Different Data Sources
54+
55+
Instead of taking a screenshot, you can analyze specific images or documents:
56+
57+
```python
58+
from PIL import Image
59+
from askui import VisionAgent
60+
from pathlib import Path
61+
62+
with VisionAgent() as agent:
63+
# From PIL Image
64+
image = Image.open("screenshot.png")
65+
result = agent.get("What's in this image?", source=image)
66+
67+
# From file path
68+
69+
## as a string
70+
result = agent.get("What's in this image?", source="screenshot.png")
71+
result = agent.get("What is this PDF about?", source="document.pdf")
72+
73+
## as a Path
74+
result = agent.get("What is this PDF about?", source="document.pdf")
75+
result = agent.get("What is this PDF about?", source=Path("table.xlsx"))
76+
77+
# From a data url
78+
result = agent.get("What's in this image?", source="data:image/png;base64,...")
79+
result = agent.get("What is this PDF about?", source="data:application/pdf;base64,...")
80+
```
81+
82+
## Extracting data other than strings
83+
84+
### Structured data extraction
85+
86+
For structured data extraction, use Pydantic models extending `ResponseSchemaBase`:
87+
88+
```python
89+
from askui import ResponseSchemaBase, VisionAgent
90+
from PIL import Image
91+
import json
92+
93+
class UserInfo(ResponseSchemaBase):
94+
username: str
95+
is_online: bool
96+
97+
class UrlResponse(ResponseSchemaBase):
98+
url: str
99+
100+
with VisionAgent() as agent:
101+
# Get structured data
102+
user_info = agent.get(
103+
"What is the username and online status?",
104+
response_schema=UserInfo
105+
)
106+
print(f"User {user_info.username} is {'online' if user_info.is_online else 'offline'}")
107+
108+
# Get URL as string
109+
url = agent.get("What is the current url shown in the url bar?")
110+
print(url) # e.g., "github.com/login"
111+
112+
# Get URL as Pydantic model from image at (relative) path
113+
response = agent.get(
114+
"What is the current url shown in the url bar?",
115+
response_schema=UrlResponse,
116+
source="screenshot.png",
117+
)
118+
119+
# Dump whole model
120+
print(response.model_dump_json(indent=2))
121+
# or
122+
response_json_dict = response.model_dump(mode="json")
123+
print(json.dumps(response_json_dict, indent=2))
124+
# or for regular dict
125+
response_dict = response.model_dump()
126+
print(response_dict["url"])
127+
```
128+
129+
### Basic Data Types
130+
131+
```python
132+
# Get boolean response
133+
is_login_page = agent.get(
134+
"Is this a login page?",
135+
response_schema=bool,
136+
)
137+
print(is_login_page)
138+
139+
# Get integer response
140+
input_count = agent.get(
141+
"How many input fields are visible on this page?",
142+
response_schema=int,
143+
)
144+
print(input_count)
145+
146+
# Get float response
147+
design_rating = agent.get(
148+
"Rate the page design quality from 0 to 1",
149+
response_schema=float,
150+
)
151+
print(design_rating)
152+
```
153+
154+
### Complex Data Structures (nested and recursive)
155+
156+
```python
157+
class NestedResponse(ResponseSchemaBase):
158+
nested: UrlResponse
159+
160+
class LinkedListNode(ResponseSchemaBase):
161+
value: str
162+
next: "LinkedListNode | None"
163+
164+
# Get nested response
165+
nested = agent.get(
166+
"Extract the URL and its metadata from the page",
167+
response_schema=NestedResponse,
168+
)
169+
print(nested.nested.url)
170+
171+
# Get recursive response
172+
linked_list = agent.get(
173+
"Extract the breadcrumb navigation as a linked list",
174+
response_schema=LinkedListNode,
175+
)
176+
current = linked_list
177+
while current:
178+
print(current.value)
179+
current = current.next
180+
```
181+
182+
## Under the hood: How we extract data from documents
183+
184+
When extracting data from documents like Docs or Excel files, we use the `markitdown` library to convert them into markdown format. We chose `markitdown` over other tools for several reasons:
185+
186+
- **LLM-Friendly Output:** The markdown output is optimized for token usage, which is efficient for subsequent processing with large language models.
187+
- **Includes Sheet Names:** When converting Excel files, the name of the sheet is included in the generated markdown, providing better context.
188+
- **Enhanced Image Descriptions:** It can use an OpenAI client (`llm_client` and `llm_model`) to generate more descriptive captions for images within documents.
189+
- **No Local Inference:** No model inference is performed on the client machine, which means no need to install and maintain heavy packages like `torch`.
190+
- **Optional Dependencies:** It allows for optional imports, meaning you only need to install the dependencies for the file types you are working with. This reduces the number of packages to manage.
191+
- **Microsoft Maintained:** Being maintained by Microsoft, it offers robust support for converting Office documents.
192+
193+
## Limitations
194+
195+
- The support for response schemas varies among models. Currently, the `askui` model provides best support for response schemas as we try different models under the hood with your schema to see which one works best.
196+
- PDF processing is only supported for Gemini models hosted on AskUI and for PDFs up to 20MB.
197+
- Complex nested schemas may not work with all models.
198+
- Some models may have token limits that affect extraction capabilities.

docs/mcp.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# MCP
22

3+
## Table of Contents
4+
5+
- [What is MCP?](#what-is-mcp)
6+
- [Our MCP Support](#our-mcp-support)
7+
- [How to Use MCP with AskUI](#how-to-use-mcp-with-askui)
8+
- [With the Library](#with-the-library)
9+
- [With Chat](#with-chat)
10+
311
## What is MCP?
412

513
The Model Context Protocol (MCP) is a standardized way to provide context and tools to Large Language Models (LLMs). It acts as a universal interface - often described as "the USB-C port for AI" - that allows LLMs to connect to external resources and functionality in a secure, standardized manner.

docs/observability.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Observability
2+
3+
## 📜 Logging
4+
5+
You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG.
6+
7+
```python
8+
import logging
9+
10+
with VisionAgent(log_level=logging.DEBUG) as agent:
11+
agent...
12+
```
13+
14+
## 📜 Reporting
15+
16+
You want to see a report of the actions your agent took? Register a reporter using the `reporters` parameter.
17+
18+
```python
19+
from typing import Optional, Union
20+
from typing_extensions import override
21+
from askui.reporting import SimpleHtmlReporter
22+
from PIL import Image
23+
24+
with VisionAgent(reporters=[SimpleHtmlReporter()]) as agent:
25+
agent...
26+
```
27+
28+
You can also create your own reporter by implementing the `Reporter` interface.
29+
30+
```python
31+
from askui.reporting import Reporter
32+
33+
class CustomReporter(Reporter):
34+
@override
35+
def add_message(
36+
self,
37+
role: str,
38+
content: Union[str, dict, list],
39+
image: Optional[Image.Image | list[Image.Image]] = None,
40+
) -> None:
41+
# adding message to the report (see implementation of `SimpleHtmlReporter` as an example)
42+
pass
43+
44+
@override
45+
def generate(self) -> None:
46+
# generate the report if not generated live (see implementation of `SimpleHtmlReporter` as an example)
47+
pass
48+
49+
50+
with VisionAgent(reporters=[CustomReporter()]) as agent:
51+
agent...
52+
```
53+
54+
You can also use multiple reporters at once. Their `generate()` and `add_message()` methods will be called in the order of the reporters in the list.
55+
56+
```python
57+
with VisionAgent(reporters=[SimpleHtmlReporter(), CustomReporter()]) as agent:
58+
agent...
59+
```

docs/telemetry.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Telemetry
2+
3+
By default, we record usage data to detect and fix bugs inside the package and improve the UX of the package including
4+
- version of the `askui` package used
5+
- information about the environment, e.g., operating system, architecture, device id (hashed to protect privacy), python version
6+
- session id
7+
- some of the methods called including (non-sensitive) method parameters and responses, e.g., the click coordinates in `click(x=100, y=100)`
8+
- exceptions (types and messages)
9+
- AskUI workspace and user id if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` are set
10+
11+
If you would like to disable the recording of usage data, set the `ASKUI__VA__TELEMETRY__ENABLED` environment variable to `False`.

0 commit comments

Comments
 (0)