diff --git a/README.md b/README.md
index 596614d1..fc3bb70f 100644
--- a/README.md
+++ b/README.md
@@ -88,7 +88,7 @@ pip install askui
 |  | AskUI [INFO](https://hub.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
 |----------|----------|----------|
 | ENV Variables    | `ASKUI_WORKSPACE_ID`, `ASKUI_TOKEN`   | `ANTHROPIC_API_KEY`   |
-| Supported Commands    | `click()`   | `click()`, `get()`, `act()`   |
+| Supported Commands    | `click()`, `get()`, `locate()`, `mouse_move()`   | `act()`, `click()`, `get()`, `locate()`, `mouse_move()`  |
 | Description    | Faster Inference, European Server, Enterprise Ready   | Supports complex actions   |
 
 To get started, set the environment variables required to authenticate with your chosen model provider.
@@ -130,7 +130,7 @@ You can test the Vision Agent with Huggingface models via their Spaces API. Plea
 
 **Example Code:**
 ```python
-agent.click("search field", model_name="OS-Copilot/OS-Atlas-Base-7B")
+agent.click("search field", model="OS-Copilot/OS-Atlas-Base-7B")
 ```
 
 ### 3c. Host your own **AI Models**
@@ -143,7 +143,7 @@ You can use Vision Agent with UI-TARS if you provide your own UI-TARS API endpoi
 
 2. Step: Provide the `TARS_URL` and `TARS_API_KEY` environment variables to Vision Agent.
 
-3. Step: Use the `model_name="tars"` parameter in your `click()`, `get()` and `act()` commands.
+3. Step: Use the `model="tars"` parameter in your `click()`, `get()` and `act()` etc. commands or when initializing the `VisionAgent`.
 
 
 ## ▶️ Start Building
@@ -171,46 +171,68 @@ with VisionAgent() as agent:
 
 ### 🎛️ Model Selection
 
-Instead of relying on the default model for the entire automation script, you can specify a model for each `click` command using the `model_name` parameter.
+Instead of relying on the default model for the entire automation script, you can specify a model for each `click()` (or `act()`, `get()` etc.) command using the `model` parameter or when initializing the `VisionAgent` (overridden by the `model` parameter of individual commands).
 
 |  | AskUI | Anthropic |
 |----------|----------|----------|
-| `click()`    | `askui-combo`, `askui-pta`, `askui-ocr`   | `anthropic-claude-3-5-sonnet-20241022`   |
+| `act()`    | | `anthropic-claude-3-5-sonnet-20241022`   |
+| `click()`    | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element` | `anthropic-claude-3-5-sonnet-20241022`   |
+| `get()`    | | `askui`, `anthropic-claude-3-5-sonnet-20241022`   |
+| `locate()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element`   | `anthropic-claude-3-5-sonnet-20241022` |
+| `mouse_move()` | `askui`, `askui-combo`, `askui-pta`, `askui-ocr`, `askui-ai-element`   | `anthropic-claude-3-5-sonnet-20241022` |
 
-**Example:** `agent.click("Preview", model_name="askui-combo")`
 
-<details>
-  <summary>Antrophic AI Models</summary>
-
-Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
-| Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
-|-------------|--------------------|--------------|--------------|--------------|--------------|
-| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"` | slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
-> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
+**Example:** 
 
+```python
+from askui import VisionAgent
 
-</details>
+with VisionAgent() as agent:
+    # Uses the default model (depending on the environment variables set, see above)
+    agent.click("Next")
+
+with VisionAgent(model="askui-combo") as agent:
+    # Uses the "askui-combo" model because it was specified when initializing the agent
+    agent.click("Next")
+    # Uses the "anthropic-claude-3-5-sonnet-20241022" model
+    agent.click("Previous", model="anthropic-claude-3-5-sonnet-20241022")
+    # Uses the "askui-combo" model again as no model was specified
+    agent.click("Next")
+```
 
 <details>
   <summary>AskUI AI Models</summary>
 
-Supported commands are: `click()`, `type()`, `mouse_move()`
+Supported commands are: `click()`, `locate()`, `mouse_move()`
 | Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
 |-------------|--------------------|--------------|--------------|--------------|--------------|
+| `askui` | `AskUI` is a combination of all the following models: `askui-pta`, `askui-ocr`, `askui-combo`, `askui-ai-element` where AskUI chooses the best model for the task depending on the input. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be (at least partially) retrained |
 | `askui-pta` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, can be retrained |
 | `askui-ocr` | `AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
 | `askui-combo` | AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. | Fast, <500ms per step | Secure hosting by AskUI or on-premise | low, <0,05$ per step | Recommended for production usage, can be retrained |
-| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name.  | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, determinitic behaviour |
+| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therefore, you have to crop out the element and give it a name.  | Very fast, <5ms per step | Secure hosting by AskUI or on-premise | Low, <0,05$ per step | Recommended for production usage, deterministic behaviour |
 
 > **Note:** Configure your AskUI Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
 
+</details>
+
+<details>
+  <summary>Antrophic AI Models</summary>
+
+Supported commands are: `act()`, `get()`, `click()`, `locate()`, `mouse_move()`
+| Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
+|-------------|--------------------|--------------|--------------|--------------|--------------|
+| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is a Large Action Model (LAM), which can autonomously achieve goals. e.g. `"Book me a flight from Berlin to Rom"` | slow, >1s per step | Model hosting by Anthropic | High, up to 1,5$ per act | Not recommended for production usage |
+> **Note:** Configure your Antrophic Model Provider [here](#3a-authenticate-with-an-ai-model-provider)
+
+
 </details>
 
 
 <details>
   <summary>Huggingface AI Models (Spaces API)</summary>
 
-Supported commands are: `click()`, `type()`, `mouse_move()`
+Supported commands are: `click()`, `locate()`, `mouse_move()`
 | Model Name  | Info | Execution Speed | Security | Cost | Reliability | 
 |-------------|--------------------|--------------|--------------|--------------|--------------|
 | `AskUI/PTA-1` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI](https://www.askui.com/) which to address all kinds of UI elements by a textual description e.g. "`Login button`", "`Text login`" | fast, <500ms per step | Huggingface hosted | Prices for Huggingface hosting | Not recommended for production applications |
@@ -226,7 +248,7 @@ Supported commands are: `click()`, `type()`, `mouse_move()`
 <details>
   <summary>Self Hosted UI Models</summary>
 
-Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
+Supported commands are: `click()`, `locate()`, `mouse_move()`, `get()`, `act()`
 | Model Name  | Info | Execution Speed |  Security | Cost | Reliability | 
 |-------------|--------------------|--------------|--------------|--------------|--------------|
 | `tars` | [`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" | slow, >1s per step | Self-hosted | Depening on infrastructure | Out-of-the-box not recommended for production usage |
@@ -269,26 +291,160 @@ agent.tools.clipboard.copy("...")
 result = agent.tools.clipboard.paste()
 ```
 
-### 📜 Logging & Reporting
+### 📜 Logging
 
-You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG. You can also generate a report of the automation run by setting `enable_report` to `True`.
+You want a better understanding of what you agent is doing? Set the `log_level` to DEBUG.
 
 ```python
 import logging
 
-with VisionAgent(log_level=logging.DEBUG, enable_report=True) as agent:
+with VisionAgent(log_level=logging.DEBUG) as agent:
+    agent...
+```
+
+### 📜 Reporting
+
+You want to see a report of the actions your agent took? Register a reporter using the `reporters` parameter.
+
+```python
+from typing import Optional, Union
+from typing_extensions import override
+from askui.reporting import SimpleHtmlReporter
+from PIL import Image
+
+with VisionAgent(reporters=[SimpleHtmlReporter()]) as agent:
+    agent...
+```
+
+You can also create your own reporter by implementing the `Reporter` interface.
+
+```python
+from askui.reporting import Reporter
+
+class CustomReporter(Reporter):
+    @override
+    def add_message(
+        self,
+        role: str,
+        content: Union[str, dict, list],
+        image: Optional[Image.Image] = None,
+    ) -> None:
+        # adding message to the report (see implementation of `SimpleHtmlReporter` as an example)
+        pass
+
+    @override
+    def generate(self) -> None:
+        # generate the report if not generated live (see implementation of `SimpleHtmlReporter` as an example)
+        pass
+
+
+with VisionAgent(reporters=[CustomReporter()]) as agent:
+    agent...
+```
+
+You can also use multiple reporters at once. Their `generate()` and `add_message()` methods will be called in the order of the reporters in the list.
+
+```python
+with VisionAgent(reporters=[SimpleHtmlReporter(), CustomReporter()]) as agent:
     agent...
 ```
 
 ### 🖥️ Multi-Monitor Support
 
-You have multiple monitors? Choose which one to automate by setting `display` to 1 or 2.
+You have multiple monitors? Choose which one to automate by setting `display` to `1`, `2` etc. To find the correct display or monitor, you have to play play around a bit setting it to different values. We are going to improve this soon. By default, the agent will use display 1.
 
 ```python
 with VisionAgent(display=1) as agent:
     agent...
 ```
 
+### 🎯 Locating elements
+
+If you have a hard time locating (clicking, moving mouse to etc.) elements by simply using text, e.g.,
+
+```python
+agent.click("Password textfield")
+agent.type("********")
+```
+
+you can build more sophisticated locators.
+
+**⚠️ Warning:** Support can vary depending on the model you are using. Currently, only, the `askui` model provides best support for locators. This model is chosen by default if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` environment variables are set and it is not overridden using the  `model` parameter.
+
+Example:
+
+```python
+from askui import locators as loc
+
+password_textfield_label = loc.Text("Password")
+password_textfield = loc.Element("textfield").right_of(password_textfield_label)
+
+agent.click(password_textfield)
+agent.type("********")
+```
+
+### 📊 Extracting information
+
+The `get()` method allows you to extract information from the screen. You can use it to:
+
+- Get text or data from the screen
+- Check the state of UI elements
+- Make decisions based on screen content
+- Analyze static images
+
+#### Basic usage
+
+```python
+# Get text from screen
+url = agent.get("What is the current url shown in the url bar?")
+print(url)  # e.g., "github.com/login"
+
+# Check UI state
+# Just as an example, may be flaky if used as is, better use a response schema to check for a boolean value (see below)
+is_logged_in = agent.get("Is the user logged in? Answer with 'yes' or 'no'.") == "yes"
+if is_logged_in:
+    agent.click("Logout")
+else:
+    agent.click("Login")
+```
+
+#### Using custom images
+
+Instead of taking a screenshot, you can analyze specific images:
+
+```python
+from PIL import Image
+
+# From PIL Image
+image = Image.open("screenshot.png")
+result = agent.get("What's in this image?", image)
+
+# From file path
+result = agent.get("What's in this image?", "screenshot.png")
+```
+
+#### Using response schemas
+
+For structured data extraction, use Pydantic models extending `JsonSchemaBase`:
+
+```python
+from askui import JsonSchemaBase
+
+class UserInfo(JsonSchemaBase):
+    username: str
+    is_online: bool
+
+# Get structured data
+user_info = agent.get(
+    "What is the username and online status?",
+    response_schema=UserInfo
+)
+print(f"User {user_info.username} is {'online' if user_info.is_online else 'offline'}")
+```
+
+**⚠️ Limitations:**
+- Nested Pydantic schemas are not currently supported
+- Response schema is currently only supported by "askui" model (default model if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` are set)
 
 ## What is AskUI Vision Agent?
 
diff --git a/pdm.lock b/pdm.lock
index aa9660da..56efc204 100644
--- a/pdm.lock
+++ b/pdm.lock
@@ -5,7 +5,7 @@
 groups = ["default", "test"]
 strategy = ["inherit_metadata"]
 lock_version = "4.5.0"
-content_hash = "sha256:8c2ae022f9280b62be3fc98d0e14053aece0661cc6dfca089149ff784b0b2efe"
+content_hash = "sha256:797a6cf550f6ec6264f8e851a84dff73bd155ed75264cf80adf458c4a3ecb832"
 
 [[metadata.targets]]
 requires_python = ">=3.10"
@@ -240,6 +240,17 @@ files = [
     {file = "exceptiongroup-1.2.2.tar.gz", hash = "sha256:47c2edf7c6738fafb49fd34290706d1a1a2f4d1c6df275526b62cbb4aa5393cc"},
 ]
 
+[[package]]
+name = "execnet"
+version = "2.1.1"
+requires_python = ">=3.8"
+summary = "execnet: rapid multi-Python deployment"
+groups = ["test"]
+files = [
+    {file = "execnet-2.1.1-py3-none-any.whl", hash = "sha256:26dee51f1b80cebd6d0ca8e74dd8745419761d3bef34163928cbebbdc4749fdc"},
+    {file = "execnet-2.1.1.tar.gz", hash = "sha256:5189b52c6121c24feae288166ab41b32549c7e2348652736540b9e6e7d4e72e3"},
+]
+
 [[package]]
 name = "filelock"
 version = "3.16.1"
@@ -980,6 +991,21 @@ files = [
     {file = "pytest_mock-3.14.0-py3-none-any.whl", hash = "sha256:0b72c38033392a5f4621342fe11e9219ac11ec9d375f8e2a0c164539e0d70f6f"},
 ]
 
+[[package]]
+name = "pytest-xdist"
+version = "3.6.1"
+requires_python = ">=3.8"
+summary = "pytest xdist plugin for distributed testing, most importantly across multiple CPUs"
+groups = ["test"]
+dependencies = [
+    "execnet>=2.1",
+    "pytest>=7.0.0",
+]
+files = [
+    {file = "pytest_xdist-3.6.1-py3-none-any.whl", hash = "sha256:9ed4adfb68a016610848639bb7e02c9352d5d9f03d04809919e2dafc3be4cca7"},
+    {file = "pytest_xdist-3.6.1.tar.gz", hash = "sha256:ead156a4db231eec769737f57668ef58a2084a34b2e55c4a8fa20d861107300d"},
+]
+
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"
diff --git a/pyproject.toml b/pyproject.toml
index 9690c759..e3cc885a 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -39,9 +39,10 @@ path = "src/askui/__init__.py"
 distribution = true
 
 [tool.pdm.scripts]
-test = "pytest"
-"test:unit" = "pytest tests/unit"
-"test:integration" = "pytest tests/integration"
+test = "pytest -n auto"
+"test:e2e" = "pytest -n auto tests/e2e"
+"test:integration" = "pytest -n auto tests/integration"
+"test:unit" = "pytest -n auto tests/unit"
 sort = "isort ."
 format = "black ."
 lint = "ruff check ."
@@ -56,6 +57,7 @@ test = [
     "black>=25.1.0",
     "ruff>=0.9.5",
     "pytest-mock>=3.14.0",
+    "pytest-xdist>=3.6.1",
 ]
 chat = [
     "streamlit>=1.42.0",
diff --git a/src/askui/__init__.py b/src/askui/__init__.py
index 5b7ab018..6cd6a904 100644
--- a/src/askui/__init__.py
+++ b/src/askui/__init__.py
@@ -3,7 +3,19 @@
 __version__ = "0.2.4"
 
 from .agent import VisionAgent
+from .models.router import ModelRouter
+from .models.types.response_schemas import ResponseSchema, ResponseSchemaBase
+from .tools.toolbox import AgentToolbox
+from .tools.agent_os import AgentOs, ModifierKey, PcKey
+
 
 __all__ = [
+    "AgentOs",
+    "AgentToolbox",
+    "ModelRouter",
+    "ModifierKey",
+    "PcKey",
+    "ResponseSchema",
+    "ResponseSchemaBase",
     "VisionAgent",
 ]
diff --git a/src/askui/agent.py b/src/askui/agent.py
index 5ca927a7..2948e88e 100644
--- a/src/askui/agent.py
+++ b/src/askui/agent.py
@@ -1,282 +1,452 @@
 import logging
 import subprocess
-from typing import Annotated, Any, Literal, Optional, Callable
-
-from pydantic import Field, validate_call
+from typing import Annotated, Any, Literal, Optional, Type, overload
+from pydantic import ConfigDict, Field, validate_call
 
 from askui.container import telemetry
+from askui.locators.locators import Locator
+from askui.utils.image_utils import ImageSource, Img
 
 from .tools.askui.askui_controller import (
     AskUiControllerClient,
     AskUiControllerServer,
-    PC_AND_MODIFIER_KEY,
-    MODIFIER_KEY,
+    ModifierKey,
+    PcKey,
 )
-from .models.anthropic.claude import ClaudeHandler
 from .logger import logger, configure_logging
 from .tools.toolbox import AgentToolbox
-from .models.router import ModelRouter
-from .reporting.report import SimpleReportGenerator
+from .models import ModelComposition
+from .models.router import ModelRouter, Point
+from .reporting import CompositeReporter, Reporter
 import time
 from dotenv import load_dotenv
-from PIL import Image
+from .models.types.response_schemas import ResponseSchema
+
 
 class InvalidParameterError(Exception):
     pass
 
 
 class VisionAgent:
-    @telemetry.record_call(exclude={"report_callback"})
+    """
+    A vision-based agent that can interact with user interfaces through computer vision and AI.
+
+    This agent can perform various UI interactions like clicking, typing, scrolling, and more.
+    It uses computer vision models to locate UI elements and execute actions on them.
+
+    Parameters:
+        log_level (int, optional): 
+            The logging level to use. Defaults to logging.INFO.
+        display (int, optional): 
+            The display number to use for screen interactions. Defaults to 1.
+        model_router (ModelRouter | None, optional): 
+            Custom model router instance. If None, a default one will be created.
+        reporters (list[Reporter] | None, optional): 
+            List of reporter instances for logging and reporting. If None, an empty list is used.
+        tools (AgentToolbox | None, optional): 
+            Custom toolbox instance. If None, a default one will be created with AskUiControllerClient.
+        model (ModelComposition | str | None, optional): 
+            The default composition or name of the model(s) to be used for vision tasks. 
+            Can be overridden by the `model` parameter in the `click()`, `get()`, `act()` etc. methods.
+
+    Example:
+        ```python
+        with VisionAgent() as agent:
+            agent.click("Submit button")
+            agent.type("Hello World")
+            agent.act("Open settings menu")
+        ```
+    """
+    @telemetry.record_call(exclude={"model_router", "reporters", "tools"})
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
     def __init__(
         self,
-        log_level=logging.INFO,
-        display: int = 1,
-        enable_report: bool = False,
-        enable_askui_controller: bool = True,
-        report_callback: Callable[[str | dict[str, Any]], None] | None = None,
+        log_level: int | str = logging.INFO,
+        display: Annotated[int, Field(ge=1)] = 1,
+        model_router: ModelRouter | None = None,
+        reporters: list[Reporter] | None = None,
+        tools: AgentToolbox | None = None,
+        model: ModelComposition | str | None = None,
     ) -> None:
         load_dotenv()
         configure_logging(level=log_level)
-        self.report = None
-        if enable_report:
-            self.report = SimpleReportGenerator(report_callback=report_callback)
-        self.controller = None
-        self.client = None
-        if enable_askui_controller:
-            self.controller = AskUiControllerServer()
-            self.controller.start(True)
-            time.sleep(0.5)
-            self.client = AskUiControllerClient(display, self.report)
-            self.client.connect()
-            self.client.set_display(display)
-        self.model_router = ModelRouter(log_level, self.report)
-        self.claude = ClaudeHandler(log_level=log_level)
-        self.tools = AgentToolbox(os_controller=self.client)
-
-    def _check_askui_controller_enabled(self) -> None:
-        if not self.client:
-            raise ValueError(
-                "AskUI Controller is not initialized. Please, set `enable_askui_controller` to `True` when initializing the `VisionAgent`."
-            )
-
-    @telemetry.record_call(exclude={"instruction"})
-    def click(self, instruction: Optional[str] = None, button: Literal['left', 'middle', 'right'] = 'left', repeat: int = 1, model_name: Optional[str] = None) -> None:
+        self._reporter = CompositeReporter(reports=reporters)
+        self.tools = tools or AgentToolbox(
+                agent_os=AskUiControllerClient(
+                display=display,
+                reporter=self._reporter, 
+                controller_server=AskUiControllerServer()
+            ),
+        )
+        self.model_router = (
+            ModelRouter(tools=self.tools, reporter=self._reporter) if model_router is None else model_router
+        )
+        self._model = model
+
+    @telemetry.record_call(exclude={"locator"})
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
+    def click(
+        self,
+        locator: Optional[str | Locator] = None,
+        button: Literal['left', 'middle', 'right'] = 'left',
+        repeat: Annotated[int, Field(gt=0)] = 1,
+        model: ModelComposition | str | None = None,
+    ) -> None:
         """
-        Simulates a mouse click on the user interface element identified by the provided instruction.
+        Simulates a mouse click on the user interface element identified by the provided locator.
 
         Parameters:
-            instruction (str | None): The identifier or description of the element to click.
-            button ('left' | 'middle' | 'right'): Specifies which mouse button to click. Defaults to 'left'.
-            repeat (int): The number of times to click. Must be greater than 0. Defaults to 1.
-            model_name (str | None): The model name to be used for element detection. Optional.
+            locator (str | Locator | None): 
+                The identifier or description of the element to click. If None, clicks at current position.
+            button ('left' | 'middle' | 'right'): 
+                Specifies which mouse button to click. Defaults to 'left'.
+            repeat (int): 
+                The number of times to click. Must be greater than 0. Defaults to 1.
+            model (ModelComposition | str | None): 
+                The composition or name of the model(s) to be used for locating the element to click on using the `locator`.
 
         Raises:
             InvalidParameterError: If the 'repeat' parameter is less than 1.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.click()              # Left click on current position
-            agent.click("Edit")        # Left click on text "Edit"
-            agent.click("Edit", button="right")  # Right click on text "Edit"
-            agent.click(repeat=2)      # Double left click on current position
-            agent.click("Edit", button="middle", repeat=4)   # 4x middle click on text "Edit"
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.click()              # Left click on current position
+                agent.click("Edit")        # Left click on text "Edit"
+                agent.click("Edit", button="right")  # Right click on text "Edit"
+                agent.click(repeat=2)      # Double left click on current position
+                agent.click("Edit", button="middle", repeat=4)   # 4x middle click on text "Edit"
+            ```
         """
         if repeat < 1:
             raise InvalidParameterError("InvalidParameterError! The parameter 'repeat' needs to be greater than 0.")
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            msg = 'click'
-            if button != 'left':
-                msg = f'{button} ' + msg 
-            if repeat > 1:
-                msg += f' {repeat}x times'
-            if instruction is not None:
-                msg += f' on "{instruction}"'
-            self.report.add_message("User", msg)
-        if instruction is not None:
-            logger.debug("VisionAgent received instruction to click '%s'", instruction)
-            self.__mouse_move(instruction, model_name)
-        self.client.click(button, repeat) # type: ignore
-
-    def __mouse_move(self, instruction: str, model_name: Optional[str] = None) -> None:
-        self._check_askui_controller_enabled()
-        screenshot = self.client.screenshot() # type: ignore
-        x, y = self.model_router.locate(screenshot, instruction, model_name)
-        if self.report is not None:
-            self.report.add_message("ModelRouter", f"locate: ({x}, {y})")
-        self.client.mouse(x, y) # type: ignore
-
-    @telemetry.record_call(exclude={"instruction"})
-    def mouse_move(self, instruction: str, model_name: Optional[str] = None) -> None:
+        msg = 'click'
+        if button != 'left':
+            msg = f'{button} ' + msg 
+        if repeat > 1:
+            msg += f' {repeat}x times'
+        if locator is not None:
+            msg += f' on {locator}'
+        self._reporter.add_message("User", msg)
+        if locator is not None:
+            logger.debug("VisionAgent received instruction to click on %s", locator)
+            self._mouse_move(locator, model or self._model)
+        self.tools.agent_os.click(button, repeat) # type: ignore
+    
+    def _locate(self, locator: str | Locator, screenshot: Optional[Img] = None, model: ModelComposition | str | None = None) -> Point:
+        _screenshot = ImageSource(self.tools.agent_os.screenshot() if screenshot is None else screenshot)
+        point = self.model_router.locate(_screenshot.root, locator, model or self._model)
+        self._reporter.add_message("ModelRouter", f"locate: ({point[0]}, {point[1]})")
+        return point
+    
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
+    def locate(
+        self,
+        locator: str | Locator,
+        screenshot: Optional[Img] = None,
+        model: ModelComposition | str | None = None,
+    ) -> Point:
         """
-        Moves the mouse cursor to the UI element identified by the provided instruction.
+        Locates the UI element identified by the provided locator.
 
         Parameters:
-            instruction (str): The identifier or description of the element to move to.
-            model_name (str | None): The model name to be used for element detection. Optional.
+            locator (str | Locator): 
+                The identifier or description of the element to locate.
+            screenshot (Img | None, optional): 
+                The screenshot to use for locating the element. Can be a path to an image file, a PIL Image object or a data URL. 
+                If None, takes a screenshot of the currently selected display.
+            model (ModelComposition | str | None): 
+                The composition or name of the model(s) to be used for locating the element using the `locator`.
+
+        Returns:
+            Point: The coordinates of the element as a tuple (x, y).
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.mouse_move("Submit button")  # Moves cursor to submit button
-            agent.mouse_move("Close")  # Moves cursor to close element
-            agent.mouse_move("Profile picture", model_name="custom_model")  # Uses specific model
-        ```
+            ```python
+            with VisionAgent() as agent:
+                point = agent.locate("Submit button")
+                print(f"Element found at coordinates: {point}")
+            ```
         """
-        if self.report is not None:
-            self.report.add_message("User", f'mouse_move: "{instruction}"')
-        logger.debug("VisionAgent received instruction to mouse_move '%s'", instruction)
-        self.__mouse_move(instruction, model_name)
+        self._reporter.add_message("User", f"locate {locator}")
+        logger.debug("VisionAgent received instruction to locate %s", locator)
+        return self._locate(locator, screenshot, model or self._model)
+
+    def _mouse_move(self, locator: str | Locator, model: ModelComposition | str | None = None) -> None:
+        point = self._locate(locator=locator, model=model or self._model)
+        self.tools.agent_os.mouse(point[0], point[1]) # type: ignore
+
+    @telemetry.record_call(exclude={"locator"})
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
+    def mouse_move(
+        self,
+        locator: str | Locator,
+        model: ModelComposition | str | None = None,
+    ) -> None:
+        """
+        Moves the mouse cursor to the UI element identified by the provided locator.
+
+        Parameters:
+            locator (str | Locator): 
+                The identifier or description of the element to move to.
+            model (ModelComposition | str | None): 
+                The composition or name of the model(s) to be used for locating the element to move the mouse to using the `locator`.
+
+        Example:
+            ```python
+            with VisionAgent() as agent:
+                agent.mouse_move("Submit button")  # Moves cursor to submit button
+                agent.mouse_move("Close")  # Moves cursor to close element
+                agent.mouse_move("Profile picture", model="custom_model")  # Uses specific model
+            ```
+        """
+        self._reporter.add_message("User", f'mouse_move: {locator}')
+        logger.debug("VisionAgent received instruction to mouse_move to %s", locator)
+        self._mouse_move(locator, model or self._model)
 
     @telemetry.record_call()
-    def mouse_scroll(self, x: int, y: int) -> None:
+    @validate_call
+    def mouse_scroll(
+        self,
+        x: int,
+        y: int,
+    ) -> None:
         """
         Simulates scrolling the mouse wheel by the specified horizontal and vertical amounts.
 
         Parameters:
-            x (int): The horizontal scroll amount. Positive values typically scroll right, negative values scroll left.
-            y (int): The vertical scroll amount. Positive values typically scroll down, negative values scroll up.
+            x (int): 
+                The horizontal scroll amount. Positive values typically scroll right, negative values scroll left.
+            y (int): 
+                The vertical scroll amount. Positive values typically scroll down, negative values scroll up.
 
         Note:
-            The actual `scroll direction` depends on the operating system's configuration.
+            The actual scroll direction depends on the operating system's configuration.
             Some systems may have "natural scrolling" enabled, which reverses the traditional direction.
             
-            The meaning of scroll `units` varies` acro`ss oper`ating` systems and applications.
+            The meaning of scroll units varies across operating systems and applications.
             A scroll value of 10 might result in different distances depending on the application and system settings.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.mouse_scroll(0, 10)  # Usually scrolls down 10 units
-            agent.mouse_scroll(0, -5)  # Usually scrolls up 5 units
-            agent.mouse_scroll(3, 0)   # Usually scrolls right 3 units
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.mouse_scroll(0, 10)  # Usually scrolls down 10 units
+                agent.mouse_scroll(0, -5)  # Usually scrolls up 5 units
+                agent.mouse_scroll(3, 0)   # Usually scrolls right 3 units
+            ```
         """
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            self.report.add_message("User", f'mouse_scroll: "{x}", "{y}"')
-        self.client.mouse_scroll(x, y)
+        self._reporter.add_message("User", f'mouse_scroll: "{x}", "{y}"')
+        self.tools.agent_os.mouse_scroll(x, y)
 
     @telemetry.record_call(exclude={"text"})
-    def type(self, text: str) -> None:
+    @validate_call
+    def type(
+        self,
+        text: Annotated[str, Field(min_length=1)],
+    ) -> None:
         """
         Types the specified text as if it were entered on a keyboard.
 
         Parameters:
-            text (str): The text to be typed.
+            text (str): 
+                The text to be typed. Must be at least 1 character long.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.type("Hello, world!")  # Types "Hello, world!"
-            agent.type("user@example.com")  # Types an email address
-            agent.type("password123")  # Types a password
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.type("Hello, world!")  # Types "Hello, world!"
+                agent.type("user@example.com")  # Types an email address
+                agent.type("password123")  # Types a password
+            ```
         """
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            self.report.add_message("User", f'type: "{text}"')
+        self._reporter.add_message("User", f'type: "{text}"')
         logger.debug("VisionAgent received instruction to type '%s'", text)
-        self.client.type(text) # type: ignore
+        self.tools.agent_os.type(text) # type: ignore
 
-    @telemetry.record_call(exclude={"instruction", "screenshot"})
-    def get(self, instruction: str, model_name: Optional[str] = None, screenshot: Optional[Image.Image] = None) -> str:
+
+    @overload
+    def get(
+        self,
+        query: Annotated[str, Field(min_length=1)],
+        response_schema: None = None,
+        image: Optional[Img] = None,
+        model: ModelComposition | str | None = None,
+    ) -> str: ...
+    @overload
+    def get(
+        self,
+        query: Annotated[str, Field(min_length=1)],
+        response_schema: Type[ResponseSchema],
+        image: Optional[Img] = None,
+        model: ModelComposition | str | None = None,
+    ) -> ResponseSchema: ...
+
+    @telemetry.record_call(exclude={"query", "image", "response_schema"})
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
+    def get(
+        self,
+        query: Annotated[str, Field(min_length=1)],
+        image: Optional[Img] = None,
+        response_schema: Type[ResponseSchema] | None = None,
+        model: ModelComposition | str | None = None,
+    ) -> ResponseSchema | str:
         """
-        Retrieves text or information from the screen based on the provided instruction.
+        Retrieves information from an image (defaults to a screenshot of the current screen) based on the provided query.
 
         Parameters:
-            instruction (str): The instruction describing what information to retrieve.
-            model_name (str | None): The model name to be used for information extraction. Optional.
+            query (str): 
+                The query describing what information to retrieve.
+            image (Img | None, optional): 
+                The image to extract information from. Defaults to a screenshot of the current screen. 
+                Can be a path to an image file, a PIL Image object or a data URL.
+            response_schema (Type[ResponseSchema] | None, optional): 
+                A Pydantic model class that defines the response schema. If not provided, returns a string.
+            model (ModelComposition | str | None, optional):
+                The composition or name of the model(s) to be used for retrieving information from the screen or image using the `query`.
+                Note: `response_schema` is not supported by all models.
 
         Returns:
-            str: The extracted text or information.
+            ResponseSchema | str: 
+                The extracted information, either as an instance of ResponseSchema or string if no response_schema is provided.
+
+        Limitations:
+            - Nested Pydantic schemas are not currently supported
+            - Schema support is only available with "askui" model (default model if `ASKUI_WORKSPACE_ID` and `ASKUI_TOKEN` are set) at the moment
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            price = agent.get("What is the price displayed?")
-            username = agent.get("What is the username shown in the profile?")
-            error_message = agent.get("What does the error message say?")
-        ```
+            ```python
+            from askui import JsonSchemaBase
+            from PIL import Image
+
+            class UrlResponse(JsonSchemaBase):
+                url: str
+
+            with VisionAgent() as agent:
+                # Get URL as string
+                url = agent.get("What is the current url shown in the url bar?")
+                
+                # Get URL as Pydantic model from image at (relative) path
+                response = agent.get(
+                    "What is the current url shown in the url bar?",
+                    response_schema=UrlResponse,
+                    image="screenshot.png",
+                )
+                print(response.url)
+
+                # Get boolean response from PIL Image
+                is_login_page = agent.get(
+                    "Is this a login page?",
+                    response_schema=bool,
+                    image=Image.open("screenshot.png"),
+                )
+
+                # Get integer response
+                input_count = agent.get(
+                    "How many input fields are visible on this page?",
+                    response_schema=int,
+                )
+
+                # Get float response
+                design_rating = agent.get(
+                    "Rate the page design quality from 0 to 1",
+                    response_schema=float,
+                )
+            ```
         """
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            self.report.add_message("User", f'get: "{instruction}"')
-        logger.debug("VisionAgent received instruction to get '%s'", instruction)
-        if screenshot is None:
-            screenshot = self.client.screenshot() # type: ignore
-        response = self.model_router.get_inference(screenshot, instruction, model_name)
-        if self.report is not None:
-            self.report.add_message("Agent", response)
+        logger.debug("VisionAgent received instruction to get '%s'", query)
+        _image = ImageSource(self.tools.agent_os.screenshot() if image is None else image) # type: ignore
+        self._reporter.add_message("User", f'get: "{query}"', image=_image.root)
+        response = self.model_router.get_inference(
+            image=_image,
+            query=query,
+            model=model or self._model,
+            response_schema=response_schema,
+        )
+        if self._reporter is not None:
+            message_content = str(response) if isinstance(response, (str, bool, int, float)) else response.model_dump()
+            self._reporter.add_message("Agent", message_content)
         return response
     
     @telemetry.record_call()
     @validate_call
-    def wait(self, sec: Annotated[float, Field(gt=0)]) -> None:
+    def wait(
+        self,
+        sec: Annotated[float, Field(gt=0.0)],
+    ) -> None:
         """
         Pauses the execution of the program for the specified number of seconds.
 
         Parameters:
-            sec (float): The number of seconds to wait. Must be greater than 0.
+            sec (float): 
+                The number of seconds to wait. Must be greater than 0.0.
 
         Raises:
             ValueError: If the provided `sec` is negative.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.wait(5)  # Pauses execution for 5 seconds
-            agent.wait(0.5)  # Pauses execution for 500 milliseconds
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.wait(5)  # Pauses execution for 5 seconds
+                agent.wait(0.5)  # Pauses execution for 500 milliseconds
+            ```
         """
         time.sleep(sec)
 
     @telemetry.record_call()
-    def key_up(self, key: PC_AND_MODIFIER_KEY) -> None:
+    @validate_call
+    def key_up(
+        self,
+        key: PcKey | ModifierKey,
+    ) -> None:
         """
         Simulates the release of a key.
 
         Parameters:
-            key (PC_AND_MODIFIER_KEY): The key to be released.
+            key (PcKey | ModifierKey): 
+                The key to be released.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.key_up('a')  # Release the 'a' key
-            agent.key_up('shift')  # Release the 'Shift' key
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.key_up('a')  # Release the 'a' key
+                agent.key_up('shift')  # Release the 'Shift' key
+            ```
         """
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            self.report.add_message("User", f'key_up "{key}"')
+        self._reporter.add_message("User", f'key_up "{key}"')
         logger.debug("VisionAgent received in key_up '%s'", key)
-        self.client.keyboard_release(key)
+        self.tools.agent_os.keyboard_release(key)
 
     @telemetry.record_call()
-    def key_down(self, key: PC_AND_MODIFIER_KEY) -> None:
+    @validate_call
+    def key_down(
+        self,
+        key: PcKey | ModifierKey,
+    ) -> None:
         """
         Simulates the pressing of a key.
 
         Parameters:
-            key (PC_AND_MODIFIER_KEY): The key to be pressed.
+            key (PcKey | ModifierKey): 
+                The key to be pressed.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.key_down('a')  # Press the 'a' key
-            agent.key_down('shift')  # Press the 'Shift' key
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.key_down('a')  # Press the 'a' key
+                agent.key_down('shift')  # Press the 'Shift' key
+            ```
         """
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            self.report.add_message("User", f'key_down "{key}"')
+        self._reporter.add_message("User", f'key_down "{key}"')
         logger.debug("VisionAgent received in key_down '%s'", key)
-        self.client.keyboard_pressed(key)
+        self.tools.agent_os.keyboard_pressed(key)
 
     @telemetry.record_call(exclude={"goal"})
-    def act(self, goal: str, model_name: Optional[str] = None) -> None:
+    @validate_call
+    def act(
+        self,
+        goal: Annotated[str, Field(min_length=1)],
+        model: ModelComposition | str | None = None,
+    ) -> None:
         """
         Instructs the agent to achieve a specified goal through autonomous actions.
 
@@ -285,54 +455,59 @@ def act(self, goal: str, model_name: Optional[str] = None) -> None:
         interface interactions.
 
         Parameters:
-            goal (str): A description of what the agent should achieve.
-            model_name (str | None): The specific model to use for vision analysis.
-                If None, uses the default model.
+            goal (str): 
+                A description of what the agent should achieve.
+            model (ModelComposition | str | None, optional): 
+                The composition or name of the model(s) to be used for achieving the `goal`.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.act("Open the settings menu")
-            agent.act("Search for 'printer' in the search box")
-            agent.act("Log in with username 'admin' and password '1234'")
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.act("Open the settings menu")
+                agent.act("Search for 'printer' in the search box")
+                agent.act("Log in with username 'admin' and password '1234'")
+            ```
         """
-        self._check_askui_controller_enabled()
-        if self.report is not None:
-            self.report.add_message("User", f'act: "{goal}"')
+        self._reporter.add_message("User", f'act: "{goal}"')
         logger.debug(
             "VisionAgent received instruction to act towards the goal '%s'", goal
         )
-        self.model_router.act(self.client, goal, model_name)
+        self.model_router.act(goal, model or self._model)
 
     @telemetry.record_call()
+    @validate_call
     def keyboard(
-        self, key: PC_AND_MODIFIER_KEY, modifier_keys: list[MODIFIER_KEY] | None = None
+        self,
+        key: PcKey | ModifierKey,
+        modifier_keys: Optional[list[ModifierKey]] = None,
     ) -> None:
         """
         Simulates pressing a key or key combination on the keyboard.
 
         Parameters:
-            key (PC_AND_MODIFIER_KEY): The main key to press. This can be a letter, number, 
-                special character, or function key.
-            modifier_keys (list[MODIFIER_KEY] | None): Optional list of modifier keys to press 
-                along with the main key. Common modifier keys include 'ctrl', 'alt', 'shift'.
+            key (PcKey | ModifierKey): 
+                The main key to press. This can be a letter, number, special character, or function key.
+            modifier_keys (list[ModifierKey] | None, optional): 
+                List of modifier keys to press along with the main key. Common modifier keys include 'ctrl', 'alt', 'shift'.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.keyboard('a')  # Press 'a' key
-            agent.keyboard('enter')  # Press 'Enter' key
-            agent.keyboard('v', ['control'])  # Press Ctrl+V (paste)
-            agent.keyboard('s', ['control', 'shift'])  # Press Ctrl+Shift+S
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.keyboard('a')  # Press 'a' key
+                agent.keyboard('enter')  # Press 'Enter' key
+                agent.keyboard('v', ['control'])  # Press Ctrl+V (paste)
+                agent.keyboard('s', ['control', 'shift'])  # Press Ctrl+Shift+S
+            ```
         """
-        self._check_askui_controller_enabled()
         logger.debug("VisionAgent received instruction to press '%s'", key)
-        self.client.keyboard_tap(key, modifier_keys)  # type: ignore
+        self.tools.agent_os.keyboard_tap(key, modifier_keys)  # type: ignore
 
     @telemetry.record_call(exclude={"command"})
-    def cli(self, command: str) -> None:
+    @validate_call
+    def cli(
+        self,
+        command: Annotated[str, Field(min_length=1)],
+    ) -> None:
         """
         Executes a command on the command line interface.
 
@@ -340,32 +515,39 @@ def cli(self, command: str) -> None:
         is split on spaces and executed as a subprocess.
 
         Parameters:
-            command (str): The command to execute on the command line.
+            command (str): 
+                The command to execute on the command line.
 
         Example:
-        ```python
-        with VisionAgent() as agent:
-            agent.cli("echo Hello World")  # Prints "Hello World"
-            agent.cli("ls -la")  # Lists files in current directory with details
-            agent.cli("python --version")  # Displays Python version
-        ```
+            ```python
+            with VisionAgent() as agent:
+                agent.cli("echo Hello World")  # Prints "Hello World"
+                agent.cli("ls -la")  # Lists files in current directory with details
+                agent.cli("python --version")  # Displays Python version
+            ```
         """
         logger.debug("VisionAgent received instruction to execute '%s' on cli", command)
         subprocess.run(command.split(" "))
 
     @telemetry.record_call(flush=True)
     def close(self) -> None:
-        if self.client:
-            self.client.disconnect()
-        if self.controller:
-            self.controller.stop(True)
+        self.tools.agent_os.disconnect()
+        self._reporter.generate()
+            
+    @telemetry.record_call()
+    def open(self) -> None:
+        self.tools.agent_os.connect()
 
     @telemetry.record_call()
     def __enter__(self) -> "VisionAgent":
+        self.open()
         return self
 
     @telemetry.record_call(exclude={"exc_value", "traceback"})
-    def __exit__(self, exc_type, exc_value, traceback) -> None:
+    def __exit__(
+        self,
+        exc_type: Optional[Type[BaseException]],
+        exc_value: Optional[BaseException],
+        traceback: Optional[Any],
+    ) -> None:
         self.close()
-        if self.report is not None:
-            self.report.generate_report()
diff --git a/src/askui/chat/__main__.py b/src/askui/chat/__main__.py
index f212521e..97cf18b2 100644
--- a/src/askui/chat/__main__.py
+++ b/src/askui/chat/__main__.py
@@ -1,17 +1,22 @@
 from random import randint
 from PIL import Image, ImageDraw
-from typing import Any, Callable, Literal
+from typing import Union
+from typing_extensions import override, TypedDict
 import streamlit as st
 from askui import VisionAgent
 import logging
 from askui.chat.click_recorder import ClickRecorder
-from askui.utils import base64_to_image, draw_point_on_image
+from askui.models import ModelName
+from askui.reporting import Reporter
+from askui.utils.image_utils import base64_to_image
 import json
-from datetime import date, datetime
+from datetime import datetime
 import os
 import glob
 import re
 
+from askui.utils.image_utils import draw_point_on_image
+
 
 st.set_page_config(
     page_title="Vision Agent Chat",
@@ -25,14 +30,6 @@
 click_recorder = ClickRecorder()
 
 
-def json_serial(obj):
-    """JSON serializer for objects not serializable by default json code"""
-
-    if isinstance(obj, (datetime, date)):
-        return obj.isoformat()
-    raise TypeError("Type %s not serializable" % type(obj))
-
-
 def setup_chat_dirs():
     os.makedirs(CHAT_SESSIONS_DIR_PATH, exist_ok=True)
     os.makedirs(CHAT_IMAGES_DIR_PATH, exist_ok=True)
@@ -70,19 +67,24 @@ def get_image(img_b64_str_or_path: str) -> Image.Image:
 
 
 def write_message(
-    role: Literal["User", "Anthropic Computer Use", "AgentOS", "User (Demonstration)"],
-    content: str,
+    role: str,
+    content: str | dict | list,
     timestamp: str,
-    image: Image.Image |str | None = None,
+    image: Image.Image | str | list[str | Image.Image] | list[str] | list[Image.Image] | None = None,
 ):
     _role = ROLE_MAP.get(role.lower(), UNKNOWN_ROLE)
     avatar = None if _role != UNKNOWN_ROLE else "❔"
     with st.chat_message(_role, avatar=avatar):
         st.markdown(f"*{timestamp}* - **{role}**\n\n")
-        st.markdown(content)
+        st.markdown(json.dumps(content, indent=2) if isinstance(content, (dict, list)) else content)
         if image:
-            img = get_image(image) if isinstance(image, str) else image
-            st.image(img)
+            if isinstance(image, list):
+                for img in image:
+                    img = get_image(img) if isinstance(img, str) else img
+                    st.image(img)
+            else:
+                img = get_image(image) if isinstance(image, str) else image
+                st.image(img)
 
 
 def save_image(image: Image.Image) -> str:
@@ -92,31 +94,44 @@ def save_image(image: Image.Image) -> str:
     return image_path
 
 
-def chat_history_appender(session_id: str) -> Callable[[str | dict[str, Any]], None]:
-    def append_to_chat_history(report: str | dict) -> None:
-        if isinstance(report, dict):
-            if report.get("image"):
-                if not os.path.isfile(report["image"]):
-                    report["image"] = save_image(base64_to_image(report["image"]))
+class Message(TypedDict):
+    role: str
+    content: str | dict | list
+    timestamp: str
+    image: str | list[str] | None
+
+
+class ChatHistoryAppender(Reporter):
+    def __init__(self, session_id: str) -> None:
+        self._session_id = session_id
+
+    @override
+    def add_message(self, role: str, content: Union[str, dict, list], image: Image.Image | list[Image.Image] | None = None) -> None:
+        image_paths: list[str] = []
+        if image is None:
+            _images = []
+        elif isinstance(image, list):
+            _images = image
         else:
-            report = {
-                "role": "unknown",
-                "content": f"🔄 {report}",
-                "timestamp": datetime.now().isoformat(),
-            }
-        write_message(
-            report["role"],
-            report["content"],
-            report["timestamp"],
-            report.get("image"),
+            _images = [image]
+        for img in _images:
+            image_paths.append(save_image(img))
+        message = Message(
+            role=role,
+            content=content,
+            timestamp=datetime.now().isoformat(),
+            image=image_paths,
         )
+        write_message(**message)
         with open(
-            os.path.join(CHAT_SESSIONS_DIR_PATH, f"{session_id}.jsonl"), "a"
+            os.path.join(CHAT_SESSIONS_DIR_PATH, f"{self._session_id}.jsonl"), "a"
         ) as f:
-            json.dump(report, f, default=json_serial)
+            json.dump(message, f)
             f.write("\n")
 
-    return append_to_chat_history
+    @override
+    def generate(self) -> None:
+        pass
 
 
 def get_available_sessions():
@@ -200,9 +215,9 @@ def rerun():
                                 screenshot, (x, y)
                             )
                             element_description = agent.get(
-                                prompt,
-                                screenshot=screenshot_with_crosshair,
-                                model_name="anthropic-claude-3-5-sonnet-20241022",
+                                query=prompt,
+                                image=screenshot_with_crosshair,
+                                model=ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022,
                             )
                             write_message(
                                 message["role"],
@@ -211,8 +226,8 @@ def rerun():
                                 image=screenshot_with_crosshair,
                             )
                             agent.mouse_move(
-                                instruction=element_description.replace('"', ""),
-                                model_name="anthropic-claude-3-5-sonnet-20241022",
+                                locator=element_description.replace('"', ""),
+                                model=ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022,
                             )
                         else:
                             write_message(
@@ -255,7 +270,7 @@ def rerun():
     st.session_state.session_id = session_id
     st.rerun()
 
-report_callback = chat_history_appender(session_id)
+reporter = ChatHistoryAppender(session_id)
 
 st.title(f"Vision Agent Chat - {session_id}")
 st.session_state.messages = load_chat_history(session_id)
@@ -270,26 +285,16 @@ def rerun():
     )
 
 if value_to_type := st.chat_input("Simulate Typing for User (Demonstration)"):
-    report_callback(
-        {
-            "role": "User (Demonstration)",
-            "content": f'type("{value_to_type}", 50)',
-            "timestamp": datetime.now().isoformat(),
-            "is_json": False,
-            "image": None,
-        }
+    reporter.add_message(
+        role="User (Demonstration)",
+        content=f'type("{value_to_type}", 50)',
     )
     st.rerun()
 
 if st.button("Simulate left click"):
-    report_callback(
-        {
-            "role": "User (Demonstration)",
-            "content": 'click("left", 1)',
-            "timestamp": datetime.now().isoformat(),
-            "is_json": False,
-            "image": None,
-        }
+    reporter.add_message(
+        role="User (Demonstration)",
+        content='click("left", 1)',
     )
     st.rerun()
 
@@ -298,35 +303,24 @@ def rerun():
     "Demonstrate where to move mouse"
 ):  # only single step, only click supported for now, independent of click always registered as click
     image, coordinates = click_recorder.record()
-    report_callback(
-        {
-            "role": "User (Demonstration)",
-            "content": "screenshot()",
-            "timestamp": datetime.now().isoformat(),
-            "is_json": False,
-            "image": save_image(image),
-        }
+    reporter.add_message(
+        role="User (Demonstration)",
+        content="screenshot()",
+        image=image,
     )
-    report_callback(
-        {
-            "role": "User (Demonstration)",
-            "content": f"mouse({coordinates[0]}, {coordinates[1]})",
-            "timestamp": datetime.now().isoformat(),
-            "is_json": False,
-            "image": save_image(
-                draw_point_on_image(image, coordinates[0], coordinates[1])
-            ),
-        }
+    reporter.add_message(
+        role="User (Demonstration)",
+        content=f"mouse({coordinates[0]}, {coordinates[1]})",
+        image=draw_point_on_image(image, coordinates[0], coordinates[1]),
     )
     st.rerun()
 
 if act_prompt := st.chat_input("Ask AI"):
     with VisionAgent(
         log_level=logging.DEBUG,
-        enable_report=True,
-        report_callback=report_callback,
+        reporters=[reporter],
     ) as agent:
-        agent.act(act_prompt, model_name="claude")
+        agent.act(act_prompt, model="claude")
         st.rerun()
 
 if st.button("Rerun"):
diff --git a/src/askui/exceptions.py b/src/askui/exceptions.py
new file mode 100644
index 00000000..467882da
--- /dev/null
+++ b/src/askui/exceptions.py
@@ -0,0 +1,8 @@
+class AutomationError(Exception):
+    """Exception raised when the automation step cannot complete."""
+    pass
+
+
+class ElementNotFoundError(AutomationError):
+    """Exception raised when an element cannot be located."""
+    pass
diff --git a/src/askui/locators/__init__.py b/src/askui/locators/__init__.py
new file mode 100644
index 00000000..23964220
--- /dev/null
+++ b/src/askui/locators/__init__.py
@@ -0,0 +1,9 @@
+from askui.locators.locators import AiElement, Element, Prompt, Image, Text
+
+__all__ = [
+    "AiElement",
+    "Element",
+    "Prompt",
+    "Image",
+    "Text",
+]
diff --git a/src/askui/locators/locators.py b/src/askui/locators/locators.py
new file mode 100644
index 00000000..24bc569a
--- /dev/null
+++ b/src/askui/locators/locators.py
@@ -0,0 +1,418 @@
+from abc import ABC
+import pathlib
+from typing import Annotated, Literal, Union
+import uuid
+
+from PIL import Image as PILImage
+from pydantic import ConfigDict, Field, validate_call
+
+from askui.utils.image_utils import ImageSource
+from askui.locators.relatable import Relatable
+
+
+class Locator(Relatable, ABC):
+    """Base class for all locators."""
+    
+    def _str(self) -> str:
+        return "locator"
+
+    pass
+
+
+class Prompt(Locator):
+    """Locator for finding ui elements by a textual prompt / description of a ui element, e.g., "green sign up button"."""
+
+    @validate_call
+    def __init__(
+        self,
+        prompt: Annotated[
+            str,
+            Field(
+                description="""A textual prompt / description of a ui element, e.g., "green sign up button"."""
+            ),
+        ],
+    ) -> None:
+        """Initialize a Prompt locator.
+
+        Args:
+            prompt: A textual prompt / description of a ui element, e.g., "green sign up button"
+        """
+        super().__init__()
+        self._prompt = prompt
+
+    @property
+    def prompt(self) -> str:
+        return self._prompt
+    
+    def _str(self) -> str:
+        return f'element with prompt "{self.prompt}"'
+
+
+class Element(Locator):
+    """Locator for finding ui elements by a class name assigned to the ui element, e.g., by a computer vision model."""
+
+    @validate_call
+    def __init__(
+        self,
+        class_name: Annotated[
+            Literal["text", "textfield"] | None,
+            Field(
+                description="""The class name of the ui element, e.g., 'text' or 'textfield'."""
+            ),
+        ] = None,
+    ) -> None:
+        """Initialize an Element locator.
+
+        Args:
+            class_name: The class name of the ui element, e.g., 'text' or 'textfield'
+        """
+        super().__init__()
+        self._class_name = class_name
+
+    @property
+    def class_name(self) -> Literal["text", "textfield"] | None:
+        return self._class_name
+
+    def _str(self) -> str:
+        return (
+            f'element with class "{self.class_name}"' if self.class_name else "element"
+        )
+
+
+TextMatchType = Literal["similar", "exact", "contains", "regex"]
+DEFAULT_TEXT_MATCH_TYPE: TextMatchType = "similar"
+DEFAULT_SIMILARITY_THRESHOLD = 70
+
+
+class Text(Element):
+    """Locator for finding text elements by their content."""
+
+    @validate_call
+    def __init__(
+        self,
+        text: Annotated[
+            str | None,
+            Field(
+                description="""The text content of the ui element, e.g., 'Sign up'."""
+            ),
+        ] = None,
+        match_type: Annotated[
+            TextMatchType,
+            Field(
+                description="""The type of match to use. Defaults to 'similar'.
+            'similar' uses a similarity threshold to determine if the text is a match.
+            'exact' requires the text to be exactly the same.
+            'contains' requires the text to contain the specified text.
+            'regex' uses a regular expression to match the text."""
+            ),
+        ] = DEFAULT_TEXT_MATCH_TYPE,
+        similarity_threshold: Annotated[
+            int,
+            Field(
+                ge=0,
+                le=100,
+                description="""A threshold for how similar the text 
+            needs to be to the text content of the ui element to be considered a match. 
+            Takes values between 0 and 100 (higher is more similar). Defaults to 70. 
+            Only used if match_type is 'similar'.""",
+            ),
+        ] = DEFAULT_SIMILARITY_THRESHOLD,
+    ) -> None:
+        """Initialize a Text locator.
+
+        Args:
+            text: The text content of the ui element, e.g., 'Sign up'
+            match_type: The type of match to use. Defaults to 'similar'. 'similar' uses a similarity threshold to
+                determine if the text is a match. 'exact' requires the text to be exactly the same. 'contains'
+                requires the text to contain the specified text. 'regex' uses a regular expression to match the text.
+            similarity_threshold: A threshold for how similar the text needs to be to the text content of the ui
+                element to be considered a match. Takes values between 0 and 100 (higher is more similar).
+                Defaults to 70. Only used if match_type is 'similar'.
+        """
+        super().__init__()
+        self._text = text
+        self._match_type = match_type
+        self._similarity_threshold = similarity_threshold
+
+    @property
+    def text(self) -> str | None:
+        return self._text
+
+    @property
+    def match_type(self) -> TextMatchType:
+        return self._match_type
+
+    @property
+    def similarity_threshold(self) -> int:
+        return self._similarity_threshold
+
+    def _str(self) -> str:
+        if self.text is None:
+            result = "text"
+        else:
+            result = "text "
+            match self.match_type:
+                case "similar":
+                    result += f'similar to "{self.text}" (similarity >= {self.similarity_threshold}%)'
+                case "exact":
+                    result += f'"{self.text}"'
+                case "contains":
+                    result += f'containing text "{self.text}"'
+                case "regex":
+                    result += f'matching regex "{self.text}"'
+        return result
+
+
+class ImageBase(Locator, ABC):
+    def __init__(
+        self,
+        threshold: float,
+        stop_threshold: float,
+        mask: list[tuple[float, float]] | None,
+        rotation_degree_per_step: int,
+        name: str,
+        image_compare_format: Literal["RGB", "grayscale", "edges"],
+    ) -> None:
+        super().__init__()
+        if threshold > stop_threshold:
+            raise ValueError(
+                f"threshold ({threshold}) must be less than or equal to stop_threshold ({stop_threshold})"
+            )
+        self._threshold = threshold
+        self._stop_threshold = stop_threshold
+        self._mask = mask
+        self._rotation_degree_per_step = rotation_degree_per_step
+        self._name = name
+        self._image_compare_format = image_compare_format
+
+    @property
+    def threshold(self) -> float:
+        return self._threshold
+
+    @property
+    def stop_threshold(self) -> float:
+        return self._stop_threshold
+
+    @property
+    def mask(self) -> list[tuple[float, float]] | None:
+        return self._mask
+
+    @property
+    def rotation_degree_per_step(self) -> int:
+        return self._rotation_degree_per_step
+
+    @property
+    def name(self) -> str:
+        return self._name
+
+    @property
+    def image_compare_format(self) -> Literal["RGB", "grayscale", "edges"]:
+        return self._image_compare_format
+    
+    def _params_str(self) -> str:
+        return (
+            "("
+            + ", ".join([
+                f"threshold: {self.threshold}",
+                f"stop_threshold: {self.stop_threshold}",
+                f"rotation_degree_per_step: {self.rotation_degree_per_step}",
+                f"image_compare_format: {self.image_compare_format}",
+                f"mask: {self.mask}"
+            ])
+            + ")"
+        )
+    
+    def _str(self) -> str:
+        return (
+            f'element "{self.name}" located by image '
+            + self._params_str()
+        )
+
+
+def _generate_name() -> str:
+    return f"anonymous image {uuid.uuid4()}"
+
+
+class Image(ImageBase):
+    """Locator for finding ui elements by an image."""
+
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
+    def __init__(
+        self,
+        image: Union[PILImage.Image, pathlib.Path, str],
+        threshold: Annotated[
+            float,
+            Field(
+                ge=0,
+                le=1,
+                description="""A threshold for how similar UI elements need to be to the image to be considered a match. 
+            Takes values between 0.0 (= all elements are recognized) and 1.0 (= elements need to look exactly 
+            like defined). Defaults to 0.5. Important: The threshold impacts the prediction quality.""",
+            ),
+        ] = 0.5,
+        stop_threshold: Annotated[
+            float | None,
+            Field(
+                ge=0,
+                le=1,
+                description="""A threshold for when to stop searching for UI elements similar to the image. As soon 
+            as UI elements have been found that are at least as similar as the stop_threshold, the search stops. Should 
+            be greater than or equal to threshold. Takes values between 0.0 and 1.0. Defaults to value of `threshold` if 
+            not provided. Important: The stop_threshold impacts the prediction speed.""",
+            ),
+        ] = None,
+        mask: Annotated[
+            list[tuple[float, float]] | None,
+            Field(
+                min_length=3,
+                description="A polygon to match only a certain area of the image.",
+            ),
+        ] = None,
+        rotation_degree_per_step: Annotated[
+            int,
+            Field(
+                ge=0,
+                lt=360,
+                description="""A step size in rotation degree. Rotates the image by rotation_degree_per_step until 
+            360° is exceeded. Range is between 0° - 360°. Defaults to 0°. Important: This increases the prediction time 
+            quite a bit. So only use it when absolutely necessary.""",
+            ),
+        ] = 0,
+        name: str | None = None,
+        image_compare_format: Annotated[
+            Literal["RGB", "grayscale", "edges"],
+            Field(
+                description="""A color compare style. Defaults to 'grayscale'. 
+            Important: The image_compare_format impacts the prediction time as well as quality. As a rule of thumb, 
+            'edges' is likely to be faster than 'grayscale' and 'grayscale' is likely to be faster than 'RGB'. For 
+            quality it is most often the other way around."""
+            ),
+        ] = "grayscale",
+    ) -> None:
+        """Initialize an Image locator.
+
+        Args:
+            image: The image to match against (PIL Image, path, or string)
+            threshold: A threshold for how similar UI elements need to be to the image to be considered a match.
+                Takes values between 0.0 (= all elements are recognized) and 1.0 (= elements need to look exactly
+                like defined). Defaults to 0.5. Important: The threshold impacts the prediction quality.
+            stop_threshold: A threshold for when to stop searching for UI elements similar to the image. As soon
+                as UI elements have been found that are at least as similar as the stop_threshold, the search stops.
+                Should be greater than or equal to threshold. Takes values between 0.0 and 1.0. Defaults to value of
+                `threshold` if not provided. Important: The stop_threshold impacts the prediction speed.
+            mask: A polygon to match only a certain area of the image. Must have at least 3 points.
+            rotation_degree_per_step: A step size in rotation degree. Rotates the image by rotation_degree_per_step
+                until 360° is exceeded. Range is between 0° - 360°. Defaults to 0°. Important: This increases the
+                prediction time quite a bit. So only use it when absolutely necessary.
+            name: Optional name for the image. Defaults to generated UUID.
+            image_compare_format: A color compare style. Defaults to 'grayscale'. Important: The image_compare_format
+                impacts the prediction time as well as quality. As a rule of thumb, 'edges' is likely to be faster
+                than 'grayscale' and 'grayscale' is likely to be faster than 'RGB'. For quality it is most often
+                the other way around.
+        """
+        super().__init__(
+            threshold=threshold,
+            stop_threshold=stop_threshold or threshold,
+            mask=mask,
+            rotation_degree_per_step=rotation_degree_per_step,
+            image_compare_format=image_compare_format,
+            name=_generate_name() if name is None else name,
+        )  # type: ignore
+        self._image = ImageSource(image)
+
+    @property
+    def image(self) -> ImageSource:
+        return self._image
+
+
+class AiElement(ImageBase):
+    """Locator for finding ui elements by an image and other kinds data saved on the disk."""
+
+    @validate_call(config=ConfigDict(arbitrary_types_allowed=True))
+    def __init__(
+        self,
+        name: str,
+        threshold: Annotated[
+            float,
+            Field(
+                ge=0,
+                le=1,
+                description="""A threshold for how similar UI elements need to be to be considered a match. 
+            Takes values between 0.0 (= all elements are recognized) and 1.0 (= elements need to be an exact match). 
+            Defaults to 0.5. Important: The threshold impacts the prediction quality.""",
+            ),
+        ] = 0.5,
+        stop_threshold: Annotated[
+            float | None,
+            Field(
+                ge=0,
+                le=1,
+                description="""A threshold for when to stop searching for UI elements. As soon 
+            as UI elements have been found that are at least as similar as the stop_threshold, the search stops. 
+            Should be greater than or equal to threshold. Takes values between 0.0 and 1.0. 
+            Defaults to value of `threshold` if not provided. 
+            Important: The stop_threshold impacts the prediction speed.""",
+            ),
+        ] = None,
+        mask: Annotated[
+            list[tuple[float, float]] | None,
+            Field(
+                min_length=3,
+                description="A polygon to match only a certain area of the image of the element saved on disk.",
+            ),
+        ] = None,
+        rotation_degree_per_step: Annotated[
+            int,
+            Field(
+                ge=0,
+                lt=360,
+                description="""A step size in rotation degree. Rotates the image of the element saved on disk by 
+            rotation_degree_per_step until 360° is exceeded. Range is between 0° - 360°. Defaults to 0°. 
+            Important: This increases the prediction time quite a bit. So only use it when absolutely necessary.""",
+            ),
+        ] = 0,
+        image_compare_format: Annotated[
+            Literal["RGB", "grayscale", "edges"],
+            Field(
+                description="""A color compare style. Defaults to 'grayscale'. 
+            Important: The image_compare_format impacts the prediction time as well as quality. As a rule of thumb, 
+            'edges' is likely to be faster than 'grayscale' and 'grayscale' is likely to be faster than 'RGB'. For 
+            quality it is most often the other way around."""
+            ),
+        ] = "grayscale",
+    ) -> None:
+        """Initialize an AiElement locator.
+
+        Args:
+            name: Name of the AI element
+            threshold: A threshold for how similar UI elements need to be to be considered a match. Takes values
+                between 0.0 (= all elements are recognized) and 1.0 (= elements need to be an exact match).
+                Defaults to 0.5. Important: The threshold impacts the prediction quality.
+            stop_threshold: A threshold for when to stop searching for UI elements. As soon as UI elements have
+                been found that are at least as similar as the stop_threshold, the search stops. Should be greater
+                than or equal to threshold. Takes values between 0.0 and 1.0. Defaults to value of `threshold` if not
+                provided. Important: The stop_threshold impacts the prediction speed.
+            mask: A polygon to match only a certain area of the image of the element saved on disk. Must have at
+                least 3 points.
+            rotation_degree_per_step: A step size in rotation degree. Rotates the image of the element saved on
+                disk by rotation_degree_per_step until 360° is exceeded. Range is between 0° - 360°. Defaults to 0°.
+                Important: This increases the prediction time quite a bit. So only use it when absolutely necessary.
+            image_compare_format: A color compare style. Defaults to 'grayscale'. Important: The image_compare_format
+                impacts the prediction time as well as quality. As a rule of thumb, 'edges' is likely to be faster
+                than 'grayscale' and 'grayscale' is likely to be faster than 'RGB'. For quality it is most often
+                the other way around.
+        """
+        super().__init__(
+            name=name,
+            threshold=threshold,
+            stop_threshold=stop_threshold or threshold,
+            mask=mask,
+            rotation_degree_per_step=rotation_degree_per_step,
+            image_compare_format=image_compare_format,
+        )  # type: ignore
+
+    def _str(self) -> str:
+        return (
+            f'ai element named "{self.name}" '
+            + self._params_str()
+        )
diff --git a/src/askui/locators/relatable.py b/src/askui/locators/relatable.py
new file mode 100644
index 00000000..1cb4df19
--- /dev/null
+++ b/src/askui/locators/relatable.py
@@ -0,0 +1,912 @@
+from abc import ABC
+from typing import Annotated, Literal
+from pydantic import BaseModel, ConfigDict, Field
+from typing_extensions import Self
+
+
+ReferencePoint = Literal["center", "boundary", "any"]
+
+
+RelationTypeMapping = {
+    "above_of": "above of",
+    "below_of": "below of",
+    "right_of": "right of",
+    "left_of": "left of",
+    "and": "and",
+    "or": "or",
+    "containing": "containing",
+    "inside_of": "inside of",
+    "nearest_to": "nearest to",
+}
+
+
+RelationIndex = Annotated[int, Field(ge=0)]
+
+
+class RelationBase(BaseModel):
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+    other_locator: "Relatable"
+    type: Literal[
+        "above_of",
+        "below_of",
+        "right_of",
+        "left_of",
+        "and",
+        "or",
+        "containing",
+        "inside_of",
+        "nearest_to",
+    ]
+
+    def __str__(self):
+        return f"{RelationTypeMapping[self.type]} {self.other_locator._str_with_relation()}"
+
+
+class NeighborRelation(RelationBase):
+    type: Literal["above_of", "below_of", "right_of", "left_of"]
+    index: RelationIndex
+    reference_point: ReferencePoint
+
+    def __str__(self):
+        i = self.index + 1
+        if i == 11 or i == 12 or i == 13:
+            index_str = f"{i}th"
+        else:
+            index_str = (
+                f"{i}st"
+                if i % 10 == 1
+                else f"{i}nd" if i % 10 == 2 else f"{i}rd" if i % 10 == 3 else f"{i}th"
+            )
+        reference_point_str = (
+            " center of"
+            if self.reference_point == "center"
+            else " boundary of" if self.reference_point == "boundary" else ""
+        )
+        return f"{RelationTypeMapping[self.type]}{reference_point_str} the {index_str} {self.other_locator._str_with_relation()}"
+
+
+class LogicalRelation(RelationBase):
+    type: Literal["and", "or"]
+
+
+class BoundingRelation(RelationBase):
+    type: Literal["containing", "inside_of"]
+
+
+class NearestToRelation(RelationBase):
+    type: Literal["nearest_to"]
+
+
+Relation = NeighborRelation | LogicalRelation | BoundingRelation | NearestToRelation
+
+
+class CircularDependencyError(ValueError):
+    """Exception raised for circular dependencies in locator relations."""
+
+    def __init__(
+        self,
+        message: str = (
+            "Detected circular dependency in locator relations. "
+            "This occurs when locators reference each other in a way that creates an infinite loop "
+            "(e.g., A is above B and B is above A)."
+        ),
+    ) -> None:
+        super().__init__(message)
+
+
+class Relatable(ABC):
+    """Base class for locators that can be related to other locators, e.g., spatially, logically, distance based etc.
+
+    Attributes:
+        relations: List of relations to other locators
+    """
+
+    def __init__(self) -> None:
+        self._relations: list[Relation] = []
+
+    @property
+    def relations(self) -> list[Relation]:
+        return self._relations
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using NeighborRelation
+    def above_of(
+        self,
+        other_locator: "Relatable",
+        index: RelationIndex = 0,
+        reference_point: ReferencePoint = "boundary",
+    ) -> Self:
+        """Defines the element (located by *self*) to be **above** another element /
+        other elements (located by *other_locator*).
+
+        An element **A** is considered to be *above* another element / other elements **B**
+
+        - if most of **A** (or, more specifically, **A**'s bounding box) is *above* **B**
+          (or, more specifically, the **top border** of **B**'s bounding box) **and**
+        - if the **bottom border** of **A** (or, more specifically, **A**'s bounding box)
+          is *above* the **bottom border** of **B** (or, more specifically, **B**'s
+          bounding box).
+
+        Args:
+            other_locator:
+                Locator for an element / elements to relate to
+            index:
+                Index of the element (located by *self*) above the other element(s)
+                (located by *other_locator*), e.g., the first (index=0), second
+                (index=1), third (index=2) etc. element above the other element(s).
+                Elements' (relative) position is determined by the **bottom border**
+                (*y*-coordinate) of their bounding box.  
+                We don't guarantee the order of elements with the same bottom border
+                (*y*-coordinate).
+            reference_point:
+                Defines which element (located by *self*) is considered to be above the
+                other element(s) (located by *other_locator*):
+
+                **"center"**: One point of the element (located by *self*) is above the
+                  center (in a straight vertical line) of the other element(s) (located
+                  by *other_locator*).
+                **"boundary"**: One point of the element (located by *self*) is above
+                  any other point (in a straight vertical line) of the other element(s)
+                  (located by *other_locator*).
+                **"any"**: No point of the element (located by *self*) has to be above
+                  a point (in a straight vertical line) of the other element(s) (located
+                  by *other_locator*).
+
+                *Default is **"boundary".***
+
+        Returns:
+            Self: The locator with the relation added
+
+        Examples:
+            ```text
+
+            ===========
+            |    A    |
+            ===========
+            ===========
+            |    B    |
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element above ("center" of)
+            # text "B"
+            text = loc.Text().above_of(loc.Text("B"), reference_point="center")
+            ```
+            
+            ```text
+
+                   ===========
+                   |    A    |
+                   ===========
+            ===========
+            |    B    |
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element above
+            # ("boundary" of / any point of) text "B" 
+            # (reference point "center" won't work here)
+            text = loc.Text().above_of(loc.Text("B"), reference_point="boundary")
+            ```
+            
+            ```text
+
+                        ===========
+                        |    A    |
+                        ===========
+            ===========
+            |    B    |
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element above text "B" 
+            # (reference point "center" or "boundary" won't work here)
+            text = loc.Text().above_of(loc.Text("B"), reference_point="any")
+            ```
+            
+            ```text
+
+                        ===========
+                        |    A    |
+                        ===========
+            ===========
+            |    B    |
+            ===========
+            ===========
+            |    C    |
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element above text "C" 
+            # (reference point "center" or "boundary" won't work here)
+            text = loc.Text().above_of(loc.Text("C"), index=1, reference_point="any")
+            ```
+            
+            ```text
+
+                    ===========
+                    |    A    |
+                    ===========
+                        ===========
+            =========== |    B    |
+            |         | ===========
+            |    C    |
+            |         |
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element above text "C"
+            # (reference point "any")
+            text = loc.Text().above_of(loc.Text("C"), index=1, reference_point="any")
+            # locates also text "A" as it is the first (index 0) element above text "C"
+            # with reference point "boundary"
+            text = loc.Text().above_of(loc.Text("C"), index=0, reference_point="boundary")
+            ```
+        """
+        self._relations.append(
+            NeighborRelation(
+                type="above_of",
+                other_locator=other_locator,
+                index=index,
+                reference_point=reference_point,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using NeighborRelation
+    def below_of(
+        self,
+        other_locator: "Relatable",
+        index: RelationIndex = 0,
+        reference_point: ReferencePoint = "boundary",
+    ) -> Self:
+        """Defines the element (located by *self*) to be **below** another element /
+        other elements (located by *other_locator*).
+
+        An element **A** is considered to be *below* another element / other elements **B**
+
+        - if most of **A** (or, more specifically, **A**'s bounding box) is *below* **B**
+          (or, more specifically, the **bottom border** of **B**'s bounding box) **and**
+        - if the **top border** of **A** (or, more specifically, **A**'s bounding box) is
+          *below* the **top border** of **B** (or, more specifically, **B**'s bounding
+          box).
+
+        Args:
+            other_locator:
+                Locator for an element / elements to relate to.
+            index:
+                Index of the element (located by *self*) **below** the other
+                element(s) (located by *other_locator*), e.g., the first (*index=0*),
+                second (*index=1*), third (*index=2*) etc. element below the other
+                element(s).  Elements' (relative) position is determined by the **top
+                border** (*y*-coordinate) of their bounding box.  
+                We don't guarantee the order of elements with the same top border
+                (*y*-coordinate).
+            reference_point:
+                Defines which element (located by *self*) is considered to be
+                *below* the other element(s) (located by *other_locator*):
+                
+                **"center"**: One point of the element (located by *self*) is
+                  **below** the *center* (in a straight vertical line) of the other
+                  element(s) (located by *other_locator*).
+                **"boundary"**: One point of the element (located by *self*) is
+                  **below** *any* other point (in a straight vertical line) of the
+                  other element(s) (located by *other_locator*).
+                **"any"**: No point of the element (located by *self*) has to
+                  be **below** a point (in a straight vertical line) of the other
+                  element(s) (located by *other_locator*).
+
+                *Default is **"boundary".***
+
+        Returns:
+            Self: The locator with the relation added.
+
+        Examples:
+            ```text
+
+            ===========
+            |    B    |
+            ===========
+            ===========
+            |    A    |
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element below ("center" of)
+            # text "B"
+            text = loc.Text().below_of(loc.Text("B"), reference_point="center")
+            ```
+
+            ```text
+
+            ===========
+            |    B    |
+            ===========
+                   ===========
+                   |    A    |
+                   ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element below 
+            # ("boundary" of / any point of) text "B" 
+            # (reference point "center" won't work here)
+            text = loc.Text().below_of(loc.Text("B"), reference_point="boundary")
+            ```
+
+            ```text
+
+            ===========
+            |    B    |
+            ===========
+                        ===========
+                        |    A    |
+                        ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element below text "B"
+            # (reference point "center" or "boundary won't work here)
+            text = loc.Text().below_of(loc.Text("B"), reference_point="any")
+            ```
+            
+            ```text
+
+            ===========
+            |    C    |
+            ===========
+            ===========
+            |    B    |
+            ===========
+                        ===========
+                        |    A    |
+                        ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element below text "C" 
+            # (reference point "center" or "boundary" won't work here)
+            text = loc.Text().below_of(loc.Text("C"), index=1, reference_point="any")
+            ```
+            
+            ```text
+
+            =========== 
+            |         | 
+            |    C    |
+            |         |===========
+            ===========|    B    |
+                       ===========
+                    ===========
+                    |    A    |
+                    ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element below text "C"
+            # (reference point "any")
+            text = loc.Text().below_of(loc.Text("C"), index=1, reference_point="any")
+            # locates also text "A" as it is the first (index 0) element below text "C"
+            # with reference point "boundary"
+            text = loc.Text().below_of(loc.Text("C"), index=0, reference_point="boundary")
+            ```
+        """
+        self._relations.append(
+            NeighborRelation(
+                type="below_of",
+                other_locator=other_locator,
+                index=index,
+                reference_point=reference_point,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using NeighborRelation
+    def right_of(
+        self,
+        other_locator: "Relatable",
+        index: RelationIndex = 0,
+        reference_point: ReferencePoint = "center",
+    ) -> Self:
+        """Defines the element (located by *self*) to be **right of** another element /
+        other elements (located by *other_locator*).
+
+        An element **A** is considered to be *right of* another element / other elements **B**
+
+        - if most of **A** (or, more specifically, **A**'s bounding box) is *right of* **B**
+          (or, more specifically, the **right border** of **B**'s bounding box) **and**
+        - if the **left border** of **A** (or, more specifically, **A**'s bounding box) is
+          *right of* the **left border** of **B** (or, more specifically, **B**'s
+          bounding box).
+
+        Args:
+            other_locator:
+                Locator for an element / elements to relate to.
+            index:
+                Index of the element (located by *self*) **right of** the other
+                element(s) (located by *other_locator*), e.g., the first (*index=0*),
+                second (*index=1*), third (*index=2*) etc. element right of the other
+                element(s).  Elements' (relative) position is determined by the **left
+                border** (*x*-coordinate) of their bounding box.  
+                We don't guarantee the order of elements with the same left border
+                (*x*-coordinate).
+            reference_point:
+                Defines which element (located by *self*) is considered to be
+                *right of* the other element(s) (located by *other_locator*):
+
+                **"center"**: One point of the element (located by *self*) is
+                  **right of** the *center* (in a straight horizontal line) of the
+                  other element(s) (located by *other_locator*).
+                **"boundary"**: One point of the element (located by *self*) is
+                  **right of** *any* other point (in a straight horizontal line) of
+                  the other element(s) (located by *other_locator*).
+                **"any"**: No point of the element (located by *self*) has to
+                  be **right of** a point (in a straight horizontal line) of the
+                  other element(s) (located by *other_locator*).
+
+                *Default is **"center".***
+
+        Returns:
+            Self: The locator with the relation added.
+
+        Examples:
+            ```text
+
+            =========== ===========
+            |    B    | |    A    |
+            =========== ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element right of ("center"
+            # of) text "B"
+            text = loc.Text().right_of(loc.Text("B"), reference_point="center")
+            ```
+
+            ```text
+
+            =========== 
+            |    B    | 
+            =========== ===========
+                        |    A    |
+                        ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element right of 
+            # ("boundary" of / any point of) text "B" 
+            # (reference point "center" won't work here)
+            text = loc.Text().right_of(loc.Text("B"), reference_point="boundary")
+            ```
+
+            ```text
+
+            =========== 
+            |    B    |
+            ===========
+                        ===========
+                        |    A    |
+                        ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element right of text "B" 
+            # (reference point "center" or "boundary" won't work here)
+            text = loc.Text().right_of(loc.Text("B"), reference_point="any")
+            ```
+            
+            ```text
+     
+                                    ===========
+                                    |    A    |
+                                    ===========
+            =========== ===========
+            |    C    | |    B    |
+            =========== ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element right of text "C" 
+            # (reference point "center" or "boundary" won't work here)
+            text = loc.Text().right_of(loc.Text("C"), index=1, reference_point="any")
+            ```
+            
+            ```text
+            
+                    ===========
+                    |    B    |
+                    =========== ===========
+            ===========         |    A    |
+            |    C    |         ===========
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element right of text "C"
+            # (reference point "any")
+            text = loc.Text().right_of(loc.Text("C"), index=1, reference_point="any")
+            # locates also text "A" as it is the first (index 0) element right of text
+            # "C" with reference point "boundary"
+            text = loc.Text().right_of(loc.Text("C"), index=0, reference_point="boundary")
+            ```
+        """
+        self._relations.append(
+            NeighborRelation(
+                type="right_of",
+                other_locator=other_locator,
+                index=index,
+                reference_point=reference_point,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using NeighborRelation
+    def left_of(
+        self,
+        other_locator: "Relatable",
+        index: RelationIndex = 0,
+        reference_point: ReferencePoint = "center",
+    ) -> Self:
+        """Defines the element (located by *self*) to be **left of** another element /
+        other elements (located by *other_locator*).
+
+        An element **A** is considered to be *left of* another element / other elements **B**
+
+        - if most of **A** (or, more specifically, **A**'s bounding box) is *left of* **B**
+          (or, more specifically, the **left border** of **B**'s bounding box) **and**
+        - if the **right border** of **A** (or, more specifically, **A**'s bounding box) is
+          *left of* the **right border** of **B** (or, more specifically, **B**'s
+          bounding box).
+
+        Args:
+            other_locator:
+                Locator for an element / elements to relate to.
+            index:
+                Index of the element (located by *self*) **left of** the other
+                element(s) (located by *other_locator*), e.g., the first (*index=0*),
+                second (*index=1*), third (*index=2*) etc. element left of the other
+                element(s).  Elements' (relative) position is determined by the **right
+                border** (*x*-coordinate) of their bounding box.  
+                We don't guarantee the order of elements with the same right border
+                (*x*-coordinate).
+            reference_point:
+                Defines which element (located by *self*) is considered to be
+                *left of* the other element(s) (located by *other_locator*):
+
+                **"center"**  : One point of the element (located by *self*) is
+                  **left of** the *center* (in a straight horizontal line) of the
+                  other element(s) (located by *other_locator*).
+                **"boundary"**: One point of the element (located by *self*) is
+                  **left of** *any* other point (in a straight horizontal line) of
+                  the other element(s) (located by *other_locator*).
+                **"any"**     : No point of the element (located by *self*) has to
+                  be **left of** a point (in a straight horizontal line) of the
+                  other element(s) (located by *other_locator*).
+
+                *Default is **"center".***
+
+        Returns:
+            Self: The locator with the relation added.
+
+        Examples:
+            ```text
+
+            =========== ===========
+            |    A    | |    B    |
+            =========== ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element left of ("center"
+            # of) text "B"
+            text = loc.Text().left_of(loc.Text("B"), reference_point="center")
+            ```
+
+            ```text
+
+                        =========== 
+            =========== |    B    |
+            |    A    | =========== 
+            ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element left of ("boundary"
+            # of / any point of) text "B"
+            # (reference point "center" won't work here)
+            text = loc.Text().left_of(loc.Text("B"), reference_point="boundary")
+            ```
+
+            ```text
+
+                        =========== 
+                        |    B    |
+                        =========== 
+            ===========              
+            |    A    |
+            =========== 
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the first (index 0) element left of text "B" 
+            # (reference point "center" or "boundary won't work here)
+            text = loc.Text().left_of(loc.Text("B"), reference_point="any")
+            ```
+
+            ```text
+     
+            ===========
+            |    A    |
+            ===========
+                        =========== ===========
+                        |    B    | |    C    |
+                        =========== ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element left of text "C" 
+            # (reference point "center" or "boundary" won't work here)
+            text = loc.Text().left_of(loc.Text("C"), index=1, reference_point="any")
+            ```
+            
+            ```text
+             
+                        ===========
+                        |    B    |
+            =========== =========== 
+            |    A    |        ===========         
+            ===========        |    C    |         
+                               ===========
+            ```
+            ```python
+            from askui import locators as loc
+            # locates text "A" as it is the second (index 1) element left of text "C"
+            # (reference point "any")
+            text = loc.Text().left_of(loc.Text("C"), index=1, reference_point="any")
+            # locates also text "A" as it is the first (index 0) element right of text
+            # "C" with reference point "boundary"
+            text = loc.Text().right_of(loc.Text("C"), index=0, reference_point="boundary")
+            ```
+        """
+        self._relations.append(
+            NeighborRelation(
+                type="left_of",
+                other_locator=other_locator,
+                index=index,
+                reference_point=reference_point,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using BoundingRelation
+    def containing(self, other_locator: "Relatable") -> Self:
+        """Defines the element (located by *self*) to contain another element (located
+        by *other_locator*).
+
+        Args:
+            other_locator: The locator to check if it's contained
+
+        Returns:
+            Self: The locator with the relation added
+
+        Examples:
+            ```text
+            ---------------------------
+            |     textfield           |
+            |  ---------------------  |
+            |  |  placeholder text |  |
+            |  ---------------------  |
+            |                         |
+            ---------------------------
+            ```
+            ```python
+            from askui import locators as loc
+
+            # Returns the textfield because it contains the placeholder text
+            textfield = loc.Element("textfield").containing(loc.Text("placeholder"))
+            ```
+        """
+        self._relations.append(
+            BoundingRelation(
+                type="containing",
+                other_locator=other_locator,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using BoundingRelation
+    def inside_of(self, other_locator: "Relatable") -> Self:
+        """Defines the element (located by *self*) to be inside of another element
+        (located by *other_locator*).
+
+        Args:
+            other_locator: The locator to check if it contains this element
+
+        Returns:
+            Self: The locator with the relation added
+
+        Examples:
+            ```text
+            ---------------------------
+            |     textfield           |
+            |  ---------------------  |
+            |  |  placeholder text |  |
+            |  ---------------------  |
+            |                         |
+            ---------------------------
+            ```
+            ```python
+            from askui import locators as loc
+
+            # Returns the placeholder text of the textfield
+            placeholder_text = loc.Text("placeholder").inside_of(
+                loc.Element("textfield")
+            )
+            ```
+        """
+        self._relations.append(
+            BoundingRelation(
+                type="inside_of",
+                other_locator=other_locator,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using NearestToRelation
+    def nearest_to(self, other_locator: "Relatable") -> Self:
+        """Defines the element (located by *self*) to be the nearest to another element
+        (located by *other_locator*).
+
+        Args:
+            other_locator: The locator to compare distance against
+
+        Returns:
+            Self: The locator with the relation added
+
+        Examples:
+            ```text
+            --------------
+            |    text    |
+            --------------
+            ---------------
+            | textfield 1 |
+            ---------------
+
+
+
+
+            ---------------
+            | textfield 2 |
+            ---------------
+            ```
+            ```python
+            from askui import locators as loc
+
+            # Returns textfield 1 because it is nearer to the text than textfield 2
+            textfield = loc.Element("textfield").nearest_to(loc.Text())
+            ```
+        """
+        self._relations.append(
+            NearestToRelation(
+                type="nearest_to",
+                other_locator=other_locator,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using LogicalRelation
+    def and_(self, other_locator: "Relatable") -> Self:
+        """Logical and operator to combine multiple locators, e.g., to require an
+        element to match multiple locators.
+
+        Args:
+            other_locator: The locator to combine with
+
+        Returns:
+            Self: The locator with the relation added
+
+        Examples:
+            ```python
+            from askui import locators as loc
+
+            # Searches for an element that contains the text "Google" and is a
+            # multi-colored Google logo (instead of, e.g., simply some text that says
+            # "Google")
+            icon_user = loc.Element().containing(
+                loc.Text("Google").and_(loc.Description("Multi-colored Google logo"))
+            )
+            ```
+        """
+        self._relations.append(
+            LogicalRelation(
+                type="and",
+                other_locator=other_locator,
+            )
+        )
+        return self
+
+    # cannot be validated by pydantic using @validate_call because of the recursive nature of the relations --> validate using LogicalRelation
+    def or_(self, other_locator: "Relatable") -> Self:
+        """Logical or operator to combine multiple locators, e.g., to provide a fallback
+        if no element is found for one of the locators.
+
+        Args:
+            other_locator: The locator to combine with
+
+        Returns:
+            Self: The locator with the relation added
+
+        Examples:
+            ```python
+            from askui import locators as loc
+
+            # Searches for element using a description and if the element cannot be
+            # found, searches for it using an image
+            search_icon = loc.Description("search icon").or_(
+                loc.Image("search_icon.png")
+            )
+            ```
+        """
+        self._relations.append(
+            LogicalRelation(
+                type="or",
+                other_locator=other_locator,
+            )
+        )
+        return self
+
+    def _str(self) -> str:
+        return "relatable"
+
+    def _relations_str(self) -> str:
+        if not self._relations:
+            return ""
+
+        result = []
+        for i, relation in enumerate(self._relations):
+            [other_locator_str, *nested_relation_strs] = str(relation).split("\n")
+            result.append(f"  {i + 1}. {other_locator_str}")
+            for nested_relation_str in nested_relation_strs:
+                result.append(f"  {nested_relation_str}")
+        return "\n" + "\n".join(result)
+    
+    def _str_with_relation(self) -> str:
+        return self._str() + self._relations_str()
+
+    def raise_if_cycle(self) -> None:
+        """Raises CircularDependencyError if the relations form a cycle (see [Cycle (graph theory)](https://en.wikipedia.org/wiki/Cycle_(graph_theory)))."""
+        if self._has_cycle():
+            raise CircularDependencyError()
+
+    def _has_cycle(self) -> bool:
+        """Check if the relations form a cycle (see [Cycle (graph theory)](https://en.wikipedia.org/wiki/Cycle_(graph_theory)))."""
+        visited_ids: set[int] = set()
+        recursion_stack_ids: set[int] = set()
+
+        def _dfs(node: Relatable) -> bool:
+            node_id = id(node)
+            if node_id in recursion_stack_ids:
+                return True
+            if node_id in visited_ids:
+                return False
+
+            visited_ids.add(node_id)
+            recursion_stack_ids.add(node_id)
+
+            for relation in node.relations:
+                if _dfs(relation.other_locator):
+                    return True
+
+            recursion_stack_ids.remove(node_id)
+            return False
+
+        return _dfs(self)
+
+    def __str__(self) -> str:
+        self.raise_if_cycle()
+        return self._str_with_relation()
diff --git a/src/askui/locators/serializers.py b/src/askui/locators/serializers.py
new file mode 100644
index 00000000..35e1f180
--- /dev/null
+++ b/src/askui/locators/serializers.py
@@ -0,0 +1,257 @@
+from typing_extensions import NotRequired, TypedDict
+
+from askui.reporting import Reporter
+from askui.utils.image_utils import ImageSource
+from askui.models.askui.ai_element_utils import AiElementCollection
+from .locators import (
+    DEFAULT_SIMILARITY_THRESHOLD,
+    DEFAULT_TEXT_MATCH_TYPE,
+    ImageBase,
+    AiElement as AiElementLocator,
+    Element,
+    Prompt,
+    Image,
+    Text,
+)
+from .relatable import (
+    BoundingRelation,
+    LogicalRelation,
+    NearestToRelation,
+    NeighborRelation,
+    ReferencePoint,
+    Relatable,
+    Relation,
+)
+
+
+class VlmLocatorSerializer:
+    def serialize(self, locator: Relatable) -> str:
+        locator.raise_if_cycle()
+        if len(locator.relations) > 0:
+            raise NotImplementedError(
+                "Serializing locators with relations is not yet supported for VLMs"
+            )
+
+        if isinstance(locator, Text):
+            return self._serialize_text(locator)
+        elif isinstance(locator, Element):
+            return self._serialize_class(locator)
+        elif isinstance(locator, Prompt):
+            return self._serialize_prompt(locator)
+        elif isinstance(locator, Image):
+            raise NotImplementedError(
+                "Serializing image locators is not yet supported for VLMs"
+            )
+        elif isinstance(locator, AiElementLocator):
+            raise NotImplementedError(
+                "Serializing AI element locators is not yet supported for VLMs"
+            )
+        else:
+            raise ValueError(f"Unsupported locator type: {type(locator)}")
+
+    def _serialize_class(self, class_: Element) -> str:
+        if class_.class_name:
+            return f"an arbitrary {class_.class_name} shown"
+        else:
+            return "an arbitrary ui element (e.g., text, button, textfield, etc.)"
+
+    def _serialize_prompt(self, prompt: Prompt) -> str:
+        return prompt.prompt
+
+    def _serialize_text(self, text: Text) -> str:
+        if text.match_type == "similar":
+            return f'text similar to "{text.text}"'
+
+        return str(text)
+
+
+class CustomElement(TypedDict):
+    threshold: NotRequired[float]
+    stopThreshold: NotRequired[float]
+    customImage: str
+    mask: NotRequired[list[tuple[float, float]]]
+    rotationDegreePerStep: NotRequired[int]
+    imageCompareFormat: NotRequired[str]
+    name: NotRequired[str]
+
+
+class AskUiSerializedLocator(TypedDict):
+    instruction: str
+    customElements: list[CustomElement]
+
+
+class AskUiLocatorSerializer:
+    _TEXT_DELIMITER = "<|string|>"
+    _RP_TO_INTERSECTION_AREA_MAPPING: dict[ReferencePoint, str] = {
+        "center": "element_center_line",
+        "boundary": "element_edge_area",
+        "any": "display_edge_area",
+    }
+    _RELATION_TYPE_MAPPING: dict[str, str] = {
+        "above_of": "above",
+        "below_of": "below",
+        "right_of": "right of",
+        "left_of": "left of",
+        "containing": "contains",
+        "inside_of": "in",
+        "nearest_to": "nearest to",
+        "and": "and",
+        "or": "or",
+    }
+
+    def __init__(self, ai_element_collection: AiElementCollection, reporter: Reporter):
+        self._ai_element_collection = ai_element_collection
+        self._reporter = reporter
+
+    def serialize(self, locator: Relatable) -> AskUiSerializedLocator:
+        locator.raise_if_cycle()
+        if len(locator.relations) > 1:
+            # If we lift this constraint, we also have to make sure that custom element references are still working + we need, e.g., some symbol or a structured format to indicate precedence
+            raise NotImplementedError(
+                "Serializing locators with multiple relations is not yet supported by AskUI"
+            )
+
+        result = AskUiSerializedLocator(instruction="", customElements=[])
+        if isinstance(locator, Text):
+            result["instruction"] = self._serialize_text(locator)
+        elif isinstance(locator, Element):
+            result["instruction"] = self._serialize_class(locator)
+        elif isinstance(locator, Prompt):
+            result["instruction"] = self._serialize_prompt(locator)
+        elif isinstance(locator, Image):
+            result = self._serialize_image(locator)
+        elif isinstance(locator, AiElementLocator):
+            result = self._serialize_ai_element(locator)
+        else:
+            raise ValueError(f'Unsupported locator type: "{type(locator)}"')
+
+        if len(locator.relations) == 0:
+            return result
+
+        serialized_relation = self._serialize_relation(locator.relations[0])
+        result["instruction"] += f" {serialized_relation['instruction']}"
+        result["customElements"] += serialized_relation["customElements"]
+        return result
+
+    def _serialize_class(self, class_: Element) -> str:
+        return class_.class_name or "element"
+
+    def _serialize_prompt(self, prompt: Prompt) -> str:
+        return f"pta {self._TEXT_DELIMITER}{prompt.prompt}{self._TEXT_DELIMITER}"
+
+    def _serialize_text(self, text: Text) -> str:
+        match text.match_type:
+            case "similar":
+                if (
+                    text.similarity_threshold == DEFAULT_SIMILARITY_THRESHOLD
+                    and text.match_type == DEFAULT_TEXT_MATCH_TYPE
+                ):
+                    # Necessary so that we can use wordlevel ocr for these texts
+                    return (
+                        f"text {self._TEXT_DELIMITER}{text.text}{self._TEXT_DELIMITER}"
+                    )
+                return f"text with text {self._TEXT_DELIMITER}{text.text}{self._TEXT_DELIMITER} that matches to {text.similarity_threshold} %"
+            case "exact":
+                return f"text equals text {self._TEXT_DELIMITER}{text.text}{self._TEXT_DELIMITER}"
+            case "contains":
+                return f"text contain text {self._TEXT_DELIMITER}{text.text}{self._TEXT_DELIMITER}"
+            case "regex":
+                return f"text match regex pattern {self._TEXT_DELIMITER}{text.text}{self._TEXT_DELIMITER}"
+            case _:
+                raise ValueError(f'Unsupported text match type: "{text.match_type}"')
+
+    def _serialize_relation(self, relation: Relation) -> AskUiSerializedLocator:
+        match relation.type:
+            case "above_of" | "below_of" | "right_of" | "left_of":
+                assert isinstance(relation, NeighborRelation)
+                return self._serialize_neighbor_relation(relation)
+            case "containing" | "inside_of" | "nearest_to" | "and" | "or":
+                assert isinstance(
+                    relation, LogicalRelation | BoundingRelation | NearestToRelation
+                )
+                return self._serialize_non_neighbor_relation(relation)
+            case _:
+                raise ValueError(f'Unsupported relation type: "{relation.type}"')
+
+    def _serialize_neighbor_relation(
+        self, relation: NeighborRelation
+    ) -> AskUiSerializedLocator:
+        serialized_other_locator = self.serialize(relation.other_locator)
+        return AskUiSerializedLocator(
+            instruction=f"index {relation.index} {self._RELATION_TYPE_MAPPING[relation.type]} intersection_area {self._RP_TO_INTERSECTION_AREA_MAPPING[relation.reference_point]} {serialized_other_locator['instruction']}",
+            customElements=serialized_other_locator["customElements"],
+        )
+
+    def _serialize_non_neighbor_relation(
+        self, relation: LogicalRelation | BoundingRelation | NearestToRelation
+    ) -> AskUiSerializedLocator:
+        serialized_other_locator = self.serialize(relation.other_locator)
+        return AskUiSerializedLocator(
+            instruction=f"{self._RELATION_TYPE_MAPPING[relation.type]} {serialized_other_locator['instruction']}",
+            customElements=serialized_other_locator["customElements"],
+        )
+
+    def _serialize_image_to_custom_element(
+        self,
+        image_locator: ImageBase,
+        image_source: ImageSource,
+    ) -> CustomElement:
+        custom_element: CustomElement = CustomElement(
+            customImage=image_source.to_data_url(),
+            threshold=image_locator.threshold,
+            stopThreshold=image_locator.stop_threshold,
+            rotationDegreePerStep=image_locator.rotation_degree_per_step,
+            imageCompareFormat=image_locator.image_compare_format,
+            name=image_locator.name,
+        )
+        if image_locator.mask:
+            custom_element["mask"] = image_locator.mask
+        return custom_element
+
+    def _serialize_image_base(
+        self,
+        image_locator: ImageBase,
+        image_sources: list[ImageSource],
+    ) -> AskUiSerializedLocator:
+        custom_elements: list[CustomElement] = [
+            self._serialize_image_to_custom_element(
+                image_locator=image_locator,
+                image_source=image_source,
+            )
+            for image_source in image_sources
+        ]
+        return AskUiSerializedLocator(
+            instruction=f"custom element with text {self._TEXT_DELIMITER}{image_locator.name}{self._TEXT_DELIMITER}",
+            customElements=custom_elements,
+        )
+        
+    def _serialize_image(
+        self,
+        image: Image,
+    ) -> AskUiSerializedLocator:
+        self._reporter.add_message(
+            "AskUiLocatorSerializer",
+            f"Image locator: {image}",
+            image=image.image.root,
+        )
+        return self._serialize_image_base(
+            image_locator=image,
+            image_sources=[image.image],
+        )
+
+    def _serialize_ai_element(
+        self, ai_element_locator: AiElementLocator
+    ) -> AskUiSerializedLocator:
+        ai_elements = self._ai_element_collection.find(ai_element_locator.name)
+        self._reporter.add_message(
+            "AskUiLocatorSerializer",
+            f"Found {len(ai_elements)} ai elements named {ai_element_locator.name}",
+            image=[ai_element.image for ai_element in ai_elements],
+        )
+        return self._serialize_image_base(
+            image_locator=ai_element_locator,
+            image_sources=[
+                ImageSource.model_construct(root=ai_element.image)
+                for ai_element in ai_elements
+            ],
+        )
diff --git a/src/askui/logger.py b/src/askui/logger.py
index e6da1743..2038ecf9 100644
--- a/src/askui/logger.py
+++ b/src/askui/logger.py
@@ -11,7 +11,7 @@
     logger.setLevel(logging.INFO)
 
 
-def configure_logging(level=logging.INFO):
+def configure_logging(level: str | int = logging.INFO):
     logger.setLevel(level)
 
 
diff --git a/src/askui/models/__init__.py b/src/askui/models/__init__.py
new file mode 100644
index 00000000..efc2755c
--- /dev/null
+++ b/src/askui/models/__init__.py
@@ -0,0 +1,7 @@
+from .models import ModelName, ModelComposition, ModelDefinition
+
+__all__ = [
+    "ModelName",
+    "ModelComposition",
+    "ModelDefinition",
+]
diff --git a/src/askui/models/anthropic/claude.py b/src/askui/models/anthropic/claude.py
index ce5813be..4d54f8e8 100644
--- a/src/askui/models/anthropic/claude.py
+++ b/src/askui/models/anthropic/claude.py
@@ -2,24 +2,25 @@
 import anthropic
 from PIL import Image
 
+from askui.utils.image_utils import ImageSource, image_to_base64, scale_coordinates_back, scale_image_with_padding
+
 from ...logger import logger
-from ...utils import AutomationError
-from ..utils import scale_image_with_padding, scale_coordinates_back, extract_click_coordinates, image_to_base64
+from ...exceptions import ElementNotFoundError
+from .utils import extract_click_coordinates
 
 
 class ClaudeHandler:
-    def __init__(self, log_level):
-        self.model_name = "claude-3-5-sonnet-20241022"
+    def __init__(self):
+        self.model = "claude-3-5-sonnet-20241022"
         self.client = anthropic.Anthropic()
         self.resolution = (1280, 800)
-        self.log_level = log_level
         self.authenticated = True
         if os.getenv("ANTHROPIC_API_KEY") is None:
             self.authenticated = False
 
-    def inference(self, base64_image, prompt, system_prompt) -> list[anthropic.types.ContentBlock]:
+    def _inference(self, base64_image: str, prompt: str, system_prompt: str) -> list[anthropic.types.ContentBlock]:
         message = self.client.messages.create(
-            model=self.model_name,
+            model=self.model,
             max_tokens=1000,
             temperature=0,
             system=system_prompt,
@@ -32,7 +33,7 @@ def inference(self, base64_image, prompt, system_prompt) -> list[anthropic.types
                             "source": {
                                 "type": "base64",
                                 "media_type": "image/png",
-                                "data": base64_image
+                                "data": base64_image,
                             }
                         },
                         {
@@ -50,19 +51,27 @@ def locate_inference(self, image: Image.Image, locator: str) -> tuple[int, int]:
         screen_width, screen_height = self.resolution[0], self.resolution[1]
         system_prompt = f"Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try taking another screenshot.\n* The screen's resolution is {screen_width}x{screen_height}.\n* The display number is 0\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\n"
         scaled_image = scale_image_with_padding(image, screen_width, screen_height)
-        response = self.inference(image_to_base64(scaled_image), prompt, system_prompt)
+        response = self._inference(image_to_base64(scaled_image), prompt, system_prompt)
         response = response[0].text
         logger.debug("ClaudeHandler received locator: %s", response)
         try:
             scaled_x, scaled_y = extract_click_coordinates(response)
         except Exception as e:
-            raise AutomationError(f"Couldn't locate '{locator}' on the screen.")
+            raise ElementNotFoundError(f"Element not found: {locator}")
         x, y = scale_coordinates_back(scaled_x, scaled_y, image.width, image.height, screen_width, screen_height)
         return int(x), int(y)
 
-    def get_inference(self, image: Image.Image, instruction: str) -> str:
-        scaled_image = scale_image_with_padding(image, self.resolution[0], self.resolution[1])
+    def get_inference(self, image: ImageSource, query: str) -> str:
+        scaled_image = scale_image_with_padding(
+            image=image.root,
+            max_width=self.resolution[0],
+            max_height=self.resolution[1],
+        )
         system_prompt = "You are an agent to process screenshots and answer questions about things on the screen or extract information from it. Answer only with the response to the question and keep it short and precise."
-        response = self.inference(image_to_base64(scaled_image), instruction, system_prompt)
+        response = self._inference(
+            base64_image=image_to_base64(scaled_image),
+            prompt=query,
+            system_prompt=system_prompt
+        )
         response = response[0].text
         return response
diff --git a/src/askui/models/anthropic/claude_agent.py b/src/askui/models/anthropic/claude_agent.py
index 54bc7922..05599433 100644
--- a/src/askui/models/anthropic/claude_agent.py
+++ b/src/askui/models/anthropic/claude_agent.py
@@ -20,10 +20,12 @@
     BetaToolUseBlockParam,
 )
 
+from askui.tools.agent_os import AgentOs
+
 from ...tools.anthropic import ComputerTool, ToolCollection, ToolResult
 from ...logger import logger
-from ...utils import truncate_long_strings
-from askui.reporting.report import SimpleReportGenerator
+from ...utils.str_utils import truncate_long_strings
+from askui.reporting import Reporter
 
 
 COMPUTER_USE_BETA_FLAG = "computer-use-2024-10-22"
@@ -60,10 +62,10 @@
 
 
 class ClaudeComputerAgent:
-    def __init__(self, controller_client, report: SimpleReportGenerator | None = None) -> None:
-        self.report = report
+    def __init__(self, agent_os: AgentOs, reporter: Reporter) -> None:
+        self._reporter = reporter
         self.tool_collection = ToolCollection(
-            ComputerTool(controller_client),
+            ComputerTool(agent_os),
         )
         self.system = BetaTextBlockParam(
             type="text",
@@ -109,8 +111,8 @@ def step(self, messages: list):
         }
         logger.debug(new_message)
         messages.append(new_message)
-        if self.report is not None: 
-            self.report.add_message("Anthropic Computer Use", response_params)
+        if self._reporter is not None: 
+            self._reporter.add_message("Anthropic Computer Use", response_params)
 
         tool_result_content: list[BetaToolResultBlockParam] = []
         for content_block in response_params:
diff --git a/src/askui/models/anthropic/utils.py b/src/askui/models/anthropic/utils.py
new file mode 100644
index 00000000..7b27a065
--- /dev/null
+++ b/src/askui/models/anthropic/utils.py
@@ -0,0 +1,8 @@
+import re
+
+
+def extract_click_coordinates(text: str):
+    pattern = r'<click>(\d+),\s*(\d+)'
+    matches = re.findall(pattern, text)
+    x, y = matches[-1]
+    return int(x), int(y)
diff --git a/src/askui/models/askui/ai_element_utils.py b/src/askui/models/askui/ai_element_utils.py
index bda8fb73..c8f3ad4b 100644
--- a/src/askui/models/askui/ai_element_utils.py
+++ b/src/askui/models/askui/ai_element_utils.py
@@ -61,48 +61,49 @@ def from_json_file(cls, json_file_path: pathlib.Path) -> "AiElement":
                     image = Image.open(image_path))
 
 
-class AiElementNotFound(Exception):
-    pass
+class AiElementNotFound(ValueError):
+    def __init__(self, name: str, locations: list[pathlib.Path]):
+        self.name = name
+        self.locations = locations
+        locations_str = ", ".join([str(location) for location in locations])
+        super().__init__(
+            f'AI element "{name}" not found in {locations_str}\n'
+            'Solutions:\n'
+            '1. Verify the element exists in these locations and try again if you are sure it is present\n'
+            '2. Add location to ASKUI_AI_ELEMENT_LOCATIONS env var (paths, comma separated)\n'
+            '3. Create new AI element (see https://docs.askui.com/02-api-reference/02-askui-suite/02-askui-suite/AskUIRemoteDeviceSnippingTool/Public/AskUI-NewAIElement)'
+        )
 
 
 class AiElementCollection:
-
     def __init__(self, additional_ai_element_locations: Optional[List[pathlib.Path]] = None):
+        additional_ai_element_locations = additional_ai_element_locations or []
+
         workspace_id = os.getenv("ASKUI_WORKSPACE_ID")
         if workspace_id is None:
             raise ValueError("ASKUI_WORKSPACE_ID is not set")
         
-        if additional_ai_element_locations is None:
-            additional_ai_element_locations = []
-        
-        addional_ai_element_from_env = []
-        if os.getenv("ASKUI_AI_ELEMENT_LOCATIONS", "") != "":
-            addional_ai_element_from_env = [pathlib.Path(ai_element_loc) for ai_element_loc in os.getenv("ASKUI_AI_ELEMENT_LOCATIONS", "").split(",")],
+        locations_from_env: list[pathlib.Path] = []
+        if locations_env := os.getenv("ASKUI_AI_ELEMENT_LOCATIONS"):
+            locations_from_env = [pathlib.Path(loc) for loc in locations_env.split(",")]
         
-        self.ai_element_locations = [
+        self._ai_element_locations = [
             pathlib.Path.home() / ".askui" / "SnippingTool" / "AIElement" / workspace_id,
-            *addional_ai_element_from_env,
+            *locations_from_env,
             *additional_ai_element_locations
         ]
 
-        logger.debug("AI Element locations: %s", self.ai_element_locations)
+        logger.debug("AI Element locations: %s", self._ai_element_locations)
 
-    def find(self, name: str):
-        ai_elements = []
-
-        for location in self.ai_element_locations:
+    def find(self, name: str) -> list[AiElement]:
+        ai_elements: list[AiElement] = []
+        for location in self._ai_element_locations:
             path = pathlib.Path(location)
-            
             json_files = list(path.glob("*.json"))
-            
-            if not json_files:
-                logger.warning(f"No JSON files found in: {location}")
-                continue
-                
             for json_file in json_files:
                 ai_element = AiElement.from_json_file(json_file)
-
                 if ai_element.metadata.name == name:
                     ai_elements.append(ai_element)
-
-        return ai_elements
\ No newline at end of file
+        if len(ai_elements) == 0:
+            raise AiElementNotFound(name=name, locations=self._ai_element_locations)
+        return ai_elements
diff --git a/src/askui/models/askui/api.py b/src/askui/models/askui/api.py
index 915bd2de..cc39cc8a 100644
--- a/src/askui/models/askui/api.py
+++ b/src/askui/models/askui/api.py
@@ -1,105 +1,95 @@
 import os
 import base64
 import pathlib
+from pydantic import RootModel
 import requests
-
+import json as json_lib
 from PIL import Image
-from typing import List, Union
-from askui.models.askui.ai_element_utils import AiElement, AiElementCollection, AiElementNotFound
-from askui.utils import image_to_base64
+from typing import Any, Type, Union
+from askui.models.models import ModelComposition
+from askui.utils.image_utils import ImageSource
+from askui.locators.serializers import AskUiLocatorSerializer
+from askui.locators.locators import Locator
+from askui.utils.image_utils import image_to_base64
 from askui.logger import logger
+from ..types.response_schemas import ResponseSchema, to_response_schema
 
 
 
-class AskUIHandler:
-    def __init__(self):
+class AskUiInferenceApi:
+    def __init__(self, locator_serializer: AskUiLocatorSerializer):
+        self._locator_serializer = locator_serializer
         self.inference_endpoint = os.getenv("ASKUI_INFERENCE_ENDPOINT", "https://inference.askui.com")
         self.workspace_id = os.getenv("ASKUI_WORKSPACE_ID")
         self.token = os.getenv("ASKUI_TOKEN")
-    
         self.authenticated = True
         if self.workspace_id is None or self.token is None:
             logger.warning("ASKUI_WORKSPACE_ID or ASKUI_TOKEN missing.")
             self.authenticated = False
 
-        self.ai_element_collection = AiElementCollection()
-
-
-
     def _build_askui_token_auth_header(self, bearer_token: str | None = None) -> dict[str, str]:
         if bearer_token is not None:
             return {"Authorization": f"Bearer {bearer_token}"}
+
+        if self.token is None:
+            raise Exception("ASKUI_TOKEN is not set.")
         token_base64 = base64.b64encode(self.token.encode("utf-8")).decode("utf-8")
         return {"Authorization": f"Basic {token_base64}"}
-    
-    def _build_custom_elements(self, ai_elements: List[AiElement] | None):
-        """
-        Converts AiElements to the CustomElementDto format expected by the backend.
-        
-        Args:
-            ai_elements (List[AiElement]): List of AI elements to convert
-            
-        Returns:
-            dict: Custom elements in the format expected by the backend
-        """
-        if not ai_elements:
-            return {}
-        
-        custom_elements = []
-        for element in ai_elements:
-            custom_element = {
-                "customImage": "," + image_to_base64(element.image),            
-                "imageCompareFormat": "grayscale",
-                "name": element.metadata.name
-            }
-            custom_elements.append(custom_element)
-        
-        return {
-            "customElements": custom_elements
-        }  
-    def __build_model_composition(self):
-        return {}
-    
-    def __build_base_url(self, endpoint: str = "inference") -> str:
+
+    def _build_base_url(self, endpoint: str) -> str:
         return f"{self.inference_endpoint}/api/v3/workspaces/{self.workspace_id}/{endpoint}"
 
-    def predict(self, image: Union[pathlib.Path, Image.Image], locator: str, ai_elements: List[pathlib.Path] = None) -> tuple[int | None, int | None]:
+    def _request(self, endpoint: str, json: dict[str, Any] | None = None) -> Any:
         response = requests.post(
-            self.__build_base_url(),
-            json={
-                "image": f",{image_to_base64(image)}",
-                **({"instruction": locator} if locator is not None else {}),
-                **self.__build_model_composition(),
-                **self._build_custom_elements(ai_elements)
-            },
+            self._build_base_url(endpoint),
+            json=json,
             headers={"Content-Type": "application/json", **self._build_askui_token_auth_header()},
             timeout=30,
         )
         if response.status_code != 200:
             raise Exception(f"{response.status_code}: Unknown Status Code\n", response.text)
 
-        content = response.json()
+        return response.json()
+
+    def predict(self, image: Union[pathlib.Path, Image.Image], locator: Locator, model: ModelComposition | None = None) -> tuple[int | None, int | None]:
+        serialized_locator = self._locator_serializer.serialize(locator=locator)
+        logger.debug(f"serialized_locator:\n{json_lib.dumps(serialized_locator)}")
+        json: dict[str, Any] = {
+            "image": f",{image_to_base64(image)}",
+            "instruction": f"Click on {serialized_locator['instruction']}",
+        }
+        if "customElements" in serialized_locator:
+            json["customElements"] = serialized_locator["customElements"]
+        if model is not None:
+            json["modelComposition"] = model.model_dump(by_alias=True)
+            logger.debug(f"modelComposition:\n{json_lib.dumps(json['modelComposition'])}")
+        content = self._request(endpoint="inference", json=json)
         assert content["type"] == "COMMANDS", f"Received unknown content type {content['type']}"
         actions = [el for el in content["data"]["actions"] if el["inputEvent"] == "MOUSE_MOVE"]
         if len(actions) == 0:
             return None, None
-        position = actions[0]["position"]
 
+        position = actions[0]["position"]
         return int(position["x"]), int(position["y"])
-    
-    def locate_pta_prediction(self, image: Union[pathlib.Path, Image.Image], locator: str) -> tuple[int | None, int | None]:
-        askui_locator = f'Click on pta "{locator}"'
-        return self.predict(image, askui_locator)
-    
-    def locate_ocr_prediction(self, image: Union[pathlib.Path, Image.Image], locator: str) -> tuple[int | None, int | None]:
-        askui_locator = f'Click on with text "{locator}"'
-        return self.predict(image, askui_locator)
-    
-    def locate_ai_element_prediction(self, image: Union[pathlib.Path, Image.Image], name: str) -> tuple[int | None, int | None]:
-        ai_elements = self.ai_element_collection.find(name)
 
-        if len(ai_elements) == 0:
-            raise AiElementNotFound(f"Could not locate AI element with name '{name}'")
-        
-        askui_instruction = f'Click on custom element with text "{name}"'
-        return self.predict(image, askui_instruction, ai_elements=ai_elements)
+    def get_inference(
+        self, 
+        image: ImageSource, 
+        query: str, 
+        response_schema: Type[ResponseSchema] | None = None
+    ) -> ResponseSchema | str:
+        json: dict[str, Any] = {
+            "image": image.to_data_url(),
+            "prompt": query,
+        }
+        _response_schema = to_response_schema(response_schema)
+        json["config"] = {
+            "json_schema": _response_schema.model_json_schema()
+        }
+        logger.debug(f"json_schema:\n{json_lib.dumps(json['config']['json_schema'])}")
+        content = self._request(endpoint="vqa/inference", json=json)
+        response = content["data"]["response"]
+        validated_response = _response_schema.model_validate(response)
+        if isinstance(validated_response, RootModel):
+            return validated_response.root
+        return validated_response
diff --git a/src/askui/models/huggingface/spaces_api.py b/src/askui/models/huggingface/spaces_api.py
index 5d2b5a7a..f8499206 100644
--- a/src/askui/models/huggingface/spaces_api.py
+++ b/src/askui/models/huggingface/spaces_api.py
@@ -2,7 +2,7 @@
 import tempfile
 
 from gradio_client import Client, handle_file
-from askui.utils import AutomationError
+from askui.exceptions import AutomationError
 
 
 class HFSpacesHandler:
diff --git a/src/askui/models/models.py b/src/askui/models/models.py
new file mode 100644
index 00000000..71da37b2
--- /dev/null
+++ b/src/askui/models/models.py
@@ -0,0 +1,86 @@
+from collections.abc import Iterator
+from enum import Enum
+import re
+from typing import Annotated
+from pydantic import BaseModel, ConfigDict, Field, RootModel
+
+
+class ModelName(str, Enum):
+    ANTHROPIC__CLAUDE__3_5__SONNET__20241022 = "anthropic-claude-3-5-sonnet-20241022"
+    ANTHROPIC = "anthropic"
+    ASKUI = "askui"
+    ASKUI__AI_ELEMENT = "askui-ai-element"
+    ASKUI__COMBO = "askui-combo"
+    ASKUI__OCR = "askui-ocr"
+    ASKUI__PTA = "askui-pta"
+    TARS = "tars"
+
+
+MODEL_DEFINITION_PROPERTY_REGEX_PATTERN = re.compile(r"^[A-Za-z0-9_]+$")
+
+
+ModelDefinitionProperty = Annotated[
+    str, Field(pattern=MODEL_DEFINITION_PROPERTY_REGEX_PATTERN)
+]
+
+
+class ModelDefinition(BaseModel):
+    """
+    A definition of a model.
+    """
+    model_config = ConfigDict(
+        populate_by_name=True,
+    )
+    task: ModelDefinitionProperty = Field(
+        description="The task the model is trained for, e.g., end-to-end OCR (e2e_ocr) or object detection (od)",
+        examples=["e2e_ocr", "od"],
+    )
+    architecture: ModelDefinitionProperty = Field(
+        description="The architecture of the model", examples=["easy_ocr", "yolo"]
+    )
+    version: str = Field(pattern=r"^[0-9]{1,6}$")
+    interface: ModelDefinitionProperty = Field(
+        description="The interface the model is trained for",
+        examples=["online_learning", "offline_learning"],
+    )
+    use_case: ModelDefinitionProperty = Field(
+        description='The use case the model is trained for. In the case of workspace specific AskUI models, this is often the workspace id but with "-" replaced by "_"',
+        examples=[
+            "fb3b9a7b_3aea_41f7_ba02_e55fd66d1c1e",
+            "00000000_0000_0000_0000_000000000000",
+        ],
+        default="00000000_0000_0000_0000_000000000000",
+        serialization_alias="useCase",
+    )
+    tags: list[ModelDefinitionProperty] = Field(
+        default_factory=list,
+        description="Tags for identifying the model that cannot be represented by other properties",
+        examples=["trained", "word_level"],
+    )
+
+    @property
+    def model_name(self) -> str:
+        return (
+            "-".join(
+                [
+                    self.task,
+                    self.architecture,
+                    self.interface,
+                    self.use_case,
+                    self.version,
+                    *self.tags,
+                ]
+            )
+        )
+
+
+class ModelComposition(RootModel[list[ModelDefinition]]):
+    """
+    A composition of models.
+    """
+
+    def __iter__(self):
+        return iter(self.root)
+    
+    def __getitem__(self, index: int) -> ModelDefinition:
+        return self.root[index]
diff --git a/src/askui/models/router.py b/src/askui/models/router.py
index 377486dc..7f3395cb 100644
--- a/src/askui/models/router.py
+++ b/src/askui/models/router.py
@@ -1,12 +1,22 @@
-from typing import Optional
+from typing import Type
+from typing_extensions import override
 from PIL import Image
 
 from askui.container import telemetry
-from .askui.api import AskUIHandler
+from askui.locators.locators import AiElement, Prompt, Text
+from askui.locators.serializers import AskUiLocatorSerializer, VlmLocatorSerializer
+from askui.locators.locators import Locator
+from askui.models.askui.ai_element_utils import AiElementCollection
+from askui.models.models import ModelComposition, ModelName
+from askui.models.types.response_schemas import ResponseSchema
+from askui.reporting import CompositeReporter, Reporter
+from askui.tools.toolbox import AgentToolbox
+from askui.utils.image_utils import ImageSource
+from .askui.api import AskUiInferenceApi
 from .anthropic.claude import ClaudeHandler
 from .huggingface.spaces_api import HFSpacesHandler
+from ..exceptions import AutomationError, ElementNotFoundError
 from ..logger import logger
-from ..utils import AutomationError
 from .ui_tars_ep.ui_tars_api import UITarsAPIHandler
 from .anthropic.claude_agent import ClaudeComputerAgent
 from abc import ABC, abstractmethod
@@ -14,115 +24,208 @@
 
 Point = tuple[int, int]
 
-def handle_response(response: tuple[int | None, int | None], locator: str):
+
+def handle_response(response: tuple[int | None, int | None], locator: str | Locator):
     if response[0] is None or response[1] is None:
-        raise AutomationError(f'Could not locate "{locator}"')
+        raise ElementNotFoundError(f"Element not found: {locator}")
     return response
 
-class GroundingModelRouter(ABC):
 
+class GroundingModelRouter(ABC):
     @abstractmethod
-    def locate(self, screenshot: Image.Image, locator: str, model_name: str | None = None) -> Point:
+    def locate(
+        self,
+        screenshot: Image.Image,
+        locator: str | Locator,
+        model: ModelComposition | str | None = None,
+    ) -> Point:
         pass
 
     @abstractmethod
-    def is_responsible(self, model_name: Optional[str]) -> bool:
+    def is_responsible(self, model: ModelComposition | str | None = None) -> bool:
         pass
-    
+
     @abstractmethod
     def is_authenticated(self) -> bool:
         pass
 
 
-class AskUIModelRouter(GroundingModelRouter):
-    
-    def __init__(self):
-        self.askui = AskUIHandler()
-
-    def locate(self, screenshot: Image.Image, locator: str, model_name: str | None = None) -> Point:
-        if not self.askui.authenticated:
-            raise AutomationError(f"NoAskUIAuthenticationSet! Please set 'AskUI ASKUI_WORKSPACE_ID' or 'ASKUI_TOKEN' as env variables!")
-
-        if  model_name == "askui-pta":
-            logger.debug(f"Routing locate prediction to askui-pta")
-            x, y = self.askui.locate_pta_prediction(screenshot, locator)
+class AskUiModelRouter(GroundingModelRouter):
+    def __init__(self, inference_api: AskUiInferenceApi):
+        self._inference_api = inference_api
+        
+    def _locate_with_askui_ocr(self, screenshot: Image.Image, locator: str | Text) -> Point:
+        locator = Text(locator) if isinstance(locator, str) else locator
+        x, y = self._inference_api.predict(screenshot, locator)
+        return handle_response((x, y), locator)
+
+    @override
+    def locate(
+        self,
+        screenshot: Image.Image,
+        locator: str | Locator,
+        model: ModelComposition | str | None = None,
+    ) -> Point:
+        if not self._inference_api.authenticated:
+            raise AutomationError(
+                "NoAskUIAuthenticationSet! Please set 'AskUI ASKUI_WORKSPACE_ID' or 'ASKUI_TOKEN' as env variables!"
+            )
+        if not isinstance(model, str) or model == ModelName.ASKUI:
+            logger.debug("Routing locate prediction to askui")
+            locator = Text(locator) if isinstance(locator, str) else locator
+            _model = model if not isinstance(model, str) else None
+            x, y = self._inference_api.predict(screenshot, locator, _model)
             return handle_response((x, y), locator)
-        if model_name == "askui-ocr":
-            logger.debug(f"Routing locate prediction to askui-ocr")
-            x, y = self.askui.locate_ocr_prediction(screenshot, locator)
+        if not isinstance(locator, str):
+            raise AutomationError(
+                f'Locators of type `{type(locator)}` are not supported for models "askui-pta", "askui-ocr" and "askui-combo" and "askui-ai-element". Please provide a `str`.'
+            )
+        if model == ModelName.ASKUI__PTA:
+            logger.debug("Routing locate prediction to askui-pta")
+            x, y = self._inference_api.predict(screenshot, Prompt(locator))
             return handle_response((x, y), locator)
-        if model_name == "askui-combo" or model_name is None:
-            logger.debug(f"Routing locate prediction to askui-combo")
-            x, y = self.askui.locate_pta_prediction(screenshot, locator)
+        if model == ModelName.ASKUI__OCR:
+            logger.debug("Routing locate prediction to askui-ocr")
+            return self._locate_with_askui_ocr(screenshot, locator)
+        if model == ModelName.ASKUI__COMBO or model is None:
+            logger.debug("Routing locate prediction to askui-combo")
+            prompt_locator = Prompt(locator)
+            x, y = self._inference_api.predict(screenshot, prompt_locator)
             if x is None or y is None:
-                x, y = self.askui.locate_ocr_prediction(screenshot, locator)
-            return handle_response((x, y), locator)
-        if model_name == "askui-ai-element":
-            logger.debug(f"Routing click prediction to askui-ai-element")
-            x, y = self.askui.locate_ai_element_prediction(screenshot, locator)
-            return handle_response((x, y), locator)
-        raise AutomationError(f"Invalid model name {model_name} for click")
-        
-    def is_responsible(self, model_name: Optional[str]):
-        return model_name is None or model_name.startswith("askui")
-    
+                return self._locate_with_askui_ocr(screenshot, locator)
+            return handle_response((x, y), prompt_locator)
+        if model == ModelName.ASKUI__AI_ELEMENT:
+            logger.debug("Routing click prediction to askui-ai-element")
+            _locator = AiElement(locator)
+            x, y = self._inference_api.predict(screenshot, _locator)
+            return handle_response((x, y), _locator)
+        raise AutomationError(f'Invalid model: "{model}"')
+
+    @override
+    def is_responsible(self, model: ModelComposition | str | None = None) -> bool:
+        return not isinstance(model, str) or model.startswith(ModelName.ASKUI)
+
+    @override
     def is_authenticated(self) -> bool:
-        return self.askui.authenticated
+        return self._inference_api.authenticated
 
-    
 
 class ModelRouter:
-    def __init__(self, log_level, report, 
-                 grounding_model_routers: list[GroundingModelRouter] | None = None):
-        self.report = report
-
-        self.grounding_model_routers = grounding_model_routers or [AskUIModelRouter()]
-
-        self.claude = ClaudeHandler(log_level)
-        self.huggingface_spaces = HFSpacesHandler()
-        self.tars = UITarsAPIHandler(self.report)
-    
-    def act(self, controller_client, goal: str, model_name: str | None = None):
-        if self.tars.authenticated and model_name == "tars":
-            return self.tars.act(controller_client, goal)
-        if self.claude.authenticated and (model_name == "claude" or model_name is None):
-            agent = ClaudeComputerAgent(controller_client, self.report)
-            return agent.run(goal)
-        raise AutomationError("Invalid model name for act")
-    
-    def get_inference(self, screenshot: Image.Image, locator: str, model_name: str | None = None):
-        if self.tars.authenticated and model_name == "tars":
-            return self.tars.get_prediction(screenshot, locator)
-        if self.claude.authenticated and (model_name == "anthropic-claude-3-5-sonnet-20241022"  or model_name is None):
-            return self.claude.get_inference(screenshot, locator)
-        raise AutomationError("Executing get commands requires to authenticate with an Automation Model Provider supporting it.")
+    def __init__(
+        self,
+        tools: AgentToolbox,
+        grounding_model_routers: list[GroundingModelRouter] | None = None,
+        reporter: Reporter | None = None,
+    ):
+        _reporter = reporter or CompositeReporter()
+        self._askui = AskUiInferenceApi(
+            locator_serializer=AskUiLocatorSerializer(
+                ai_element_collection=AiElementCollection(),
+                reporter=_reporter,
+            ),
+        )
+        self._grounding_model_routers = grounding_model_routers or [AskUiModelRouter(inference_api=self._askui)]
+        self._claude = ClaudeHandler()
+        self._huggingface_spaces = HFSpacesHandler()
+        self._tars = UITarsAPIHandler(agent_os=tools.agent_os, reporter=_reporter)
+        self._claude_computer_agent = ClaudeComputerAgent(agent_os=tools.agent_os, reporter=_reporter)
+        self._locator_serializer = VlmLocatorSerializer()
+
+    def act(self, goal: str, model: ModelComposition | str | None = None):
+        if self._tars.authenticated and model == ModelName.TARS:
+            return self._tars.act(goal)
+        if self._claude.authenticated and (model is None or isinstance(model, str) and model.startswith(ModelName.ANTHROPIC)):
+            return self._claude_computer_agent.run(goal)
+        raise AutomationError(f"Invalid model for act: {model}")
+
+    def get_inference(
+        self,
+        query: str,
+        image: ImageSource,
+        response_schema: Type[ResponseSchema] | None = None,
+        model: ModelComposition | str | None = None,
+    ) -> ResponseSchema | str:
+        if self._tars.authenticated and model == ModelName.TARS:
+            if response_schema not in [str, None]:
+                raise NotImplementedError("(Non-String) Response schema is not yet supported for UI-TARS models.")
+            return self._tars.get_inference(image=image, query=query)
+        if self._claude.authenticated and (
+            isinstance(model, str) and model.startswith(ModelName.ANTHROPIC)
+        ):
+            if response_schema not in [str, None]:
+                raise NotImplementedError("(Non-String) Response schema is not yet supported for Anthropic models.")
+            return self._claude.get_inference(image=image, query=query)
+        if self._askui.authenticated and (model == ModelName.ASKUI or model is None):
+            return self._askui.get_inference(
+                image=image,
+                query=query,
+                response_schema=response_schema,
+            )
+        raise AutomationError(
+            f"Executing get commands requires to authenticate with an Automation Model Provider supporting it: {model}"
+        )
+
+    def _serialize_locator(self, locator: str | Locator) -> str:
+        if isinstance(locator, Locator):
+            return self._locator_serializer.serialize(locator=locator)
+        return locator
 
     @telemetry.record_call(exclude={"locator", "screenshot"})
-    def locate(self, screenshot: Image.Image, locator: str, model_name: str | None = None) -> Point:
-        if model_name is not None and model_name in self.huggingface_spaces.get_spaces_names():
-            x, y = self.huggingface_spaces.predict(screenshot, locator, model_name)
+    def locate(
+        self,
+        screenshot: Image.Image,
+        locator: str | Locator,
+        model: ModelComposition | str | None = None,
+    ) -> Point:
+        if (
+            isinstance(model, str)
+            and model in self._huggingface_spaces.get_spaces_names()
+        ):
+            x, y = self._huggingface_spaces.predict(
+                screenshot=screenshot,
+                locator=self._serialize_locator(locator),
+                model_name=model,
+            )
             return handle_response((x, y), locator)
-        if model_name is not None:
-            if model_name.startswith("anthropic") and not self.claude.authenticated:
-                raise AutomationError("You need to provide Anthropic credentials to use Anthropic models.")
-            if model_name.startswith("tars") and not self.tars.authenticated:
-                raise AutomationError("You need to provide UI-TARS HF Endpoint credentials to use UI-TARS models.")
-        if self.tars.authenticated and model_name == "tars":
-            x, y = self.tars.locate_prediction(screenshot, locator)
+        if isinstance(model, str):
+            if model.startswith(ModelName.ANTHROPIC) and not self._claude.authenticated:
+                raise AutomationError(
+                    "You need to provide Anthropic credentials to use Anthropic models."
+                )
+            if model.startswith(ModelName.TARS) and not self._tars.authenticated:
+                raise AutomationError(
+                    "You need to provide UI-TARS HF Endpoint credentials to use UI-TARS models."
+                )
+        if self._tars.authenticated and model == ModelName.TARS:
+            x, y = self._tars.locate_prediction(
+                screenshot, self._serialize_locator(locator)
+            )
             return handle_response((x, y), locator)
-        if self.claude.authenticated and model_name == "anthropic-claude-3-5-sonnet-20241022":
+        if (
+            self._claude.authenticated
+            and isinstance(model, str) and model.startswith(ModelName.ANTHROPIC)
+        ):
             logger.debug("Routing locate prediction to Anthropic")
-            x, y = self.claude.locate_inference(screenshot, locator)
+            x, y = self._claude.locate_inference(
+                screenshot, self._serialize_locator(locator)
+            )
             return handle_response((x, y), locator)
-        
-        for grounding_model_router in self.grounding_model_routers:
-            if grounding_model_router.is_responsible(model_name) and grounding_model_router.is_authenticated():
-                return grounding_model_router.locate(screenshot, locator, model_name)
 
-        if model_name is None:
-            if self.claude.authenticated:
+        for grounding_model_router in self._grounding_model_routers:
+            if (
+                grounding_model_router.is_responsible(model)
+                and grounding_model_router.is_authenticated()
+            ):
+                return grounding_model_router.locate(screenshot, locator, model)
+
+        if model is None:
+            if self._claude.authenticated:
                 logger.debug("Routing locate prediction to Anthropic")
-                x, y = self.claude.locate_inference(screenshot, locator)
+                x, y = self._claude.locate_inference(
+                    screenshot, self._serialize_locator(locator)
+                )
                 return handle_response((x, y), locator)
-            
-        raise AutomationError("Executing locate commands requires to authenticate with an Automation Model Provider.")
+
+        raise AutomationError(
+            "Executing locate commands requires to authenticate with an Automation Model Provider."
+        )
diff --git a/src/askui/models/types/__init__.py b/src/askui/models/types/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/src/askui/models/types/response_schemas.py b/src/askui/models/types/response_schemas.py
new file mode 100644
index 00000000..e9eba25c
--- /dev/null
+++ b/src/askui/models/types/response_schemas.py
@@ -0,0 +1,43 @@
+from typing import Type, TypeVar, overload
+from pydantic import BaseModel, ConfigDict, RootModel
+
+
+class ResponseSchemaBase(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
+
+String = RootModel[str]
+Boolean = RootModel[bool]
+Integer = RootModel[int]
+Float = RootModel[float]
+
+
+ResponseSchema = TypeVar('ResponseSchema', ResponseSchemaBase, str, bool, int, float)
+
+
+@overload
+def to_response_schema(response_schema: None) -> Type[String]: ...
+@overload
+def to_response_schema(response_schema: Type[str]) -> Type[String]: ...
+@overload
+def to_response_schema(response_schema: Type[bool]) -> Type[Boolean]: ...
+@overload
+def to_response_schema(response_schema: Type[int]) -> Type[Integer]: ...
+@overload
+def to_response_schema(response_schema: Type[float]) -> Type[Float]: ...
+@overload
+def to_response_schema(response_schema: Type[ResponseSchemaBase]) -> Type[ResponseSchemaBase]: ...
+def to_response_schema(response_schema: Type[ResponseSchemaBase] | Type[str] | Type[bool] | Type[int] | Type[float] | None = None) -> Type[ResponseSchemaBase] | Type[String] | Type[Boolean] | Type[Integer] | Type[Float]:
+    if response_schema is None:
+        return String
+    if response_schema is str:
+        return String
+    if response_schema is bool:
+        return Boolean
+    if response_schema is int:
+        return Integer
+    if response_schema is float:
+        return Float
+    if issubclass(response_schema, ResponseSchemaBase):
+        return response_schema
+    raise ValueError(f"Invalid response schema type: {response_schema}")
diff --git a/src/askui/models/ui_tars_ep/ui_tars_api.py b/src/askui/models/ui_tars_ep/ui_tars_api.py
index 312e6a56..0bc97c96 100644
--- a/src/askui/models/ui_tars_ep/ui_tars_api.py
+++ b/src/askui/models/ui_tars_ep/ui_tars_api.py
@@ -1,18 +1,23 @@
 import re
 import os
 import pathlib
-from typing import Union
+from typing import Any, Union
 from openai import OpenAI
-from askui.utils import image_to_base64
+from askui.reporting import Reporter
+from askui.tools.agent_os import AgentOs
+from askui.utils.image_utils import image_to_base64
 from PIL import Image
+
+from askui.utils.image_utils import ImageSource
 from .prompts import PROMPT, PROMPT_QA
 from .parser import UITarsEPMessage
 import time
 
 
 class UITarsAPIHandler:
-    def __init__(self, report):
-        self.report = report
+    def __init__(self, agent_os: AgentOs, reporter: Reporter):
+        self._agent_os = agent_os
+        self._reporter = reporter
         if os.getenv("TARS_URL") is None or os.getenv("TARS_API_KEY") is None:
             self.authenticated = False
         else:
@@ -22,7 +27,7 @@ def __init__(self, report):
                 api_key=os.getenv("TARS_API_KEY")
             )
 
-    def predict(self, screenshot, instruction: str, prompt: str):
+    def _predict(self, image_url: str, instruction: str, prompt: str) -> Any:
         chat_completion = self.client.chat.completions.create(
         model="tgi",
         messages=[
@@ -32,7 +37,7 @@ def predict(self, screenshot, instruction: str, prompt: str):
                     {
                         "type": "image_url",
                         "image_url": {
-                            "url": f"data:image/png;base64,{image_to_base64(screenshot)}"
+                            "url": image_url,
                         }
                     },
                     {
@@ -55,7 +60,11 @@ def predict(self, screenshot, instruction: str, prompt: str):
 
     def locate_prediction(self, image: Union[pathlib.Path, Image.Image], locator: str) -> tuple[int | None, int | None]:
         askui_locator = f'Click on "{locator}"'
-        prediction = self.predict(image, askui_locator, PROMPT)
+        prediction = self._predict(
+            image_url=f"data:image/png;base64,{image_to_base64(image)}",
+            instruction=askui_locator,
+            prompt=PROMPT,
+        )
         pattern = r"click\(start_box='(\(\d+,\d+\))'\)"
         match = re.search(pattern, prediction)
         if match:
@@ -69,11 +78,15 @@ def locate_prediction(self, image: Union[pathlib.Path, Image.Image], locator: st
             return x, y
         return None, None
 
-    def get_prediction(self, image: Image.Image, instruction: str) -> str:
-        return self.predict(image, instruction, PROMPT_QA)
+    def get_inference(self, image: ImageSource, query: str) -> str:
+        return self._predict(
+            image_url=image.to_data_url(),
+            instruction=query,
+            prompt=PROMPT_QA,
+        )
 
-    def act(self, controller_client, goal: str) -> str:
-        screenshot = controller_client.screenshot()
+    def act(self, goal: str) -> None:
+        screenshot = self._agent_os.screenshot()
         self.act_history = [
             {
                 "role": "user",
@@ -91,10 +104,10 @@ def act(self, controller_client, goal: str) -> str:
                 ]
             }
         ]
-        self.execute_act(controller_client, self.act_history)
+        self.execute_act(self.act_history)
 
-    def add_screenshot_to_history(self, controller_client, message_history):
-        screenshot = controller_client.screenshot()
+    def add_screenshot_to_history(self, message_history):
+        screenshot = self._agent_os.screenshot()
         message_history.append(
             {
                 "role": "user",
@@ -148,7 +161,7 @@ def filter_message_thread(self, message_history, max_screenshots=3):
                 
         return filtered_messages
 
-    def execute_act(self, controller_client, message_history):
+    def execute_act(self, message_history):
         message_history = self.filter_message_thread(message_history)
         
         chat_completion = self.client.chat.completions.create(
@@ -166,8 +179,8 @@ def execute_act(self, controller_client, message_history):
         raw_message = chat_completion.choices[-1].message.content
         print(raw_message)
 
-        if self.report is not None: 
-            self.report.add_message("UI-TARS", raw_message)
+        if self._reporter is not None: 
+            self._reporter.add_message("UI-TARS", raw_message)
 
         try:
             message = UITarsEPMessage.parse_message(raw_message)
@@ -184,21 +197,21 @@ def execute_act(self, controller_client, message_history):
                     ]
                 }
             )
-            self.execute_act(controller_client, message_history)
+            self.execute_act(message_history)
             return
 
         action = message.parsed_action
         if action.action_type == "click":
-            controller_client.mouse(action.start_box.x, action.start_box.y)
-            controller_client.click("left")
+            self._agent_os.mouse(action.start_box.x, action.start_box.y)
+            self._agent_os.click("left")
             time.sleep(1)
         if action.action_type == "type":
-            controller_client.click("left")
-            controller_client.type(action.content)
+            self._agent_os.click("left")
+            self._agent_os.type(action.content)
             time.sleep(0.5)
         if action.action_type == "hotkey":
-            controller_client.keyboard_pressed(action.content)
-            controller_client.keyboard_release(action.content)
+            self._agent_os.keyboard_pressed(action.content)
+            self._agent_os.keyboard_release(action.content)
             time.sleep(0.5)
         if action.action_type == "call_user":
             time.sleep(1)
@@ -207,5 +220,5 @@ def execute_act(self, controller_client, message_history):
         if action.action_type == "finished":
             return
 
-        self.add_screenshot_to_history(controller_client, message_history)
-        self.execute_act(controller_client, message_history)
\ No newline at end of file
+        self.add_screenshot_to_history(message_history)
+        self.execute_act(message_history)
diff --git a/src/askui/models/utils.py b/src/askui/models/utils.py
deleted file mode 100644
index a5f0cd43..00000000
--- a/src/askui/models/utils.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import re
-import base64
-
-from io import BytesIO
-from PIL import Image, ImageOps
-
-
-def scale_image_with_padding(image, max_width, max_height):
-    original_width, original_height = image.size
-    aspect_ratio = original_width / original_height
-    if (max_width / max_height) > aspect_ratio:
-        scale_factor = max_height / original_height
-    else:
-        scale_factor = max_width / original_width
-    scaled_width = int(original_width * scale_factor)
-    scaled_height = int(original_height * scale_factor)
-    scaled_image = image.resize((scaled_width, scaled_height), Image.Resampling.LANCZOS)
-    pad_left = (max_width - scaled_width) // 2
-    pad_top = (max_height - scaled_height) // 2
-    padded_image = ImageOps.expand(
-        scaled_image,
-        border=(pad_left, pad_top, max_width - scaled_width - pad_left, max_height - scaled_height - pad_top),
-        fill=(0, 0, 0)  # Black padding
-    )
-    return padded_image
-
-
-def scale_coordinates_back(x, y, original_width, original_height, max_width, max_height):
-    aspect_ratio = original_width / original_height
-    if (max_width / max_height) > aspect_ratio:
-        scale_factor = max_height / original_height
-        scaled_width = int(original_width * scale_factor)
-        scaled_height = max_height
-    else:
-        scale_factor = max_width / original_width
-        scaled_width = max_width
-        scaled_height = int(original_height * scale_factor)
-    pad_left = (max_width - scaled_width) // 2
-    pad_top = (max_height - scaled_height) // 2
-    adjusted_x = x - pad_left
-    adjusted_y = y - pad_top
-    if adjusted_x < 0 or adjusted_x > scaled_width or adjusted_y < 0 or adjusted_y > scaled_height:
-        raise ValueError("Coordinates are outside the padded image area")
-    original_x = adjusted_x / scale_factor
-    original_y = adjusted_y / scale_factor
-    return original_x, original_y
-
-
-def extract_click_coordinates(text: str):
-    pattern = r'<click>(\d+),\s*(\d+)'
-    matches = re.findall(pattern, text)
-    x, y = matches[-1]
-    return int(x), int(y)
-
-
-def base64_to_image(base64_string):
-    base64_string = base64_string.split(",")[1]
-    while len(base64_string) % 4 != 0:
-        base64_string += '='
-    image_data = base64.b64decode(base64_string)
-    image = Image.open(BytesIO(image_data))
-    return image
-
-
-def image_to_base64(image):
-    buffered = BytesIO()
-    image.save(buffered, format="PNG")
-    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
-    return img_str
diff --git a/src/askui/py.typed b/src/askui/py.typed
new file mode 100644
index 00000000..0519ecba
--- /dev/null
+++ b/src/askui/py.typed
@@ -0,0 +1 @@
+ 
\ No newline at end of file
diff --git a/src/askui/reporting/report.py b/src/askui/reporting.py
similarity index 80%
rename from src/askui/reporting/report.py
rename to src/askui/reporting.py
index accb9a76..c274fc80 100644
--- a/src/askui/reporting/report.py
+++ b/src/askui/reporting.py
@@ -1,7 +1,10 @@
+from abc import ABC, abstractmethod
 from pathlib import Path
+import random
 from jinja2 import Template
 from datetime import datetime
-from typing import Any, List, Dict, Optional, Union, Callable
+from typing import List, Dict, Optional, Union
+from typing_extensions import override
 import platform
 import sys
 from importlib.metadata import distributions
@@ -11,49 +14,96 @@
 import json
 
 
-class SimpleReportGenerator:
-    def __init__(self, report_dir: str = "reports", report_callback: Callable[[str | dict[str, Any]], None] | None = None) -> None:
+class Reporter(ABC):
+    @abstractmethod
+    def add_message(
+        self,
+        role: str,
+        content: Union[str, dict, list],
+        image: Optional[Image.Image | list[Image.Image]] = None,
+    ) -> None:
+        raise NotImplementedError()
+
+    @abstractmethod
+    def generate(self) -> None:
+        raise NotImplementedError()
+
+
+class CompositeReporter(Reporter):
+    def __init__(self, reports: list[Reporter] | None = None) -> None:
+        self._reports = reports or []
+
+    @override
+    def add_message(
+        self,
+        role: str,
+        content: Union[str, dict, list],
+        image: Optional[Image.Image | list[Image.Image]] = None,
+    ) -> None:
+        for report in self._reports:
+            report.add_message(role, content, image)
+
+    @override
+    def generate(self) -> None:
+        for report in self._reports:
+            report.generate()
+
+
+class SimpleHtmlReporter(Reporter):
+    def __init__(self, report_dir: str = "reports") -> None:
         self.report_dir = Path(report_dir)
         self.report_dir.mkdir(exist_ok=True)
         self.messages: List[Dict] = []
         self.system_info = self._collect_system_info()
-        self.report_callback = report_callback
 
     def _collect_system_info(self) -> Dict[str, str]:
         """Collect system and Python information"""
         return {
             "platform": platform.platform(),
             "python_version": sys.version.split()[0],
-            "packages": sorted([f"{dist.metadata['Name']}=={dist.version}" 
-                              for dist in distributions()])
+            "packages": sorted(
+                [f"{dist.metadata['Name']}=={dist.version}" for dist in distributions()]
+            ),
         }
-    
+
     def _image_to_base64(self, image: Image.Image) -> str:
         """Convert PIL Image to base64 string"""
         buffered = BytesIO()
         image.save(buffered, format="PNG")
         return base64.b64encode(buffered.getvalue()).decode()
-    
+
     def _format_content(self, content: Union[str, dict, list]) -> str:
         """Format content based on its type"""
         if isinstance(content, (dict, list)):
             return json.dumps(content, indent=2)
         return str(content)
-    
-    def add_message(self, role: str, content: Union[str, dict, list], image: Optional[Image.Image] = None):
+
+    @override
+    def add_message(
+        self,
+        role: str,
+        content: Union[str, dict, list],
+        image: Optional[Image.Image | list[Image.Image]] = None,
+    ) -> None:
         """Add a message to the report, optionally with an image"""
+        if image is None:
+            _images = []
+        elif isinstance(image, list):
+            _images = image
+        else:
+            _images = [image]
+
         message = {
             "timestamp": datetime.now(),
             "role": role,
             "content": self._format_content(content),
             "is_json": isinstance(content, (dict, list)),
-            "image": self._image_to_base64(image) if image else None
+            "images": [self._image_to_base64(img) for img in _images],
         }
         self.messages.append(message)
-        if self.report_callback is not None:
-            self.report_callback(message)
 
-    def generate_report(self) -> str:
+    @override
+    def generate(self) -> None:
         """Generate HTML report using a Jinja template"""
         template_str = """
         <html>
@@ -190,12 +240,12 @@ def generate_report(self) -> str:
                                 {% else %}
                                     {{ msg.content }}
                                 {% endif %}
-                                {% if msg.image %}
+                                {% for image in msg.images %}
                                     <br>
-                                    <img src="data:image/png;base64,{{ msg.image }}" 
+                                    <img src="data:image/png;base64,{{ image }}" 
                                          class="message-image" 
                                          alt="Message image">
-                                {% endif %}
+                                {% endfor %}
                             </td>
                         </tr>
                     {% endfor %}
@@ -203,14 +253,13 @@ def generate_report(self) -> str:
             </body>
         </html>
         """
-        
+
         template = Template(template_str)
         html = template.render(
             timestamp=datetime.now(),
             messages=self.messages,
-            system_info=self.system_info
+            system_info=self.system_info,
         )
-        
-        report_path = self.report_dir / f"report_{datetime.now():%Y%m%d_%H%M%S}.html"
+
+        report_path = self.report_dir / f"report_{datetime.now():%Y%m%d%H%M%S%f}{random.randint(0, 1000):03}.html"
         report_path.write_text(html)
-        return str(report_path)
diff --git a/src/askui/telemetry/telemetry.py b/src/askui/telemetry/telemetry.py
index 182c30f0..5ddc61c4 100644
--- a/src/askui/telemetry/telemetry.py
+++ b/src/askui/telemetry/telemetry.py
@@ -174,10 +174,15 @@ def wrapper(*args: Any, **kwargs: Any) -> Any:
                 )
                 if exclude_first_arg:
                     processed_args = processed_args[1:] if processed_args else ()
+                processed_args = tuple(arg.model_dump() if isinstance(arg, BaseModel) else arg for arg in processed_args)
                 processed_kwargs = {
                     k: v if k not in _exclude else self._EXCLUDE_MASK
                     for k, v in kwargs.items()
                 }
+                processed_kwargs = {
+                    k: v.model_dump() if isinstance(v, BaseModel) else v
+                    for k, v in processed_kwargs.items()
+                }
                 attributes: dict[str, Any] = {
                     "module": module,
                     "fn_name": fn_name,
diff --git a/src/askui/tools/__init__.py b/src/askui/tools/__init__.py
index e69de29b..e76623ba 100644
--- a/src/askui/tools/__init__.py
+++ b/src/askui/tools/__init__.py
@@ -0,0 +1,3 @@
+from .toolbox import AgentToolbox
+
+__all__ = ["AgentToolbox"]
\ No newline at end of file
diff --git a/src/askui/tools/agent_os.py b/src/askui/tools/agent_os.py
new file mode 100644
index 00000000..e7bc437e
--- /dev/null
+++ b/src/askui/tools/agent_os.py
@@ -0,0 +1,202 @@
+from abc import ABC, abstractmethod
+from typing import Literal
+from PIL import Image
+
+ModifierKey = Literal["command", "alt", "control", "shift", "right_shift"]
+PcKey = Literal[
+    "backspace",
+    "delete",
+    "enter",
+    "tab",
+    "escape",
+    "up",
+    "down",
+    "right",
+    "left",
+    "home",
+    "end",
+    "pageup",
+    "pagedown",
+    "f1",
+    "f2",
+    "f3",
+    "f4",
+    "f5",
+    "f6",
+    "f7",
+    "f8",
+    "f9",
+    "f10",
+    "f11",
+    "f12",
+    "space",
+    "0",
+    "1",
+    "2",
+    "3",
+    "4",
+    "5",
+    "6",
+    "7",
+    "8",
+    "9",
+    "a",
+    "b",
+    "c",
+    "d",
+    "e",
+    "f",
+    "g",
+    "h",
+    "i",
+    "j",
+    "k",
+    "l",
+    "m",
+    "n",
+    "o",
+    "p",
+    "q",
+    "r",
+    "s",
+    "t",
+    "u",
+    "v",
+    "w",
+    "x",
+    "y",
+    "z",
+    "A",
+    "B",
+    "C",
+    "D",
+    "E",
+    "F",
+    "G",
+    "H",
+    "I",
+    "J",
+    "K",
+    "L",
+    "M",
+    "N",
+    "O",
+    "P",
+    "Q",
+    "R",
+    "S",
+    "T",
+    "U",
+    "V",
+    "W",
+    "X",
+    "Y",
+    "Z",
+    "!",
+    '"',
+    "#",
+    "$",
+    "%",
+    "&",
+    "'",
+    "(",
+    ")",
+    "*",
+    "+",
+    ",",
+    "-",
+    ".",
+    "/",
+    ":",
+    ";",
+    "<",
+    "=",
+    ">",
+    "?",
+    "@",
+    "[",
+    "\\",
+    "]",
+    "^",
+    "_",
+    "`",
+    "{",
+    "|",
+    "}",
+    "~",
+]
+
+
+class AgentOs(ABC):
+    @abstractmethod
+    def connect(self) -> None:
+        """Connect to the Agent OS."""
+        pass
+    
+    @abstractmethod
+    def disconnect(self) -> None:
+        """Disconnect from the Agent OS."""
+        pass
+    
+    @abstractmethod
+    def screenshot(self, report: bool = True) -> Image.Image:
+        """Take a screenshot of the current display."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def mouse(self, x: int, y: int) -> None:
+        """Move mouse to specified coordinates."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def type(self, text: str, typing_speed: int = 50) -> None:
+        """Type text."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def click(
+        self, button: Literal["left", "middle", "right"] = "left", count: int = 1
+    ) -> None:
+        """Click mouse button (repeatedly)."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def mouse_down(self, button: Literal["left", "middle", "right"] = "left") -> None:
+        """Press and hold mouse button."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def mouse_up(self, button: Literal["left", "middle", "right"] = "left") -> None:
+        """Release mouse button."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def mouse_scroll(self, x: int, y: int) -> None:
+        """Scroll mouse wheel horizontally and vertically."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def keyboard_pressed(
+        self, key: PcKey | ModifierKey, modifier_keys: list[ModifierKey] | None = None
+    ) -> None:
+        """Press and hold keyboard key."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def keyboard_release(
+        self, key: PcKey | ModifierKey, modifier_keys: list[ModifierKey] | None = None
+    ) -> None:
+        """Release keyboard key."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def keyboard_tap(
+        self, key: PcKey | ModifierKey, modifier_keys: list[ModifierKey] | None = None
+    ) -> None:
+        """Press and release keyboard key."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def set_display(self, displayNumber: int = 1) -> None:
+        """Set active display, e.g., when using multiple displays."""
+        raise NotImplementedError()
diff --git a/src/askui/tools/anthropic/computer.py b/src/askui/tools/anthropic/computer.py
index 84aa54bc..082ff10f 100644
--- a/src/askui/tools/anthropic/computer.py
+++ b/src/askui/tools/anthropic/computer.py
@@ -2,9 +2,9 @@
 
 from anthropic.types.beta import BetaToolComputerUse20241022Param
 
-from .base import BaseAnthropicTool, ToolError, ToolResult
+from ...utils.image_utils import image_to_base64, scale_coordinates_back, scale_image_with_padding
 
-from ..utils import image_to_base64, scale_image_with_padding, scale_coordinates_back
+from .base import BaseAnthropicTool, ToolError, ToolResult
 
 
 Action = Literal[
diff --git a/src/askui/tools/askui/__init__.py b/src/askui/tools/askui/__init__.py
index e69de29b..657f2f1f 100644
--- a/src/askui/tools/askui/__init__.py
+++ b/src/askui/tools/askui/__init__.py
@@ -0,0 +1,3 @@
+from .askui_controller import AskUiControllerClient
+
+__all__ = ["AskUiControllerClient"]
diff --git a/src/askui/tools/askui/askui_controller.py b/src/askui/tools/askui/askui_controller.py
index 7911e20f..65c0506d 100644
--- a/src/askui/tools/askui/askui_controller.py
+++ b/src/askui/tools/askui/askui_controller.py
@@ -1,5 +1,7 @@
+from abc import ABC, abstractmethod
 import pathlib
-from typing import List, Literal
+from typing import Literal
+from typing_extensions import Self, override
 import grpc
 import os
 
@@ -10,11 +12,13 @@
 import uuid
 import sys
 
+from askui.tools.agent_os import ModifierKey, PcKey, AgentOs
+
 from ..utils import process_exists, wait_for_port
 from askui.container import telemetry
 from askui.logger import logger
-from askui.reporting.report import SimpleReportGenerator
-from askui.utils import draw_point_on_image
+from askui.reporting import Reporter
+from askui.utils.image_utils import draw_point_on_image
 
 import askui.tools.askui.askui_ui_controller_grpc.Controller_V1_pb2_grpc as controller_v1
 import askui.tools.askui.askui_ui_controller_grpc.Controller_V1_pb2 as controller_v1_pbs
@@ -55,13 +59,29 @@ def validate_either_component_registry_or_installation_directory_is_set(self) ->
         if self.component_registry_file is None and self.installation_directory is None:
             raise ValueError("Either ASKUI_COMPONENT_REGISTRY_FILE or ASKUI_INSTALLATION_DIRECTORY environment variable must be set")
         return self
+    
+
+class ControllerServer(ABC):
+    @abstractmethod
+    def start(self, clean_up: bool = False) -> None:
+        raise NotImplementedError()
+    
+    @abstractmethod
+    def stop(self, force: bool = False) -> None:
+        raise NotImplementedError()
+    
 
-MODIFIER_KEY = Literal['command', 'alt', 'control', 'shift', 'right_shift']
-PC_KEY = Literal['backspace', 'delete', 'enter', 'tab', 'escape', 'up', 'down', 'right', 'left', 'home', 'end', 'pageup', 'pagedown', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'space', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
-PC_AND_MODIFIER_KEY = Literal['command', 'alt', 'control', 'shift', 'right_shift', 'backspace', 'delete', 'enter', 'tab', 'escape', 'up', 'down', 'right', 'left', 'home', 'end', 'pageup', 'pagedown', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'space', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
+class EmptyControllerServer(ControllerServer):
+    @override
+    def start(self, clean_up: bool = False) -> None:
+        pass
+    
+    @override
+    def stop(self, force: bool = False) -> None:
+        pass
 
 
-class AskUiControllerServer:
+class AskUiControllerServer(ControllerServer):
     def __init__(self) -> None:
         self._process = None
         self._settings = AskUiControllerSettings()  # type: ignore
@@ -98,30 +118,33 @@ def _find_remote_device_controller_by_legacy_path(self) -> pathlib.Path:
     def __start_process(self, path):
         self.process = subprocess.Popen(path)
         wait_for_port(23000)
-        
-    def start(self, clean_up=False):
+    
+    @override
+    def start(self, clean_up: bool = False) -> None:
         if sys.platform == 'win32' and clean_up and process_exists("AskuiRemoteDeviceController.exe"):
             self.clean_up()
         remote_device_controller_path = self._find_remote_device_controller()
         logger.debug("Starting AskUI Remote Device Controller: %s", remote_device_controller_path)
         self.__start_process(remote_device_controller_path)
-        
+        time.sleep(0.5) # TODO Find better way to do this, e.g., waiting for something to be logged or port to be opened
+
     def clean_up(self):
         if sys.platform == 'win32':
             subprocess.run("taskkill.exe /IM AskUI*")
             time.sleep(0.1)
 
-    def stop(self, force=False):
+    @override
+    def stop(self, force: bool = False) -> None:
         if force:
             self.process.terminate()
             self.clean_up()
             return
         self.process.kill()
-        
 
-class AskUiControllerClient:
+
+class AskUiControllerClient(AgentOs):
     @telemetry.record_call(exclude={"report"})
-    def __init__(self, display: int = 1, report: SimpleReportGenerator | None = None) -> None:
+    def __init__(self, reporter: Reporter, display: int = 1, controller_server: ControllerServer | None = None) -> None:
         self.stub = None
         self.channel = None
         self.session_info = None
@@ -129,10 +152,13 @@ def __init__(self, display: int = 1, report: SimpleReportGenerator | None = None
         self.post_action_wait = 0.05
         self.max_retries = 10
         self.display = display
-        self.report = report
+        self._reporter = reporter
+        self._controller_server = controller_server or EmptyControllerServer()
 
     @telemetry.record_call()
+    @override
     def connect(self) -> None:
+        self._controller_server.start()
         self.channel = grpc.insecure_channel('localhost:23000', options=[
                 ('grpc.max_send_message_length', 2**30 ),
                 ('grpc.max_receive_message_length', 2**30 ),
@@ -140,6 +166,7 @@ def connect(self) -> None:
         self.stub = controller_v1.ControllerAPIStub(self.channel)        
         self._start_session()
         self._start_execution()
+        self.set_display(self.display)
 
     def _run_recorder_action(self, acion_class_id: controller_v1_pbs.ActionClassID, action_parameters: controller_v1_pbs.ActionParameters):
         time.sleep(self.pre_action_wait)
@@ -158,10 +185,21 @@ def _run_recorder_action(self, acion_class_id: controller_v1_pbs.ActionClassID,
         return response
     
     @telemetry.record_call()
+    @override
     def disconnect(self) -> None:
         self._stop_execution()
         self._stop_session()
         self.channel.close()
+        self._controller_server.stop()
+
+    @telemetry.record_call()
+    def __enter__(self) -> Self:
+        self.connect()
+        return self
+    
+    @telemetry.record_call(exclude={"exc_value", "traceback"})
+    def __exit__(self, exc_type, exc_value, traceback) -> None:
+        self.disconnect()
 
     def _start_session(self):
         response = self.stub.StartSession(controller_v1_pbs.Request_StartSession(sessionGUID="{" + str(uuid.uuid4()) + "}", immediateExecution=True))
@@ -177,32 +215,32 @@ def _stop_execution(self):
         self.stub.StopExecution(controller_v1_pbs.Request_StopExecution(sessionInfo=self.session_info))        
 
     @telemetry.record_call()
+    @override
     def screenshot(self, report: bool = True) -> Image.Image:
         assert isinstance(self.stub, controller_v1.ControllerAPIStub), "Stub is not initialized"
         screenResponse = self.stub.CaptureScreen(controller_v1_pbs.Request_CaptureScreen(sessionInfo=self.session_info, captureParameters=controller_v1_pbs.CaptureParameters(displayID=self.display)))        
         r, g, b, _ = Image.frombytes('RGBA', (screenResponse.bitmap.width, screenResponse.bitmap.height), screenResponse.bitmap.data).split()
         image = Image.merge("RGB", (b, g, r))
-        if self.report is not None and report: 
-            self.report.add_message("AgentOS", "screenshot()", image)
+        self._reporter.add_message("AgentOS", "screenshot()", image)
         return image
 
     @telemetry.record_call()
+    @override
     def mouse(self, x: int, y: int) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"mouse({x}, {y})", draw_point_on_image(self.screenshot(report=False), x, y, size=5))
+        self._reporter.add_message("AgentOS", f"mouse({x}, {y})", draw_point_on_image(self.screenshot(report=False), x, y, size=5))
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_MouseMove, action_parameters=controller_v1_pbs.ActionParameters(mouseMove=controller_v1_pbs.ActionParameters_MouseMove(position=controller_v1_pbs.Coordinate2(x=x, y=y))))
 
 
     @telemetry.record_call(exclude={"text"})
+    @override
     def type(self, text: str, typing_speed: int = 50) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"type(\"{text}\", {typing_speed})")
+        self._reporter.add_message("AgentOS", f"type(\"{text}\", {typing_speed})")
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_KeyboardType_UnicodeText, action_parameters=controller_v1_pbs.ActionParameters(keyboardTypeUnicodeText=controller_v1_pbs.ActionParameters_KeyboardType_UnicodeText(text=text.encode('utf-16-le'), typingSpeed=typing_speed, typingSpeedValue=controller_v1_pbs.TypingSpeedValue.TypingSpeedValue_CharactersPerSecond)))
         
     @telemetry.record_call()
+    @override
     def click(self, button: Literal['left', 'middle', 'right'] = 'left', count: int = 1) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"click(\"{button}\", {count})")
+        self._reporter.add_message("AgentOS", f"click(\"{button}\", {count})")
         mouse_button = None
         match button:
             case 'left':
@@ -214,9 +252,9 @@ def click(self, button: Literal['left', 'middle', 'right'] = 'left', count: int
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_MouseButton_PressAndRelease, action_parameters=controller_v1_pbs.ActionParameters(mouseButtonPressAndRelease=controller_v1_pbs.ActionParameters_MouseButton_PressAndRelease(mouseButton=mouse_button, count=count)))
         
     @telemetry.record_call()
+    @override
     def mouse_down(self, button: Literal['left', 'middle', 'right'] = 'left') -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"mouse_down(\"{button}\")")
+        self._reporter.add_message("AgentOS", f"mouse_down(\"{button}\")")
         mouse_button = None
         match button:
             case 'left':
@@ -228,9 +266,9 @@ def mouse_down(self, button: Literal['left', 'middle', 'right'] = 'left') -> Non
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_MouseButton_Press, action_parameters=controller_v1_pbs.ActionParameters(mouseButtonPress=controller_v1_pbs.ActionParameters_MouseButton_Press(mouseButton=mouse_button)))
 
     @telemetry.record_call()
-    def mouse_up(self, button: Literal['left', 'middle', 'right'] = 'left') -> None:      
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"mouse_up(\"{button}\")")  
+    @override
+    def mouse_up(self, button: Literal['left', 'middle', 'right'] = 'left') -> None:       
+        self._reporter.add_message("AgentOS", f"mouse_up(\"{button}\")")  
         mouse_button = None
         match button:
             case 'left':
@@ -242,9 +280,9 @@ def mouse_up(self, button: Literal['left', 'middle', 'right'] = 'left') -> None:
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_MouseButton_Release, action_parameters=controller_v1_pbs.ActionParameters(mouseButtonRelease=controller_v1_pbs.ActionParameters_MouseButton_Release(mouseButton=mouse_button)))
 
     @telemetry.record_call()
+    @override
     def mouse_scroll(self, x: int, y: int) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"mouse_scroll({x}, {y})")
+        self._reporter.add_message("AgentOS", f"mouse_scroll({x}, {y})")
         if x != 0:
             self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_MouseWheelScroll, action_parameters=controller_v1_pbs.ActionParameters(mouseWheelScroll=controller_v1_pbs.ActionParameters_MouseWheelScroll(
                 direction = controller_v1_pbs.MouseWheelScrollDirection.MouseWheelScrollDirection_Horizontal,
@@ -262,33 +300,33 @@ def mouse_scroll(self, x: int, y: int) -> None:
 
 
     @telemetry.record_call()
-    def keyboard_pressed(self, key: PC_AND_MODIFIER_KEY,  modifier_keys: List[MODIFIER_KEY] | None = None) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"keyboard_pressed(\"{key}\", {modifier_keys})")
+    @override
+    def keyboard_pressed(self, key: PcKey | ModifierKey,  modifier_keys: list[ModifierKey] | None = None) -> None: 
+        self._reporter.add_message("AgentOS", f"keyboard_pressed(\"{key}\", {modifier_keys})")
         if modifier_keys is None:
             modifier_keys = []   
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_KeyboardKey_Press, action_parameters=controller_v1_pbs.ActionParameters(keyboardKeyPress=controller_v1_pbs.ActionParameters_KeyboardKey_Press(keyName=key, modifierKeyNames=modifier_keys)))
 
     @telemetry.record_call()
-    def keyboard_release(self, key: PC_AND_MODIFIER_KEY,  modifier_keys: List[MODIFIER_KEY] | None = None) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"keyboard_release(\"{key}\", {modifier_keys})")
+    @override
+    def keyboard_release(self, key: PcKey | ModifierKey,  modifier_keys: list[ModifierKey] | None = None) -> None:
+        self._reporter.add_message("AgentOS", f"keyboard_release(\"{key}\", {modifier_keys})")
         if modifier_keys is None:
             modifier_keys = []   
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_KeyboardKey_Release, action_parameters=controller_v1_pbs.ActionParameters(keyboardKeyRelease=controller_v1_pbs.ActionParameters_KeyboardKey_Release(keyName=key, modifierKeyNames=modifier_keys)))
 
     @telemetry.record_call()
-    def keyboard_tap(self, key: PC_AND_MODIFIER_KEY,  modifier_keys: List[MODIFIER_KEY] | None = None) -> None:
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"keyboard_tap(\"{key}\", {modifier_keys})")
+    @override
+    def keyboard_tap(self, key: PcKey | ModifierKey,  modifier_keys: list[ModifierKey] | None = None) -> None: 
+        self._reporter.add_message("AgentOS", f"keyboard_tap(\"{key}\", {modifier_keys})")
         if modifier_keys is None:
             modifier_keys = []   
         self._run_recorder_action(acion_class_id=controller_v1_pbs.ActionClassID_KeyboardKey_PressAndRelease, action_parameters=controller_v1_pbs.ActionParameters(keyboardKeyPressAndRelease=controller_v1_pbs.ActionParameters_KeyboardKey_PressAndRelease(keyName=key, modifierKeyNames=modifier_keys)))
 
     @telemetry.record_call()
+    @override
     def set_display(self, displayNumber: int = 1) -> None:
         assert isinstance(self.stub, controller_v1.ControllerAPIStub), "Stub is not initialized"
-        if self.report is not None: 
-            self.report.add_message("AgentOS", f"set_display({displayNumber})")
+        self._reporter.add_message("AgentOS", f"set_display({displayNumber})")
         self.stub.SetActiveDisplay(controller_v1_pbs.Request_SetActiveDisplay(displayID=displayNumber))
         self.display = displayNumber
diff --git a/src/askui/tools/toolbox.py b/src/askui/tools/toolbox.py
index 0b88521d..0affcec9 100644
--- a/src/askui/tools/toolbox.py
+++ b/src/askui/tools/toolbox.py
@@ -1,15 +1,15 @@
 import httpx
 import pyperclip
 import webbrowser
-from askui.tools.askui.askui_controller import AskUiControllerClient
+from askui.tools.agent_os import AgentOs
 from askui.tools.askui.askui_hub import AskUIHub
 
 
 class AgentToolbox:
-    def __init__(self, os_controller: AskUiControllerClient | None = None):
+    def __init__(self, agent_os: AgentOs):
         self.webbrowser = webbrowser
         self.clipboard: pyperclip = pyperclip
-        self._os = os_controller
+        self.agent_os = agent_os
         self._hub = AskUIHub()
         self.httpx = httpx
     
@@ -18,9 +18,3 @@ def hub(self) -> AskUIHub:
         if self._hub.disabled:
             raise ValueError("AskUI Hub is disabled. Please, set ASKUI_WORKSPACE_ID and ASKUI_TOKEN environment variables to enable it.")
         return self._hub
-    
-    @property
-    def os(self) -> AskUiControllerClient:
-        if self._os is None:
-            raise ValueError("OS controller is not initialized. Please, provide a `os_controller` when initializing the `AgentToolbox`.")
-        return self._os
diff --git a/src/askui/tools/utils.py b/src/askui/tools/utils.py
index 07301926..22fbe4b9 100644
--- a/src/askui/tools/utils.py
+++ b/src/askui/tools/utils.py
@@ -1,11 +1,7 @@
-import base64
 import socket
 import subprocess
 import time
 
-from PIL import Image, ImageOps
-from io import BytesIO
-
 
 def wait_for_port(port: int, host: str = 'localhost', timeout: float = 5.0):
     """Wait until a port starts accepting TCP connections.
@@ -36,60 +32,3 @@ def process_exists(process_name):
     last_line = output.strip().split('\r\n')[-1]
     # because Fail message could be translated
     return last_line.lower().startswith(process_name.lower())
-
-
-def base64_to_image(base64_string):
-    base64_string = base64_string.split(",")[1]
-    while len(base64_string) % 4 != 0:
-        base64_string += '='
-    image_data = base64.b64decode(base64_string)
-    image = Image.open(BytesIO(image_data))
-    return image
-
-
-def image_to_base64(image):
-    buffered = BytesIO()
-    image.save(buffered, format="PNG")
-    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
-    return img_str
-
-
-def scale_image_with_padding(image, max_width, max_height):
-    original_width, original_height = image.size
-    aspect_ratio = original_width / original_height
-    if (max_width / max_height) > aspect_ratio:
-        scale_factor = max_height / original_height
-    else:
-        scale_factor = max_width / original_width
-    scaled_width = int(original_width * scale_factor)
-    scaled_height = int(original_height * scale_factor)
-    scaled_image = image.resize((scaled_width, scaled_height), Image.Resampling.LANCZOS)
-    pad_left = (max_width - scaled_width) // 2
-    pad_top = (max_height - scaled_height) // 2
-    padded_image = ImageOps.expand(
-        scaled_image,
-        border=(pad_left, pad_top, max_width - scaled_width - pad_left, max_height - scaled_height - pad_top),
-        fill=(0, 0, 0)  # Black padding
-    )
-    return padded_image
-
-
-def scale_coordinates_back(x, y, original_width, original_height, max_width, max_height):
-    aspect_ratio = original_width / original_height
-    if (max_width / max_height) > aspect_ratio:
-        scale_factor = max_height / original_height
-        scaled_width = int(original_width * scale_factor)
-        scaled_height = max_height
-    else:
-        scale_factor = max_width / original_width
-        scaled_width = max_width
-        scaled_height = int(original_height * scale_factor)
-    pad_left = (max_width - scaled_width) // 2
-    pad_top = (max_height - scaled_height) // 2
-    adjusted_x = x - pad_left
-    adjusted_y = y - pad_top
-    if adjusted_x < 0 or adjusted_x > scaled_width or adjusted_y < 0 or adjusted_y > scaled_height:
-        raise ValueError("Coordinates are outside the padded image area")
-    original_x = adjusted_x / scale_factor
-    original_y = adjusted_y / scale_factor
-    return original_x, original_y
diff --git a/src/askui/utils.py b/src/askui/utils.py
deleted file mode 100644
index 14c2ac03..00000000
--- a/src/askui/utils.py
+++ /dev/null
@@ -1,72 +0,0 @@
-import io
-import base64
-import pathlib
-
-from PIL import Image, ImageDraw
-from typing import Union
-
-
-class AutomationError(Exception):
-    """Exception raised when the automation step cannot complete."""
-    def __init__(self, message):
-        self.message = message
-        super().__init__(self.message)
-
-
-def truncate_long_strings(json_data, max_length=100, truncate_length=20, tag="[shortened]"):
-    """
-    Traverse and truncate long strings in JSON data.
-
-    :param json_data: The JSON data (dict, list, or str).
-    :param max_length: The maximum length before truncation.
-    :param truncate_length: The length to truncate the string to.
-    :param tag: The tag to append to truncated strings.
-    :return: JSON data with truncated long strings.
-    """
-    if isinstance(json_data, dict):
-        return {k: truncate_long_strings(v, max_length, truncate_length, tag) for k, v in json_data.items()}
-    elif isinstance(json_data, list):
-        return [truncate_long_strings(item, max_length, truncate_length, tag) for item in json_data]
-    elif isinstance(json_data, str) and len(json_data) > max_length:
-        return f"{json_data[:truncate_length]}... {tag}"
-    return json_data
-
-
-def image_to_base64(image: Union[pathlib.Path, Image.Image]) -> str:
-    image_bytes: bytes | None = None
-    if isinstance(image, Image.Image):
-        with io.BytesIO() as _bytes:
-            image.save(_bytes, format="PNG")
-            image_bytes = _bytes.getvalue()
-    elif isinstance(image, pathlib.Path):
-        with open(image, "rb") as f:
-            image_bytes = f.read()
-
-    return base64.b64encode(image_bytes).decode("utf-8")
-
-
-def base64_to_image(base64_string: str) -> Image.Image:
-    """
-    Convert a base64 string to a PIL Image.
-    
-    :param base64_string: The base64 encoded image string
-    :return: PIL Image object
-    """
-    image_bytes = base64.b64decode(base64_string)
-    image = Image.open(io.BytesIO(image_bytes))
-    return image
-
-
-def draw_point_on_image(image: Image.Image, x: int, y: int, size: int = 3) -> Image.Image:
-    """
-    Draw a red point at the specified x,y coordinates on a copy of the input image.
-    
-    :param image: PIL Image to draw on
-    :param x: X coordinate for the point
-    :param y: Y coordinate for the point
-    :return: New PIL Image with the point drawn
-    """    
-    img_copy = image.copy()
-    draw = ImageDraw.Draw(img_copy)
-    draw.ellipse([x-size, y-size, x+size, y+size], fill='red')
-    return img_copy
diff --git a/src/askui/utils/__init__.py b/src/askui/utils/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/src/askui/utils/image_utils.py b/src/askui/utils/image_utils.py
new file mode 100644
index 00000000..dc677540
--- /dev/null
+++ b/src/askui/utils/image_utils.py
@@ -0,0 +1,290 @@
+from io import BytesIO
+import pathlib
+from typing import Any, Literal, Union, Tuple
+from pathlib import Path
+from PIL import Image, Image as PILImage, ImageDraw, ImageOps, UnidentifiedImageError
+import base64
+import io
+import re
+import binascii
+
+from pydantic import RootModel, field_validator, ConfigDict
+
+
+# Regex to capture any kind of valid base64 data url (with optional media type and ;base64)
+# e.g., data:image/png;base64,... or data:;base64,... or data:,... or just ,...
+_DATA_URL_GENERIC_RE = re.compile(r"^(?:data:)?[^,]*?,(.*)$", re.DOTALL)
+
+
+def load_image(source: Union[str, Path, Image.Image]) -> Image.Image:
+    """
+    Load and validate an image from a PIL Image, a path (`str` or `pathlib.Path`), or any form of base64 data URL.
+
+    Accepts:
+      - `PIL.Image.Image`
+      - File path (`str` or `pathlib.Path`)
+      - Data URL (e.g., "data:image/png;base64,...", "data:,...", ",...")
+
+    Returns:
+        A valid `PIL.Image.Image` object.
+
+    Raises:
+        ValueError: If the input is not a valid or recognizable image.
+    """
+    if isinstance(source, Image.Image):
+        return source
+
+    if isinstance(source, Path) or (
+        isinstance(source, str) and not source.startswith(("data:", ","))
+    ):
+        try:
+            return Image.open(source)
+        except (OSError, FileNotFoundError, UnidentifiedImageError) as e:
+            raise ValueError(f"Could not open image from file path: {source}") from e
+
+    if isinstance(source, str):
+        match = _DATA_URL_GENERIC_RE.match(source)
+        if match:
+            try:
+                image_data = base64.b64decode(match.group(1))
+                return Image.open(io.BytesIO(image_data))
+            except (binascii.Error, UnidentifiedImageError):
+                try:
+                    return Image.open(source)
+                except (FileNotFoundError, UnidentifiedImageError) as e:
+                    raise ValueError(
+                        f"Could not decode or identify image from input: {source[:100]}{'...' if len(source) > 100 else ''}"
+                    ) from e
+
+    raise ValueError(f"Unsupported image input type: {type(source)}")
+
+
+def image_to_data_url(image: PILImage.Image) -> str:
+    """
+    Convert a PIL Image to a data URL.
+
+    Args:
+        image: The PIL Image to convert.
+
+    Returns:
+        A data URL string in the format "data:image/png;base64,..."
+    """
+    return f"data:image/png;base64,{image_to_base64(image=image, format='PNG')}"
+
+
+def data_url_to_image(data_url: str) -> Image.Image:
+    """
+    Convert a data URL to a PIL Image.
+
+    Args:
+        data_url: The data URL string to convert.
+
+    Returns:
+        A PIL Image object.
+
+    Raises:
+        ValueError: If the data URL is invalid or the image cannot be decoded.
+    """
+    data_url = data_url.split(",")[1]
+    while len(data_url) % 4 != 0:
+        data_url += "="
+    image_data = base64.b64decode(data_url)
+    image = Image.open(BytesIO(image_data))
+    return image
+
+
+def draw_point_on_image(
+    image: Image.Image, x: int, y: int, size: int = 3
+) -> Image.Image:
+    """
+    Draw a red point at the specified x,y coordinates on a copy of the input image.
+
+    Args:
+        image: The PIL Image to draw on.
+        x: The x-coordinate for the point.
+        y: The y-coordinate for the point.
+        size: The size of the point in pixels. Defaults to 3.
+
+    Returns:
+        A new PIL Image with the point drawn.
+    """
+    img_copy = image.copy()
+    draw = ImageDraw.Draw(img_copy)
+    draw.ellipse([x - size, y - size, x + size, y + size], fill="red")
+    return img_copy
+
+
+def base64_to_image(base64_string: str) -> Image.Image:
+    """
+    Convert a base64 string to a PIL Image.
+
+    Args:
+        base64_string: The base64 encoded image string.
+
+    Returns:
+        A PIL Image object.
+
+    Raises:
+        ValueError: If the base64 string is invalid or the image cannot be decoded.
+    """
+    image_bytes = base64.b64decode(base64_string)
+    image = Image.open(io.BytesIO(image_bytes))
+    return image
+
+
+def image_to_base64(
+    image: Union[pathlib.Path, Image.Image], format: Literal["PNG"] | None = None
+) -> str:
+    """
+    Convert an image to a base64 string.
+
+    Args:
+        image: The image to convert, either a PIL Image or a file path.
+        format: The image format to use. Currently only "PNG" is supported.
+
+    Returns:
+        A base64 encoded string of the image.
+
+    Raises:
+        ValueError: If the image cannot be encoded or the format is unsupported.
+    """
+    image_bytes: bytes | None = None
+    if isinstance(image, Image.Image):
+        with io.BytesIO() as _bytes:
+            image.save(_bytes, format="PNG")
+            image_bytes = _bytes.getvalue()
+    elif isinstance(image, pathlib.Path):
+        with open(image, "rb") as f:
+            image_bytes = f.read()
+    return base64.b64encode(image_bytes).decode("utf-8")
+
+
+def scale_image_with_padding(
+    image: Image.Image, max_width: int, max_height: int
+) -> Image.Image:
+    """
+    Scale an image to fit within specified dimensions while maintaining aspect ratio and adding padding.
+
+    Args:
+        image: The PIL Image to scale.
+        max_width: The maximum width of the output image.
+        max_height: The maximum height of the output image.
+
+    Returns:
+        A new PIL Image that fits within the specified dimensions with padding.
+    """
+    original_width, original_height = image.size
+    aspect_ratio = original_width / original_height
+    if (max_width / max_height) > aspect_ratio:
+        scale_factor = max_height / original_height
+    else:
+        scale_factor = max_width / original_width
+    scaled_width = int(original_width * scale_factor)
+    scaled_height = int(original_height * scale_factor)
+    scaled_image = image.resize((scaled_width, scaled_height), Image.Resampling.LANCZOS)
+    pad_left = (max_width - scaled_width) // 2
+    pad_top = (max_height - scaled_height) // 2
+    padded_image = ImageOps.expand(
+        scaled_image,
+        border=(
+            pad_left,
+            pad_top,
+            max_width - scaled_width - pad_left,
+            max_height - scaled_height - pad_top,
+        ),
+        fill=(0, 0, 0),  # Black padding
+    )
+    return padded_image
+
+
+def scale_coordinates_back(
+    x: float,
+    y: float,
+    original_width: int,
+    original_height: int,
+    max_width: int,
+    max_height: int,
+) -> Tuple[float, float]:
+    """
+    Convert coordinates from a scaled and padded image back to the original image coordinates.
+
+    Args:
+        x: The x-coordinate in the scaled image.
+        y: The y-coordinate in the scaled image.
+        original_width: The width of the original image.
+        original_height: The height of the original image.
+        max_width: The maximum width used for scaling.
+        max_height: The maximum height used for scaling.
+
+    Returns:
+        A tuple of (original_x, original_y) coordinates.
+
+    Raises:
+        ValueError: If the coordinates are outside the padded image area.
+    """
+    aspect_ratio = original_width / original_height
+    if (max_width / max_height) > aspect_ratio:
+        scale_factor = max_height / original_height
+        scaled_width = int(original_width * scale_factor)
+        scaled_height = max_height
+    else:
+        scale_factor = max_width / original_width
+        scaled_width = max_width
+        scaled_height = int(original_height * scale_factor)
+    pad_left = (max_width - scaled_width) // 2
+    pad_top = (max_height - scaled_height) // 2
+    adjusted_x = x - pad_left
+    adjusted_y = y - pad_top
+    if (
+        adjusted_x < 0
+        or adjusted_x > scaled_width
+        or adjusted_y < 0
+        or adjusted_y > scaled_height
+    ):
+        raise ValueError("Coordinates are outside the padded image area")
+    original_x = adjusted_x / scale_factor
+    original_y = adjusted_y / scale_factor
+    return original_x, original_y
+
+
+Img = Union[str, Path, PILImage.Image]
+
+
+class ImageSource(RootModel):
+    """
+    A Pydantic model that represents an image source and provides methods to convert it to different formats.
+
+    The model can be initialized with:
+    - A PIL Image object
+    - A file path (str or pathlib.Path)
+    - A data URL string
+    """
+
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+    root: PILImage.Image
+
+    def __init__(self, root: Img, **kwargs) -> None:
+        super().__init__(root=root, **kwargs)
+
+    @field_validator("root", mode="before")
+    @classmethod
+    def validate_root(cls, v: Any) -> PILImage.Image:
+        return load_image(v)
+
+    def to_data_url(self) -> str:
+        """
+        Convert the image to a data URL.
+
+        Returns:
+            A data URL string in the format "data:image/png;base64,..."
+        """
+        return image_to_data_url(image=self.root)
+
+    def to_base64(self) -> str:
+        """
+        Convert the image to a base64 string.
+
+        Returns:
+            A base64 encoded string of the image.
+        """
+        return image_to_base64(image=self.root)
diff --git a/src/askui/utils/str_utils.py b/src/askui/utils/str_utils.py
new file mode 100644
index 00000000..1a5491d4
--- /dev/null
+++ b/src/askui/utils/str_utils.py
@@ -0,0 +1,63 @@
+from typing import Any, TypeVar, overload
+
+T = TypeVar('T', dict[str, Any], list[Any], str)
+
+@overload
+def truncate_long_strings(
+    json_data: dict[str, Any],
+    max_length: int = 100,
+    truncate_length: int = 20,
+    tag: str = "[shortened]"
+) -> dict[str, Any]: ...
+
+@overload
+def truncate_long_strings(
+    json_data: list[Any],
+    max_length: int = 100,
+    truncate_length: int = 20,
+    tag: str = "[shortened]"
+) -> list[Any]: ...
+
+@overload
+def truncate_long_strings(
+    json_data: str,
+    max_length: int = 100,
+    truncate_length: int = 20,
+    tag: str = "[shortened]"
+) -> str: ...
+
+def truncate_long_strings(
+    json_data: T,
+    max_length: int = 100,
+    truncate_length: int = 20,
+    tag: str = "[shortened]"
+) -> T:
+    """
+    Traverse and truncate long strings in JSON data.
+
+    Args:
+        json_data: The JSON data to process. Can be a dict, list, or str.
+        max_length: Maximum length of a string before truncation occurs.
+        truncate_length: Number of characters to keep when truncating.
+        tag: Tag to append to truncated strings.
+
+    Returns:
+        Processed JSON data with truncated long strings. Returns the same type as input.
+
+    Examples:
+        >>> truncate_long_strings({"key": "a" * 101})
+        {'key': 'aaaaaaaaaaaaaaaaaaaa... [shortened]'}
+        
+        >>> truncate_long_strings(["short", "a" * 101])
+        ['short', 'aaaaaaaaaaaaaaaaaaaa... [shortened]']
+        
+        >>> truncate_long_strings("a" * 101)
+        'aaaaaaaaaaaaaaaaaaaa... [shortened]'
+    """
+    if isinstance(json_data, dict):
+        return {k: truncate_long_strings(v, max_length, truncate_length, tag) for k, v in json_data.items()}
+    elif isinstance(json_data, list):
+        return [truncate_long_strings(item, max_length, truncate_length, tag) for item in json_data]
+    elif isinstance(json_data, str) and len(json_data) > max_length:
+        return f"{json_data[:truncate_length]}... {tag}"
+    return json_data
diff --git a/tests/conftest.py b/tests/conftest.py
new file mode 100644
index 00000000..ce33ac4d
--- /dev/null
+++ b/tests/conftest.py
@@ -0,0 +1,45 @@
+import pathlib
+
+import pytest
+from PIL import Image
+from pytest_mock import MockerFixture
+
+from askui.models.router import ModelRouter
+from askui.tools.agent_os import AgentOs
+from askui.tools.toolbox import AgentToolbox
+
+
+@pytest.fixture
+def path_fixtures() -> pathlib.Path:
+    """Fixture providing the path to the fixtures directory."""
+    return pathlib.Path().absolute() / "tests" / "fixtures"
+
+@pytest.fixture
+def path_fixtures_images(path_fixtures: pathlib.Path) -> pathlib.Path:
+    """Fixture providing the path to the images directory."""
+    return path_fixtures / "images"
+
+@pytest.fixture
+def path_fixtures_github_com__icon(path_fixtures_images: pathlib.Path) -> pathlib.Path:
+    """Fixture providing the path to the github com icon image."""
+    return path_fixtures_images / "github_com__icon.png"
+
+@pytest.fixture
+def agent_os_mock(mocker: MockerFixture) -> AgentOs:
+    """Fixture providing a mock agent os."""
+    mock = mocker.MagicMock(spec=AgentOs)
+    mock.screenshot.return_value = Image.new('RGB', (100, 100), color='white')
+    return mock
+
+@pytest.fixture
+def agent_toolbox_mock(agent_os_mock: AgentOs) -> AgentToolbox:
+    """Fixture providing a mock agent toolbox."""
+    return AgentToolbox(agent_os=agent_os_mock)
+
+@pytest.fixture
+def model_router_mock(mocker: MockerFixture) -> ModelRouter:
+    """Fixture providing a mock model router."""
+    mock = mocker.MagicMock(spec=ModelRouter)
+    mock.locate.return_value = (100, 100)  # Return fixed point for all locate calls
+    mock.get_inference.return_value = "Mock response"  # Return fixed response for all get_inference calls
+    return mock
diff --git a/tests/e2e/__init__.py b/tests/e2e/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/e2e/agent/__init__.py b/tests/e2e/agent/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/e2e/agent/conftest.py b/tests/e2e/agent/conftest.py
new file mode 100644
index 00000000..ba8b859d
--- /dev/null
+++ b/tests/e2e/agent/conftest.py
@@ -0,0 +1,56 @@
+"""Shared pytest fixtures for e2e tests."""
+
+import pathlib
+from typing import Generator, Optional, Union
+from typing_extensions import override
+import pytest
+from PIL import Image as PILImage
+
+from askui.agent import VisionAgent
+from askui.models.askui.ai_element_utils import AiElementCollection
+from askui.models.askui.api import AskUiInferenceApi
+from askui.locators.serializers import AskUiLocatorSerializer
+from askui.models.router import ModelRouter, AskUiModelRouter
+from askui.reporting import Reporter, SimpleHtmlReporter
+from askui.tools.toolbox import AgentToolbox
+
+
+class ReporterMock(Reporter):
+    @override
+    def add_message(self, role: str, content: Union[str, dict, list], image: Optional[PILImage.Image] = None) -> None:
+        pass
+    
+    @override
+    def generate(self) -> None:
+        pass
+
+
+@pytest.fixture
+def vision_agent(
+    path_fixtures: pathlib.Path, agent_toolbox_mock: AgentToolbox
+) -> Generator[VisionAgent, None, None]:
+    """Fixture providing a VisionAgent instance."""
+    ai_element_collection = AiElementCollection(
+        additional_ai_element_locations=[path_fixtures / "images"]
+    )
+    reporter = SimpleHtmlReporter()
+    serializer = AskUiLocatorSerializer(ai_element_collection=ai_element_collection, reporter=reporter)
+    inference_api = AskUiInferenceApi(locator_serializer=serializer)
+    model_router = ModelRouter(
+        tools=agent_toolbox_mock,
+        reporter=reporter,
+        grounding_model_routers=[AskUiModelRouter(inference_api=inference_api)]
+    )
+    with VisionAgent(
+        reporters=[reporter], model_router=model_router, tools=agent_toolbox_mock
+    ) as agent:
+        yield agent
+
+
+@pytest.fixture
+def github_login_screenshot(path_fixtures: pathlib.Path) -> PILImage.Image:
+    """Fixture providing the GitHub login screenshot."""
+    screenshot_path = (
+        path_fixtures / "screenshots" / "macos__chrome__github_com__login.png"
+    )
+    return PILImage.open(screenshot_path)
diff --git a/tests/e2e/agent/test_get.py b/tests/e2e/agent/test_get.py
new file mode 100644
index 00000000..73ae576f
--- /dev/null
+++ b/tests/e2e/agent/test_get.py
@@ -0,0 +1,206 @@
+from typing import Literal
+import pytest
+from PIL import Image as PILImage
+from askui.models import ModelName
+from askui import VisionAgent
+from askui import ResponseSchemaBase
+
+
+class UrlResponse(ResponseSchemaBase):
+    url: str
+
+
+class PageContextResponse(UrlResponse):
+    title: str
+
+
+class BrowserContextResponse(ResponseSchemaBase):
+    page_context: PageContextResponse
+    browser_type: Literal["chrome", "firefox", "edge", "safari"]
+
+
+@pytest.mark.parametrize("model", [None, ModelName.ASKUI, ModelName.ANTHROPIC])
+def test_get(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    url = vision_agent.get(
+        "What is the current url shown in the url bar?",
+        image=github_login_screenshot,
+        model=model,
+    )
+    assert url in ["github.com/login", "https://github.com/login"]
+
+
+@pytest.mark.skip("Skip for now as this pops up in our observability systems as a false positive")
+def test_get_with_response_schema_without_additional_properties_with_askui_model_raises(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+) -> None:
+    with pytest.raises(Exception):
+        vision_agent.get(
+            "What is the current url shown in the url bar?",
+            image=github_login_screenshot,
+            response_schema=UrlResponse,
+            model=ModelName.ASKUI,
+        )
+
+
+@pytest.mark.skip("Skip for now as this pops up in our observability systems as a false positive")
+def test_get_with_response_schema_without_required_with_askui_model_raises(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+) -> None:
+    with pytest.raises(Exception):
+        vision_agent.get(
+            "What is the current url shown in the url bar?",
+            image=github_login_screenshot,
+            response_schema=UrlResponse,
+            model=ModelName.ASKUI,
+        )
+
+
+@pytest.mark.parametrize("model", [None, ModelName.ASKUI])
+def test_get_with_response_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "What is the current url shown in the url bar?",
+        image=github_login_screenshot,
+        response_schema=UrlResponse,
+        model=model,
+    )
+    assert isinstance(response, UrlResponse)
+    assert response.url in ["https://github.com/login", "github.com/login"]
+
+
+def test_get_with_response_schema_with_anthropic_model_raises_not_implemented(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+) -> None:
+    with pytest.raises(NotImplementedError):
+        vision_agent.get(
+            "What is the current url shown in the url bar?",
+            image=github_login_screenshot,
+            response_schema=UrlResponse,
+            model=ModelName.ANTHROPIC,
+        )
+
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+@pytest.mark.skip("Skip as there is currently a bug on the api side not supporting definitions used for nested schemas")
+def test_get_with_nested_and_inherited_response_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "What is the current browser context?",
+        image=github_login_screenshot,
+        response_schema=BrowserContextResponse,
+        model=model,
+    )
+    assert isinstance(response, BrowserContextResponse)
+    assert response.page_context.url in ["https://github.com/login", "github.com/login"]
+    assert "Github" in response.page_context.title
+    assert response.browser_type in ["chrome", "firefox", "edge", "safari"]
+
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+def test_get_with_string_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "What is the current url shown in the url bar?",
+        image=github_login_screenshot,
+        response_schema=str,
+        model=model,
+    )
+    assert response in ["https://github.com/login", "github.com/login"]
+
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+def test_get_with_boolean_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "Is this a login page?",
+        image=github_login_screenshot,
+        response_schema=bool,
+        model=model,
+    )
+    assert isinstance(response, bool)
+    assert response is True
+
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+def test_get_with_integer_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "How many input fields are visible on this page?",
+        image=github_login_screenshot,
+        response_schema=int,
+        model=model,
+    )
+    assert isinstance(response, int)
+    assert response > 0
+
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+def test_get_with_float_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "Return a floating point number between 0 and 1 as a rating for how you well this page is designed (0 is the worst, 1 is the best)",
+        image=github_login_screenshot,
+        response_schema=float,
+        model=model,
+    )
+    assert isinstance(response, float)
+    assert response > 0
+
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+def test_get_returns_str_when_no_schema_specified(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "What is the display showing?",
+        image=github_login_screenshot,
+        model=model,
+    )
+    assert isinstance(response, str)
+
+
+class Basis(ResponseSchemaBase):
+    answer: str
+    
+
+@pytest.mark.parametrize("model", [ModelName.ASKUI])
+def test_get_with_basis_schema(
+    vision_agent: VisionAgent,
+    github_login_screenshot: PILImage.Image,
+    model: str,
+) -> None:
+    response = vision_agent.get(
+        "What is the display showing?",
+        image=github_login_screenshot,
+        response_schema=Basis,
+        model=model,
+    )
+    assert isinstance(response, Basis)
+    assert response.answer != "\"What is the display showing?\""
diff --git a/tests/e2e/agent/test_locate.py b/tests/e2e/agent/test_locate.py
new file mode 100644
index 00000000..0cf0d524
--- /dev/null
+++ b/tests/e2e/agent/test_locate.py
@@ -0,0 +1,247 @@
+"""Tests for VisionAgent.locate() with different locator types and models"""
+
+import pathlib
+import pytest
+from PIL import Image as PILImage
+
+from askui.agent import VisionAgent
+from askui.locators import (
+    Prompt,
+    Element,
+    Text,
+    AiElement,
+)
+from askui.locators.locators import Image
+from askui.exceptions import ElementNotFoundError
+from askui.models import ModelName
+
+
+@pytest.mark.parametrize(
+    "model",
+    [
+        ModelName.ASKUI,
+        ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022,
+    ],
+)
+class TestVisionAgentLocate:
+    """Test class for VisionAgent.locate() method."""
+
+    def test_locate_with_string_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a simple string locator."""
+        locator = "Forgot password?"
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_textfield_class_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a class locator."""
+        locator = Element("textfield")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 50 <= x <= 860 or 350 <= x <= 570
+        assert 0 <= y <= 80 or 160 <= y <= 280
+
+    def test_locate_with_unspecified_class_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a class locator."""
+        locator = Element()
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 0 <= x <= github_login_screenshot.width
+        assert 0 <= y <= github_login_screenshot.height
+
+    def test_locate_with_description_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a description locator."""
+        locator = Prompt("Username textfield")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 160 <= y <= 230
+
+    def test_locate_with_similar_text_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a text locator."""
+        locator = Text("Forgot password?")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_typo_text_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a text locator with a typo."""
+        locator = Text("Forgot pasword", similarity_threshold=90)
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_exact_text_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a text locator."""
+        locator = Text("Forgot password?", match_type="exact")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_regex_text_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a text locator."""
+        locator = Text(r"F.*?", match_type="regex")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_contains_text_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using a text locator."""
+        locator = Text("Forgot", match_type="contains")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+        
+    def test_locate_with_image(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+        path_fixtures: pathlib.Path,
+    ) -> None:
+        """Test locating elements using image locator."""
+        if model in [ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022]:
+            pytest.skip("Skipping test for Anthropic model because not supported yet")
+        image_path = path_fixtures / "images" / "github_com__signin__button.png"
+        image = PILImage.open(image_path)
+        locator = Image(image=image)
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    def test_locate_with_image_and_custom_params(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+        path_fixtures: pathlib.Path,
+    ) -> None:
+        """Test locating elements using image locator with custom parameters."""
+        if model in [ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022]:
+            pytest.skip("Skipping test for Anthropic model because not supported yet")
+        image_path = path_fixtures / "images" / "github_com__signin__button.png"
+        image = PILImage.open(image_path)
+        locator = Image(
+            image=image,
+            threshold=0.7,
+            stop_threshold=0.95,
+            rotation_degree_per_step=45,
+            image_compare_format="RGB",
+            name="Sign in button"
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    def test_locate_with_image_should_fail_when_threshold_is_too_high(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+        path_fixtures: pathlib.Path,
+    ) -> None:
+        """Test locating elements using image locator with custom parameters."""
+        if model in [ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022]:
+            pytest.skip("Skipping test for Anthropic model because not supported yet")
+        image_path = path_fixtures / "images" / "github_com__icon.png"
+        image = PILImage.open(image_path)
+        locator = Image(
+            image=image,
+            threshold=1.0,
+            stop_threshold=1.0
+        )
+        with pytest.raises(ElementNotFoundError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    def test_locate_with_ai_element_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using an AI element locator."""
+        if model in [ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022]:
+            pytest.skip("Skipping test for Anthropic model because not supported yet")
+        locator = AiElement("github_com__icon")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 50 <= y <= 130
+        
+    def test_locate_with_ai_element_locator_should_fail_when_threshold_is_too_high(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using image locator with custom parameters."""
+        if model in [ModelName.ANTHROPIC__CLAUDE__3_5__SONNET__20241022]:
+            pytest.skip("Skipping test for Anthropic model because not supported yet")
+        locator = AiElement("github_com__icon", threshold=1.0)
+        with pytest.raises(ElementNotFoundError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
diff --git a/tests/e2e/agent/test_locate_with_different_models.py b/tests/e2e/agent/test_locate_with_different_models.py
new file mode 100644
index 00000000..2a3b887e
--- /dev/null
+++ b/tests/e2e/agent/test_locate_with_different_models.py
@@ -0,0 +1,140 @@
+"""Tests for VisionAgent.locate() with different AskUI models"""
+
+import pytest
+from PIL import Image as PILImage
+
+from askui.agent import VisionAgent
+from askui.locators import (
+    Prompt,
+    Text,
+    AiElement,
+)
+from askui.exceptions import AutomationError
+from askui.models.models import ModelName
+
+
+class TestVisionAgentLocateWithDifferentModels:
+    """Test class for VisionAgent.locate() method with different AskUI models."""
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__PTA])
+    def test_locate_with_pta_model(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using PTA model with description locator."""
+        locator = "Username textfield"
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 160 <= y <= 230
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__PTA])
+    def test_locate_with_pta_model_fails_with_wrong_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test that PTA model fails with wrong locator type."""
+        locator = Text("Username textfield")
+        with pytest.raises(AutomationError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__OCR])
+    def test_locate_with_ocr_model(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using OCR model with text locator."""
+        locator = "Forgot password?"
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__OCR])
+    def test_locate_with_ocr_model_fails_with_wrong_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test that OCR model fails with wrong locator type."""
+        locator = Prompt("Forgot password?")
+        with pytest.raises(AutomationError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__AI_ELEMENT])
+    def test_locate_with_ai_element_model(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using AI element model."""
+        locator = "github_com__signin__button"
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__AI_ELEMENT])
+    def test_locate_with_ai_element_model_fails_with_wrong_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test that AI element model fails with wrong locator type."""
+        locator = Text("Sign in")
+        with pytest.raises(AutomationError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__COMBO])
+    def test_locate_with_combo_model_description_first(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using combo model with description locator."""
+        locator = "Username textfield"
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 160 <= y <= 230
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__COMBO])
+    def test_locate_with_combo_model_text_fallback(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using combo model with text locator as fallback."""
+        locator = "Forgot password?"
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    @pytest.mark.parametrize("model", [ModelName.ASKUI__COMBO])
+    def test_locate_with_combo_model_fails_with_wrong_locator(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test that combo model fails with wrong locator type."""
+        locator = AiElement("github_com__signin__button")
+        with pytest.raises(AutomationError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
diff --git a/tests/e2e/agent/test_locate_with_relations.py b/tests/e2e/agent/test_locate_with_relations.py
new file mode 100644
index 00000000..21a5425e
--- /dev/null
+++ b/tests/e2e/agent/test_locate_with_relations.py
@@ -0,0 +1,417 @@
+"""Tests for VisionAgent.locate() with different locator types and models"""
+
+import pathlib
+import pytest
+from PIL import Image as PILImage
+from askui.locators.locators import AiElement
+from askui.exceptions import ElementNotFoundError
+from askui.agent import VisionAgent
+from askui.locators import (
+    Prompt,
+    Element,
+    Text,
+    Image,
+)
+
+
+@pytest.mark.parametrize(
+    "model",
+    [
+        "askui",
+    ],
+)
+class TestVisionAgentLocateWithRelations:
+    """Test class for VisionAgent.locate() method with relations."""
+
+    def test_locate_with_above_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using above_of relation."""
+        locator = Text("Forgot password?").above_of(Element("textfield"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_below_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using below_of relation."""
+        locator = Text("Forgot password?").below_of(Element("textfield"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_right_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using right_of relation."""
+        locator = Text("Forgot password?").right_of(Text("Password"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_left_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using left_of relation."""
+        locator = Text("Password").left_of(Text("Forgot password?"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 450
+        assert 190 <= y <= 260
+
+    def test_locate_with_containing_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using containing relation."""
+        locator = Element("textfield").containing(Text("github.com/login"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 50 <= x <= 860
+        assert 0 <= y <= 80
+
+    def test_locate_with_inside_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using inside_of relation."""
+        locator = Text("github.com/login").inside_of(Element("textfield"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 70 <= x <= 200
+        assert 10 <= y <= 75
+
+    def test_locate_with_nearest_to_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using nearest_to relation."""
+        locator = Element("textfield").nearest_to(Text("Password"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 210 <= y <= 280
+
+    @pytest.mark.skip("Skipping tests for now because it is failing for unknown reason")
+    def test_locate_with_and_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using and_ relation."""
+        locator = Text("Forgot password?").and_(Element("text"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_or_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using or_ relation."""
+        locator = Element("textfield").nearest_to(
+            Text("Password").or_(Text("Username or email address"))
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 160 <= y <= 280
+
+    def test_locate_with_relation_index(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with index."""
+        locator = Element("textfield").below_of(
+            Text("Username or email address"), index=0
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 160 <= y <= 230
+
+    def test_locate_with_relation_index_greater_0(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with index."""
+        locator = Element("textfield").below_of(Element("textfield"), index=1)
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 210 <= y <= 280
+
+    @pytest.mark.skip("Skipping tests for now because it is failing for unknown reason")
+    def test_locate_with_relation_index_greater_1(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with index."""
+        locator = Text("Sign in").below_of(Text(), index=4, reference_point="any")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 420 <= x <= 500
+        assert 250 <= y <= 310
+
+    def test_locate_with_relation_reference_point_center(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with center reference point."""
+        locator = Text("Forgot password?").right_of(
+            Text("Password"), reference_point="center"
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_relation_reference_point_center_raises_when_element_cannot_be_located(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with center reference point."""
+        locator = Text("Sign in").below_of(Text("Password"), reference_point="center")
+        with pytest.raises(ElementNotFoundError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    def test_locate_with_relation_reference_point_boundary(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with boundary reference point."""
+        locator = Text("Forgot password?").right_of(
+            Text("Password"), reference_point="boundary"
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 450 <= x <= 570
+        assert 190 <= y <= 260
+
+    def test_locate_with_relation_reference_point_boundary_raises_when_element_cannot_be_located(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with boundary reference point."""
+        locator = Text("Sign in").below_of(Text("Password"), reference_point="boundary")
+        with pytest.raises(ElementNotFoundError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    def test_locate_with_relation_reference_point_any(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with any reference point."""
+        locator = Text("Sign in").below_of(Text("Password"), reference_point="any")
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 420 <= x <= 500
+        assert 250 <= y <= 310
+
+    def test_locate_with_multiple_relations_with_same_locator_raises(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using multiple relations with same locator which is not supported by AskUI."""
+        locator = (
+            Text("Forgot password?")
+            .below_of(Element("textfield"))
+            .below_of(Element("textfield"))
+        )
+        with pytest.raises(NotImplementedError):
+            vision_agent.locate(locator, github_login_screenshot, model=model)
+
+    def test_locate_with_chained_relations(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using chained relations."""
+        locator = Text("Sign in").below_of(
+            Text("Password").below_of(Text("Username or email address")),
+            reference_point="any",
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 420 <= x <= 500
+        assert 250 <= y <= 310
+
+    def test_locate_with_relation_different_locator_types(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using relation with different locator types."""
+        locator = Text("Sign in").below_of(
+            Element("textfield").below_of(Text("Username or email address")),
+            reference_point="center",
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 420 <= x <= 500
+        assert 250 <= y <= 310
+
+    def test_locate_with_description_and_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using description with relation."""
+        locator = Prompt("Sign in button").below_of(Prompt("Password field"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    @pytest.mark.skip("Skipping tests for now because it is failing for unknown reason")
+    def test_locate_with_description_and_complex_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using description with relation."""
+        locator = Prompt("Sign in button").below_of(
+            Element("textfield").below_of(Text("Password"))
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    def test_locate_with_image_and_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+        path_fixtures: pathlib.Path,
+    ) -> None:
+        """Test locating elements using image locator with relation."""
+        image_path = path_fixtures / "images" / "github_com__signin__button.png"
+        image = PILImage.open(image_path)
+        locator = Image(image=image).containing(Text("Sign in"))
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    def test_locate_with_image_in_relation_to_other_image(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+        path_fixtures: pathlib.Path,
+    ) -> None:
+        """Test locating elements using image locator with relation."""
+        github_icon_image_path = path_fixtures / "images" / "github_com__icon.png"
+        signin_button_image_path = path_fixtures / "images" / "github_com__signin__button.png"
+        github_icon_image = PILImage.open(github_icon_image_path)
+        signin_button_image = PILImage.open(signin_button_image_path)
+        github_icon = Image(image=github_icon_image)
+        signin_button = Image(image=signin_button_image).below_of(github_icon)
+        x, y = vision_agent.locate(
+            signin_button, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    def test_locate_with_image_and_complex_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+        path_fixtures: pathlib.Path,
+    ) -> None:
+        """Test locating elements using image locator with complex relation."""
+        image_path = path_fixtures / "images" / "github_com__signin__button.png"
+        image = PILImage.open(image_path)
+        locator = Image(image=image).below_of(
+            Element("textfield").below_of(Text("Password"))
+        )
+        x, y = vision_agent.locate(
+            locator, github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
+
+    def test_locate_with_ai_element_locator_relation(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: str,
+    ) -> None:
+        """Test locating elements using an AI element locator with relation."""
+        icon_locator = AiElement("github_com__icon")
+        signin_locator = AiElement("github_com__signin__button")
+        x, y = vision_agent.locate(
+            signin_locator.below_of(icon_locator), github_login_screenshot, model=model
+        )
+        assert 350 <= x <= 570
+        assert 240 <= y <= 320
diff --git a/tests/e2e/agent/test_model_composition.py b/tests/e2e/agent/test_model_composition.py
new file mode 100644
index 00000000..8ae8b165
--- /dev/null
+++ b/tests/e2e/agent/test_model_composition.py
@@ -0,0 +1,111 @@
+"""Tests for VisionAgent with different model compositions"""
+
+import pytest
+from PIL import Image as PILImage
+from askui.agent import VisionAgent
+from askui.locators.locators import DEFAULT_SIMILARITY_THRESHOLD, Text
+from askui.models import ModelComposition, ModelDefinition
+
+
+@pytest.mark.parametrize(
+    "model",
+    [
+        ModelComposition(
+            [
+                ModelDefinition(
+                    task="e2e_ocr",
+                    architecture="easy_ocr",
+                    version="1",
+                    interface="online_learning",
+                )
+            ]
+        ),
+        ModelComposition(
+            [
+                ModelDefinition(
+                    task="e2e_ocr",
+                    architecture="easy_ocr",
+                    version="1",
+                    interface="online_learning",
+                    use_case="fb3b9a7b_3aea_41f7_ba02_e55fd66d1c1e",
+                    tags=["trained"],
+                )
+            ]
+        ),
+    ],
+)
+class TestSimpleOcrModel:
+    """Test class for simple OCR model compositions."""
+
+    def test_locate_with_simple_ocr(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: ModelComposition,
+    ) -> None:
+        """Test locating elements using simple OCR model."""
+        x, y = vision_agent.locate("Sign in", github_login_screenshot, model=model)
+        assert isinstance(x, int)
+        assert isinstance(y, int)
+        assert 0 <= x <= github_login_screenshot.width
+        assert 0 <= y <= github_login_screenshot.height
+
+
+@pytest.mark.parametrize(
+    "model",
+    [
+        ModelComposition(
+            [
+                ModelDefinition(
+                    task="e2e_ocr",
+                    architecture="easy_ocr",
+                    version="1",
+                    interface="online_learning",
+                    tags=["word_level"],
+                )
+            ]
+        ),
+        ModelComposition(
+            [
+                ModelDefinition(
+                    task="e2e_ocr",
+                    architecture="easy_ocr",
+                    version="1",
+                    interface="online_learning",
+                    use_case="fb3b9a7b_3aea_41f7_ba02_e55fd66d1c1e",
+                    tags=["trained", "word_level"],
+                )
+            ]
+        ),
+    ],
+)
+class TestWordLevelOcrModel:
+    """Test class for word-level OCR model compositions."""
+
+    def test_locate_with_word_level_ocr(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: ModelComposition,
+    ) -> None:
+        """Test locating elements using word-level OCR model."""
+        x, y = vision_agent.locate("Sign", github_login_screenshot, model=model)
+        assert isinstance(x, int)
+        assert isinstance(y, int)
+        assert 0 <= x <= github_login_screenshot.width
+        assert 0 <= y <= github_login_screenshot.height
+
+    def test_locate_with_trained_word_level_ocr_with_non_default_text_raises(
+        self,
+        vision_agent: VisionAgent,
+        github_login_screenshot: PILImage.Image,
+        model: ModelComposition,
+    ) -> None:
+        if any("trained" not in m.tags for m in model):
+            pytest.skip("Skipping test for non-trained model")
+        with pytest.raises(Exception):
+            vision_agent.locate(Text("Sign in", text_type="exact"), github_login_screenshot, model=model)
+            vision_agent.locate(Text("Sign in", text_type="regex"), github_login_screenshot, model=model)
+            vision_agent.locate(Text("Sign in", text_type="contains"), github_login_screenshot, model=model)
+            assert DEFAULT_SIMILARITY_THRESHOLD != 80
+            vision_agent.locate(Text("Sign in", similarity_threshold=80), github_login_screenshot, model=model)
diff --git a/tests/fixtures/images/github_com__icon.json b/tests/fixtures/images/github_com__icon.json
new file mode 100644
index 00000000..a6e57434
--- /dev/null
+++ b/tests/fixtures/images/github_com__icon.json
@@ -0,0 +1,12 @@
+{
+  "version": 1,
+  "id": "f76dbcab-5c2e-42c5-8757-812c80a892f3",
+  "name": "github_com__icon",
+  "creationDateTime": "2025-04-10T15:45:34.798374",
+  "image": {
+    "size": {
+      "width": 128,
+      "height": 125
+    }
+  }
+}
\ No newline at end of file
diff --git a/tests/fixtures/images/github_com__icon.png b/tests/fixtures/images/github_com__icon.png
new file mode 100644
index 00000000..af606226
Binary files /dev/null and b/tests/fixtures/images/github_com__icon.png differ
diff --git a/tests/fixtures/images/github_com__signin__button.json b/tests/fixtures/images/github_com__signin__button.json
new file mode 100644
index 00000000..ebbe5898
--- /dev/null
+++ b/tests/fixtures/images/github_com__signin__button.json
@@ -0,0 +1,12 @@
+{
+  "version": 1,
+  "id": "de00298f-f671-4c31-bfaf-1a8258188c4a",
+  "name": "github_com__signin__button",
+  "creationDateTime": "2025-04-10T15:45:34.799032",
+  "image": {
+    "size": {
+      "width": 166,
+      "height": 24
+    }
+  }
+}
\ No newline at end of file
diff --git a/tests/fixtures/images/github_com__signin__button.png b/tests/fixtures/images/github_com__signin__button.png
new file mode 100644
index 00000000..9e8ed9b0
Binary files /dev/null and b/tests/fixtures/images/github_com__signin__button.png differ
diff --git a/tests/fixtures/screenshots/macos__chrome__github_com__login.png b/tests/fixtures/screenshots/macos__chrome__github_com__login.png
new file mode 100644
index 00000000..0e1eb123
Binary files /dev/null and b/tests/fixtures/screenshots/macos__chrome__github_com__login.png differ
diff --git a/tests/integration/__init__.py b/tests/integration/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/integration/tools/askui/test_askui_controller.py b/tests/integration/tools/askui/test_askui_controller.py
index 8d081d8b..f477a73b 100644
--- a/tests/integration/tools/askui/test_askui_controller.py
+++ b/tests/integration/tools/askui/test_askui_controller.py
@@ -2,9 +2,7 @@
 from pathlib import Path
 
 
-def test___find_remote_device_controller_by_component_registry():
+def test_find_remote_device_controller_by_component_registry():
     controller = AskUiControllerServer()
-
     remote_device_controller_path = Path(controller._find_remote_device_controller_by_component_registry())
-
     assert "AskuiRemoteDeviceController" == remote_device_controller_path.stem
diff --git a/tests/unit/__init__.py b/tests/unit/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/unit/conftest.py b/tests/unit/conftest.py
new file mode 100644
index 00000000..d7f6efef
--- /dev/null
+++ b/tests/unit/conftest.py
@@ -0,0 +1,6 @@
+import pytest
+
+@pytest.fixture(autouse=True)
+def set_env_variable(monkeypatch: pytest.MonkeyPatch) -> None:
+    monkeypatch.setenv('ASKUI__VA__TELEMETRY__ENABLED', 'False')
+    monkeypatch.setenv('ASKUI_WORKSPACE_ID', 'test_workspace_id')
diff --git a/tests/unit/locators/__init__.py b/tests/unit/locators/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/unit/locators/serializers/__init__.py b/tests/unit/locators/serializers/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/unit/locators/serializers/test_askui_locator_serializer.py b/tests/unit/locators/serializers/test_askui_locator_serializer.py
new file mode 100644
index 00000000..e79eadfb
--- /dev/null
+++ b/tests/unit/locators/serializers/test_askui_locator_serializer.py
@@ -0,0 +1,378 @@
+import pathlib
+import re
+import pytest
+from PIL import Image as PILImage
+from pytest_mock import MockerFixture
+
+from askui.locators.locators import Locator
+from askui.locators import Element, Prompt, Text, Image
+from askui.locators.serializers import AskUiLocatorSerializer
+from askui.models.askui.ai_element_utils import AiElementCollection
+from askui.utils.image_utils import image_to_base64
+from askui.reporting import CompositeReporter
+from askui.locators.relatable import CircularDependencyError
+
+
+TEST_IMAGE = PILImage.new("RGB", (100, 100), color="red")
+TEST_IMAGE_BASE64 = image_to_base64(TEST_IMAGE)
+
+
+@pytest.fixture
+def askui_serializer(path_fixtures: pathlib.Path) -> AskUiLocatorSerializer:
+    return AskUiLocatorSerializer(
+        ai_element_collection=AiElementCollection(
+            additional_ai_element_locations=[
+                path_fixtures / "images"
+            ]
+        ),
+        reporter=CompositeReporter()
+    )
+
+
+def test_serialize_text_similar(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello", match_type="similar", similarity_threshold=80)
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text with text <|string|>hello<|string|> that matches to 80 %"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_text_exact(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello", match_type="exact")
+    result = askui_serializer.serialize(text)
+    assert result["instruction"] == "text equals text <|string|>hello<|string|>"
+    assert result["customElements"] == []
+
+
+def test_serialize_text_contains(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello", match_type="contains")
+    result = askui_serializer.serialize(text)
+    assert result["instruction"] == "text contain text <|string|>hello<|string|>"
+    assert result["customElements"] == []
+
+
+def test_serialize_text_regex(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("h.*o", match_type="regex")
+    result = askui_serializer.serialize(text)
+    assert result["instruction"] == "text match regex pattern <|string|>h.*o<|string|>"
+    assert result["customElements"] == []
+
+
+def test_serialize_class_no_name(askui_serializer: AskUiLocatorSerializer) -> None:
+    class_ = Element()
+    result = askui_serializer.serialize(class_)
+    assert result["instruction"] == "element"
+    assert result["customElements"] == []
+
+
+def test_serialize_description(askui_serializer: AskUiLocatorSerializer) -> None:
+    desc = Prompt("a big red button")
+    result = askui_serializer.serialize(desc)
+    assert result["instruction"] == "pta <|string|>a big red button<|string|>"
+    assert result["customElements"] == []
+
+
+CUSTOM_ELEMENT_STR_PATTERN = re.compile(r'^custom element with text <|string|>.*<|string|>$')
+
+
+def test_serialize_image(askui_serializer: AskUiLocatorSerializer) -> None:
+    image = Image(TEST_IMAGE)
+    result = askui_serializer.serialize(image)
+    assert re.match(CUSTOM_ELEMENT_STR_PATTERN, result["instruction"])
+    assert len(result["customElements"]) == 1
+    custom_element = result["customElements"][0]
+    assert custom_element["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+    assert custom_element["threshold"] == image.threshold
+    assert custom_element["stopThreshold"] == image.stop_threshold
+    assert "mask" not in custom_element
+    assert custom_element["rotationDegreePerStep"] == image.rotation_degree_per_step
+    assert custom_element["imageCompareFormat"] == image.image_compare_format
+    assert custom_element["name"] == image.name
+
+
+def test_serialize_image_with_all_options(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    image = Image(
+        TEST_IMAGE,
+        threshold=0.8,
+        stop_threshold=0.9,
+        mask=[(0.1, 0.1), (0.5, 0.5), (0.9, 0.9)],
+        rotation_degree_per_step=5,
+        image_compare_format="RGB",
+        name="test_image",
+    )
+    result = askui_serializer.serialize(image)
+    assert result["instruction"] == "custom element with text <|string|>test_image<|string|>"
+    assert len(result["customElements"]) == 1
+    custom_element = result["customElements"][0]
+    assert custom_element["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+    assert custom_element["threshold"] == 0.8
+    assert custom_element["stopThreshold"] == 0.9
+    assert custom_element["mask"] == [(0.1, 0.1), (0.5, 0.5), (0.9, 0.9)]
+    assert custom_element["rotationDegreePerStep"] == 5
+    assert custom_element["imageCompareFormat"] == "RGB"
+    assert custom_element["name"] == "test_image"
+
+
+def test_serialize_above_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.above_of(Text("world"), index=1, reference_point="center")
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> index 1 above intersection_area element_center_line text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_below_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.below_of(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> index 0 below intersection_area element_edge_area text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_right_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.right_of(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> index 0 right of intersection_area element_center_line text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_left_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.left_of(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> index 0 left of intersection_area element_center_line text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_containing_relation(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    text = Text("hello")
+    text.containing(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> contains text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_inside_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.inside_of(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> in text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_nearest_to_relation(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    text = Text("hello")
+    text.nearest_to(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> nearest to text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_and_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.and_(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> and text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_or_relation(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.or_(Text("world"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> or text <|string|>world<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_multiple_relations_raises(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    text = Text("hello")
+    text.above_of(Text("world"))
+    text.below_of(Text("earth"))
+    with pytest.raises(
+        NotImplementedError,
+        match="Serializing locators with multiple relations is not yet supported by AskUI",
+    ):
+        askui_serializer.serialize(text)
+
+
+def test_serialize_relations_chain(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.above_of(Text("world").below_of(Text("earth")))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> index 0 above intersection_area element_edge_area text <|string|>world<|string|> index 0 below intersection_area element_edge_area text <|string|>earth<|string|>"
+    )
+    assert result["customElements"] == []
+
+
+def test_serialize_unsupported_locator_type(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    class UnsupportedLocator(Locator):
+        pass
+
+    with pytest.raises(ValueError, match="Unsupported locator type:.*"):
+        askui_serializer.serialize(UnsupportedLocator())
+
+
+def test_serialize_simple_cycle_raises(askui_serializer: AskUiLocatorSerializer) -> None:
+    text1 = Text("hello")
+    text2 = Text("world")
+    text1.above_of(text2)
+    text2.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        askui_serializer.serialize(text1)
+
+
+def test_serialize_self_reference_cycle_raises(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    text.above_of(text)
+    with pytest.raises(CircularDependencyError):
+        askui_serializer.serialize(text)
+
+
+def test_serialize_deep_cycle_raises(askui_serializer: AskUiLocatorSerializer) -> None:
+    text1 = Text("hello")
+    text2 = Text("world")
+    text3 = Text("earth")
+    text1.above_of(text2)
+    text2.above_of(text3)
+    text3.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        askui_serializer.serialize(text1)
+
+
+def test_serialize_cycle_detection_called_once(askui_serializer: AskUiLocatorSerializer, mocker: MockerFixture) -> None:
+    text1 = Text("hello")
+    mocked_text1 = mocker.patch.object(text1, '_has_cycle')
+    text2 = Text("world")
+    mocked_text2 = mocker.patch.object(text2, '_has_cycle')
+    text1.above_of(text2)
+    text2.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        askui_serializer.serialize(text1)
+    mocked_text1.assert_called_once()
+    mocked_text2.assert_not_called()
+
+
+def test_serialize_image_with_cycle_raises(askui_serializer: AskUiLocatorSerializer) -> None:
+    image1 = Image(TEST_IMAGE, name="image1")
+    image2 = Image(TEST_IMAGE, name="image2")
+    image1.above_of(image2)
+    image2.above_of(image1)
+    with pytest.raises(CircularDependencyError):
+        askui_serializer.serialize(image1)
+
+
+def test_serialize_mixed_locator_types_cycle_raises(askui_serializer: AskUiLocatorSerializer) -> None:
+    text = Text("hello")
+    image = Image(TEST_IMAGE, name="image")
+    text.above_of(image)
+    image.above_of(text)
+    with pytest.raises(CircularDependencyError):
+        askui_serializer.serialize(text)
+
+
+def test_serialize_image_with_relation(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    image = Image(TEST_IMAGE, name="image")
+    image.above_of(Text("world"))
+    result = askui_serializer.serialize(image)
+    assert (
+        result["instruction"]
+        == "custom element with text <|string|>image<|string|> index 0 above intersection_area element_edge_area text <|string|>world<|string|>"
+    )
+    assert len(result["customElements"]) == 1
+    custom_element = result["customElements"][0]
+    assert custom_element["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+
+
+def test_serialize_text_with_image_relation(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    text = Text("hello")
+    text.above_of(Image(TEST_IMAGE, name="image"))
+    result = askui_serializer.serialize(text)
+    assert (
+        result["instruction"]
+        == "text <|string|>hello<|string|> index 0 above intersection_area element_edge_area custom element with text <|string|>image<|string|>"
+    )
+    assert len(result["customElements"]) == 1
+    custom_element = result["customElements"][0]
+    assert custom_element["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+
+
+def test_serialize_multiple_images_with_relation(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    image1 = Image(TEST_IMAGE, name="image1")
+    image2 = Image(TEST_IMAGE, name="image2")
+    image1.above_of(image2)
+    result = askui_serializer.serialize(image1)
+    assert (
+        result["instruction"]
+        == "custom element with text <|string|>image1<|string|> index 0 above intersection_area element_edge_area custom element with text <|string|>image2<|string|>"
+    )
+    assert len(result["customElements"]) == 2
+    assert result["customElements"][0]["name"] == "image1"
+    assert result["customElements"][1]["name"] == "image2"
+    assert result["customElements"][0]["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+    assert result["customElements"][1]["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+
+
+def test_serialize_images_with_non_neighbor_relation(
+    askui_serializer: AskUiLocatorSerializer,
+) -> None:
+    image1 = Image(TEST_IMAGE, name="image1")
+    image2 = Image(TEST_IMAGE, name="image2")
+    image1.and_(image2)
+    result = askui_serializer.serialize(image1)
+    assert (
+        result["instruction"]
+        == "custom element with text <|string|>image1<|string|> and custom element with text <|string|>image2<|string|>"
+    )
+    assert len(result["customElements"]) == 2
+    assert result["customElements"][0]["name"] == "image1"
+    assert result["customElements"][1]["name"] == "image2"
+    assert result["customElements"][0]["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
+    assert result["customElements"][1]["customImage"] == f"data:image/png;base64,{TEST_IMAGE_BASE64}"
diff --git a/tests/unit/locators/serializers/test_locator_string_representation.py b/tests/unit/locators/serializers/test_locator_string_representation.py
new file mode 100644
index 00000000..6529714f
--- /dev/null
+++ b/tests/unit/locators/serializers/test_locator_string_representation.py
@@ -0,0 +1,252 @@
+import re
+import pytest
+from askui.locators import Element, Prompt, Text, Image
+from askui.locators.relatable import CircularDependencyError
+from PIL import Image as PILImage
+
+
+TEST_IMAGE = PILImage.new("RGB", (100, 100), color="red")
+
+
+def test_text_similar_str() -> None:
+    text = Text("hello", match_type="similar", similarity_threshold=80)
+    assert str(text) == 'text similar to "hello" (similarity >= 80%)'
+
+
+def test_text_exact_str() -> None:
+    text = Text("hello", match_type="exact")
+    assert str(text) == 'text "hello"'
+
+
+def test_text_contains_str() -> None:
+    text = Text("hello", match_type="contains")
+    assert str(text) == 'text containing text "hello"'
+
+
+def test_text_regex_str() -> None:
+    text = Text("h.*o", match_type="regex")
+    assert str(text) == 'text matching regex "h.*o"'
+
+
+def test_class_with_name_str() -> None:
+    class_ = Element("textfield")
+    assert str(class_) == 'element with class "textfield"'
+
+
+def test_class_without_name_str() -> None:
+    class_ = Element()
+    assert str(class_) == "element"
+
+
+def test_description_str() -> None:
+    desc = Prompt("a big red button")
+    assert str(desc) == 'element with prompt "a big red button"'
+
+
+def test_text_with_above_relation_str() -> None:
+    text = Text("hello")
+    text.above_of(Text("world"), index=1, reference_point="center")
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. above of center of the 2nd text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_below_relation_str() -> None:
+    text = Text("hello")
+    text.below_of(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. below of boundary of the 1st text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_right_relation_str() -> None:
+    text = Text("hello")
+    text.right_of(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. right of center of the 1st text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_left_relation_str() -> None:
+    text = Text("hello")
+    text.left_of(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. left of center of the 1st text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_containing_relation_str() -> None:
+    text = Text("hello")
+    text.containing(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. containing text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_inside_relation_str() -> None:
+    text = Text("hello")
+    text.inside_of(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. inside of text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_nearest_to_relation_str() -> None:
+    text = Text("hello")
+    text.nearest_to(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. nearest to text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_and_relation_str() -> None:
+    text = Text("hello")
+    text.and_(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. and text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_or_relation_str() -> None:
+    text = Text("hello")
+    text.or_(Text("world"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. or text similar to "world" (similarity >= 70%)'
+    )
+
+
+def test_text_with_multiple_relations_str() -> None:
+    text = Text("hello")
+    text.above_of(Text("world"))
+    text.below_of(Text("earth"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. above of boundary of the 1st text similar to "world" (similarity >= 70%)\n  2. below of boundary of the 1st text similar to "earth" (similarity >= 70%)'
+    )
+
+
+def test_text_with_chained_relations_str() -> None:
+    text = Text("hello")
+    text.above_of(Text("world").below_of(Text("earth")))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. above of boundary of the 1st text similar to "world" (similarity >= 70%)\n    1. below of boundary of the 1st text similar to "earth" (similarity >= 70%)'
+    )
+
+
+def test_mixed_locator_types_with_relations_str() -> None:
+    text = Text("hello")
+    text.above_of(Element("textfield"))
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. above of boundary of the 1st element with class "textfield"'
+    )
+
+
+def test_description_with_relation_str() -> None:
+    desc = Prompt("button")
+    desc.above_of(Prompt("input"))
+    assert (
+        str(desc)
+        == 'element with prompt "button"\n  1. above of boundary of the 1st element with prompt "input"'
+    )
+
+
+def test_complex_relation_chain_str() -> None:
+    text = Text("hello")
+    text.above_of(
+        Element("textfield")
+        .right_of(Text("world", match_type="exact"))
+        .and_(
+            Prompt("input")
+            .below_of(Text("earth", match_type="contains"))
+            .nearest_to(Element("textfield"))
+        )
+    )
+    assert (
+        str(text)
+        == 'text similar to "hello" (similarity >= 70%)\n  1. above of boundary of the 1st element with class "textfield"\n    1. right of center of the 1st text "world"\n    2. and element with prompt "input"\n      1. below of boundary of the 1st text containing text "earth"\n      2. nearest to element with class "textfield"'
+    )
+
+
+IMAGE_STR_PATTERN = re.compile(r'^element ".*" located by image \(threshold: \d+\.\d+, stop_threshold: \d+\.\d+, rotation_degree_per_step: \d+, image_compare_format: \w+, mask: None\)$')
+
+
+def test_image_str() -> None:
+    image = Image(TEST_IMAGE)
+    assert re.match(IMAGE_STR_PATTERN, str(image))
+
+
+def test_image_with_name_str() -> None:
+    image = Image(TEST_IMAGE, name="test_image")
+    assert str(image) == 'element "test_image" located by image (threshold: 0.5, stop_threshold: 0.5, rotation_degree_per_step: 0, image_compare_format: grayscale, mask: None)'
+
+
+def test_image_with_relation_str() -> None:
+    image = Image(TEST_IMAGE, name="image")
+    image.above_of(Text("hello"))
+    lines = str(image).split("\n")
+    assert lines[0] == 'element "image" located by image (threshold: 0.5, stop_threshold: 0.5, rotation_degree_per_step: 0, image_compare_format: grayscale, mask: None)'
+    assert lines[1] == '  1. above of boundary of the 1st text similar to "hello" (similarity >= 70%)'
+
+
+def test_simple_cycle_str() -> None:
+    text1 = Text("hello")
+    text2 = Text("world")
+    text1.above_of(text2)
+    text2.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        str(text1)
+
+
+def test_self_reference_cycle_str() -> None:
+    text = Text("hello")
+    text.above_of(text)
+    with pytest.raises(CircularDependencyError):
+        str(text)
+
+
+def test_deep_cycle_str() -> None:
+    text1 = Text("hello")
+    text2 = Text("world")
+    text3 = Text("earth")
+    text1.above_of(text2)
+    text2.above_of(text3)
+    text3.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        str(text1)
+
+
+def test_multiple_references_no_cycle_str() -> None:
+    heading = Text("heading")
+    textfield = Element("textfield")
+    textfield.right_of(heading)
+    textfield.below_of(heading)
+    assert str(textfield) == 'element with class "textfield"\n  1. right of center of the 1st text similar to "heading" (similarity >= 70%)\n  2. below of boundary of the 1st text similar to "heading" (similarity >= 70%)'
+
+
+def test_image_cycle_str() -> None:
+    image1 = Image(TEST_IMAGE, name="image1")
+    image2 = Image(TEST_IMAGE, name="image2")
+    image1.above_of(image2)
+    image2.above_of(image1)
+    with pytest.raises(CircularDependencyError):
+        str(image1)
+
+
+def test_mixed_locator_types_cycle_str() -> None:
+    text = Text("hello")
+    image = Image(TEST_IMAGE, name="image")
+    text.above_of(image)
+    image.above_of(text)
+    with pytest.raises(CircularDependencyError):
+        str(text)
diff --git a/tests/unit/locators/serializers/test_vlm_locator_serializer.py b/tests/unit/locators/serializers/test_vlm_locator_serializer.py
new file mode 100644
index 00000000..86e70c1d
--- /dev/null
+++ b/tests/unit/locators/serializers/test_vlm_locator_serializer.py
@@ -0,0 +1,108 @@
+import pytest
+from askui.locators.locators import Locator
+from askui.locators import Element, Prompt, Text
+from askui.locators.locators import Image
+from askui.locators.relatable import CircularDependencyError
+from askui.locators.serializers import VlmLocatorSerializer
+
+from PIL import Image as PILImage
+
+
+TEST_IMAGE = PILImage.new('RGB', (100, 100), color='red')
+
+
+@pytest.fixture
+def vlm_serializer() -> VlmLocatorSerializer:
+    return VlmLocatorSerializer()
+
+
+def test_serialize_text_similar(vlm_serializer: VlmLocatorSerializer) -> None:
+    text = Text("hello", match_type="similar", similarity_threshold=80)
+    result = vlm_serializer.serialize(text)
+    assert result == 'text similar to "hello"'
+
+
+def test_serialize_text_exact(vlm_serializer: VlmLocatorSerializer) -> None:
+    text = Text("hello", match_type="exact")
+    result = vlm_serializer.serialize(text)
+    assert result == 'text "hello"'
+
+
+def test_serialize_text_contains(vlm_serializer: VlmLocatorSerializer) -> None:
+    text = Text("hello", match_type="contains")
+    result = vlm_serializer.serialize(text)
+    assert result == 'text containing text "hello"'
+
+
+def test_serialize_text_regex(vlm_serializer: VlmLocatorSerializer) -> None:
+    text = Text("h.*o", match_type="regex")
+    result = vlm_serializer.serialize(text)
+    assert result == 'text matching regex "h.*o"'
+
+
+def test_serialize_class(vlm_serializer: VlmLocatorSerializer) -> None:
+    class_ = Element("textfield")
+    result = vlm_serializer.serialize(class_)
+    assert result == "an arbitrary textfield shown"
+
+
+def test_serialize_class_no_name(vlm_serializer: VlmLocatorSerializer) -> None:
+    class_ = Element()
+    result = vlm_serializer.serialize(class_)
+    assert result == "an arbitrary ui element (e.g., text, button, textfield, etc.)"
+
+
+def test_serialize_description(vlm_serializer: VlmLocatorSerializer) -> None:
+    desc = Prompt("a big red button")
+    result = vlm_serializer.serialize(desc)
+    assert result == "a big red button"
+
+
+def test_serialize_with_relation_raises(vlm_serializer: VlmLocatorSerializer) -> None:
+    text = Text("hello")
+    text.above_of(Text("world"))
+    with pytest.raises(NotImplementedError):
+        vlm_serializer.serialize(text)
+
+
+def test_serialize_image(vlm_serializer: VlmLocatorSerializer) -> None:
+    image = Image(TEST_IMAGE)
+    with pytest.raises(NotImplementedError):
+        vlm_serializer.serialize(image)
+
+
+def test_serialize_unsupported_locator_type(
+    vlm_serializer: VlmLocatorSerializer,
+) -> None:
+    class UnsupportedLocator(Locator):
+        pass
+
+    with pytest.raises(ValueError, match="Unsupported locator type:.*"):
+        vlm_serializer.serialize(UnsupportedLocator())
+
+
+def test_serialize_simple_cycle_raises(vlm_serializer: VlmLocatorSerializer) -> None:
+    text1 = Text("hello")
+    text2 = Text("world")
+    text1.above_of(text2)
+    text2.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        vlm_serializer.serialize(text1)
+
+
+def test_serialize_self_reference_cycle_raises(vlm_serializer: VlmLocatorSerializer) -> None:
+    text = Text("hello")
+    text.above_of(text)
+    with pytest.raises(CircularDependencyError):
+        vlm_serializer.serialize(text)
+
+
+def test_serialize_deep_cycle_raises(vlm_serializer: VlmLocatorSerializer) -> None:
+    text1 = Text("hello")
+    text2 = Text("world")
+    text3 = Text("earth")
+    text1.above_of(text2)
+    text2.above_of(text3)
+    text3.above_of(text1)
+    with pytest.raises(CircularDependencyError):
+        vlm_serializer.serialize(text1)
diff --git a/tests/unit/locators/test_locators.py b/tests/unit/locators/test_locators.py
new file mode 100644
index 00000000..1b60fd9f
--- /dev/null
+++ b/tests/unit/locators/test_locators.py
@@ -0,0 +1,236 @@
+from pathlib import Path
+import re
+import pytest
+from PIL import Image as PILImage
+
+from askui.locators import Prompt, Element, Text, Image, AiElement
+
+
+TEST_IMAGE_PATH = Path("tests/fixtures/images/github_com__icon.png")
+
+
+class TestDescriptionLocator:
+    def test_initialization_with_description(self) -> None:
+        desc = Prompt(prompt="test")
+        assert desc.prompt == "test"
+        assert str(desc) == 'element with prompt "test"'
+
+    def test_initialization_without_description_raises(self) -> None:
+        with pytest.raises(ValueError):
+            Prompt()  # type: ignore
+
+    def test_initialization_with_positional_arg(self) -> None:
+        desc = Prompt("test")
+        assert desc.prompt == "test"
+
+    def test_initialization_with_invalid_args_raises(self) -> None:
+        with pytest.raises(ValueError):
+            Prompt(prompt=123)  # type: ignore
+
+        with pytest.raises(ValueError):
+            Prompt(123)  # type: ignore
+
+
+class TestClassLocator:
+    def test_initialization_with_class_name(self) -> None:
+        cls = Element(class_name="text")
+        assert cls.class_name == "text"
+        assert str(cls) == 'element with class "text"'
+
+    def test_initialization_without_class_name(self) -> None:
+        cls = Element()
+        assert cls.class_name is None
+        assert str(cls) == "element"
+
+    def test_initialization_with_positional_arg(self) -> None:
+        cls = Element("text")
+        assert cls.class_name == "text"
+
+    def test_initialization_with_invalid_args_raises(self) -> None:
+        with pytest.raises(ValueError):
+            Element(class_name="button")  # type: ignore
+
+        with pytest.raises(ValueError):
+            Element(class_name=123)  # type: ignore
+
+        with pytest.raises(ValueError):
+            Element(123)  # type: ignore
+
+
+class TestTextLocator:
+    def test_initialization_with_positional_text(self) -> None:
+        text = Text("Hello")
+        assert text.text == "Hello"
+        assert text.match_type == "similar"
+        assert text.similarity_threshold == 70
+        assert str(text) == 'text similar to "Hello" (similarity >= 70%)'
+
+    def test_initialization_with_named_text(self) -> None:
+        text = Text(text="hello", match_type="exact")
+        assert text.text == "hello"
+        assert text.match_type == "exact"
+        assert str(text) == 'text "hello"'
+
+    def test_initialization_with_similarity(self) -> None:
+        text = Text(text="hello", match_type="similar", similarity_threshold=80)
+        assert text.similarity_threshold == 80
+        assert str(text) == 'text similar to "hello" (similarity >= 80%)'
+
+    def test_initialization_with_contains(self) -> None:
+        text = Text(text="hello", match_type="contains")
+        assert str(text) == 'text containing text "hello"'
+
+    def test_initialization_with_regex(self) -> None:
+        text = Text(text="hello.*", match_type="regex")
+        assert str(text) == 'text matching regex "hello.*"'
+
+    def test_initialization_without_text(self) -> None:
+        text = Text()
+        assert text.text is None
+        assert str(text) == "text"
+
+    def test_initialization_with_invalid_args(self) -> None:
+        with pytest.raises(ValueError):
+            Text(text=123)  # type: ignore
+
+        with pytest.raises(ValueError):
+            Text(123)  # type: ignore
+
+        with pytest.raises(ValueError):
+            Text(text="hello", match_type="invalid")  # type: ignore
+
+        with pytest.raises(ValueError):
+            Text(text="hello", similarity_threshold=-1)
+
+        with pytest.raises(ValueError):
+            Text(text="hello", similarity_threshold=101)
+
+
+class TestImageLocator:
+    @pytest.fixture
+    def test_image(self) -> PILImage.Image:
+        return PILImage.open(TEST_IMAGE_PATH)
+    
+    _STR_PATTERN = re.compile(r'^element ".*" located by image \(threshold: \d+\.\d+, stop_threshold: \d+\.\d+, rotation_degree_per_step: \d+, image_compare_format: \w+, mask: None\)$')
+
+    def test_initialization_with_basic_params(self, test_image: PILImage.Image) -> None:
+        locator = Image(image=test_image)
+        assert locator.image.root == test_image
+        assert locator.threshold == 0.5
+        assert locator.stop_threshold == 0.5
+        assert locator.mask is None
+        assert locator.rotation_degree_per_step == 0
+        assert locator.image_compare_format == "grayscale"
+        assert re.match(self._STR_PATTERN, str(locator))
+
+    def test_initialization_with_name(self, test_image: PILImage.Image) -> None:
+        locator = Image(image=test_image, name="test")
+        assert str(locator) == 'element "test" located by image (threshold: 0.5, stop_threshold: 0.5, rotation_degree_per_step: 0, image_compare_format: grayscale, mask: None)'
+
+    def test_initialization_with_custom_params(self, test_image: PILImage.Image) -> None:
+        locator = Image(
+            image=test_image,
+            threshold=0.7,
+            stop_threshold=0.95,
+            mask=[(0, 0), (1, 0), (1, 1)],
+            rotation_degree_per_step=45,
+            image_compare_format="RGB"
+        )
+        assert locator.threshold == 0.7
+        assert locator.stop_threshold == 0.95
+        assert locator.mask == [(0, 0), (1, 0), (1, 1)]
+        assert locator.rotation_degree_per_step == 45
+        assert locator.image_compare_format == "RGB"
+        assert re.match(r'^element "anonymous image [a-f0-9-]+" located by image \(threshold: 0.7, stop_threshold: 0.95, rotation_degree_per_step: 45, image_compare_format: RGB, mask: \[\(0.0, 0.0\), \(1.0, 0.0\), \(1.0, 1.0\)\]\)$', str(locator))
+
+    def test_initialization_with_invalid_args(self, test_image: PILImage.Image) -> None:
+        with pytest.raises(ValueError):
+            Image(image="not_an_image")  # type: ignore
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, threshold=-0.1)
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, threshold=1.1)
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, stop_threshold=-0.1)
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, stop_threshold=1.1)
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, rotation_degree_per_step=-1)
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, rotation_degree_per_step=361)
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, image_compare_format="invalid")  # type: ignore
+
+        with pytest.raises(ValueError):
+            Image(image=test_image, mask=[(0, 0), (1)])  # type: ignore
+
+
+class TestAiElementLocator:
+    def test_initialization_with_name(self) -> None:
+        locator = AiElement("github_com__icon")
+        assert locator.name == "github_com__icon"
+        assert str(locator) == 'ai element named "github_com__icon" (threshold: 0.5, stop_threshold: 0.5, rotation_degree_per_step: 0, image_compare_format: grayscale, mask: None)'
+
+    def test_initialization_without_name_raises(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement()  # type: ignore
+
+    def test_initialization_with_invalid_args_raises(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement(123)  # type: ignore
+
+    def test_initialization_with_custom_params(self) -> None:
+        locator = AiElement(
+            name="test_element",
+            threshold=0.7,
+            stop_threshold=0.95,
+            mask=[(0, 0), (1, 0), (1, 1)],
+            rotation_degree_per_step=45,
+            image_compare_format="RGB"
+        )
+        assert locator.name == "test_element"
+        assert locator.threshold == 0.7
+        assert locator.stop_threshold == 0.95
+        assert locator.mask == [(0, 0), (1, 0), (1, 1)]
+        assert locator.rotation_degree_per_step == 45
+        assert locator.image_compare_format == "RGB"
+        assert str(locator) == 'ai element named "test_element" (threshold: 0.7, stop_threshold: 0.95, rotation_degree_per_step: 45, image_compare_format: RGB, mask: [(0.0, 0.0), (1.0, 0.0), (1.0, 1.0)])'
+
+    def test_initialization_with_invalid_threshold(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement(name="test", threshold=-0.1)
+
+        with pytest.raises(ValueError):
+            AiElement(name="test", threshold=1.1)
+
+    def test_initialization_with_invalid_stop_threshold(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement(name="test", stop_threshold=-0.1)
+
+        with pytest.raises(ValueError):
+            AiElement(name="test", stop_threshold=1.1)
+
+    def test_initialization_with_invalid_rotation(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement(name="test", rotation_degree_per_step=-1)
+
+        with pytest.raises(ValueError):
+            AiElement(name="test", rotation_degree_per_step=361)
+
+    def test_initialization_with_invalid_image_format(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement(name="test", image_compare_format="invalid")  # type: ignore
+
+    def test_initialization_with_invalid_mask(self) -> None:
+        with pytest.raises(ValueError):
+            AiElement(name="test", mask=[(0, 0), (1)])  # type: ignore
+
+        with pytest.raises(ValueError):
+            AiElement(name="test", mask=[(0, 0)])  # type: ignore
diff --git a/tests/unit/models/__init__.py b/tests/unit/models/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/unit/models/test_models.py b/tests/unit/models/test_models.py
new file mode 100644
index 00000000..c28402f9
--- /dev/null
+++ b/tests/unit/models/test_models.py
@@ -0,0 +1,85 @@
+import pytest
+from src.askui.models.models import ModelComposition, ModelDefinition
+
+
+MODEL_DEFINITIONS = {
+    "e2e_ocr": ModelDefinition(
+        task="e2e_ocr",
+        architecture="easy_ocr",
+        version="1",
+        interface="online_learning",
+        use_case="test_workspace",
+        tags=["trained"]
+    ),
+    "od": ModelDefinition(
+        task="od",
+        architecture="yolo",
+        version="789012",
+        interface="offline_learning",
+        use_case="test_workspace2"
+    )
+}
+
+
+def test_model_composition_initialization():
+    composition = ModelComposition([MODEL_DEFINITIONS["e2e_ocr"]])
+    assert len(composition.root) == 1
+    assert composition.root[0].model_name == "e2e_ocr-easy_ocr-online_learning-test_workspace-1-trained"
+
+
+def test_model_composition_initialization_with_multiple_models():
+    composition = ModelComposition([MODEL_DEFINITIONS["e2e_ocr"], MODEL_DEFINITIONS["od"]])
+    assert len(composition.root) == 2
+    assert composition.root[0].model_name == "e2e_ocr-easy_ocr-online_learning-test_workspace-1-trained"
+    assert composition.root[1].model_name == "od-yolo-offline_learning-test_workspace2-789012"
+
+
+def test_model_composition_serialization():
+    model_def = MODEL_DEFINITIONS["e2e_ocr"]
+    composition = ModelComposition([model_def])
+    serialized = composition.model_dump(by_alias=True)
+    assert isinstance(serialized, list)
+    assert len(serialized) == 1
+    assert serialized[0]["task"] == "e2e_ocr"
+    assert serialized[0]["architecture"] == "easy_ocr"
+    assert serialized[0]["version"] == "1"
+    assert serialized[0]["interface"] == "online_learning"
+    assert serialized[0]["useCase"] == "test_workspace"
+    assert serialized[0]["tags"] == ["trained"]
+
+
+def test_model_composition_serialization_with_multiple_models():
+    composition = ModelComposition([MODEL_DEFINITIONS["e2e_ocr"], MODEL_DEFINITIONS["od"]])
+    serialized = composition.model_dump(by_alias=True)
+    assert isinstance(serialized, list)
+    assert len(serialized) == 2
+    assert serialized[0]["task"] == "e2e_ocr"
+    assert serialized[1]["task"] == "od"
+
+
+def test_model_composition_validation_with_invalid_task():
+    with pytest.raises(ValueError):
+        ModelComposition([{
+            "task": "invalid task!",
+            "architecture": "easy_ocr",
+            "version": "123456",
+            "interface": "online_learning",
+            "useCase": "test_workspace"
+        }])
+
+
+def test_model_composition_validation_with_invalid_version():
+    with pytest.raises(ValueError):
+        ModelComposition([{
+            "task": "e2e_ocr",
+            "architecture": "easy_ocr",
+            "version": "invalid",
+            "interface": "online_learning",
+            "useCase": "test_workspace"
+        }])
+
+
+def test_model_composition_with_empty_tags_and_use_case():
+    model_def = ModelDefinition(**{**MODEL_DEFINITIONS["e2e_ocr"].model_dump(exclude={"tags", "use_case"}), "tags": []})
+    composition = ModelComposition([model_def])
+    assert composition.root[0].model_name == "e2e_ocr-easy_ocr-online_learning-00000000_0000_0000_0000_000000000000-1"
diff --git a/tests/unit/telemetry/test_device_id.py b/tests/unit/telemetry/test_device_id.py
index 538bc371..49b8959e 100644
--- a/tests/unit/telemetry/test_device_id.py
+++ b/tests/unit/telemetry/test_device_id.py
@@ -4,8 +4,12 @@
 from askui.telemetry.device_id import get_device_id
 
 
+
+
+
 def test_get_device_id_returns_cached_id(mocker: MockerFixture):
     # First call to get_device_id will set the cache
+    mocker.patch("askui.telemetry.device_id._device_id", None)
     mocker.patch("machineid.hashed_id", return_value="02c2431a4608f230d2d759ac888d773d274229ebd9c9093249752dd839ee3ea3")
     first_id = get_device_id()
     
@@ -16,13 +20,14 @@ def test_get_device_id_returns_cached_id(mocker: MockerFixture):
 
 def test_get_device_id_returns_hashed_id(mocker: MockerFixture):
     test_id = "02c2431a4608f230d2d759ac888d773d274229ebd9c9093249752dd839ee3ea3"
+    mocker.patch("askui.telemetry.device_id._device_id", None)
     mocker.patch("machineid.hashed_id", return_value=test_id)
     device_id = get_device_id()
     assert device_id == test_id
 
 
 def test_get_device_id_returns_none_on_error(mocker: MockerFixture):
-    mocker.patch("machineid.hashed_id", side_effect=machineid.MachineIdNotFound)
     mocker.patch("askui.telemetry.device_id._device_id", None)
+    mocker.patch("machineid.hashed_id", side_effect=machineid.MachineIdNotFound)
     device_id = get_device_id()
     assert device_id is None 
\ No newline at end of file
diff --git a/tests/unit/test_validate_call.py b/tests/unit/test_validate_call.py
new file mode 100644
index 00000000..c8b11b29
--- /dev/null
+++ b/tests/unit/test_validate_call.py
@@ -0,0 +1,9 @@
+import pytest
+from askui import VisionAgent
+
+
+def test_validate_call_with_non_pydantic_invalid_types_raises_value_error():
+    class InvalidModelRouter:
+        pass
+    with pytest.raises(ValueError):
+        VisionAgent(model_router=InvalidModelRouter())
diff --git a/tests/unit/utils/test_image_utils.py b/tests/unit/utils/test_image_utils.py
new file mode 100644
index 00000000..72f39a97
--- /dev/null
+++ b/tests/unit/utils/test_image_utils.py
@@ -0,0 +1,194 @@
+import pathlib
+import pytest
+import base64
+from PIL import Image
+
+from askui.utils.image_utils import (
+    load_image, 
+    ImageSource, 
+    image_to_data_url,
+    data_url_to_image,
+    draw_point_on_image,
+    base64_to_image,
+    image_to_base64,
+    scale_image_with_padding,
+    scale_coordinates_back
+)
+
+class TestLoadImage:
+    def test_load_image_from_pil(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        img = Image.open(path_fixtures_github_com__icon)
+        loaded = load_image(img)
+        assert loaded == img
+
+    def test_load_image_from_path(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        # Test loading from Path
+        loaded = load_image(path_fixtures_github_com__icon)
+        assert isinstance(loaded, Image.Image)
+        assert loaded.size == (128, 125)  # GitHub icon size
+
+        # Test loading from str path
+        loaded = load_image(str(path_fixtures_github_com__icon))
+        assert isinstance(loaded, Image.Image)
+        assert loaded.size == (128, 125)
+
+    def test_load_image_from_base64(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        # Load test image and convert to base64
+        with open(path_fixtures_github_com__icon, "rb") as f:
+            img_bytes = f.read()
+        img_str = base64.b64encode(img_bytes).decode()
+
+        # Test different base64 formats
+        formats = [
+            f"data:image/png;base64,{img_str}",
+            f"data:;base64,{img_str}",
+            f"data:,{img_str}",
+            f",{img_str}",
+        ]
+
+        for fmt in formats:
+            loaded = load_image(fmt)
+            assert isinstance(loaded, Image.Image)
+            assert loaded.size == (128, 125)
+
+    def test_load_image_invalid(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        with pytest.raises(ValueError):
+            load_image("invalid_path.png")
+
+        with pytest.raises(ValueError):
+            load_image("invalid_base64")
+            
+        with pytest.raises(ValueError):
+            with open(path_fixtures_github_com__icon, "rb") as f:
+                img_bytes = f.read()
+                img_str = base64.b64encode(img_bytes).decode()
+                load_image(img_str)
+
+
+class TestImageSource:
+    def test_image_source(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        # Test with PIL Image
+        img = Image.open(path_fixtures_github_com__icon)
+        source = ImageSource(root=img)
+        assert source.root == img
+
+        # Test with path
+        source = ImageSource(root=path_fixtures_github_com__icon)
+        assert isinstance(source.root, Image.Image)
+        assert source.root.size == (128, 125)
+
+        # Test with base64
+        with open(path_fixtures_github_com__icon, "rb") as f:
+            img_bytes = f.read()
+        img_str = base64.b64encode(img_bytes).decode()
+        source = ImageSource(root=f"data:image/png;base64,{img_str}")
+        assert isinstance(source.root, Image.Image)
+        assert source.root.size == (128, 125)
+
+    def test_image_source_invalid(self) -> None:
+        with pytest.raises(ValueError):
+            ImageSource(root="invalid_path.png")
+
+        with pytest.raises(ValueError):
+            ImageSource(root="invalid_base64")
+
+    def test_to_data_url(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        source = ImageSource(root=path_fixtures_github_com__icon)
+        data_url = source.to_data_url()
+        assert data_url.startswith("data:image/png;base64,")
+        assert len(data_url) > 100  # Should have some base64 content
+
+    def test_to_base64(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        source = ImageSource(root=path_fixtures_github_com__icon)
+        base64_str = source.to_base64()
+        assert len(base64_str) > 100  # Should have some base64 content
+
+
+class TestDataUrlConversion:
+    def test_image_to_data_url(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        img = Image.open(path_fixtures_github_com__icon)
+        data_url = image_to_data_url(img)
+        assert data_url.startswith("data:image/png;base64,")
+        assert len(data_url) > 100
+
+    def test_data_url_to_image(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        with open(path_fixtures_github_com__icon, "rb") as f:
+            img_bytes = f.read()
+        img_str = base64.b64encode(img_bytes).decode()
+        data_url = f"data:image/png;base64,{img_str}"
+        
+        img = data_url_to_image(data_url)
+        assert isinstance(img, Image.Image)
+        assert img.size == (128, 125)
+
+
+class TestPointDrawing:
+    def test_draw_point_on_image(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        img = Image.open(path_fixtures_github_com__icon)
+        x, y = 64, 62  # Center of the image
+        new_img = draw_point_on_image(img, x, y)
+        
+        assert new_img != img  # Should be a new image
+        assert isinstance(new_img, Image.Image)
+        # Check that the point was drawn by looking at the pixel color
+        assert new_img.getpixel((x, y)) == (255, 0, 0, 255)  # Red color
+
+
+class TestBase64Conversion:
+    def test_base64_to_image(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        with open(path_fixtures_github_com__icon, "rb") as f:
+            img_bytes = f.read()
+        img_str = base64.b64encode(img_bytes).decode()
+        
+        img = base64_to_image(img_str)
+        assert isinstance(img, Image.Image)
+        assert img.size == (128, 125)
+
+    def test_image_to_base64(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        # Test with PIL Image
+        img = Image.open(path_fixtures_github_com__icon)
+        base64_str = image_to_base64(img)
+        assert len(base64_str) > 100
+        
+        # Test with Path
+        base64_str = image_to_base64(path_fixtures_github_com__icon)
+        assert len(base64_str) > 100
+
+
+class TestImageScaling:
+    def test_scale_image_with_padding(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        img = Image.open(path_fixtures_github_com__icon)
+        max_width, max_height = 200, 200
+        
+        scaled = scale_image_with_padding(img, max_width, max_height)
+        assert isinstance(scaled, Image.Image)
+        assert scaled.size == (max_width, max_height)
+        
+        # Check that the image was scaled proportionally
+        original_ratio = img.size[0] / img.size[1]
+        scaled_ratio = (scaled.size[0] - 2 * (max_width - int(img.size[0] * (max_height / img.size[1]))) // 2) / max_height
+        assert abs(original_ratio - scaled_ratio) < 0.01
+
+    def test_scale_coordinates_back(self, path_fixtures_github_com__icon: pathlib.Path) -> None:
+        img = Image.open(path_fixtures_github_com__icon)
+        max_width, max_height = 200, 200
+        
+        # Test coordinates in the center of the scaled image
+        x, y = 100, 100
+        original_x, original_y = scale_coordinates_back(
+            x, y, 
+            img.size[0], img.size[1],
+            max_width, max_height
+        )
+        
+        # Coordinates should be within the original image bounds
+        assert 0 <= original_x <= img.size[0]
+        assert 0 <= original_y <= img.size[1]
+        
+        # Test coordinates outside the padded area
+        with pytest.raises(ValueError):
+            scale_coordinates_back(
+                -10, -10,
+                img.size[0], img.size[1],
+                max_width, max_height
+            )
diff --git a/tests/unit/utils/test_str_utils.py b/tests/unit/utils/test_str_utils.py
new file mode 100644
index 00000000..21e3061c
--- /dev/null
+++ b/tests/unit/utils/test_str_utils.py
@@ -0,0 +1,50 @@
+from askui.utils.str_utils import truncate_long_strings
+
+def test_truncate_long_strings_with_dict():
+    input_data = {
+        "short": "short",
+        "long": "a" * 101,
+        "nested": {
+            "long": "b" * 101
+        }
+    }
+    expected = {
+        "short": "short",
+        "long": "a" * 20 + "... [shortened]",
+        "nested": {
+            "long": "b" * 20 + "... [shortened]"
+        }
+    }
+    assert truncate_long_strings(input_data) == expected
+
+def test_truncate_long_strings_with_list():
+    input_data = ["short", "a" * 101, ["b" * 101]]
+    expected = ["short", "a" * 20 + "... [shortened]", ["b" * 20 + "... [shortened]"]]
+    assert truncate_long_strings(input_data) == expected
+
+def test_truncate_long_strings_with_string():
+    assert truncate_long_strings("short") == "short"
+    assert truncate_long_strings("a" * 101) == "a" * 20 + "... [shortened]"
+
+def test_truncate_long_strings_with_custom_params():
+    input_data = "a" * 101
+    expected = "a" * 10 + "... [custom]"
+    assert truncate_long_strings(input_data, max_length=50, truncate_length=10, tag="[custom]") == expected
+
+def test_truncate_long_strings_with_mixed_data():
+    input_data = {
+        "list": ["short", "a" * 101],
+        "dict": {"long": "b" * 101},
+        "str": "c" * 101
+    }
+    expected = {
+        "list": ["short", "a" * 20 + "... [shortened]"],
+        "dict": {"long": "b" * 20 + "... [shortened]"},
+        "str": "c" * 20 + "... [shortened]"
+    }
+    assert truncate_long_strings(input_data) == expected
+
+def test_truncate_long_strings_with_empty_data():
+    assert truncate_long_strings({}) == {}
+    assert truncate_long_strings([]) == []
+    assert truncate_long_strings("") == "" 
diff --git a/tests/utils/generate_ai_elements.py b/tests/utils/generate_ai_elements.py
new file mode 100644
index 00000000..8864816f
--- /dev/null
+++ b/tests/utils/generate_ai_elements.py
@@ -0,0 +1,37 @@
+import json
+import pathlib
+import uuid
+from datetime import datetime
+from PIL import Image
+
+def generate_ai_element_json(image_path: pathlib.Path) -> None:
+    # Open image to get dimensions
+    with Image.open(image_path) as img:
+        width, height = img.size
+
+    # Create metadata
+    metadata = {
+        "version": 1,
+        "id": str(uuid.uuid4()),
+        "name": image_path.stem,
+        "creationDateTime": datetime.now().isoformat(),
+        "image": {
+            "size": {
+                "width": width,
+                "height": height
+            }
+        }
+    }
+
+    # Write JSON file
+    json_path = image_path.with_suffix('.json')
+    with open(json_path, 'w') as f:
+        json.dump(metadata, f, indent=2)
+
+def main():
+    fixtures_dir = pathlib.Path('tests/fixtures/images')
+    for image_path in fixtures_dir.glob('*.png'):
+        generate_ai_element_json(image_path)
+
+if __name__ == '__main__':
+    main()