Skip to content

feat: add a tool that allows the agent to query the location of the text on the screen.#105

Open
arkrolin wants to merge 2 commits intoCursorTouch:mainfrom
arkrolin:add-new-tool
Open

feat: add a tool that allows the agent to query the location of the text on the screen.#105
arkrolin wants to merge 2 commits intoCursorTouch:mainfrom
arkrolin:add-new-tool

Conversation

@arkrolin
Copy link

Summary

A new tool (LocateText) has been added that allows the agent to query the position of text on the screen. Screenshots are taken using the existing desktop.get_screenshot method, and text queries utilize WinRT's native OCR. This significantly improves operational accuracy when UIA is unavailable, and effectively avoids pixelation, especially when the agent's visual capabilities are insufficient.

The agent can use use_vision (defaults to False) to request whether to return the annotated text content, allowing the agent to verify the required real text when the same text exists on the page.

Now, it can even play some simple games.

Why this is needed

1. The Phenomenon of "Pixel Illusions"

UIA may fail to function in certain scenarios—specifically with applications or games that utilize independent rendering. In such cases, due to the phenomenon of "pixel illusions," the agent struggles to perform precise actions—such as clicking on specific text within a UI—even when the use_vision feature is enabled to upload screenshots. This issue is particularly acute for models with limited Vision capabilities.

2. Centered on Lightweight Design

It utilizes only nine WinRT libraries, strictly avoiding the introduction of any bloated dependencies (such as OpenCV). Icon rendering is implemented natively via the PIL library and generates no cache files on the hard drive. Actual latency is low.
Furthermore, by enabling the agent to prioritize specific text segments for querying, Token consumption is reduced to a certain extent. Additionally, the system allows the agent to use approximate spatial descriptions to pinpoint text locations and filter out unwanted content.
Given that Windows' native OCR often yields poor recognition results—with particularly chaotic output formatting when processing certain complex languages—the aforementioned strategies effectively prevent the context window from becoming cluttered with large volumes of disorganized and incoherent OCR data.

Changes

src/windows_mcp/__main__.py

  • Added LocateText tool with display parameter.
  • use_vision,region_hint parameters can be used.

src/windows_mcp/desktop/locate_text.py

  • _perform_ocr(): Use Native OCR Recognition.
  • locate_text_tool():Filter OCR results, annotate images, and return results.
  • clean_ocr_text(): Clean spaces between CJK characters that OCR might have inserted.
  • _process_image_for_transfer(): Image Compression and Transmission.

Behavior

Receiving Parameters

{
    "tool": "LocateText",
    "text_query": "Start",
    "use_vision": True,
    "region_hint": "top"
}

Response

[
  {
    "status": "clear",
    "message": "{found a clear match for the query.}",
    "data": {
        "center_point": {
            "x": 150,
            "y": 300
            },
        "bounds": {
            "x": 100,
            "y": 290,
            "w": 100,
            "h": 20
            },
            "id": 1
            }
  },
  {
<!-- fastmcp.utilities.types.Image / McpImage) -->
    "type": "image",
    "data": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAg...",
    "mimeType": "image/jpeg"
  }
]

Testing

python -m pytest -q tests/test_locate_text.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant