feat: add a tool that allows the agent to query the location of the text on the screen. by arkrolin · Pull Request #105 · CursorTouch/Windows-MCP

arkrolin · 2026-03-14T14:03:59Z

Summary

A new tool (LocateText) has been added that allows the agent to query the position of text on the screen. Screenshots are taken using the existing desktop.get_screenshot method, and text queries utilize WinRT's native OCR. This significantly improves operational accuracy when UIA is unavailable, and effectively avoids pixelation, especially when the agent's visual capabilities are insufficient.

The agent can use use_vision (defaults to False) to request whether to return the annotated text content, allowing the agent to verify the required real text when the same text exists on the page.

Now, it can even play some simple games.

Why this is needed

1. The Phenomenon of "Pixel Illusions"

UIA may fail to function in certain scenarios—specifically with applications or games that utilize independent rendering. In such cases, due to the phenomenon of "pixel illusions," the agent struggles to perform precise actions—such as clicking on specific text within a UI—even when the use_vision feature is enabled to upload screenshots. This issue is particularly acute for models with limited Vision capabilities.

2. Centered on Lightweight Design

It utilizes only nine WinRT libraries, strictly avoiding the introduction of any bloated dependencies (such as OpenCV). Icon rendering is implemented natively via the PIL library and generates no cache files on the hard drive. Actual latency is low.
Furthermore, by enabling the agent to prioritize specific text segments for querying, Token consumption is reduced to a certain extent. Additionally, the system allows the agent to use approximate spatial descriptions to pinpoint text locations and filter out unwanted content.
Given that Windows' native OCR often yields poor recognition results—with particularly chaotic output formatting when processing certain complex languages—the aforementioned strategies effectively prevent the context window from becoming cluttered with large volumes of disorganized and incoherent OCR data.

Changes

src/windows_mcp/__main__.py

Added LocateText tool with display parameter.
use_vision,region_hint parameters can be used.

src/windows_mcp/desktop/locate_text.py

_perform_ocr(): Use Native OCR Recognition.
locate_text_tool():Filter OCR results, annotate images, and return results.
clean_ocr_text(): Clean spaces between CJK characters that OCR might have inserted.
_process_image_for_transfer(): Image Compression and Transmission.

Behavior

Receiving Parameters

{
    "tool": "LocateText",
    "text_query": "Start",
    "use_vision": True,
    "region_hint": "top"
}

Response

[
  {
    "status": "clear",
    "message": "{found a clear match for the query.}",
    "data": {
        "center_point": {
            "x": 150,
            "y": 300
            },
        "bounds": {
            "x": 100,
            "y": 290,
            "w": 100,
            "h": 20
            },
            "id": 1
            }
  },
  {
<!-- fastmcp.utilities.types.Image / McpImage) -->
    "type": "image",
    "data": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAg...",
    "mimeType": "image/jpeg"
  }
]

Testing

python -m pytest -q tests/test_locate_text.py

arkrolin added 2 commits March 14, 2026 17:38

add local_text

9a9799e

noticed that almost no agents utilize color hints, so removed them.

11f276e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add a tool that allows the agent to query the location of the text on the screen.#105

feat: add a tool that allows the agent to query the location of the text on the screen.#105
arkrolin wants to merge 2 commits intoCursorTouch:mainfrom
arkrolin:add-new-tool

arkrolin commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arkrolin commented Mar 14, 2026

Summary

Why this is needed

1. The Phenomenon of "Pixel Illusions"

2. Centered on Lightweight Design

Changes

Behavior

Receiving Parameters

Response

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant