feat: add a tool that allows the agent to query the location of the text on the screen.#105
Open
arkrolin wants to merge 2 commits intoCursorTouch:mainfrom
Open
feat: add a tool that allows the agent to query the location of the text on the screen.#105arkrolin wants to merge 2 commits intoCursorTouch:mainfrom
arkrolin wants to merge 2 commits intoCursorTouch:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A new tool (LocateText) has been added that allows the agent to query the position of text on the screen. Screenshots are taken using the existing
desktop.get_screenshotmethod, and text queries utilize WinRT's native OCR. This significantly improves operational accuracy when UIA is unavailable, and effectively avoids pixelation, especially when the agent's visual capabilities are insufficient.The agent can use
use_vision(defaults toFalse) to request whether to return the annotated text content, allowing the agent to verify the required real text when the same text exists on the page.Now, it can even play some simple games.
Why this is needed
1. The Phenomenon of "Pixel Illusions"
UIA may fail to function in certain scenarios—specifically with applications or games that utilize independent rendering. In such cases, due to the phenomenon of "pixel illusions," the agent struggles to perform precise actions—such as clicking on specific text within a UI—even when the
use_visionfeature is enabled to upload screenshots. This issue is particularly acute for models with limited Vision capabilities.2. Centered on Lightweight Design
It utilizes only nine WinRT libraries, strictly avoiding the introduction of any bloated dependencies (such as OpenCV). Icon rendering is implemented natively via the PIL library and generates no cache files on the hard drive. Actual latency is low.
Furthermore, by enabling the agent to prioritize specific text segments for querying, Token consumption is reduced to a certain extent. Additionally, the system allows the agent to use approximate spatial descriptions to pinpoint text locations and filter out unwanted content.
Given that Windows' native OCR often yields poor recognition results—with particularly chaotic output formatting when processing certain complex languages—the aforementioned strategies effectively prevent the context window from becoming cluttered with large volumes of disorganized and incoherent OCR data.
Changes
src/windows_mcp/__main__.pysrc/windows_mcp/desktop/locate_text.pyBehavior
Receiving Parameters
Response
Testing