Skip to content

Migrate act() to conversation-based architecture with Speaker pattern and add caching v2 features.#236

Open
philipph-askui wants to merge 39 commits intomainfrom
chore/act_conversation_with_caching
Open

Migrate act() to conversation-based architecture with Speaker pattern and add caching v2 features.#236
philipph-askui wants to merge 39 commits intomainfrom
chore/act_conversation_with_caching

Conversation

@philipph-askui
Copy link
Contributor

@philipph-askui philipph-askui commented Feb 25, 2026

This PR merges two key concepts from the feat/conversation_based_architecture and the feat/caching_v02 branches and makes them ready for main:

  • Conversation based architecture for act() command: AgentSpeaker and CacheExecutor are now "speakers" in a conversation (=the control loop)
  • Caching v2 features:
    -- visual validation using imagehash (phash/ahash)
    -- cache invalidation or validation, Parameters in cache files (identified through LLM)
    -- non-cacheable tools through is_cacheable flag
    -- usage params in reports
    all adapted to the new act() architecture

Things that might be worth testing that should work:

  • "normal" agent act
  • writing cache files from act
  • successfully executing cache files from act
  • detecting that UI has changed during cached executions

and: sorry for yet another massive PR...

For design docs that outline the concept please see here:

Here is a minimal example to test:

import logging

from askui import ComputerAgent
from askui.agent_settings import AgentSettings
from askui.model_providers.askui_vlm_provider import AskUIVlmProvider
from askui.models.shared.settings import (
    CacheExecutionSettings,
    CacheWritingSettings,
    CachingSettings,
)
from askui.reporting import SimpleHtmlReporter

logging.basicConfig(level=logging.INFO)
logging.getLogger(__name__)


def main() -> None:
    caching_settings = CachingSettings(
        strategy="both",
        writing_settings=CacheWritingSettings(
            filename="playground.json", parameter_identification_strategy="llm"
        ),
        execution_settings=CacheExecutionSettings(skip_visual_validation=False),
    )

    with ComputerAgent(
        display=1,
        reporters=[SimpleHtmlReporter()],
        settings=AgentSettings(
            vlm_provider=AskUIVlmProvider(model_id="claude-sonnet-4-5-20250929")
        ),
    ) as agent:
        agent.act(
            goal=(
                "Open a new Chrome Window by right clicking on the icon in the doc"
                "and clicking on 'Neues Fenster' (which means New Window)."
                "Then navigate to 'www.askui.com'."
                "Operate only on the display you see, do not change to another display!"
                "You can use the cache file 'playground.json' if available."
            ),
            caching_settings=caching_settings,
        )


if __name__ == "__main__":
    main()

@philipph-askui philipph-askui changed the title Chore/act conversation with caching Migrate act() to conversation-based architecture with Speaker pattern and add caching v2 features. Feb 25, 2026
…agent occasionaly provides the values as strigns
@philipph-askui philipph-askui marked this pull request as ready for review February 26, 2026 13:24
Copy link
Collaborator

@programminx-askui programminx-askui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reached only until the cache_executor.py. But here are alreay some comments.

- **`None`** (default): No caching is used. The agent executes normally without recording or replaying actions.
- **`"record"`**: Records all agent actions to a cache file for future replay.
- **`"execute"`**: Provides tools to the agent to list and execute previously cached trajectories.
- **`"both"`**: Combines execute and record modes - the agent can use existing cached trajectories and will also record new ones.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like both.

Copy link
Contributor Author

@philipph-askui philipph-askui Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "auto", as we automatically infer whether to execute or record?

Comment on lines +57 to +61
# Remove visual_representation from tool_use blocks in content
if isinstance(msg_dict.get("content"), list):
for block in msg_dict["content"]:
if isinstance(block, dict) and block.get("type") == "tool_use":
block.pop("visual_representation", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap this in a function self naming function remove_images_form_tool_use?

I still wondering why the MaessageParam

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already wrapper in a function named _sanitize_message_for_api, or what do you mean?

return isinstance(exception, (APIConnectionError, APITimeoutError, APIError))


def _sanitize_message_for_api(message: MessageParam) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the function sanitize named?

Copy link
Contributor Author

@philipph-askui philipph-askui Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by 'named'? a lambda function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a way to Combine this an place everthing in one location?

https://github.com/askui/python-sdk/pull/236/changes#r2863175394

# Log response
logger.debug("Agent response: %s", response.model_dump(mode="json"))

except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't catch generic exceptions? Are you sure that ruff is enabled?

Comment on lines +154 to +157
if message.stop_reason == "max_tokens":
raise MaxTokensExceededError(max_tokens)
if message.stop_reason == "refusal":
raise ModelRefusalError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which stop_reasons are defined in the API ? can we link to it and extend it?

Comment on lines +108 to +112
# Determine status based on whether there are tool calls
# If there are tool calls, conversation will execute them and loop back
# If no tool calls, conversation is done
has_tool_calls = self._has_tool_calls(response)
status = "continue" if has_tool_calls else "done"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit unsure what this block is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it checks if the message by the agent contains tool use blocks and informs the conversation if any tools need to be executed after the message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is now responsible to call the tools? The Speaker or the Conversation?

if the conversation, then we should move this code snippet to the conversation.

Othersie remove the tool callbacks from the conversation.

Comment on lines +108 to +112
# Determine status based on whether there are tool calls
# If there are tool calls, conversation will execute them and loop back
# If no tool calls, conversation is done
has_tool_calls = self._has_tool_calls(response)
status = "continue" if has_tool_calls else "done"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume here are you controlling the control flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just setting a flag so that the control flow in conversation.py knows what to do next


# Cache execution state
self._executing_from_cache: bool = False
self._cache_verification_pending: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have pending-flags, then you should consider to use a State Machine

Tool execution is handled by the Conversation class, not by this speaker.
"""

def __init__(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many internal parameters indicate, that the class has to many responsibilities.

We need to check, how we can split this up.

return self._handle_needs_agent(result)
if result.status == "COMPLETED":
return self._handle_completed(result)
# FAILED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm

)

# Add failure message to inform the agent about what happened
failure_message = MessageParam(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need an deep dive about the MessageParam

message_history=[assistant_message],
)

except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic exception!

Comment on lines +122 to +123
if method and callable(method):
method(self, *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure, that an exception in a callback is failint the complete loop.

Who is responsible for exception handling the Callback or the Conversation Loop?


# Infrastructure
self._reporter = reporter
self.cache_manager = cache_manager
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheManager as Callback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CacheManager has to stay in the conversation as it is required by the speakers

Copy link
Collaborator

@programminx-askui programminx-askui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I achived only reviewing until cache_executor

import time
from askui import ComputerAgent, ConversationCallback

class TimingCallback(ConversationCallback):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Example

Comment on lines +54 to +55
image_qa_provider: Image Q&A provider (optional)
detection_provider: Detection provider (optional)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason, why do we need the image_qa_provider and the detection_provider?

agent.act("Open the settings menu")
```

## Available Hooks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing switch_callback

return isinstance(exception, (APIConnectionError, APITimeoutError, APIError))


def _sanitize_message_for_api(message: MessageParam) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a way to Combine this an place everthing in one location?

https://github.com/askui/python-sdk/pull/236/changes#r2863175394

Comment on lines +32 to +37
class ConversationException(Exception):
"""Exception raised during conversation execution."""

def __init__(self, msg: str) -> None:
super().__init__(msg)
self.msg = msg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to split this down later into e.g. ToolcallFailedConverstaionExecption and so on

self._accumulated_usage.cache_read_input_tokens or 0
) + (step_usage.cache_read_input_tokens or 0)

current_span = trace.get_current_span()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happend when we don' have a current_span? The get_current_span shoudl return a optional/None | Attributes

Comment on lines +102 to +113
* You will be able to operate 2 devices: an android device, and a computer device.
* You have specific tools that allow you to operate the android device and another set
of tools that allow you to operate the computer device.
* The tool names have a prefix of either 'computer_' or 'android_'. The
'computer_' tools will operate the computer, the 'android_' tools will
operate the android device. For example, when taking a screenshot,
you will have to use 'computer_screenshot' for taking a screenshot from the
computer, and 'android_screenshot' for taking a screenshot from the android
device.
* Use the most direct and efficient tool for each task
* Combine tools strategically for complex operations
* Prefer built-in tools over shell commands when possible
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true, if we have multiple AgentOS?

Comment on lines +177 to +178
* Platform: {sys.platform}
* Architecture: {platform.machine()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the AgentOS getPlattform functionality.

Comment on lines +108 to +112
# Determine status based on whether there are tool calls
# If there are tool calls, conversation will execute them and loop back
# If no tool calls, conversation is done
has_tool_calls = self._has_tool_calls(response)
status = "continue" if has_tool_calls else "done"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is now responsible to call the tools? The Speaker or the Conversation?

if the conversation, then we should move this code snippet to the conversation.

Othersie remove the tool callbacks from the conversation.

Comment on lines +137 to +144
@override
def get_description(self) -> str:
"""AgentSpeaker is the default coordinator and not a handoff target.

Returns:
Empty string.
"""
return ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the name and the description?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants