diff --git a/.claude/skills/devlog/SKILL.md b/.claude/skills/devlog/SKILL.md new file mode 100644 index 00000000..775f73bb --- /dev/null +++ b/.claude/skills/devlog/SKILL.md @@ -0,0 +1,68 @@ +--- +name: devlog +description: Write or update a devlog entry in the devlog directory. Use when the user asks to write a devlog, record a decision, document what happened, or says "write up what we did". +--- + +# Devlog + +Write or update a devlog entry in the devlog directory. + +## Devlog Format + +Devlog entries are development journal posts that capture decisions, discoveries, and plans as work happens. They're written for the team — concise, honest, and useful for future reference. + +**Filename:** `YYYY-MM-DD-.md` + +Use today's date and a short kebab-case topic slug. + +## Entry Structure + +```markdown +--- +title: "Short Descriptive Title" +type: devlog +date: YYYY-MM-DD +--- + +# Title + +## Summary +1-3 sentences on what this entry covers. + +## +Use whatever sections make sense for the content. Common ones: +- What Changed +- Key Decisions (and rationale) +- What We Learned +- Open Questions +- Future Work + +Keep it direct. No filler. Write like you're explaining to a teammate who will read this in 3 months. +``` + +## Front Matter Fields + +| Field | Required | Values | +| :--- | :--- | :--- | +| `title` | Yes | Short descriptive title | +| `type` | Yes | `decision` (architectural/technical choice) or `devlog` (development note) | +| `date` | Yes | `YYYY-MM-DD` format | + +Use `type: decision` when recording a significant architectural or technical choice. Use `type: devlog` for development notes, debugging sessions, and implementation details. + +## Guidelines + +- **Be opinionated.** Capture *why* decisions were made, not just what happened. +- **Include the dead ends.** What didn't work and why is often more valuable than what did. +- **Link to context.** Reference PRs, branches, test names, file paths — make it traceable. +- **One entry per topic.** Don't combine unrelated work. Multiple entries on the same day is fine. + +## Before Writing + +1. Check existing devlog entries to avoid duplicating a topic +2. If updating an existing topic, consider appending to the existing entry rather than creating a new one +3. Review the current conversation context for decisions, discoveries, and rationale worth capturing + +## Invocation + +When the user says `/devlog`, ask what topic to write about if it's not clear from context. If you've been working on something substantial in the current session, suggest writing about that. \ No newline at end of file diff --git a/docs/devlog/2025-10-01-llm-as-compiler.md b/docs/devlog/2025-10-01-llm-as-compiler.md new file mode 100644 index 00000000..d0bb72c1 --- /dev/null +++ b/docs/devlog/2025-10-01-llm-as-compiler.md @@ -0,0 +1,64 @@ +--- +title: "LLM as Compiler Architecture" +type: decision +date: 2025-10-01 +--- + +# LLM as Compiler Architecture + +The core architectural insight behind Trailblaze — treating the LLM as a compiler rather than a chatbot. + +## Background + +Traditional UI test frameworks require developers to write explicit, imperative test code. We want to enable natural language test authoring while maintaining deterministic execution. + +## What we decided + +Trailblaze treats the **LLM as a compiler** that transforms natural language test cases into deterministic tool sequences. + +### The Compiler Metaphor + +``` +Natural Language → LLM + Agent + Tools → Trail Recording + (Source) (Compiler) (Output/IR) +``` + +| Concept | Traditional Compiler | Trailblaze | +| :--- | :--- | :--- | +| Source | Code (.c, .kt) | Natural language test steps | +| Compiler | gcc, kotlinc | LLM + Trailblaze Agent | +| IR/Output | Assembly, bytecode | Trail YAML (tool sequence) | +| Runtime | CPU, JVM | Device + Maestro/Tools | + +### Compilation Flow + +``` +Test Case Steps → LLM interprets steps → Execute tools on device + ↓ ↓ ↓ + Natural Language Agent orchestration Success/Failure + ↓ ↓ ↓ + On failure: retry Record successful run + with context as .trail.yaml +``` + +### Key Properties + +- **Compilation happens once**: First successful run is recorded +- **Replay is deterministic**: Subsequent runs use recording, no LLM needed +- **Self-healing on failure**: LLM can adapt and retry when UI changes +- **Recompilation on demand**: Force AI mode to generate new recording + +### Agent Loop + +1. LLM receives test step + current screen state +2. LLM selects and invokes tools +3. Tools execute via Maestro/device drivers +4. On success → record tool invocation +5. On failure → provide error context, retry +6. After all steps → save complete `.trail.yaml` + +## What changed + +**Positive:** Natural language authoring, deterministic replay, self-healing capability, familiar mental model for engineers. + +**Negative:** Initial "compilation" requires LLM (cost/latency); recordings may need "recompilation" when UI changes significantly. diff --git a/docs/devlog/2025-10-01-trail-recording-format.md b/docs/devlog/2025-10-01-trail-recording-format.md new file mode 100644 index 00000000..768b9ee7 --- /dev/null +++ b/docs/devlog/2025-10-01-trail-recording-format.md @@ -0,0 +1,174 @@ +--- +title: "Trail Recording Format (YAML)" +type: decision +date: 2025-10-01 +--- + +# Trail Recording Format (YAML) + +Building on our monorepo structure, we needed a format for recording UI test interactions. + +## Background + +Trailblaze uses an LLM to interpret natural language test steps and execute them. We need a way to capture successful executions as **deterministic recordings** that can replay without LLM involvement, ensuring consistency and reducing costs. + +## What we decided + +Trail recordings use a **YAML format** (`.trail.yaml`) that captures the mapping from natural language steps to tool invocations. + +### Format Structure + +```yaml +- prompts: + - step: Launch the app signed in with user@example.com + recording: + tools: + - app_ios_launchAppSignedIn: + email: user@example.com + password: "12345678" + - step: Add a pizza to the cart and click 'Review sale' + recording: + tools: + - scrollUntilTextIsVisible: + text: Pizza + direction: DOWN + - tapOnElementWithAccessibilityText: + accessibilityText: Pizza + - tapOnElementWithAccessibilityText: + accessibilityText: Review sale 1 item + - step: Verify the total is correct + recordable: false # Always uses AI, never replays from recording +``` + +### Step-Level Recordability + +Each step has a `recordable` flag (default: `true`): +- **`recordable: true`**: Step can be recorded and replayed deterministically +- **`recordable: false`**: Step always requires AI interpretation, even in recorded mode + +Use `recordable: false` for steps that need dynamic behavior (e.g., verification steps that should re-evaluate on each run). + +> **Note:** This is separate from tool-level `isRecordable` (see [Tool Execution Modes](2026-01-01-tool-execution-modes.md)). + +### Key Properties + +- **Human-readable**: YAML is easy to inspect, edit, and version control +- **Deterministic**: Recordings replay exactly the same tool sequence +- **Step-aligned**: Each natural language step maps to its tool invocations +- **Platform-specific**: Trails are stored per platform/device (e.g., `ios-iphone.trail.yaml`) + +### Storage Convention + +Trails are organized by test case hierarchy: +``` +trails/suite_{id}/section_{id}/case_{id}/ +├── ios-iphone.trail.yaml +├── ios-ipad.trail.yaml +└── android-phone.trail.yaml +``` + +### Execution Modes + +1. **AI Mode**: LLM interprets steps, executes tools, records successful runs +2. **Recorded Mode**: Replay existing `.trail.yaml` without LLM (fast, deterministic) + +### Raw Maestro Blocks (Deprecated) + +The trail format supports a `maestro:` block for raw Maestro commands: + +```yaml +# Deprecated - avoid in new trails +- maestro: + - tapOn: + id: "com.example:id/button" + - assertVisible: + text: "Success" +``` + +**This is deprecated.** Prefer using Trailblaze tools instead: + +```yaml +# Preferred - tools can be recorded and processed by the agent +- prompts: + - step: Tap the submit button and verify success + recording: + tools: + - tapOnElementWithText: + text: Submit + - assertVisible: + text: Success +``` + +**Principle:** Trailblaze supports a limited subset of Maestro. Every supported Maestro command should have a corresponding Trailblaze tool that: +- Can be selected by the LLM agent +- Can be recorded in trails +- Provides a consistent abstraction across platforms + +Raw `maestro:` blocks bypass the agent and recording system, making them harder to maintain and migrate. + +### No Conditionals in Trail Recordings + +Trail recordings intentionally contain **no conditional logic or branching**. A recording is simply a list of Trailblaze tool invocations that execute sequentially. + +```yaml +# This is what a recording looks like - just tool calls, no conditionals +- prompts: + - step: Navigate to settings + recording: + tools: + - tapOnElementWithAccessibilityText: + accessibilityText: Settings + - waitForElementWithText: + text: Account Settings +``` + +**Why no conditionals?** + +1. **Simplicity**: Recordings are easy to read, review, and debug +2. **Determinism**: No runtime branching means predictable, reproducible execution +3. **Code is better for logic**: Conditional behavior belongs in custom Trailblaze tools (see [Tool Naming Convention](2026-01-14-tool-naming-convention.md) and [Custom Tool Authoring](2026-01-28-custom-tool-authoring.md)) + +**Where conditionals belong:** + +- **Custom tools**: App-specific or platform-specific tools can contain arbitrary code, including conditionals. For example, a `myapp_ios_handleOptionalPopup` tool might check for and dismiss a popup if present. +- **Within a single natural language step**: Test authors can write conditionals in the step text for LLM interpretation (e.g., "If a popup appears, dismiss it"). However, this requires AI mode and cannot be recorded. + +**What doesn't work:** Branching from one natural language step to different subsequent steps based on conditions. The step sequence in `trail.yaml` is always linear. + +### Non-Goal: Code Generation + +Trailblaze intentionally does **not** generate traditional test code (Playwright scripts, XCUITest, Espresso, etc.). While technically possible—recorded tool calls contain all necessary information—this is explicitly not a goal. + +**Trailblaze is a runtime, not a codegen tool.** + +Think of it like the difference between: +- **Java bytecode**: Runs on the JVM, not compiled to native code +- **Trail files**: Run on Trailblaze, not compiled to test scripts + +The trail format is the artifact. Trailblaze interprets and executes it. + +**Why not generate code?** + +| Capability | Trail Runtime | Generated Code | +| :--- | :--- | :--- | +| AI Fallback | ✅ Re-derive from prompt when recording fails | ❌ Static—fails are just failures | +| Self-healing | ✅ Natural language is always available for recovery | ❌ Once generated, prompt is gone | +| Visual debugging | ✅ Desktop app replays with screenshots | ❌ Stack traces and logs only | +| Edit by non-engineers | ✅ Modify natural language steps | ❌ Must edit TypeScript/Swift/Kotlin | +| Cross-platform | ✅ One prompt, multiple recordings | ❌ Separate codegen per platform | + +**Positioning clarity:** + +Code generation would position Trailblaze as "yet another test recorder"—competing with Playwright Codegen, Appium Inspector, Maestro Studio, etc. These tools are mature and do codegen well. + +Trailblaze's value is different: **tests defined in natural language, recorded for deterministic replay, with AI fallback when recordings break**. The trail file is not an intermediate artifact to be compiled away—it's the test definition that retains its semantic meaning at runtime. + +**What about exporting for debugging?** + +For debugging purposes, Trailblaze could provide a "view as code" feature that shows what the equivalent Playwright/XCUITest code would look like—without actually generating runnable files. This helps developers understand what a recording does in familiar terms, while keeping the trail as the source of truth. + +## What changed + +**Positive:** Reproducible tests, reduced LLM costs on replay, easy debugging via readable YAML, version-controllable recordings. + +**Negative:** Platform-specific recordings may diverge; recordings become stale if UI changes. diff --git a/docs/devlog/2026-01-01-maestro-integration.md b/docs/devlog/2026-01-01-maestro-integration.md new file mode 100644 index 00000000..39a54269 --- /dev/null +++ b/docs/devlog/2026-01-01-maestro-integration.md @@ -0,0 +1,60 @@ +--- +title: "Maestro as Current Execution Backend" +type: decision +date: 2026-01-01 +--- + +# Maestro as Current Execution Backend + +Choosing our execution backend for driving UI interactions. + +## Background + +Trailblaze needs to interact with mobile devices to perform UI actions (taps, swipes, text input) and query screen state. Building and maintaining these low-level device interaction implementations across multiple platforms (Android, iOS) requires significant effort and ongoing maintenance. + +[Maestro](https://maestro.mobile.dev/) is an open source mobile UI testing framework that already provides robust, cross-platform device interaction capabilities with an active community. + +## What we decided + +**Trailblaze currently uses Maestro as its primary execution backend for device interactions, but Maestro is not an intrinsic part of the Trailblaze architecture.** + +Maestro handles the majority of UI interactions, but Trailblaze also uses **ADB commands and shell commands** directly for certain device control operations. This hybrid approach gives us flexibility—Maestro for high-level UI actions, and direct device commands when lower-level control is needed. + +### Why Maestro (For Now) + +- **Avoids reinventing the wheel**: Maestro provides battle-tested implementations for taps, swipes, scrolls, text input, and screen queries across Android and iOS +- **Community maintenance**: We benefit from bug fixes, platform updates (new Android/iOS versions), and improvements contributed by the broader community +- **Reduced dependency surface**: Using a focused tool means we don't need to pull in larger testing framework dependencies + +### Not a Permanent Coupling + +Trailblaze's core value is in its LLM-driven test generation and trail recording/replay architecture—not in how device interactions are executed. We may choose to replace Maestro in the future if: + +- A better-suited tool emerges +- Our requirements diverge from Maestro's direction +- We need tighter control over the execution layer + +Tool implementations should remain abstracted such that swapping execution backends is feasible. + +### On-Device Orchestra Fork + +Maestro's standard architecture assumes a host machine driving a connected device. For Trailblaze's on-device execution mode, we maintain **a copy of Maestro's Orchestra code** in our codebase. + +This is necessary because: +- Maestro's base implementation doesn't work when running directly on the device +- Pulling in the full Maestro dependency would bring unnecessary transitive dependencies +- We need a minimal, self-contained implementation for the on-device use case + +**Maintenance requirement**: When upgrading Maestro versions, the Orchestra copy must be reviewed and updated to incorporate relevant changes while preserving on-device compatibility. + +## What changed + +**Positive:** +- Faster time-to-market by leveraging existing device interaction code +- Benefit from community improvements without maintaining low-level platform code +- Clear abstraction boundary makes future migration possible + +**Negative:** +- Dependent on external project's stability and direction +- Orchestra fork requires manual sync during Maestro upgrades +- Must track Maestro releases for security patches and compatibility updates diff --git a/docs/devlog/2026-01-01-tool-execution-modes.md b/docs/devlog/2026-01-01-tool-execution-modes.md new file mode 100644 index 00000000..3fc67281 --- /dev/null +++ b/docs/devlog/2026-01-01-tool-execution-modes.md @@ -0,0 +1,72 @@ +--- +title: "Tool Execution Modes" +type: decision +date: 2026-01-01 +--- + +# Tool Execution Modes + +As the tool system grew, we needed to formalize how tools execute across different environments. + +## Background + +Tools in Trailblaze serve different purposes. Some are meant for LLM selection, while others are precise implementation details for recordings. We need a way to classify tools by their execution characteristics. + +## What we decided + +Tools declare two boolean properties that determine their execution mode: + +```kotlin +@TrailblazeToolClass( + name = "tapOnElementWithText", + isForLlm = true, // Can LLM select this tool? + isRecordable = true, // Can this tool appear in recordings? +) +``` + +### Property Definitions + +| Property | Default | Meaning | +| :--- | :--- | :--- | +| `isForLlm` | `true` | Whether the LLM can select this tool. Set to `false` for implementation-detail tools that use unstable identifiers. | +| `isRecordable` | `true` | Whether this tool can appear in trail recordings. Set to `false` for wrapper tools that delegate to more precise tools. | + +### Tool Type Matrix + +| `isForLlm` | `isRecordable` | Type | Use Case | +| :--- | :--- | :--- | :--- | +| `true` | `true` | **Standard** | Normal tools (default) | +| `true` | `false` | **LLM-only** | Wrapper tools that delegate to more precise tools | +| `false` | `true` | **Recording-only** | Precise tools with unstable identifiers | +| `false` | `false` | **Internal** | Helper tools, not directly usable | + +### Use Cases + +**Standard (`isForLlm=true`, `isRecordable=true`)**: Most tools. LLM can select them, and they appear in recordings. + +**LLM-only (`isForLlm=true`, `isRecordable=false`)**: High-level tools the LLM selects, but recordings capture the delegated call instead. + +```kotlin +// LLM selects this (stable, text-based) +@TrailblazeToolClass(name = "tapOnElementWithText", isForLlm = true, isRecordable = false) +class TapOnElementWithText : Tool { + override fun execute(args: Args): Result { + val nodeId = findNodeIdByText(args.text) + return tapOnElementByNodeId.execute(nodeId) // Recording captures this + } +} +``` + +**Recording-only (`isForLlm=false`, `isRecordable=true`)**: Precise tools that use unstable identifiers (node IDs, coordinates). Not LLM-selectable because the identifiers change between runs. + +```kotlin +// Recording stores this (precise, but node IDs are unstable) +@TrailblazeToolClass(name = "tapOnElementByNodeId", isForLlm = false, isRecordable = true) +class TapOnElementByNodeId : Tool { ... } +``` + +## What changed + +**Positive:** Clean separation between LLM-facing and implementation tools; recordings can use more precise identifiers than what LLM reasons about. + +**Negative:** Requires understanding the delegation pattern; tool authors must choose modes deliberately. diff --git a/docs/devlog/2026-01-14-tool-naming-convention.md b/docs/devlog/2026-01-14-tool-naming-convention.md new file mode 100644 index 00000000..29db5992 --- /dev/null +++ b/docs/devlog/2026-01-14-tool-naming-convention.md @@ -0,0 +1,81 @@ +--- +title: "Tool Naming Convention" +type: decision +date: 2026-01-14 +--- + +# Tool Naming Convention + +With multiple tool authors contributing, we needed naming consistency. + +## Background + +Tool names must be **globally unique** because our serialization system uses the tool name as the sole lookup key. Additionally, we want to minimize LLM context by only exposing relevant tools, and keep schemas simple by avoiding platform-conditional parameters. + +## What we decided + +### Naming Convention + +| Category | Format | Example | +| :--- | :--- | :--- | +| Universal primitive | `{verbNoun}` | `tap`, `scroll`, `inputText` | +| Platform primitive | `{platform}_{verbNoun}` | `ios_clearKeychain`, `android_pressSystemBack` | +| Org-wide | `org_{verbNoun}` | `org_mockServer`, `org_resetTestEnvironment` | +| Org-wide + platform | `org_{platform}_{verbNoun}` | `org_ios_configureTestUser` | +| App-specific | `{app}_{verbNoun}` | `myapp_launchAppSignedIn`, `payments_addFunds` | +| App + platform | `{app}_{platform}_{verbNoun}` | `myapp_ios_scroll` | + +**Key rules:** +- Use **underscores** as separators (dots aren't supported in OpenAI function names) +- **Device type** (phone/tablet) should NOT appear in names—use execution context instead +- **Versioning**: Append `_v2`, `_v3` for breaking changes (e.g., `myapp_ios_scroll_v2`) + +### When to Use App + Platform Tools + +Only when parameter schemas differ materially between platforms: + +```kotlin +// Good: focused schemas +myapp_ios_launchAppSignedIn(permissions: [String], virtualCard: Bool) +myapp_android_launchAppSignedIn(permissions: [String], overlayPermission: Bool) + +// Avoid: confusing conditional parameters +myapp_launchAppSignedIn(permissions: [String], virtualCard: Bool?, overlayPermission: Bool?) +``` + +### Tool Metadata + +```kotlin +@TrailblazeToolClass( + name = "myapp_ios_launchAppSignedIn", + isForLlm = true, // false = deprecated or implementation-detail tool + isRecordable = true, + platforms = [Platform.IOS], +) +``` + +Tools are filtered before LLM exposure based on `isForLlm`, target app, and current platform. The executor must also reject unsupported tool calls at runtime. + +### Reserved Names + +Centrally owned and validated at build time: +- Global primitives (`tap`, `scroll`, etc.) +- Platform primitives (`ios_*`, `android_*`) +- Org-wide (`org_*`) +- App prefixes (`myapp`, `payments`, etc.) + +### Shared Implementation + +Platform-specific tools can share logic via delegation: + +```kotlin +class MyAppIosLaunchAppSignedIn : Tool { + override fun execute(args: Args) = sharedLaunchLogic(args, platformConfig = iosConfig) +} +``` + +## What changed + +**Positive:** Globally unique names, minimized LLM context, simple schemas, clear tool provenance. + +**Negative:** Requires naming discipline and potential migration of existing tools. diff --git a/docs/devlog/2026-01-28-agent-loop-implementation.md b/docs/devlog/2026-01-28-agent-loop-implementation.md new file mode 100644 index 00000000..d587076a --- /dev/null +++ b/docs/devlog/2026-01-28-agent-loop-implementation.md @@ -0,0 +1,165 @@ +--- +title: "Handwritten Agent Loop" +type: decision +date: 2026-01-28 +--- + +# Handwritten Agent Loop + +A core architectural choice — why we hand-wrote the agent loop instead of using a framework. + +## Background + +AI agents require an execution loop that orchestrates: + +1. Gathering context (screen state, test instructions, history) +2. Calling the LLM for reasoning and tool selection +3. Executing selected tools +4. Processing results and deciding next steps +5. Handling errors and retries +6. Recording successful runs + +Many agent frameworks exist (LangChain, AutoGen, CrewAI, etc.) that provide abstractions for this loop. We needed to decide whether to adopt an existing framework or implement our own. + +## What we decided + +**Trailblaze uses a handwritten while loop for its core agent execution.** + +### Implementation Overview + +The agent loop is a straightforward `while` loop that continues until the test completes (success or failure) or a termination condition is met: + +```kotlin +// Simplified conceptual representation +suspend fun runAgent(objective: Objective): TestResult { + val completedTools = mutableListOf() + + while (completedTools.size < MAX_ITERATIONS) { + // 1. Capture current screen state + val screenshot = captureScreenshot() + val viewHierarchy = captureViewHierarchy() + + // 2. Build fresh LLM request with current context + val request = buildRequest( + systemPrompt = SYSTEM_PROMPT, + objective = objective, + completedTools = completedTools, + screenshot = screenshot, + viewHierarchy = viewHierarchy + ) + + // 3. Call LLM for next action + val response = llmClient.chat(request) + + // 4. Execute tool calls sequentially + for (toolCall in response.toolCalls) { + val result = driver.executeTool(toolCall) + completedTools.add(CompletedTool(toolCall, result)) + + // Check for terminal conditions + if (result.isObjectiveComplete || result.isFailure) { + return result.toTestResult() + } + } + } + + return TestResult.Timeout("Exceeded $MAX_ITERATIONS iterations") +} +``` + +### Why Handwritten + +#### 1. Simplicity and Transparency + +A while loop is easy to understand, debug, and modify. New team members can read the code and understand exactly what the agent does. There's no framework abstraction layer to learn or work around. + +#### 2. Control Over Execution + +We have precise control over: + +- When and how the LLM is called +- How context is constructed for each request +- Tool execution ordering and sequencing +- What gets recorded and when +- Termination conditions and limits + +#### 3. Mobile-Specific Requirements + +Trailblaze has unique requirements that existing agent frameworks don't address: + +- On-device execution with resource constraints +- Integration with platform drivers for device interactions +- Trail recording format (see [Trail Recording Format](2025-10-01-trail-recording-format.md)) +- Trail mode that replays recorded tool sequences without LLM calls + +#### 4. Avoiding Dependency Risk + +Agent frameworks are evolving rapidly. Depending on an external framework means: + +- Tracking breaking changes in a fast-moving ecosystem +- Working around framework limitations +- Framework bugs becoming our bugs +- Potential abandonment or direction changes + +### Loop Termination + +The loop terminates under the following conditions: + +- **Objective completion**: The agent calls the `objectiveStatus` tool with a `COMPLETED` or `FAILED` status, indicating the test objective has been achieved or cannot be completed +- **Assertion failure**: An assertion tool (e.g., `assertVisibleWithText`) fails, indicating an unexpected state +- **Element not found**: A required UI element cannot be located after the agent's attempts +- **Iteration limit**: A maximum of **50 LLM calls per step** prevents runaway execution + +Future improvements may include more sophisticated loop detection to identify when the agent is stuck repeating ineffective actions. + +### Tool Execution + +Tools execute **sequentially**, one at a time. Parallel tool execution is not supported because Trailblaze interacts with a UI—only one interaction can happen at a time on a device. + +Tools execute **once** without automatic retries at the loop level. If a tool needs retry logic, it must implement that internally. When a tool completes (successfully or not), the agent proceeds based on the result: + +- For terminal results (assertions, objective status), the loop may end +- For non-terminal results, the agent continues and relies on subsequent steps to detect any issues + +Tool calls delegate to platform drivers (Android or iOS) to perform actual device interactions. The Trailblaze tools provide a high-level abstraction, while drivers handle the device-specific implementation details. + +### Context Window Management + +Rather than maintaining a growing conversation history, Trailblaze constructs **each LLM request fresh**. On every iteration, the agent sends: + +- System prompt with instructions +- Current objective +- List of previously completed tools (providing execution history) +- Latest screenshot +- Current view hierarchy + +This "subagent" pattern keeps the context window manageable—typically under 10,000 input tokens—well within LLM limits. By always including the latest screen state and omitting stale information, we reduce LLM confusion and improve decision quality. + +### Running Trails (Replay Mode) + +When a test has a recorded trail, it can run in **trail mode** which bypasses the LLM entirely. The recorded tool sequence from the `.trail.yaml` file executes deterministically. See [Trail Recording Format](2025-10-01-trail-recording-format.md) for details on how trails are structured and when they're used. + +### What the Loop Handles + +- **Context construction**: Building fresh LLM requests with current screen state, objectives, and execution history +- **LLM communication**: Calling the LLM, parsing responses, extracting tool calls +- **Tool execution**: Invoking tools sequentially, delegating to platform drivers +- **Recording**: Capturing successful tool sequences for trail replay +- **Termination**: Recognizing completion, failure, and limit conditions + +## What changed + +**Positive:** + +- Complete control over agent behavior +- Easy to understand, debug, and modify +- No external framework dependencies to manage +- Can optimize for mobile and on-device constraints +- Straightforward to add new capabilities + +**Negative:** + +- Must implement features that frameworks provide out-of-the-box +- No automatic benefit from framework improvements +- Requires more upfront implementation work +- Team must maintain all agent logic internally diff --git a/docs/devlog/2026-01-28-custom-tool-authoring.md b/docs/devlog/2026-01-28-custom-tool-authoring.md new file mode 100644 index 00000000..d674cb2d --- /dev/null +++ b/docs/devlog/2026-01-28-custom-tool-authoring.md @@ -0,0 +1,92 @@ +--- +title: "Custom Tool Authoring" +type: decision +date: 2026-01-28 +--- + +# Custom Tool Authoring + +Enabling teams to extend the framework with domain-specific tools. + +## Background + +Trailblaze's AI agent interacts with applications through tools—discrete operations like "tap on element," "enter text," or "verify screen content." While general-purpose tools handle most UI interactions, **custom tools** are critical for: + +1. **Speed** — A custom tool can perform complex multi-step operations in a single call, reducing LLM round-trips +2. **Reliability** — Custom tools can encode domain-specific knowledge, handle edge cases, and provide deterministic behavior where general tools might be brittle +3. **Abstraction** — Custom tools hide implementation complexity from the LLM, making prompts simpler and more focused + +Example: A `login(username, password)` custom tool is faster and more reliable than instructing the LLM to "tap the username field, enter the username, tap the password field, enter the password, tap the login button." + +## What we decided + +**Custom tools are currently authored in Kotlin code and registered programmatically with the agent.** + +### Current Approach + +Tools are defined as Kotlin classes or functions that: + +1. Implement a tool interface with name, description, and parameters +2. Contain execution logic that interacts with the device (via Maestro, ADB, etc.) +3. Return results that the LLM can interpret +4. Are registered with the agent at initialization time + +```kotlin +// Simplified example of custom tool definition +class LoginTool : Tool { + override val name = "login" + override val description = "Log into the app with provided credentials" + override val parameters = listOf( + Parameter("username", "string", "The username to log in with"), + Parameter("password", "string", "The password to use") + ) + + override suspend fun execute(params: Map): ToolResult { + // Implementation using Maestro/device commands + } +} +``` + +### Workflow for Adding Custom Tools + +1. Write tool implementation in Kotlin +2. Add tool to the appropriate module (framework or internal) +3. Register tool with the agent +4. Rebuild and redeploy + +### Recognized Limitations + +This approach has significant friction: + +- **Heavy process**: Adding a tool requires code changes, compilation, and redeployment +- **Developer expertise required**: Tool authors must know Kotlin and the Trailblaze internals +- **Slow iteration**: Testing and refining tools requires the full build cycle +- **Limited accessibility**: Non-developers (QA engineers, product managers) cannot easily create or modify tools + +### Future Opportunities + +We recognize this is an area for improvement. Potential future directions include: + +- **Declarative tool definitions**: YAML/JSON specifications that can be loaded at runtime +- **Scripting layer**: Lightweight scripting (e.g., Kotlin Script, or an embedded language) for tool logic +- **Dynamic tool loading**: Hot-reload tools without agent restart +- **Visual tool builder**: UI for non-developers to compose tools from primitives +- **MCP tool sources**: Loading tools dynamically from MCP servers + +These enhancements would lower the barrier to creating custom tools while preserving the power and flexibility of native Kotlin tools when needed. + +## What changed + +**Positive:** + +- Full power of Kotlin for complex tool implementations +- Type safety and IDE support during development +- Tools can leverage any Kotlin/JVM library +- Consistent with the rest of the Trailblaze codebase + +**Negative:** + +- High barrier to entry for tool authoring +- Slow iteration cycle (code → compile → deploy → test) +- Requires developer involvement for all tool changes +- Not accessible to non-technical team members diff --git a/docs/devlog/2026-01-28-desktop-application.md b/docs/devlog/2026-01-28-desktop-application.md new file mode 100644 index 00000000..a6691e33 --- /dev/null +++ b/docs/devlog/2026-01-28-desktop-application.md @@ -0,0 +1,169 @@ +--- +title: "Desktop Application (Moving Away from IDE-based Execution)" +type: decision +date: 2026-01-28 +--- + +# Trailblaze Decision 016: Desktop Application (Moving Away from IDE-based Execution) + +## Context + +Early Trailblaze prototypes ran as an IntelliJ/Android Studio plugin. This made sense initially: QE engineers and mobile developers already have Android Studio open, and plugins can access IDE features like project context, device management, and integrated tooling. + +However, as Trailblaze evolved, the IDE-based approach revealed significant limitations: + +### IDE Plugin Constraints + +1. **IDE version coupling** — IntelliJ and Android Studio release frequently. Plugin APIs change between versions, requiring continuous maintenance to support the latest IDE releases alongside older versions still in use across teams. + +2. **Installation friction** — Users must install the plugin through the IDE's plugin marketplace, manage plugin updates separately from Trailblaze framework updates, and troubleshoot version conflicts with other plugins. + +3. **Resource contention** — Running AI-powered test automation within the IDE competes for memory and CPU with the IDE itself, Gradle builds, and other development tasks. Heavy operations can make the IDE sluggish. + +4. **Limited audience** — Not all Trailblaze users need or want an IDE. QE engineers authoring tests, CI/CD pipelines executing tests, and MCP clients controlling devices don't require a full IDE. + +5. **Platform limitations** — IDE plugins can't easily provide native OS integrations (menu bar apps, system notifications, global hotkeys) that enhance the desktop experience. + +6. **Deployment complexity** — Different plugin versions for internal vs. open source builds, plugin signing requirements, and marketplace review processes add distribution overhead. + +### Evolving Usage Patterns + +Trailblaze usage has shifted toward patterns that don't require IDE integration: + +- **Trail authoring via MCP** — Engineers use Cursor, Claude Desktop, or other MCP clients to author trails conversationally (see [Decision 008](2026-01-28-trailblaze-mcp.md)) +- **CLI-driven execution** — CI/CD pipelines and local scripts invoke Trailblaze from the command line +- **Standalone test management** — QE teams want a dedicated interface for organizing, running, and debugging trails + +## Decision + +**Trailblaze is distributed as a standalone desktop application rather than an IDE plugin.** + +### Application Architecture + +The Trailblaze desktop application is a Compose Multiplatform app built with Kotlin (see [Decision 009](../../devlog/2026-01-28-kotlin-language.md)). It runs as a native application on macOS, with Linux support planned. + +``` +┌─────────────────────────────────────────────────────────┐ +│ Trailblaze Desktop App │ +├─────────────────────────────────────────────────────────┤ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ +│ │ Trail │ │ Device │ │ Test Run │ │ +│ │ Editor │ │ Manager │ │ Dashboard │ │ +│ └─────────────┘ └─────────────┘ └─────────────────┘ │ +├─────────────────────────────────────────────────────────┤ +│ ┌─────────────────────────────────────────────────────┐│ +│ │ MCP Server (embedded) ││ +│ │ - Client Agent mode (default) ││ +│ │ - Runner mode ││ +│ │ - Trailblaze Agent mode ││ +│ └─────────────────────────────────────────────────────┘│ +├─────────────────────────────────────────────────────────┤ +│ ┌─────────────────────────────────────────────────────┐│ +│ │ Trailblaze Agent Core ││ +│ │ - Tool execution (Maestro, Playwright) ││ +│ │ - Trail recording & replay ││ +│ │ - Custom tools (app-specific) ││ +│ └─────────────────────────────────────────────────────┘│ +└─────────────────────────────────────────────────────────┘ + │ │ + ▼ ▼ + ┌──────────┐ ┌──────────────┐ + │ Android │ │ iOS │ + │ (ADB) │ │ (Simulator) │ + └──────────┘ └──────────────┘ +``` + +### Core Capabilities + +| Capability | Description | +| :--- | :--- | +| **Trail management** | Browse, edit, run, and debug trails. View step-by-step execution with screenshots. | +| **Device control** | Connect to Android devices (via ADB) and iOS simulators. Live screen mirroring. | +| **MCP server** | Embedded MCP server for integration with Cursor, Claude Desktop, and other clients. | +| **Test dashboard** | View test results, execution history, and failure analysis. | +| **Settings & configuration** | LLM provider setup, target app selection, platform preferences. | + +### Interaction Modes + +Trailblaze supports multiple interaction paradigms, all sharing the same underlying agent core: + +1. **CLI-driven** — Primary interface for scripting, CI/CD, and terminal workflows (`trailblaze run`, `trailblaze mcp`, etc.) +2. **GUI-driven** — Launch the desktop app (`trailblaze app`) for visual trail editing, debugging, and test management +3. **MCP-driven** — External agents connect via MCP to control devices and author trails (works with both headless and GUI modes) + +All three modes can be used interchangeably and share configuration. + +### Menu Bar Integration (macOS) + +The app runs primarily as a menu bar application, staying out of the way while providing quick access to: + +- Device status and connection +- Active MCP sessions +- Quick trail execution +- Recent test results + +A full window can be opened for detailed trail editing, test management, and debugging. + +### Relationship to IDE Workflows + +While Trailblaze no longer runs *within* the IDE, it integrates seamlessly with IDE-based workflows: + +- **MCP integration** — Cursor and other AI-enabled editors connect to Trailblaze via MCP +- **File watching** — The app can watch for trail file changes, enabling edit-in-IDE, run-in-Trailblaze workflows +- **Project awareness** — When launched from a project directory, Trailblaze discovers trails and configuration automatically + +Developers keep their IDE for code editing; Trailblaze handles UI test automation as a complementary tool. + +### Distribution + +Trailblaze is distributed as a **CLI tool that bundles the desktop application** (see [Decision 013](2026-01-28-distribution-model.md)): + +| Audience | Channel | Command | +| :--- | :--- | :--- | +| **Open source** | Homebrew | `brew install block/tap/trailblaze` | +| **Block internal** | Internal package source | `brew install block-internal/tap/trailblaze` | + +The `trailblaze` CLI is the primary entry point. It supports headless operation for CI/CD and scripting, and can launch the desktop GUI when needed: + +```bash +# Run a trail headlessly (CI/CD, scripts) +trailblaze run my-trail.yaml + +# Start the MCP server (headless) +trailblaze mcp + +# Launch the desktop application +trailblaze app +# or simply double-click the app bundle + +# Other CLI commands +trailblaze list # List available trails +trailblaze devices # Show connected devices +trailblaze config # Manage configuration +``` + +This approach provides: + +- **Single installation** — One `brew install` gives you both CLI and GUI +- **Terminal-first workflow** — CLI is the default; GUI is available when you need visual debugging or trail editing +- **CI/CD compatibility** — Headless operation works in automated pipelines +- **Version consistency** — CLI and desktop app are always the same version + +## Consequences + +**Positive:** + +- **Decoupled from IDE releases** — No more plugin API compatibility maintenance across IDE versions +- **Simplified installation** — Single package manager command instead of IDE plugin marketplace +- **Better performance** — Dedicated process with its own resources, doesn't compete with IDE +- **Broader audience** — Useful for QE engineers, CI/CD pipelines, and MCP clients without requiring an IDE +- **Native experience** — Menu bar integration, system notifications, and OS-level features +- **Unified distribution** — CLI and desktop app bundled together, always in sync +- **Flexible interaction** — CLI, GUI, and MCP modes all supported from one package + +**Negative:** + +- **Separate window** — Users must context-switch between IDE and Trailblaze app (mitigated by MCP integration) +- **No IDE project context** — Can't automatically access IDE's understanding of the codebase (partially mitigated by project awareness features) +- **Additional process** — Another application running alongside the IDE +- **macOS-first** — Linux and Windows support requires additional effort (Linux planned, Windows not currently prioritized) diff --git a/docs/devlog/2026-01-28-koog-llm-client.md b/docs/devlog/2026-01-28-koog-llm-client.md new file mode 100644 index 00000000..bbfc0fc1 --- /dev/null +++ b/docs/devlog/2026-01-28-koog-llm-client.md @@ -0,0 +1,167 @@ +--- +title: "Koog Library for LLM Communication" +type: decision +date: 2026-01-28 +--- + +# Koog Library for LLM Communication + +Selecting a Kotlin-native library for LLM communication. + +## Background + +Trailblaze needs to communicate with Large Language Models (LLMs) to power its AI-driven test generation and execution. This requires: + +1. A client library that handles LLM API communication +2. Support for multiple LLM providers (OpenAI, Anthropic, Azure, etc.) +3. Compatibility with our Kotlin codebase +4. Multiplatform support for on-device and host-based execution + +## What we decided + +**Trailblaze uses [Koog](https://github.com/JetBrains/koog)/[koog.ai](https://koog.ai) as its LLM client library.** + +### What is KOOG + +KOOG (Kotlin AI Orchestration and Operations Gateway) is JetBrains' Kotlin-native library for LLM interactions. It provides: + +- A unified API for multiple LLM backends +- First-class Kotlin support with coroutines and type safety +- Kotlin Multiplatform (KMP) support +- Tool/function calling abstractions +- Streaming response support + +### Why KOOG + +#### 1. Kotlin-Native + +KOOG is written in Kotlin for Kotlin developers. It leverages Kotlin idioms like coroutines, sealed classes, and extension functions. This aligns with Trailblaze being a Kotlin-first project (see [Kotlin Language](2026-01-28-kotlin-language.md)). + +```kotlin +// KOOG provides idiomatic Kotlin APIs +val response = llm.chat { + system("You are a UI testing agent...") + user(buildPrompt(screenState, testStep)) + tools(availableTools) +} +``` + +#### 2. Standard for Kotlin + +KOOG is developed by JetBrains, the creators of Kotlin. This makes it the de facto standard for LLM communication in the Kotlin ecosystem. Using a standard library means: + +- Better community support and documentation +- More likely to receive long-term maintenance +- Easier to find developers familiar with the library +- Integration with other JetBrains tooling + +#### 3. Multiple Backend Support + +KOOG provides a unified interface across LLM providers: + +| Provider | Support | +| :--- | :--- | +| OpenAI | Full support | +| Anthropic | Full support | +| Azure OpenAI | Full support | +| Google AI | Full support | +| Local models (Ollama) | Supported | + +This allows Trailblaze to: + +- Switch providers without code changes +- Use different providers for different use cases +- Support customer/enterprise requirements for specific providers + +#### 4. Multiplatform Support + +KOOG supports Kotlin Multiplatform (KMP), which is essential for Trailblaze's cross-platform goals: + +- **JVM**: Host-based execution on developer machines and CI +- **Android**: On-device agent execution +- **iOS** (future): Potential iOS agent support + +The same LLM communication code can run across all platforms. + +#### 5. Tool Calling Abstractions + +KOOG provides built-in support for function/tool calling, which is central to how Trailblaze agents work. The library handles: + +- Tool schema generation from Kotlin types +- Parsing tool calls from LLM responses +- Serialization of tool results back to the LLM + +### Integration with Trailblaze + +KOOG integrates into the Trailblaze architecture at the LLM communication layer: + +``` +Agent Loop → KOOG Client → LLM Provider + ↓ ↓ +Tool Execution Response Parsing +``` + +The agent loop (see [Agent Loop Implementation](2026-01-28-agent-loop-implementation.md)) calls KOOG to communicate with the LLM, passing screen state and receiving tool calls to execute. + +### Authentication and Configuration + +LLM authentication is handled through **environment variables** containing API keys. The specific variables depend on the provider: + +- **OpenAI**: `OPENAI_API_KEY` +- **Anthropic**: `ANTHROPIC_API_KEY` +- **OpenAI-compatible endpoints**: `OPENAI_API_KEY` + `OPENAI_BASE_URL` + +See the [Open Source LLM Documentation](https://block.github.io/trailblaze/llms) for the full list of supported providers and configuration options. + +### Model Selection + +The LLM model is selected **before test execution** — currently, a single model is used for all requests within a test run. Future enhancements may include: + +- Using different models for different request types (e.g., cheaper models for simple decisions, more capable models for complex reasoning) +- Optimizing for cost and speed based on task complexity + +### Rate Limiting and Retries + +Rate limiting and retry logic is handled at the **agent loop level** rather than within KOOG. The agent loop implements iteration limits and handles transient failures. See [Agent Loop Implementation](2026-01-28-agent-loop-implementation.md) for details on execution control and termination conditions. + +### Streaming + +While KOOG supports streaming responses, Trailblaze does **not currently use streaming** for LLM responses. This is because the agent primarily operates through tool calls, which are typically returned as complete responses rather than streamed. + +However, the **Trailblaze desktop app provides real-time updates** to users — logs and progress are displayed after every step. This "streaming" experience comes from the logging layer, not the LLM client itself. See [Logging and Reporting](2026-01-28-logging-and-reporting.md) for details. + +### Token and Cost Tracking + +Trailblaze maintains a **list of models and their pricing** to calculate and display: + +- **Cost per test**: Individual test execution costs +- **Suite pricing**: Combined pricing for full test suite runs in CI + +This helps teams understand and optimize their LLM usage costs. + +### Local Models (Ollama) + +Support for local models via Ollama serves the **open source community**. This allows developers to: + +- Try Trailblaze without paying for or configuring a remote LLM provider +- Run tests in air-gapped or offline environments +- Experiment with different models locally + +Ideally, local model support enables Trailblaze to be **used by anyone, anywhere** without requiring paid LLM services. + +## What changed + +**Positive:** + +- Idiomatic Kotlin APIs that fit naturally in our codebase +- JetBrains backing provides confidence in long-term support +- Multi-provider support gives flexibility in LLM selection +- KMP support enables our cross-platform strategy +- Built-in tool calling reduces boilerplate + +**Negative:** + +- Younger library compared to Python alternatives (LangChain, etc.) — though as noted in [Kotlin Language](2026-01-28-kotlin-language.md), choosing Kotlin inherently means a smaller AI/ML ecosystem, and KOOG is the standard within that ecosystem +- Smaller ecosystem of extensions and integrations +- Must track KOOG releases and handle any breaking changes +- Some advanced features may lag behind Python-first libraries diff --git a/docs/devlog/2026-01-28-kotlin-language.md b/docs/devlog/2026-01-28-kotlin-language.md new file mode 100644 index 00000000..18b10d28 --- /dev/null +++ b/docs/devlog/2026-01-28-kotlin-language.md @@ -0,0 +1,80 @@ +--- +title: "Kotlin as Primary Language" +type: decision +date: 2026-01-28 +--- + +# Kotlin as Primary Language + +Choosing a primary language that works across Android, desktop, and server environments. + +## Background + +Trailblaze is an AI-powered UI testing agent that needs to run in multiple environments: + +1. **On Android devices** — directly within the Android instrumentation process, enabling execution on remote device farms without a connected host machine +2. **On host machines** — driving connected devices for local development and CI workflows + +We needed a language that could satisfy these runtime requirements while also integrating well with the existing mobile testing ecosystem. + +A key requirement was leveraging Block's existing Android device farm infrastructure, which enables massive parallelization for test execution. Running the agent on-device (rather than from a host) allows us to use this infrastructure without building custom connectivity solutions. + +## What we decided + +**Trailblaze is written in Kotlin.** + +### Why Kotlin + +#### 1. Android Runtime Compatibility + +Kotlin runs natively in Android's runtime environment (ART). This is critical for our on-device execution mode, where the Trailblaze agent runs directly within the Android instrumentation process on the device being tested. No additional runtimes, interpreters, or bridges are needed—Kotlin code compiles to bytecode that runs on the device like any other Android application. + +This enables two distinct execution modes: + +- **On-Device Android Driver** — The agent runs entirely on the device within instrumentation, ideal for device farm execution +- **Android Host Driver** — The agent runs on a connected machine using standard Maestro mechanics, useful for local development + +#### 2. Maestro Integration + +The Maestro framework, which provides device drivers and commands for mobile platforms, is written in Kotlin. By choosing Kotlin for Trailblaze, we can: + +- Directly leverage Maestro's APIs without cross-language bridges +- Extend and customize Maestro components easily +- Maintain a fork of Maestro's Orchestra code for on-device execution (see [Maestro Integration](2026-01-01-maestro-integration.md)) +- Contribute upstream to Maestro when appropriate + +For the on-device version of Trailblaze, we exclude Maestro dependencies that are not required for on-device execution (such as JavaScript engines and Playwright web drivers). This keeps the on-device artifact lean. See [Maestro Integration](2026-01-01-maestro-integration.md) for details on our Maestro integration strategy. + +#### 3. Compose Web for Reporting + +Kotlin enables us to use Compose Web (Wasm) for rendering test reports in the browser. This allows us to share models and UI components between the agent and the reporting interface without any modifications or translation layers. This is critical for delivering rich, interactive reports in CI/CD environments. + +This would not be possible with Java, which lacks equivalent web compilation targets. + +#### 4. Strong Typing and Developer Experience + +Kotlin's type system catches errors at compile time, which is valuable for an agent framework where tool definitions, parameters, and LLM responses need careful handling. IDE support (IntelliJ/Android Studio) provides excellent autocomplete, refactoring, and debugging. + +### Considered Alternatives + +| Language | Pros | Why Not | +| :--- | :--- | :--- | +| Java | Mature ecosystem, Android-native | No Compose Web support for shared reporting UI | +| Python | Rich AI/ML ecosystem, rapid prototyping | No native Android execution, would require embedding an interpreter | +| TypeScript | Popular for tooling, async-native | Same runtime challenges as Python on Android | + +## What changed + +**Positive:** + +- Single codebase runs on Android devices (via instrumentation) and host machines +- Enables execution on Block's device farm infrastructure without custom solutions +- Seamless integration with Maestro and Android ecosystem +- Shared models and UI components between agent and web-based reports via Compose Web +- Strong typing reduces runtime errors in tool definitions and LLM parsing +- Team can leverage existing Kotlin/JVM expertise + +**Negative:** + +- Smaller AI/ML library ecosystem compared to Python — however, [KOOG](https://github.com/JetBrains/koog) (JetBrains' Kotlin-native LLM library) enables LLM communication and multi-provider support within the Kotlin ecosystem. See [Koog LLM Client](2026-01-28-koog-llm-client.md). +- Learning curve for team members not familiar with Kotlin diff --git a/docs/devlog/2026-01-28-logging-and-reporting.md b/docs/devlog/2026-01-28-logging-and-reporting.md new file mode 100644 index 00000000..58d1c309 --- /dev/null +++ b/docs/devlog/2026-01-28-logging-and-reporting.md @@ -0,0 +1,133 @@ +--- +title: "Logging and Reporting Architecture" +type: decision +date: 2026-01-28 +--- + +# Logging and Reporting Architecture + +Designing structured logging that works across agent runs, CI, and desktop. + +## Background + +AI agents are notoriously difficult to debug. When a test fails, understanding *why* requires visibility into: + +- What the agent "saw" (screen state, view hierarchy) +- What the LLM was asked and what it responded +- Which tools were executed and their results +- The sequence of events leading to failure + +Without detailed logging, debugging becomes guesswork. Additionally, we need to present this information in ways that are accessible during development (desktop app) and in CI results (web reports). + +## What we decided + +**Trailblaze implements a structured logging system (`TrailblazeLog`) that captures detailed agent activity, which powers both the desktop app's real-time view and generated web reports.** + +### Structured Log Events + +All agent activity is captured as typed log events that inherit from `TrailblazeLog`. Each log includes: + +- **Session ID**: Groups logs for a single test execution +- **Timestamp**: Precise timing for event ordering + +Key log types capture different aspects of agent behavior: + +| Log Type | Purpose | +| :--- | :--- | +| `TrailblazeSessionStatusChangeLog` | Test lifecycle (started, completed, failed) | +| `TrailblazeLlmRequestLog` | LLM prompts, responses, tool calls, and cost | +| `TrailblazeToolLog` | Tool execution results and timing | +| `MaestroDriverLog` | Low-level device interactions | +| `MaestroCommandLog` | Maestro command execution details | +| `ObjectiveStartLog` / `ObjectiveCompleteLog` | Test step progress | +| `TrailblazeSnapshotLog` | User-initiated screen captures | +| `TrailblazeAgentTaskStatusChangeLog` | Agent task state transitions | + +### Rich Context Capture + +Logs capture rich context for debugging: + +- **Screenshots**: Screen captures at key moments (LLM requests, tool execution) +- **View Hierarchies**: Full and filtered UI tree for element inspection +- **LLM Messages**: Complete conversation history with the model +- **Tool Options**: Available tools at each decision point +- **Usage/Cost**: Token counts and estimated costs per LLM request +- **Durations**: Timing for each operation + +### Log Storage + +Logs are written to disk as JSON files organized by session: + +``` +logs/ +└── 2026-01-28_14-30-00_LoginTest/ + ├── 001_TrailblazeSessionStatusChangeLog.json + ├── 002_TrailblazeLlmRequestLog.json + ├── 002_screenshot.png + ├── 003_TrailblazeToolLog.json + ├── 004_MaestroDriverLog.json + └── ... +``` + +This file-based approach enables: + +- Persistence across restarts +- Easy sharing of debug artifacts +- Simple archiving in CI systems +- Reactive file watching for live updates + +### Desktop App Integration + +The desktop app uses `LogsRepo` to provide a real-time view of test execution: + +- **Live updates**: File watchers detect new logs and update the UI immediately +- **Session list**: Browse all test sessions with status indicators +- **Log timeline**: Step through events chronologically +- **Screenshot viewer**: See exactly what the agent saw +- **View hierarchy inspector**: Explore the UI tree at any point +- **LLM conversation viewer**: Review prompts and responses + +This makes the desktop app an essential development tool—engineers can watch tests execute in real-time and immediately understand failures. + +### Web Report Generation + +The `trailblaze-report` module generates static HTML/WASM reports from log data: + +1. **Log collection**: Gather logs from test execution (local or CI) +2. **Report generation**: Bundle logs with a WebAssembly-based viewer +3. **Static output**: Single-file HTML that can be viewed in any browser + +Reports provide the same inspection capabilities as the desktop app but as a shareable artifact. This is critical for CI pipelines where: + +- Test failures need investigation without access to the original machine +- Results must be archived for compliance or historical analysis +- Multiple team members need to review the same failure + +### Why Custom Logging (Not Standard Logging Frameworks) + +We chose structured `TrailblazeLog` events over traditional logging (Log4j, SLF4J) because: + +1. **Type safety**: Sealed class hierarchy ensures all logs have required fields +2. **Rich data**: Screenshots and view hierarchies can't be captured in text logs +3. **Queryable**: Logs can be filtered by type, searched, and analyzed programmatically +4. **UI-friendly**: Typed events map directly to UI components +5. **Cross-platform**: Same log format works on Android, desktop, and web + +Traditional logging is still used for framework-level debugging, but `TrailblazeLog` captures the semantically meaningful agent events. + +## What changed + +**Positive:** + +- Debugging agent failures becomes tractable with full context +- Desktop app provides immediate feedback during development +- CI reports enable async investigation of failures +- Screenshots and hierarchies make visual debugging possible +- Structured format enables tooling (analysis, comparison, search) + +**Negative:** + +- Log files can become large (especially with screenshots) +- Disk I/O overhead during test execution +- Custom log viewer required (can't use standard log tools) +- Log format changes require updates to viewers diff --git a/docs/devlog/2026-01-28-trailblaze-mcp.md b/docs/devlog/2026-01-28-trailblaze-mcp.md new file mode 100644 index 00000000..554a3845 --- /dev/null +++ b/docs/devlog/2026-01-28-trailblaze-mcp.md @@ -0,0 +1,272 @@ +--- +title: "Trailblaze MCP" +type: decision +date: 2026-01-28 +--- + +# Trailblaze Decision 008: Trailblaze MCP + +## Context + +Trailblaze provides LLM-driven UI automation for mobile applications. + +Historically, single-agent approaches to UI automation required the agent to maintain screen state (view hierarchies, screenshots) within its own conversation. This caused two problems: + +1. **Context window bloat**: Each step added more screen state to the conversation, eventually exhausting the context limit +2. **LLM confusion**: Multiple screen states in the same conversation led to the model reasoning about outdated UI or conflating different screens + +Trailblaze addresses this with a **subagent architecture**: each step is handled by a fresh agent conversation that only receives the current screen state. The orchestrating layer maintains continuity while subagents operate statelessly on the latest UI. + +External and internal teams have expressed interest in integrating with Trailblaze via MCP for device control: + +- **Block mobile engineers and Firebender**: Automate mobile UI interactions to remove human-in-the-loop friction during development—typically throwaway trails for quick validation of a flow +- **Test authoring, execution, and infrastructure**: Enable developers and QE to create, run, and manage persistent UI tests that run continuously +- **General device control**: Provide MCP-based mobile device control for any agent or tool that needs to interact with mobile applications + +A key principle: **author once, run deterministically**. While the subagent approach is used during initial authoring (exploring the UI, figuring out the right steps), the result is a recorded trail. Subsequent runs use the trail deterministically without LLM reasoning—fast, predictable, and cost-free. + +**Trail recording** works through sessions: a new session starts automatically when interactions begin, and everything within that session is recorded. Users explicitly indicate when they want to finalize a trail from their actions, allowing them to review in the Trailblaze desktop app before sending it for automated execution. + +**Trail storage**: Trails are persisted as `trail.yaml` files on disk. At Block, trails are stored in a dedicated directory and referenced by path. For the internal test infrastructure, if a trail doesn't exist on disk, it can be generated from natural language via the TestTrail system. + +**AI fallback** can recover from trail failures due to UI changes, but is disabled by default. This preserves determinism and avoids LLM costs. When a trail step fails, Trailblaze reports the failure to the MCP client, which can then decide whether to invoke AI-assisted recovery using natural language prompts. + +**Custom tools** are a key benefit of Trailblaze. By specifying a target app, teams get access to app-specific tools that expose functionality beyond standard UI interactions. For example, an app target can provide a tool for quickly logging into staging or test accounts, providing the same access as debug menus without navigating through the UI. + +The [Model Context Protocol (MCP)](https://modelcontextprotocol.io) provides a standardized interface for exposing Trailblaze capabilities to external AI systems. Trailblaze uses the **Streamable HTTP** transport, which allows MCP clients to connect via HTTP POST requests to a session-based endpoint. See the [MCP setup guide](../../mcp/index.md) for connection details. + +## Decision + +**Introduce a Trailblaze MCP server with multiple modes that support different integration patterns. Tools are dynamically registered based on the current mode, and clients can switch modes during a session.** + +### Operating Modes + +The modes are defined by two questions: +1. **Who is the agent?** (Who decides what actions to take) +2. **Where does the LLM come from?** (Who provides the "brain") + +--- + +#### Mode 1: `MCP_CLIENT_LIKE_GOOSE_AS_AGENT` (Dumb Tools) + +| Aspect | Value | +|--------|-------| +| **Who's the agent** | MCP client (e.g., Goose, Firebender) | +| **LLM source** | MCP client's LLM | +| **Trailblaze exposes** | Primitive tools only (`tap`, `swipe`, `inputText`, `getScreenshot`, `viewHierarchy`) | +| **Trailblaze role** | Dumb tool executor - no reasoning | + +``` +Goose: "I see login button" → tap(150, 300) +Goose: "I see text field" → inputText("username") +Goose: "I see password field" → inputText("password") +``` + +Trailblaze is completely dumb. Just executes what the MCP client tells it. + +--- + +#### Mode 2: `TRAILBLAZE_AGENT_WHILE_LOOP` (Local LLM) + +| Aspect | Value | +|--------|-------| +| **Who's the agent** | Trailblaze | +| **LLM source** | Trailblaze's local LLM (configured provider) | +| **Trailblaze exposes** | `runPrompt()` only | +| **Trailblaze role** | Full agent - does all reasoning and execution | + +``` +Goose: runPrompt("login to the app") +Trailblaze: *thinks using configured LLM* → tap → type → tap → done +Goose: *waits, gets result* +``` + +The MCP client just kicks off the task. Trailblaze does everything internally. + +--- + +#### Mode 3: `MCP_CLIENT_LIKE_GOOSE_WITH_SAMPLING` (Tunneled LLM) + +| Aspect | Value | +|--------|-------| +| **Who's the agent** | MCP client (high-level) + Trailblaze (low-level execution) | +| **LLM source** | MCP client's LLM (tunneled via MCP Sampling) | +| **Trailblaze exposes** | High-level tools (`runPrompt`, `switchToolSet`) | +| **Trailblaze role** | Sub-agent that borrows MCP client's brain | + +``` +Goose: runPrompt("tap the login button") ← Goose decides WHAT task +Trailblaze: *needs to think* → asks Goose via sampling: "where is login button?" +Goose's LLM: "it's at (150, 300)" +Trailblaze: *taps* → returns result +Goose: runPrompt("enter username sam") ← Goose decides NEXT task +``` + +**Goose drives the conversation** (decides what tasks to do next). +**Trailblaze borrows Goose's brain** for the low-level "how" decisions via MCP Sampling. + +--- + +#### Mode 4: `TRAILBLAZE_AGENT_RECURSIVE_MCP` (Future - Self-Connection) + +| Aspect | Value | +|--------|-------| +| **Who's the agent** | Trailblaze | +| **LLM source** | Trailblaze's local LLM (configured provider) | +| **Trailblaze exposes** | `runPrompt()` only | +| **Trailblaze role** | Full agent that calls its OWN MCP tools | + +``` +Goose: runPrompt("login to the app") +Trailblaze Agent: *thinks using local LLM* +Trailblaze Agent: → calls tap() via MCP (to itself!) +Trailblaze Agent: → calls inputText() via MCP (to itself!) +Trailblaze Agent: → done, returns to Goose +``` + +Same external interface as Mode 2, but internally the agent uses MCP for tool execution (self-connection). This creates **architectural symmetry** - external MCP clients and internal agent use the exact same tool interface. + +--- + +### Mode Summary Table + +| Mode | Agent | LLM Source | Trailblaze Role | Status | +|------|-------|------------|-----------------|--------| +| `MCP_CLIENT_LIKE_GOOSE_AS_AGENT` | MCP client | MCP client | Dumb tool executor | ✅ Implemented | +| `TRAILBLAZE_AGENT_WHILE_LOOP` | Trailblaze | Local (configured LLM) | Full agent | ✅ Implemented | +| `MCP_CLIENT_LIKE_GOOSE_WITH_SAMPLING` | MCP client + Trailblaze | MCP client (tunneled) | Sub-agent, borrows brain | ✅ Implemented | +| `TRAILBLAZE_AGENT_RECURSIVE_MCP` | Trailblaze | Local (configured LLM) | Full agent via self-MCP | 🔮 Future | + +**Note on deterministic execution**: Trail recording/playback is orthogonal to these modes. If a trail has recordings, it runs deterministically without LLM calls regardless of mode. + +--- + +### Session State + +The MCP server is single-tenant—one session controls one device at a time. Settings like target device, target app, and platform are retained within a session and persist across reconnections. + +### Dynamic Tool Management + +Tools change based on multiple dimensions: +- **Mode**: Switching between modes changes available tools +- **Target app**: App-specific tools for your configured app targets +- **Target platform**: iOS vs Android may expose different capabilities +- **Tool categories**: Subagents can dynamically swap toolsets to reduce context window usage + +Users can configure settings via the Trailblaze desktop app or via MCP tools (e.g., `setMode`, `setTargetApp`). + +### Scope + +This design assumes **local device control**: the MCP server runs on the same machine as the MCP client, with devices connected directly via ADB (Android) or as physical/simulated iOS devices. Remote device farms and cloud-based device provisioning are out of scope. + +## Consequences + +**Positive:** +- Single MCP server supports multiple integration patterns +- Client Agent mode requires no Trailblaze LLM configuration +- Dynamic mode switching enables seamless transitions between execution and authoring +- MCP Sampling enables the subagent pattern, preventing context window exhaustion +- Deterministic trail execution by default keeps costs low and behavior predictable + +**Negative:** +- TRAILBLAZE_AS_AGENT mode requires LLM configuration +- MCP_CLIENT_AS_AGENT mode with subagent orchestration requires clients that support MCP Sampling +- Single-tenant design limits to one device per session + +--- + +## Implementation Summary + +### What Was Built + +| Component | Description | +|-----------|-------------| +| **Session Configuration** | `TrailblazeMcpMode`, `ScreenshotFormat`, `ViewHierarchyVerbosity`, `LlmCallStrategy` enums and configurable `TrailblazeMcpSessionContext` | +| **Session Config Tools** | `getSessionConfig`, `setMode`, `setScreenshotFormat`, `setAutoIncludeScreenshot`, `setViewHierarchyVerbosity`, `setLlmCallStrategy`, `configureSession` - all use enum parameters directly for type safety | +| **Dynamic Tool Categories** | `ToolSetCategory` enum with `DynamicToolSetManager` for per-session tool state | +| **Tool Management Tools** | `listToolCategories`, `enableToolCategories`, `addToolCategory`, `removeToolCategory`, `focusOnCategory`, plus presets (`useMinimalTools`, `useStandardTools`, `useTestingTools`) | +| **MCP Sampling Support** | `McpSamplingClient` using MCP Kotlin SDK's `ServerSession.createMessage()` and `SubagentOrchestrator` for multi-step automation | +| **Progress Notifications** | `McpProgressNotifier` bridges LogsRepo events to MCP progress notifications | +| **Multi-Session Support** | New transport + MCP server instance per client, allowing simultaneous connections | +| **Bridge Entry Point** | `runYamlBlocking()` method encapsulates MCP-specific blocking execution with progress callbacks | +| **Cancellation Propagation** | MCP session lifecycle wired to automation cancellation | +| **MCP Tool Executor** | `McpToolExecutor` interface with `DirectMcpToolExecutor` for in-process tool execution | +| **Dual Sampling Source** | `SamplingSource` interface with `LocalLlmSamplingSource`, `McpClientSamplingSource`, and `SamplingSourceResolver` | +| **Koog MCP Agent** | `KoogMcpAgent` using Koog's native `AIAgent` with MCP tools via self-connection | +| **LLM Call Strategy** | `LlmCallStrategy` enum (DIRECT/MCP_SAMPLING) for selecting how LLM API calls are made | +| **Agent Metrics** | `AgentMetricsCollector` tracking success/failure rates, `getAgentMetrics` and `clearAgentMetrics` tools | +| **LLM Wiring** | Optional `llmClientProvider` and `llmModelProvider` in `TrailblazeMcpServer` for local LLM fallback | + +### Type-Safe Enum Parameters + +MCP tool parameters use **enum types directly** instead of strings. Koog and the MCP SDK serialize enums automatically via kotlinx.serialization. + +**Benefits**: +- **LLM visibility**: Enum values are enumerated in the tool schema, so LLMs see all valid options +- **Type safety**: No runtime parsing errors from invalid string values +- **Cleaner code**: No `fromString()` boilerplate in enum companions + +**Example**: `setMode(mode: TrailblazeMcpMode)` instead of `setMode(mode: String)`. The LLM sees the schema includes `MCP_CLIENT_AS_AGENT` and `TRAILBLAZE_AS_AGENT` as valid values. + +### Two-Tier Tool Management Pattern + +For subagents to reduce context window usage: + +1. **Parent LLM** selects initial tool categories based on the high-level task +2. **Subagent** can swap categories as it discovers what it needs + +This reduces context window usage by 50-80% compared to exposing all tools. + +### MCP Logging Infrastructure + +Structured `TrailblazeLog` events for MCP agent operations, enabling visibility in the Trailblaze desktop app and debugging: + +| Log Type | Purpose | +|----------|---------| +| `McpAgentRunLog` | Full agent run lifecycle - objective, transport mode, iteration count, final result | +| `McpAgentIterationLog` | Per-iteration details - iteration number, LLM completion, tool called, result | +| `McpSamplingLog` | LLM completion requests - messages, model, tokens, duration, strategy | +| `McpAgentToolLog` | Tool execution - tool name, arguments, result, duration, transport mode | + +All log types use enum types (`AgentToolTransport`, `LlmCallStrategy`) for type safety, defined in `trailblaze-models` so logs can reference them. + +--- + +## Known Limitations + +1. **MCP Sampling**: Most MCP clients (Cursor, Firebender) don't support `sampling/createMessage`. Goose does support it. Use TRAILBLAZE_AS_AGENT mode with DIRECT LLM strategy as the recommended fallback. + +2. **Manual Refresh Required After Server Restart**: Sessions are in-memory only. Trailblaze returns HTTP 404 per [MCP spec](https://modelcontextprotocol.io/specification/draft/basic/transports#session-management), but Cursor/Firebender don't auto-reconnect ([known client bug](https://forum.cursor.com/t/mcp-client-wrong-handling-of-http-not-found-in-session-management-stateful-mcp-server/134781)). Manual refresh is required. + +3. **Single Device Per Session**: Each MCP session controls one device at a time. + +--- + +## Future Direction: Two-Tier Agent Architecture + +> **See [Decision 025: Two-Tier Agent Architecture](./2026-02-04-trail-blaze-agent-architecture.md)** for the next evolution of agent design. + +The two-tier architecture separates concerns: +- **Outer Agent** (MCP client like Goose, or Koog in standalone): Planning, replanning, cross-system orchestration +- **Inner Agent** (Trailblaze): Screen understanding, action recommendation, device execution + +This enables **model specialization** (cheap vision model for screen analysis, expensive reasoning model for planning) and **cross-system testing** where the outer agent coordinates mobile UI + filesystem + database + API verification. + +--- + +## Architecture: TrailblazeMcpBridge + +`TrailblazeMcpBridgeImpl` is the **primary entry point** for all MCP-specific operations, bridging MCP's request/response model and Trailblaze's internal async architecture. + +| Aspect | Desktop UI | MCP | +|--------|-----------|-----| +| Execution model | Fire-and-forget | Must block until completion | +| Progress | Shown in UI | Streamed as MCP notifications | +| Session continuity | UI maintains state | Bridge manages per-device sessions | +| Cancellation | User clicks Stop | MCP session close triggers cancellation | + +**Bridge Responsibilities:** +- Device selection and session management +- YAML execution (`runYaml()` fire-and-forget, `runYamlBlocking()` for MCP) +- Screen state access and tool execution +- Cancellation propagation diff --git a/docs/devlog/2026-01-29-ai-fallback.md b/docs/devlog/2026-01-29-ai-fallback.md new file mode 100644 index 00000000..986888a1 --- /dev/null +++ b/docs/devlog/2026-01-29-ai-fallback.md @@ -0,0 +1,155 @@ +--- +title: "AI Fallback" +type: decision +date: 2026-01-29 +--- + +# Trailblaze Decision 021: AI Fallback + +## Context + +A core value proposition of Trailblaze is that **natural language is always the source of truth** for test definitions. As described in [Decision 002](../../devlog/2025-10-01-trail-recording-format.md), trail recordings (`.trail.yaml` files) are an *optimization*—they capture successful executions as deterministic tool sequences that can replay without LLM involvement, reducing costs and ensuring consistency. + +However, recordings are inherently tied to the application state at the time they were captured. When the application changes—a new onboarding popup appears, button text is updated, a feature flag changes the UI flow—recorded tool calls may fail. Rather than treating this as an immediate test failure, Trailblaze can leverage the natural language source of truth to attempt recovery. + +This is **AI Fallback**: when recorded steps fail, Trailblaze falls back to AI interpretation of the natural language steps, allowing tests to navigate through UI inconsistencies and complete successfully. + +## Decision + +**Trailblaze implements AI Fallback as a configurable execution feature that re-interprets natural language steps when recorded tool calls fail, distinguishing these recoveries with a specific test result status.** + +### Natural Language as Source of Truth + +Every Trailblaze test is defined by natural language steps: + +```yaml +- prompts: + - step: Launch the app and sign in with user@example.com + - step: Navigate to Settings + - step: Verify the account email is displayed +``` + +These steps represent the *intent* of the test. A recording captures *one way* to accomplish that intent: + +```yaml +- prompts: + - step: Navigate to Settings + recording: + tools: + - tapOnElementWithAccessibilityText: + accessibilityText: Settings + - waitForElementWithText: + text: Account Settings +``` + +When the recording fails (e.g., the "Settings" button was renamed to "Preferences"), the natural language step "Navigate to Settings" still clearly describes what should happen. AI Fallback uses this to recover. + +### How AI Fallback Works + +1. **Recorded execution begins**: Trailblaze executes the recorded tool calls for each step +2. **Tool call fails**: A tool call returns an error (element not found, assertion failed, timeout, etc.) +3. **Fallback triggered**: Instead of failing immediately, Trailblaze switches to AI mode for the current step +4. **LLM interprets step**: The natural language step is sent to the LLM, which analyzes the current screen state and determines the appropriate actions +5. **Execution continues**: If the LLM successfully completes the step, execution proceeds to the next step (which may continue in recorded or fallback mode depending on configuration) +6. **Result marked**: The test result is marked with a distinct status indicating AI Fallback was used + +### Configuration Options + +AI Fallback can be enabled or disabled based on execution context: + +| Configuration | Behavior | +| :--- | :--- | +| `aiFallback: enabled` | When recorded steps fail, fall back to AI interpretation | +| `aiFallback: disabled` | Recorded step failures immediately fail the test | + +**When to enable fallback:** + +- CI pipelines where test stability is prioritized over strict determinism +- Tests running against frequently-changing areas of the application +- Environments where minor UI inconsistencies are expected (e.g., feature flags, A/B tests) + +**When to disable fallback:** + +- Recording new trails (fallback would mask recording issues) +- Validating that recordings are up-to-date +- Performance-critical pipelines where LLM latency is unacceptable +- Debugging specific recording failures + +### Test Result Statuses + +AI Fallback introduces a distinct test result status to provide visibility into how tests succeeded: + +| Status | Description | +| :--- | :--- | +| `PASSED` | Test succeeded using recordings only (no AI involvement) | +| `PASSED_WITH_AI_FALLBACK` | Test succeeded, but one or more steps required AI fallback | +| `PASSED_AI_MODE` | Test ran entirely in AI mode (no recording or recording intentionally skipped) | +| `FAILED` | Test failed (even after AI fallback attempts, if enabled) | + +The `PASSED_WITH_AI_FALLBACK` status is critical for several reasons: + +1. **Recording staleness detection**: A high rate of fallback-assisted passes indicates recordings need updating +2. **Pipeline health monitoring**: Teams can track fallback usage over time and set thresholds +3. **Debugging context**: When investigating test behavior, knowing fallback was used helps explain differences from expected execution +4. **Cost awareness**: AI fallback incurs LLM costs; tracking helps with budget planning + +### Interaction with Step-Level Recordability + +As noted in [Decision 002](../../devlog/2025-10-01-trail-recording-format.md), individual steps can be marked `recordable: false` to always use AI interpretation: + +```yaml +- step: Verify the total matches the expected value + recordable: false # Always uses AI +``` + +AI Fallback is different—it applies to steps that *have* recordings but whose recordings fail at runtime. The two features are complementary: + +- **`recordable: false`**: Intentionally always use AI (design decision) +- **AI Fallback**: Gracefully recover when recordings unexpectedly fail (resilience mechanism) + +### Fallback Scope and Continuation + +When AI Fallback is triggered for a step: + +1. **Step scope**: The LLM re-interprets only the failing step, not the entire test +2. **Screen context**: The LLM receives the current screen state (screenshot, view hierarchy) +3. **Continuation**: After successful fallback, the next step attempts recorded execution first (if available) +4. **Cascading fallback**: If subsequent recorded steps also fail, fallback is triggered for each independently + +This step-by-step approach minimizes LLM usage while maximizing recovery opportunities. + +### Example Scenario + +Consider a test with this step: + +```yaml +- step: Dismiss any promotional popups and navigate to the main screen + recording: + tools: + - waitForElementWithText: + text: Welcome to MyApp + - tapOnElementWithAccessibilityText: + accessibilityText: Home +``` + +**Without AI Fallback:** If a new "What's New" popup appears before the Welcome screen, the `waitForElementWithText` call fails, and the test fails immediately. + +**With AI Fallback:** The tool call fails, fallback is triggered, the LLM sees the "What's New" popup, dismisses it, then proceeds to navigate to the main screen. The test passes with `PASSED_WITH_AI_FALLBACK` status. + +## Consequences + +**Positive:** + +- Tests are more resilient to minor UI changes, reducing flakiness +- Natural language remains the authoritative test definition, with recordings as an optimization +- Clear visibility into fallback usage enables informed decisions about recording maintenance +- Teams can balance determinism and resilience based on their specific needs +- Recordings can remain valid longer, reducing maintenance burden + +**Negative:** + +- AI Fallback incurs LLM costs when triggered +- Fallback-assisted passes may mask recordings that need updating if not monitored +- Execution time increases when fallback is triggered (LLM latency) +- Test behavior may vary slightly between recorded and fallback execution paths +- Requires monitoring and alerting on fallback rates to maintain recording health diff --git a/docs/devlog/2026-01-29-device-specific-trail-recordings.md b/docs/devlog/2026-01-29-device-specific-trail-recordings.md new file mode 100644 index 00000000..87baa389 --- /dev/null +++ b/docs/devlog/2026-01-29-device-specific-trail-recordings.md @@ -0,0 +1,208 @@ +--- +title: "Device-Specific Trail Recordings" +type: decision +date: 2026-01-29 +--- + +# Trailblaze Decision 019: Device-Specific Trail Recordings + +## Context + +Trailblaze tests need to execute across multiple platforms (Android, iOS, web) and device types (phones, tablets, custom hardware). While the *intent* of a test is the same—"Sign in and verify the dashboard loads"—the actual UI interactions often differ significantly between platforms and device form factors. + +Consider these real-world differences: + +- **Platform differences**: iOS uses accessibility identifiers while Android uses resource IDs; navigation patterns differ (back gesture vs. back button) +- **Form factor differences**: Tablets may show split-view layouts where phones show single screens; tablet tests might need additional taps to select items from a master list +- **Hardware-specific UI**: Custom hardware devices may have unique screen sizes and hardware buttons that differ from consumer devices +- **Resolution/density differences**: Scroll distances, tap coordinates, and visible element counts vary between device sizes + +Attempting to use a single recording across all platforms and devices leads to flaky tests and false failures. + +## Decision + +**Trail recordings are stored per-platform and per-device type, with `trail.yaml` (natural language) as the authoritative source of truth.** + +### The Source of Truth: `trail.yaml` + +The `trail.yaml` file contains the natural language test steps. This file defines *what* the test should do, independent of *how* it's executed on any specific device: + +```yaml +- prompts: + - step: Launch the app signed in with test@example.com + - step: Navigate to the Items screen + - step: Create a new item named "Test Product" + - step: Verify the item appears in the list +``` + +This `trail.yaml` is the source of truth because: + +1. **It represents the test intent**: The natural language steps capture what the test is verifying, not implementation details +2. **It's human-readable and reviewable**: QE engineers can understand and update test steps without knowing platform-specific interactions +3. **It can be managed externally**: Test case management systems can generate or sync the natural language steps +4. **It enables AI interpretation**: When no recording exists, the LLM agent interprets these steps to execute the test + +> **Important:** If a device-specific recording exists (e.g., `android-phone.trail.yaml`), it is used instead of interpreting the natural language. The recording contains the same steps but with captured tool invocations. + +### Device Classifiers + +Trailblaze uses a **classifier system** to identify devices and resolve the appropriate recording. Classifiers are ordered from most general to most specific: + +1. **Platform classifier** (required, always first): `android`, `ios`, `web`, or custom platform identifiers +2. **Form factor classifier**: `phone`, `tablet`, `iphone`, `ipad`, etc. +3. **Future classifiers** (extensible): API version, orientation (`landscape`/`portrait`), specific device models + +At runtime, the `TrailblazeDeviceClassifiersProvider` interface detects the current device and returns its classifiers. For example, a Pixel phone returns `["android", "phone"]` and an iPad returns `["ios", "ipad"]`. + +The framework provides a default implementation for standard Android and iOS devices. Custom implementations can support additional hardware platforms. + +Recording filenames are derived by joining classifiers with dashes: `{platform}-{form_factor}.trail.yaml` + +> **Note:** Classifier names themselves cannot contain dashes, as dashes are used as the delimiter in filenames. + +### Platform and Device-Specific Recordings + +Each successful execution generates a recording specific to that device's classifiers: + +``` +trails/login/ +├── trail.yaml # Source of truth (natural language) +├── android-phone.trail.yaml # Android phone recording +├── android-tablet.trail.yaml # Android tablet recording +├── ios-iphone.trail.yaml # iPhone recording +└── ios-ipad.trail.yaml # iPad recording +``` + +Each recording captures the exact tool calls that succeeded on that device type: + +```yaml +# android-phone.trail.yaml +- prompts: + - step: Navigate to the Items screen + recording: + tools: + - tapOnElementWithResourceId: + resourceId: "com.example.myapp:id/nav_items" +``` + +```yaml +# ios-iphone.trail.yaml +- prompts: + - step: Navigate to the Items screen + recording: + tools: + - tapOnElementWithAccessibilityIdentifier: + accessibilityIdentifier: "items_tab" +``` + +### Why This Matters + +#### 1. Higher Confidence in CI Failures + +When tests use deterministic recordings, failures signal real issues—not LLM interpretation variance or platform-specific flakiness. A failing `ios_phone.trail.yaml` recording tells you exactly which tool call failed and what changed in the app. + +#### 2. Faster Execution + +Recorded mode skips LLM interpretation entirely, replaying tool calls directly. This dramatically speeds up CI runs and reduces API costs. + +#### 3. Independent Recording Lifecycle + +Each platform recording can be updated independently. If the Android UI changes, only the Android recordings need re-recording—iOS recordings remain stable. + +#### 4. Gradual Coverage Expansion + +Teams can start with recordings for their primary device type (e.g., `ios_phone`) and expand coverage to tablets and other platforms over time. Tests still run on unrecorded platforms via AI interpretation. + +### Runtime Resolution + +When executing a test, Trailblaze resolves the recording based on the device's classifiers: + +1. **Device-specific recording exists**: Use `android-phone.trail.yaml` (deterministic replay) +2. **No matching recording**: Fall back to `trail.yaml` (AI interprets natural language) + +There is intentionally **no middle-tier fallback** (e.g., `android.trail.yaml` for all Android devices). Device-specific recordings scale better for automated generation and require less manual maintenance—when a recording is generated, it's tied to the exact device type that produced it. + +> **Note on AI Fallback:** When a recorded test fails during execution, Trailblaze can attempt to recover using AI interpretation. This "AI fallback" functionality is covered in [Decision 021: AI Fallback](2026-01-29-ai-fallback.md). + +### Recording Generation Workflow + +Recordings are created through two primary mechanisms: + +#### 1. Desktop Application (Current) + +After a successful test execution, the recording is available in the test report. The desktop application provides a **"Save Recording"** feature that writes the recording to the correct location on disk (based on the test directory and device classifiers). Once committed, this recording is used for subsequent CI runs. + +#### 2. CI Auto-Generation (Planned) + +To scale recording coverage without requiring local runs, we are building CI pipeline automation that: + +1. Detects successful AI-interpreted test executions (tests without existing recordings) +2. Automatically generates a pull request containing the new device-specific recording +3. Links the PR to the successful CI run for traceability + +**Human-in-the-loop approval is required.** A QE engineer must review and approve each recording PR to ensure the recorded tool sequence correctly implements the test intent. + +**Recording PR Review Checklist:** +- Verify the recorded steps match the test's intention (not just that it passed) +- Check for extraneous steps that shouldn't be part of the recording +- Confirm the tool calls are appropriate for the device type +- Review the linked CI run to understand the execution context + +### Out-of-Sync Recordings + +When `trail.yaml` changes (steps added, modified, or removed), existing device recordings may become **out of sync** with the natural language source. + +**Current behavior:** +- Out-of-sync recordings are **still allowed to execute**—they do not block test runs +- The desktop application displays indicators when a recording is out of sync (see [Decision 016](2026-01-28-desktop-application.md)) +- It is the test author's responsibility to reconcile out-of-sync recordings (re-record or update) + +This approach prioritizes test continuity while providing visibility into sync status. However, keeping recordings in sync is important—out-of-sync recordings undermine the value of having natural language as the source of truth. + +### Custom Tools for Platform-Specific Logic + +Rather than embedding platform conditionals in recordings, platform-specific behavior is encapsulated in **custom Trailblaze tools** (see [Decision 005: Tool Naming Convention](2026-01-14-tool-naming-convention.md)). These tools: + +- Can be app-specific (e.g., `myapp_ios_launchAppSignedIn`) or platform-specific +- Encapsulate complex, multi-step actions into a single tool call +- Handle internal conditionals in code, making recordings simpler and more stable + +This keeps trail recordings as simple lists of tool invocations while allowing sophisticated platform-specific behavior where needed. + +### Recording Naming Convention + +Device-specific recordings follow the pattern: `{platform}-{form_factor}.trail.yaml` + +| Platform | Form Factor | Filename | +| :--- | :--- | :--- | +| Android | Phone | `android-phone.trail.yaml` | +| Android | Tablet | `android-tablet.trail.yaml` | +| iOS | iPhone | `ios-iphone.trail.yaml` | +| iOS | iPad | `ios-ipad.trail.yaml` | +| Web | (TBD) | `web.trail.yaml` (or `web-chromium.trail.yaml`, etc.) | + +> **Why `ios-iphone` instead of `ios-phone`?** Form factor classifiers use terminology natural to each platform. iOS users and engineers refer to "iPhone" and "iPad," not "phone" and "tablet." This makes recordings immediately recognizable. + +The classifier system is extensible. Future classifiers could enable recordings like: +- `android-phone-api34.trail.yaml` (API version-specific) +- `ios-ipad-landscape.trail.yaml` (orientation-specific) +- `web-chromium-mobile.trail.yaml` (browser + viewport) + +Currently, platform + form factor is sufficient for most test differentiation needs. Web platform classifiers are still being defined and may include browser type and/or viewport size. + +## Consequences + +**Positive:** + +- Clear separation between test intent (`trail.yaml`) and platform-specific execution (recordings) +- Higher CI reliability through deterministic, device-specific replay +- Independent recording updates per platform without affecting others +- Graceful fallback enables testing on new device types without upfront recording +- Natural language source of truth enables external test case management integration + +**Negative:** + +- Multiple recordings per test increases maintenance surface area +- Recordings may drift out of sync with each other or with `trail.yaml` +- Storage grows linearly with supported device types +- Teams must decide which device types warrant dedicated recordings vs. AI interpretation diff --git a/docs/devlog/2026-02-04-mobile-agent-v3-integration.md b/docs/devlog/2026-02-04-mobile-agent-v3-integration.md new file mode 100644 index 00000000..690ac9cc --- /dev/null +++ b/docs/devlog/2026-02-04-mobile-agent-v3-integration.md @@ -0,0 +1,199 @@ +--- +title: "Mobile-Agent-v3 Integration Plan" +type: decision +date: 2026-02-04 +--- + +# Trailblaze Decision 032b: Mobile-Agent-v3 Integration Plan + +## Executive Summary + +Integrates [Mobile-Agent-v3](https://arxiv.org/abs/2508.15144) innovations into Trailblaze while preserving our **trail/blaze** architecture. Key decisions: + +- **Tiered model approach**: Frontier models for vision/reasoning, mini models for text-only planning. No fine-tuned models (e.g., GUI-Owl) — frontier models improve over time without maintenance. +- **"Blaze once, trail forever"**: After exploration, the recorded trail executes with **zero LLM calls**, making CI/CD runs free at any scale. + +## Deployment Architecture + +### Host Mode (Development) + +Agent runs on the **desktop machine**, controls devices remotely via ADB/XCTest/DevTools. + +**Use cases:** Local development, recording new trails, interactive testing. + +### Remote Device Farm Mode (CI/CD) + +Agent runs **on the Android device** inside the test APK. Same `trailblaze-agent` code, different execution context. + +**Why:** Remote device farms (Firebase Test Lab, AWS Device Farm) only provide the device — no external host process. Test APKs must be self-contained. + +| Component | Host Mode | Remote Device Farm Mode | +|-----------|-----------|------------------------| +| `MultiAgentV3Runner` | Desktop JVM | Android JVM (in APK) | +| `SamplingSource` | `LocalLlmSamplingSource` | `KoogLlmSamplingSource` | +| `UiActionExecutor` | Remote (ADB) | Local (UIAutomator) | +| LLM Access | Direct HTTP | Direct HTTP | + +## Tiered Model Strategy + +| Tier | Models | Use For | Cost Impact | +|------|--------|---------|-------------| +| **Frontier** | Claude Sonnet 4.5, GPT-5 | Screen analysis, complex decisions | HIGH (every iteration) | +| **Mid-Tier** | Claude Haiku 4.5, GPT-4.1 | General reasoning, fallback | MEDIUM | +| **Mini** | GPT-4.1-mini, GPT-5-mini | Task planning, decomposition (text-only) | LOW (once + replans) | +| **Zero Cost** | Trail mode (recordings), Reflection (heuristic) | CI/CD execution | ZERO | + +Configured via `BlazeConfig.analyzerModel` (vision, called every iteration) and `BlazeConfig.plannerModel` (text-only, called at start + replans). + +## Value Proposition: Blaze Once, Trail Forever + +| Aspect | Mobile-Agent-v3 | Trailblaze | +|--------|-----------------|------------| +| Every execution | LLM calls required | **Zero LLM** (trail mode) | +| CI/CD at scale | Linear cost growth | **Constant $0** | +| Determinism | Non-deterministic | **100% reproducible** | +| Speed | 2-5s per action | **~100ms per action** (recordings) | + +**Enterprise impact:** 1000 test runs/day with LLM = ~$500/day. Trail mode = $0/day. + +## On-Device Execution + +**Constraints:** No Python (all Kotlin), limited memory, self-contained APK, HTTP for LLM. + +Memory-optimized config (`BlazeConfig.ON_DEVICE`): reduced iterations (20), frequent reflection (every 5), bounded backtrack (3 steps), limited subtasks (6), bounded screenshot retention (`WorkingMemory.MAX_SCREENSHOTS_ON_DEVICE = 3`). + +| Device Farm | Timeout | Network | Notes | +|-------------|---------|---------|-------| +| Firebase Test Lab | 45 min | Available | Set `maxIterations` accordingly | +| AWS Device Farm | 60 min | Available | Supports custom environments | +| Sauce Labs | Varies | Available | Real devices available | + +## Benchmarks + +### AndroidWorld (Google Research) + +116 programmatic tasks across 20 real Android apps. [GitHub](https://github.com/google-research/android_world) | [Paper](https://arxiv.org/abs/2405.14573) + +| Agent | Score | +|-------|-------| +| Mobile-Agent-v3 | **73.3%** | +| **Trailblaze (target)** | **70%+** | +| AppAgent | 34.2% | + +### OSWorld (XLang AI) + +369 tasks across Ubuntu, Windows, macOS. [GitHub](https://github.com/xlang-ai/OSWorld) | [Paper](https://arxiv.org/abs/2404.07972) + +| Agent | Score | +|-------|-------| +| Mobile-Agent-v3 | **37.7%** | +| **Trailblaze (target)** | **35%+** | + +## Implementation Summary (All Phases Complete) + +Six phases were implemented, inspired by Mobile-Agent-v3's multi-agent framework: + +| Phase | Feature | Key Deliverables | +|-------|---------|-----------------| +| 1 | Exception Handling & Recovery | `ExceptionalScreenState`, `RecoveryAction`, `handleExceptionalState()` | +| 2 | Reflection & Self-Correction | `ReflectionNode`, loop detection, backtracking | +| 3 | Dynamic Task Decomposition | `PlanningNode`, `TaskPlan`/`Subtask`, replan support | +| 4 | Cross-Application Memory | `WorkingMemory`, `MemoryOperation`, `MemoryNode`, OCR extraction | +| 5 | Trail Recording Enhancement | `EnhancedRecording` with pre/post conditions, `RecordingValidator` | +| 6 | MCP Progress Reporting | 12 event types, `ExecutionStatus`, `ProgressEventListener` | + +### Key Files + +| File | Module | Description | +|------|--------|-------------| +| `ScreenAnalysis.kt` | trailblaze-models | ExceptionalScreenState, RecoveryAction | +| `TrailblazeModels.kt` | trailblaze-models | TaskPlan, Subtask, WorkingMemory, MemoryOperation, ReflectionResult | +| `TrailblazeConfig.kt` | trailblaze-models | Task decomposition config, presets | +| `ProgressReporting.kt` | trailblaze-models | Progress events and execution status | +| `ReflectionNode.kt` | trailblaze-agent/blaze | Reflection and self-correction | +| `PlanningNode.kt` | trailblaze-agent/blaze | Task decomposition | +| `MemoryNode.kt` | trailblaze-agent/blaze | Cross-app memory | +| `BlazeGoalPlanner.kt` | trailblaze-agent/blaze | Integration of all nodes | +| `EnhancedRecording.kt` | trailblaze-agent/trail | Smart recordings with validation | +| `MultiAgentV3Runner.kt` | trailblaze-agent | High-level orchestration | + +## Architecture + +``` +MULTI_AGENT_V3 Architecture: + Planning Node (Decomposition) → Decision Node (ScreenAnalyzer) → Execution Node (UiActionExecutor) + ↓ + Exception Node (Popup/Ad/Error) + ↓ + Reflection Node (Loop detection, progress, course correction) + ↓ + Working Memory (Facts, key screenshots, cross-app clipboard) + +Execution Modes: + trail() — Zero LLM, deterministic recordings, fast CI/CD + blaze() — Full agent loop, generates recordings for trail() +``` + +## Success Metrics + +| Metric | Target | +|--------|--------| +| AndroidWorld benchmark | 70%+ | +| OSWorld benchmark | 35%+ | +| Trail (recorded) success | 99% | +| Trail (AI fallback) success | 90% | +| Blaze → Trail conversion | 80% | +| Exception recovery | 90% | + +## Parallel Work Assignments + +### Priority Order + +| Priority | Agents | Focus | +|----------|--------|-------| +| **P1 (Core)** | J (Device ID), G (ScreenAnalyzer), H (Trail Mode V3) | Core agent must work before benchmarking | +| **P2 (Validation)** | D (Unit Tests), F (MCP Integration, **after J**), K (On-Device Config) | Testing, config, validation | +| **P3 (Benchmarks/Docs)** | E (Benchmark Integration), I (Documentation) | Measure performance, document | + +### Agent Summary + +| Agent | Task | Dependencies | Creates/Modifies | +|-------|------|-------------|------------------| +| **D** | Unit Tests | None | New `*Test.kt` files only | +| **E** | AndroidWorld Benchmark | None | New `:benchmarks-androidworld` module | +| **F** | MCP Tool Integration | **Blocked by J** | `RunYamlRequestHandler.kt`, progress handlers | +| **G** | ScreenAnalyzer Enhancement | None | `ScreenAnalysis.kt`, `ScreenAnalyzerImpl.kt` | +| **H** | Trail Mode V3 | None (coordinate with J on `MultiAgentV3Runner.kt`) | `MultiAgentV3Runner.kt`, trail/*.kt | +| **I** | Documentation | None | New docs only | +| **J** | Device ID Threading | None (**blocks F**) | `ProgressReporting.kt`, `TrailblazeModels.kt`, `MultiAgentV3Runner.kt` | +| **K** | On-Device Config & Tiered Models | None | `TrailblazeConfig.kt`, `TrailblazeModels.kt` | + +### Conflict Resolution + +1. **H + J on `MultiAgentV3Runner.kt`**: J adds `deviceId` to `create()`, H adds `trail()` method — non-overlapping, second to merge rebases. +2. **J + K on `TrailblazeModels.kt`**: J adds `targetDeviceId` to state classes, K adds memory limits to `WorkingMemory` — different classes. +3. **D tests existing API**: Does not assume new fields from J/G/K. +4. **F depends on J**: F must not start until J merges (needs `deviceId` fields). + +## Open Questions + +1. How do we run AndroidWorld/OSWorld against Trailblaze's MCP interface? +2. Can we implement Mobile-Agent-v3's self-evolving trajectory production? +3. Can we auto-generate trail files from successful benchmark runs? +4. Do we need an iOS equivalent of AndroidWorld? +5. Should we support on-device model inference (Gemma, Phi) for fully offline blaze? +6. Should `WorkingMemory` facts persist to disk for crash recovery? + +## Resolved Questions + +1. **On-Device Execution**: Yes — all Kotlin, agent JVM code is a dependency in the test APK. +2. **Model Strategy**: No fine-tuned models. Tiered frontier/mini approach. +3. **Cost at Scale**: "Blaze once, trail forever." Trail mode = zero LLM calls. + +## References + +- [Mobile-Agent-v3 Paper](https://arxiv.org/abs/2508.15144) +- [X-PLUG/MobileAgent GitHub](https://github.com/X-PLUG/MobileAgent) +- [AndroidWorld Benchmark](https://github.com/google-research/android_world) +- [OSWorld Benchmark](https://github.com/xlang-ai/OSWorld) +- [Decision 032](./2026-02-04-trail-blaze-agent-architecture.md) — Original trail/blaze architecture diff --git a/docs/devlog/2026-02-04-trail-blaze-agent-architecture.md b/docs/devlog/2026-02-04-trail-blaze-agent-architecture.md new file mode 100644 index 00000000..10bf5e51 --- /dev/null +++ b/docs/devlog/2026-02-04-trail-blaze-agent-architecture.md @@ -0,0 +1,119 @@ +--- +title: "Trail/Blaze Agent Architecture" +type: decision +date: 2026-02-04 +--- + +# Trail/Blaze Agent Architecture + +## Status + +**Accepted** - `MULTI_AGENT_V3` is the active modern agent path. `TWO_TIER_AGENT` has been removed. + +| Implementation | Status | Description | +|----------------|--------|-------------| +| `TRAILBLAZE_RUNNER` | **Keep** | Battle-tested YAML-based loop. Do not modify. | +| `MULTI_AGENT_V3` | **Active** | This decision. Koog planners + trail/blaze modes. | + +## Context + +The two-tier agent architecture (Decision 025, superseded) separates screen analysis from planning but has limitations: fixed two tiers, custom agent loops instead of Koog's proven strategies, no distinction between executing known paths vs. exploring new ones, and limited composability. + +Research from [Mobile-Agent-v3](https://arxiv.org/abs/2508.15144) demonstrates that multi-agent GUI automation with specialized components (planning, reflection, memory) achieves significantly better results — **73.3 on AndroidWorld** vs 66.4 with their foundational model alone. + +Trailblaze serves two fundamentally different use cases: + +| Use Case | Input | Goal | Recordings | +|----------|-------|------|------------| +| **Automated Testing** | `.trail.yaml` with steps + recordings | Execute known path reliably | Consumed | +| **Mobile Device Control** | Natural language objective | Accomplish goal | Generated (optional) | + +## Decision + +Implement a **Trail/Blaze agent architecture** using Koog's native strategy infrastructure: + +- **`trail<>`** - Execute a known path from a trail file using **goal-oriented action planning** +- **`blaze<>`** - Explore and accomplish an objective using a **strategy graph**, optionally generating a trail + +Both leverage Koog's `AIAgent` infrastructure rather than custom loops. + +``` +trail<> blaze<> +"Follow the path" "Cut a new path" +───────────────── ───────────────── +Input: .trail.yaml Input: Natural language +Pattern: Koog Goal Planner Pattern: Koog Strategy Graph +Planning: A* through steps Planning: Dynamic per-screen +Recordings: CONSUMED Recordings: GENERATED +Speed: Fast (cached paths) Speed: Slower (exploring) + +Workflow: + blaze("objective") → generates → trail.yaml → trail(file) → executes + ↑ │ + └──────── AI Fallback when recording fails ────┘ +``` + +## trail<>: Goal Planner with Predefined Actions + +**When a trail file has complete recordings for all steps, execution uses zero LLM calls.** The A* search through predefined steps is deterministic — no LLM needed when the plan is already known. LLM calls only happen for steps without recordings or recordings that fail at runtime (if `aiFallbackEnabled`). + +| Scenario | Executor | LLM Calls | +|----------|----------|-----------| +| All steps have recordings, strict mode | `DeterministicTrailExecutor` | **0** | +| All steps have recordings, fallback enabled | `DeterministicTrailExecutor` | 0 (unless recording fails) | +| Some steps missing recordings | `GoalPlannerTrailExecutor` | Only for missing steps | +| Complex branching/conditional trails | `GoalPlannerTrailExecutor` | As needed | + +Goal planner mapping: each `step:` prompt becomes an action with preconditions (step N requires N-1 done), `recording.tools` provide optimistic beliefs, cost model prefers recordings (cost 1.0) over AI (cost 5.0), and failed recordings trigger AI Fallback ([Decision 021](./2026-01-29-ai-fallback.md)). + +## blaze<>: Strategy Graph for Exploration + +For exploratory mobile control, we use Koog's [custom strategy graphs](https://docs.koog.ai/complex-workflow-agents/): + +``` +nodeStart → nodeCapture → nodeAnalyze → nodeDecide + ↑ │ + │ ↓ + nodeExecute ← [Continue] + │ + ├─ [Complete] → nodeFinalize → nodeFinish + └─ [Failed] → nodeFinish + +Optional tiers (Mobile-Agent-v3 inspired): + • nodeReflect - Review actions, suggest corrections + • nodeProgress - Track multi-step progress + • nodeMemory - Persist cross-context information +``` + +Blaze accumulates `RecordedAction` entries during exploration. On success, these convert to trail steps via `toTrailSteps()`, enabling the blaze→trail workflow: explore once, replay deterministically. + +## MCP Integration + +When `blaze()` runs via MCP, progress callbacks report iteration status, action summaries, and objective progress to the MCP client. An optional interactive mode allows the client to inspect, redirect, or abort mid-execution. + +## Dynamic Tool Management + +`trail<>` uses a **fixed tool set** based on what the trail requires. `blaze<>` can **dynamically request additional tool categories** as it discovers needs, leveraging the existing `DynamicToolSetManager` infrastructure. + +## What changed + +**Positive:** +- Unified Koog infrastructure for both modes (same `AIAgent`, same state management) +- A* cost optimization naturally prefers recordings over AI +- Both modes support replanning and recovery from failures +- Extensible via strategy graph nodes (reflection, memory, progress tracking) +- Clear separation: `trail<>` for reliability, `blaze<>` for exploration + +**Negative:** +- Additional Koog dependency (`koog-agents-planner`) +- Goal planner may be overkill for linear step sequences +- Two execution paths to maintain + +## Related Documents + +- [021: AI Fallback](./2026-01-29-ai-fallback.md) - Integrated into `trail<>` execution +- [002: Trail Recording Format](../../devlog/2025-10-01-trail-recording-format.md) - Trail file format used by both modes +- [011: Agent Loop Implementation](../../devlog/2026-01-28-agent-loop-implementation.md) - Custom loops being replaced +- [012: Koog LLM Client](../../devlog/2026-01-28-koog-llm-client.md) - Koog integration foundation +- [Mobile-Agent-v3 Paper](https://arxiv.org/abs/2508.15144) - Multi-agent GUI automation research +- [Koog Complex Workflow Agents](https://docs.koog.ai/complex-workflow-agents/) - Strategy graph documentation diff --git a/docs/devlog/2026-02-09-agent-resilience-and-driver-architecture.md b/docs/devlog/2026-02-09-agent-resilience-and-driver-architecture.md new file mode 100644 index 00000000..e6eaa277 --- /dev/null +++ b/docs/devlog/2026-02-09-agent-resilience-and-driver-architecture.md @@ -0,0 +1,115 @@ +--- +title: "Agent Resilience, Maestro Decoupling, and Driver-Specific Hierarchies" +type: decision +date: 2026-02-09 +--- + +# Agent Resilience, Maestro Decoupling, and Driver-Specific Hierarchies + +## Summary + +Architectural evolution for Trailblaze's device interaction layer addressing three interconnected concerns: agent resilience when view hierarchies are insufficient, decoupling from Maestro's private APIs, and preserving full platform fidelity in view hierarchies. Organized as seven phases, each delivering standalone value. + +## Context + +### View Hierarchy Insufficiency + +The agent relies on the accessibility tree for element identification. When the VH is incomplete (WebViews, custom-drawn UIs, React Native/Flutter with inconsistent accessibility, unsettled dynamic content), the agent enters a failure loop — picking bad nodeIds, failing, retrying with the same insufficient VH until max iterations. + +There is **no mechanism** that compares the screenshot against the view hierarchy to detect insufficiency. Additionally, screen dimensions (`screenState.deviceWidth/Height`) are available but not included in LLM messages, degrading coordinate estimation quality. + +### tapOnPoint Gap + +`tapOnPoint` is treated as a last resort with no intelligence — it doesn't attempt to find a VH node at the coordinates or compute a recordable selector. Meanwhile, `tapOnElementByNodeId` computes selectors but requires a valid nodeId, which is exactly what's missing when the VH is insufficient. No tool combines coordinate simplicity with selector intelligence. + +### Maestro Coupling + +Trailblaze's Maestro dependency exists at three layers: +- **Matching**: `ElementMatcherUsingMaestro` uses reflection on private `Orchestra` methods, instantiating a fake `Maestro` with `ViewHierarchyOnlyDriver`. Fragile and breaks on version changes. +- **Execution**: Tools extending `MapsToMaestroCommands` produce Maestro `Command` objects. +- **Type conversion**: `TreeNodeExt.kt` and `TrailblazeElementSelectorExt.kt` exist solely to bridge the matching dependency. + +Per [Decision 006](../../devlog/2026-01-01-maestro-integration.md), Maestro was explicitly "not a permanent coupling." + +### ViewHierarchyTreeNode Loses Fidelity + +The conversion pipeline `Raw Platform Tree → Maestro TreeNode → ViewHierarchyTreeNode` drops information at each step. Android loses `package`, `long-clickable`, `NAF`. iOS loses accessibility traits, `value`, `identifier`. Web/Playwright loses HTML tags, CSS classes, ARIA roles, `href`, input types, `data-*` attributes — collapsing everything into a single `resource-id` string. The correct granularity is **driver-specific** (not just platform-specific), since the same platform can have multiple drivers with fundamentally different UI projections (UiAutomator vs Espresso vs Compose on Android). + +## Decision + +Seven phases, ordered by dependency: + +### Phase 1: View Hierarchy Resilience (Quick Wins) + +**1a. Pre-LLM VH insufficiency heuristic** — After filtering interactable nodes, if count is below a threshold, inject a warning into the LLM message giving explicit permission to use `tapOnPoint`. Zero LLM cost. + +**1b. Screen dimensions in LLM messages** — Include device dimensions and scale factor in every user message. Data already available, just not included. + +### Phase 2: LLM-Reported VH Quality + +**2a.** Add `viewHierarchyQuality` (COMPLETE/PARTIAL/INSUFFICIENT/EMPTY) as an optional field on `ScreenAnalysis`. Zero extra LLM calls — piggybacks on existing analysis. + +**2b.** Outer agent uses quality signal for adaptive tool switching — dropping nodeId tools and switching to coordinate-based tools when VH is insufficient. + +**2c.** `ReflectionNode` uses VH quality history to produce targeted diagnostics instead of generic "try something different." + +### Phase 3: Augmented tapOnPoint + +Convert `tapOnPoint` from `MapsToMaestroCommands` to `DelegatingTrailblazeTool`. It hit-tests the VH at the given coordinates, computes a recordable selector if a node is found, validates the selector resolves back to the intended coordinates (within tolerance), and falls back to raw coordinates if not. The LLM's coordinates are ground truth; the selector is an optimization for recordability. + +### Phase 4: Trailblaze-Native Element Matcher + +Replace `ElementMatcherUsingMaestro` with `TrailblazeElementMatcher` operating directly on `ViewHierarchyTreeNode`. Reimplement text/ID regex matching, state filtering, bounds filtering, `childOf`/`containsChild` traversal, spatial relationships, and index-based selection. Delete `ElementMatcherUsingMaestro.kt`, `ViewHierarchyOnlyDriver.kt`, and `TreeNodeExt.kt`. + +### Phase 5: Driver-Specific View Hierarchies + +Replace `ViewHierarchyTreeNode` with a `ViewHierarchyNode` interface and driver-specific implementations: `AndroidUiAutomatorNode`, `PlaywrightDomNode`, `IosAccessibilityNode`. Each provides a `descriptionForLlm()` in its platform's native style (UiAutomator dump format, HTML-like, XCTest output). `ViewHierarchyTreeNode` becomes a legacy adapter during migration. + +### Phase 6: Driver-Specific Selectors + +`TrailblazeSelector` sealed interface with `UiAutomatorSelector`, `PlaywrightSelector` (full locator API: `ByRole`, `ByTestId`, `ByCss`, etc.), and `XCTestSelector`. Each driver gets a `SelectorEngine` implementation. Existing `TapSelectorV2` strategies are refactored under the `UiAutomatorSelectorEngine`. + +### Phase 7: Execution Layer Abstraction + +`DeviceCommandExecutor` interface makes Maestro a pluggable backend. Implementations: `MaestroDeviceCommandExecutor` (current behavior), `AdbDeviceCommandExecutor` (builds on existing benchmarks code), `PlaywrightDeviceCommandExecutor` (future). + +## Dependency Graph + +``` +Phase 1 (VH Resilience) + | + v +Phase 2 (LLM VH Quality) Phase 4 (Own Element Matcher) + | | + v v +Phase 3 (Augmented tapOnPoint) --> Phase 5 (Driver-Specific Hierarchies) + | + v + Phase 6 (Driver-Specific Selectors) + | + v + Phase 7 (Execution Abstraction) +``` + +Phases 1 and 4 can proceed in parallel. + +## What changed + +**Positive:** +- Phases 1-2: Immediately reduce wasted iterations on insufficient VH screens at zero LLM cost +- Phase 3: `tapOnPoint` becomes recordable; graceful degradation (coordinates always work, selectors are an optimization) +- Phase 4: Eliminates fragile private API reflection into Maestro +- Phases 5-6: Full platform fidelity; web gets Playwright-native selectors; new drivers can be added cleanly +- Phase 7: Maestro becomes optional; alternative backends possible + +**Negative:** +- Phase 4 must exactly replicate Maestro's matching behavior (significant test effort) +- Phases 5-6 introduce multiple hierarchy/selector types to maintain; large migration surface +- Phase 7 requires maintaining parity across backends + +## Related Documents + +- [006: Maestro Integration](../../devlog/2026-01-01-maestro-integration.md) - This makes 006's "not a permanent coupling" concrete +- [032: Trail/Blaze Agent](2026-02-04-trail-blaze-agent-architecture.md) - Phase 2's adaptive tool switching applies to both trail and blaze +- [032b: Mobile-Agent-v3](2026-02-04-mobile-agent-v3-integration.md) - Decoupled; V3 benefits from but doesn't depend on this work +- [002: Trail Recording Format](../../devlog/2025-10-01-trail-recording-format.md) - Phases 3 and 6 improve recording quality diff --git a/docs/devlog/2026-02-20-recording-memory-template-substitution.md b/docs/devlog/2026-02-20-recording-memory-template-substitution.md new file mode 100644 index 00000000..fc09e9d8 --- /dev/null +++ b/docs/devlog/2026-02-20-recording-memory-template-substitution.md @@ -0,0 +1,164 @@ +--- +title: "Recording Memory Template Substitution" +type: decision +date: 2026-02-20 +--- + +# Trailblaze Decision 024: Recording Memory Template Substitution + +## Context + +Trailblaze recordings are a core differentiator: an LLM-driven session records the exact tool +calls it made, and subsequent runs replay those recordings deterministically **without any LLM +involvement**. See Decision 002 for the recording format. + +A gap exists in how recordings capture tool parameters that originated from `AgentMemory`. + +### The Problem: Literal Values in Recordings + +When a trail step is driven by the LLM, `AgentMemory.interpolateVariables()` resolves template +variables _before_ the tool executes: + +``` +${merchant_email} → "trailblaze+merchant.coffee-shop.abc123@example.com" +``` + +The tool then executes with the resolved value. When the session is recorded, the tool is +serialized with the already-resolved literal string. The resulting recording looks like: + +```yaml +- step: Launch Square app signed in as the coffee shop merchant + recording: + tools: + - myapp_ios_launchAppSignedIn: + email: trailblaze+merchant.coffee-shop.abc123@example.com # ← literal + password: password +``` + +On a future replay, if a different `account.json` has been committed (e.g., after the staging +account was regenerated), `merchantFactory_loadAccount` runs and puts a _new_ email into memory +— but the recording still hardcodes the old email. The replay will attempt to log in with a +stale address and fail. + +The recording should instead capture: + +```yaml +- myapp_ios_launchAppSignedIn: + email: ${merchant_email} # ← template variable + password: password +``` + +This way, replay always uses the current session's memory value regardless of which account was +loaded. + +### Current State of `TrailblazeToolLog` + +The log entry that drives recording generation (`TrailblazeToolLog`) does not currently store +memory state: + +```kotlin +data class TrailblazeToolLog( + override val trailblazeTool: TrailblazeTool, // already-interpolated parameters + val toolName: String, + val successful: Boolean, + // ... no memory snapshot +) +``` + +`generateRecordedYaml()` receives only `List` and has no access to the memory +state at the time each tool executed. + +## Decision + +### Approach: Memory Snapshot per Tool Log + Post-Processing Reverse Substitution + +Rather than changing tool execution to preserve pre-interpolation state (which would require +threading the raw template strings through the full call chain), recording generation performs +a **post-processing reverse substitution** using a memory snapshot captured at tool execution +time. + +#### Step 1: Add `memorySnapshot` to `TrailblazeToolLog` + +```kotlin +data class TrailblazeToolLog( + override val trailblazeTool: TrailblazeTool, + val toolName: String, + val successful: Boolean, + override val traceId: TraceId?, + val exceptionMessage: String? = null, + override val durationMs: Long, + override val session: SessionId, + override val timestamp: Instant, + val memorySnapshot: Map? = null, // ← new field +) +``` + +When a tool executes in `MaestroTrailblazeAgent.handleExecutableTool()`, the current +`AgentMemory.variables` map is captured as an immutable snapshot and stored alongside the tool +log. The snapshot is taken _after_ the tool executes (so memory written by the tool itself is +also captured). + +The field is nullable and defaults to null for backward compatibility with existing log files +and serialized sessions. + +#### Step 2: Reverse Substitution in `generateRecordedYaml()` + +During recording generation, for each `TrailblazeToolLog` that has a non-null `memorySnapshot`, +the tool's serialized YAML is post-processed to substitute literal values back to `${key}`: + +``` +"trailblaze+merchant.coffee-shop.abc123@example.com" → "${merchant_email}" +"ML4XV8YWNMESK" → "${merchant_token}" +``` + +The algorithm: +1. Serialize the tool to its YAML/JSON representation +2. For each `(key, value)` pair in the `memorySnapshot`, if `value` appears as a string + parameter in the serialized tool, replace it with `${key}` +3. Prefer longer/more specific values first to avoid partial substring collisions + +#### Mitigating False Positives + +Not all parameter values that happen to match a memory value should be templated. Mitigations: + +- **Minimum value length**: Only substitute values of 8+ characters. Short values like `"US"`, + `"password"`, or `"true"` are too ambiguous. +- **Known memory-writing tools**: Give higher confidence to substitutions where the memory key + was written by a known provisioning tool (`merchantFactory_*`, `rememberWithAi`, etc.). These + can be substituted regardless of length. +- **Exact match only**: Only substitute exact full-field matches, not substrings within a + longer string. + +#### Why Not Pre-Interpolation Capture? + +An alternative is to not call `interpolateVariables()` before recording — store the raw +template syntax and resolve it only at execution time. This is cleaner in theory but requires +significant changes to the tool execution path: the raw template form would need to be +threaded through serialization, stored separately from the runtime form, and kept in sync. The +memory snapshot + post-processing approach is additive and localized to the recording +generation layer with no changes to execution semantics. + +## Consequences + +**Positive:** + +- Recordings using memory variables become durable across session boundaries — replaying with a + different loaded account still works correctly +- No changes to tool execution semantics; post-processing is isolated to recording generation +- Backward compatible: existing recordings without `memorySnapshot` continue to work +- The memory snapshot also provides useful debugging context (what was in memory when a tool + ran) + +**Negative:** + +- Risk of false-positive substitutions for short or coincidentally matching values (mitigated + by length threshold and exact-match-only rule) +- `TrailblazeToolLog` grows slightly in serialized size (one map per tool log) +- The reverse substitution logic needs its own tests to verify edge cases (multiple keys with + same value, values that are substrings of other values, etc.) + +## Related Decisions + +- Decision 002: Trail Recording Format (YAML) — defines the recording schema this improves +- Decision 023: Merchant Factory Provisioning Trails — the primary consumer of this improvement + (`${merchant_email}`, `${merchant_token}` being the most common affected variables) diff --git a/docs/devlog/2026-02-20-scripted-tools-vision.md b/docs/devlog/2026-02-20-scripted-tools-vision.md new file mode 100644 index 00000000..c5996fea --- /dev/null +++ b/docs/devlog/2026-02-20-scripted-tools-vision.md @@ -0,0 +1,216 @@ +--- +title: "Scripted Tools Vision (TypeScript/QuickJS)" +type: decision +date: 2026-02-20 +--- + +# Trailblaze Decision 025: Scripted Tools Vision (TypeScript/QuickJS) + +## Context + +All Trailblaze tools are currently authored in Kotlin and registered at compile time. This +works well for stable, well-defined tools, but creates friction for **conditional logic** in +tools — branching behavior based on device state, memory values, or runtime conditions. + +### Current Limitation: Conditional Logic Requires Kotlin + +When a tool needs to do something like "if the merchant has a subscription, take path A; +otherwise take path B," the only option today is to write a `DelegatingTrailblazeTool` in +Kotlin: + +```kotlin +class EnsureMerchantReadyTool(...) : DelegatingTrailblazeTool { + override fun toExecutableTrailblazeTools(ctx): List { + return if (ctx.trailblazeAgent.memory.variables["merchant_token"] != null) { + listOf(LaunchAppSignedInTool(...)) + } else { + listOf(LoadAccountTool(...), LaunchAppSignedInTool(...)) + } + } +} +``` + +This requires Kotlin knowledge, a full rebuild, and a code review cycle. It is not accessible +to test engineers who primarily work in TypeScript/JavaScript. + +### Trail YAML Is Static + +Trail YAML files are a flat list of steps. They are expressive for sequencing but have no +conditional syntax. The LLM handles selection and orchestration at a natural-language level, +but deterministic conditional logic (things that should always behave the same way regardless +of LLM interpretation) currently cannot be expressed without Kotlin. + +### `DelegatingTrailblazeTool` Is the Right Pattern + +The existing `DelegatingTrailblazeTool` interface is already the correct abstraction: + +```kotlin +interface DelegatingTrailblazeTool : TrailblazeTool { + fun toExecutableTrailblazeTools(ctx: TrailblazeToolExecutionContext): List +} +``` + +It takes an execution context (device info, memory, screen state) and returns a list of +concrete tool calls. This is a pure function: context in, tool list out. Any scripting layer +that produces the same input/output contract would integrate naturally. + +## Decision + +### Vision: TypeScript as a First-Class Tool Authoring Surface + +We intend to allow TypeScript files to define Trailblaze tools with conditional logic, compiled +to JavaScript ahead of time and executed at runtime via an embedded JavaScript engine. This is +a **future investment** — current work remains Kotlin-based. + +### Design Principles + +#### 1. One-to-One: TypeScript File → Named Tool(s) + +A TypeScript file is the unit of authorship for scripted tools. A single file may export one +or more tool definitions, each conforming to Trailblaze's tool naming convention (Decision 005, +e.g., `merchantFactory_*`, `myapp_*`). The tool names defined in the script become first-class +citizens in the tool registry alongside Kotlin-defined tools. + +```typescript +// merchant-factory/scripts/ensure-account.ts + +export const merchantFactory_ensureAccount = tool({ + name: "merchantFactory_ensureAccount", + description: "Loads a merchant account if not already in memory, otherwise no-ops.", + params: { key: string() }, + run(ctx, { key }) { + if (ctx.memory.has("merchant_token")) return []; + return [{ tool: "merchantFactory_loadAccount", params: { key } }]; + }, +}); +``` + +#### 2. Precompile Step: TypeScript → JavaScript + +TypeScript source is compiled to JavaScript (`tsc` or `esbuild`) at authoring time, not at +test execution time. The resulting `.js` files are bundled alongside the trail assets. No +TypeScript toolchain is required on CI runners or Android test devices. + +``` +scripts/ensure-account.ts → (tsc compile) → scripts/ensure-account.js +``` + +#### 3. Runtime: QuickJS + +JavaScript is executed at runtime using **QuickJS**, a lightweight embeddable JS engine by +Fabrice Bellard. QuickJS is suitable because: + +- Tiny binary footprint (~400KB native library) +- Runs on Android (ART) — not just JVM host mode +- No external process required — runs in-process via JNI bindings +- Sandboxed by default (no file system, no network, no `require`) +- Supports ES2020 including async/await via promise resolution driven by the host + +Recommended binding: **`quickjs-kt`** (Kotlin-idiomatic, coroutine-backed async, Maven +Central). Cash App's **Zipline** library uses QuickJS under the hood for dynamic code loading +on Android. + +#### 4. Restricted API Surface + +Scripts have access to a **limited, intentionally small API**. This is enforced by the QuickJS +host — globals not explicitly provided simply do not exist in the script environment: + +```typescript +declare const trailblaze: { + memory: { + get(key: string): string | undefined; + has(key: string): boolean; + // Note: memory.set() is NOT exposed. Scripts emit tool calls; they don't + // mutate memory directly. Tool execution mutates memory via the Kotlin path. + }; + emit(toolName: string, params: Record): void; +}; +``` + +Scripts decide **what tools to call** (`trailblaze.emit()`). Kotlin then executes those tool +calls through the normal execution path, which handles memory writes, logging, and recording. + +**HTTP is explicitly excluded** from the on-device scripting surface. API-calling tools (like +`merchantFactory_*`) must remain Kotlin-based, as they require proper network error handling, +authentication (Cloudflare tokens, etc.), and robust retry logic that is already implemented +in Kotlin. If HTTP is needed in scripts running on the JVM host (not on device), that can be +revisited separately. + +#### 5. Integration as `DelegatingTrailblazeTool` + +A `ScriptDelegatingTool` Kotlin wrapper executes the compiled JavaScript and collects emitted +tool calls as the delegation output: + +```kotlin +@TrailblazeToolClass("script", isRecordable = false) +data class ScriptDelegatingTool( + val scriptPath: String, + val params: Map = emptyMap(), +) : DelegatingTrailblazeTool { + override fun toExecutableTrailblazeTools( + ctx: TrailblazeToolExecutionContext, + ): List { + val js = loadScriptAsset(scriptPath) + val engine = TrailblazeQuickJsEngine(memory = ctx.trailblazeAgent.memory) + return engine.evaluate(js, params).toTrailblazeTools() + } +} +``` + +Because `ScriptDelegatingTool` is `isRecordable = false`, the script itself does not appear +in recordings. The **expanded tool calls it emits** are what get recorded, just like any other +delegating tool. This means: + +- On **Android on-device replay**, the recording's expanded tool calls execute directly — + QuickJS never runs during replay +- On **LLM-driven sessions** (always JVM host), the script runs and emits tool calls that are + then captured in the recording + +This resolves the Android on-device constraint: QuickJS only needs to run where the LLM runs +(JVM host), and on-device replay uses pre-recorded tool calls. + +### LLM as Dynamic Orchestrator + +For account selection and similar decisions, **the LLM already provides dynamic logic at no +additional cost**. A trail step like "Sign in with a US coffee shop merchant that has a Free +subscription" causes the LLM to call `merchantFactory_loadAccount(account: COFFEE_SHOP)` — +no scripting required. TypeScript scripting is intended for **deterministic conditional logic** +that must behave identically on every run regardless of LLM inference, not for decisions that +are naturally expressed in natural language. + +### Current State + +**This vision is not yet implemented.** All conditional tool logic remains Kotlin-based. +The merchant factory module is the intended first real-world use case once scripting is +available, but its current implementation is pure Kotlin and meets current needs. + +## Consequences + +**Positive:** + +- Conditional tool logic becomes accessible to test engineers without Kotlin expertise +- TypeScript is familiar to the broader mobile/web engineering community at Block +- The precompile step keeps runtime simple and eliminates scripting toolchain dependencies + from test execution environments +- Compatible with Android on-device tests via the recording model (scripts run during + authoring, not replay) +- The QuickJS sandbox naturally limits the blast radius of poorly-authored scripts +- Tool naming conventions are enforced at the script export level, maintaining consistency + +**Negative:** + +- Adds a precompile step to the test authoring workflow (TypeScript → JavaScript) +- Two-language codebase for tools (Kotlin + TypeScript); contributors need to know which to use +- QuickJS adds a native library dependency to the Android APK (~400KB) +- Debugging scripted tools is harder than debugging Kotlin tools (no IDE integration, + stack traces from QuickJS are less ergonomic) +- API-calling tools (HTTP, gRPC) must remain Kotlin — TypeScript scripts cannot reach external + services on device + +## Related Decisions + +- Decision 009: Kotlin as Primary Language — TypeScript scripting is an additive layer, not a + replacement; Kotlin remains primary for framework code and API-calling tools +- Decision 010: Custom Tool Authoring — this decision extends the future directions noted there +- Decision 023: Merchant Factory Provisioning Trails — intended first real-world application + for scripted conditional provisioning logic diff --git a/docs/devlog/2026-03-04-trailblaze-node-view-hierarchy.md b/docs/devlog/2026-03-04-trailblaze-node-view-hierarchy.md new file mode 100644 index 00000000..5f103080 --- /dev/null +++ b/docs/devlog/2026-03-04-trailblaze-node-view-hierarchy.md @@ -0,0 +1,161 @@ +--- +title: "TrailblazeNode — Type-Safe Driver-Specific View Hierarchy" +type: decision +date: 2026-03-04 +--- + +# TrailblazeNode — Type-Safe Driver-Specific View Hierarchy + +Creating a type-safe abstraction over platform-specific UI trees. + +## Background + +Trailblaze interacts with UI across four driver backends: Android via Maestro, Android via the native Accessibility Service, Web via Playwright, and Desktop via Compose. Each driver captures fundamentally different information about the UI. + +The original shared model, `ViewHierarchyTreeNode`, was designed as a lowest-common-denominator representation that mirrors Maestro's `TreeNode`. Every driver was forced to map its native data into this single shape, losing platform-specific information in the process: + +- **Android Accessibility** captures ~30 properties that `ViewHierarchyTreeNode` drops entirely: `className`, `inputType`, `collectionItemInfo` (list position), `labeledByText`, `stateDescription`, `isEditable`, `isHeading`, `isCheckable`, and more. These are the properties that would make element disambiguation dramatically more reliable. +- **Playwright** naturally identifies elements by ARIA descriptor + occurrence index, not by integer node IDs. Forcing it through `ViewHierarchyTreeNode` requires an artificial mapping that doesn't represent how Playwright actually works. +- **Compose** uses semantic roles, test tags, and toggleable state — none of which map cleanly to `ViewHierarchyTreeNode`'s fields. + +This forced normalization had a direct impact on selector quality. The `TapSelectorV2` selector generator could only use properties available in `ViewHierarchyTreeNode`: text, resource ID, and a few boolean state flags. When five list items share the same text, it had to fall back to brittle strategies like `index` or complex hierarchy traversals, even though the native accessibility tree contained `collectionItemInfo.rowIndex` that would disambiguate instantly. + +The key insight is that Maestro's YAML-based selectors were designed for humans to write by hand, so they use a small, simple set of properties. Trailblaze's selectors are generated by AI and computed programmatically — we can leverage arbitrarily rich property sets because the complexity is managed by the system, not the user. + +## What we decided + +### 1. Introduce `TrailblazeNode` as the universal tree model + +`TrailblazeNode` is a minimal data class with only truly universal properties: + +```kotlin +data class TrailblazeNode( + val nodeId: Long, + val children: List, + val bounds: Bounds?, + val driverDetail: DriverNodeDetail, +) +``` + +There is no shared `text`, `role`, or `isEnabled` field. Those concepts mean different things on different platforms (Android's `className` vs ARIA roles vs Compose semantic roles), and normalizing them loses information. All meaningful properties live in `driverDetail`. + +### 2. `DriverNodeDetail` sealed interface with strongly-typed driver variants + +Each driver has its own data class with full native properties: + +```kotlin +sealed interface DriverNodeDetail { + data class AndroidAccessibility(...) : DriverNodeDetail // ~30 properties + data class AndroidMaestro(...) : DriverNodeDetail // Maestro-compatible subset + data class Web(...) : DriverNodeDetail // ARIA + CSS selectors + data class Compose(...) : DriverNodeDetail // Semantics + testTag +} +``` + +This is explicitly NOT a `Map`. Strongly-typed data classes provide: +- Compile-time safety — no typos in property names, no wrong types +- IDE autocompletion when writing matchers and generators +- Exhaustive `when` matching ensures new drivers are handled everywhere +- Serialization via kotlinx.serialization for recording persistence + +### 3. Property matchability annotations + +Each property in `DriverNodeDetail` is documented as either **matchable** or **display-only**: + +- **Matchable** properties are stable across runs and safe for recorded selectors (e.g., `className`, `resourceId`, `text`, `labeledByText`, `collectionItemInfo`). +- **Display-only** properties are transient and useful for the LLM to recognize elements but must not appear in recordings (e.g., `error`, `isVisibleToUser`, `isShowingHintText`, `drawingOrder`). + +Each variant exposes a `matchablePropertyNames: Set` for programmatic access, ensuring selector generators only use stable properties. + +### 4. `TrailblazeNodeSelector` with driver-specific matchers + +A new selector model parallels the `DriverNodeDetail` sealed hierarchy: + +```kotlin +data class TrailblazeNodeSelector( + val driverMatch: DriverNodeMatch?, // Driver-specific property matching + val above: TrailblazeNodeSelector?, // Spatial relationships + val below: TrailblazeNodeSelector?, + val leftOf: TrailblazeNodeSelector?, + val rightOf: TrailblazeNodeSelector?, + val childOf: TrailblazeNodeSelector?, // Hierarchy + val containsChild: TrailblazeNodeSelector?, + val containsDescendants: List?, + val index: Int?, // Last resort +) +``` + +`DriverNodeMatch` mirrors the sealed hierarchy — each driver has its own matcher that can match on all matchable properties: + +```kotlin +sealed interface DriverNodeMatch { + data class AndroidAccessibility( + val classNameRegex: String?, + val textRegex: String?, + val labeledByTextRegex: String?, + val collectionItemRowIndex: Int?, + val inputType: Int?, + // ... all matchable properties + ) : DriverNodeMatch + // ... Web, Compose, AndroidMaestro +} +``` + +### 5. Selector generation via cascading strategies + +`TrailblazeNodeSelectorGenerator` computes the simplest selector that uniquely identifies a target node, trying strategies from most stable to most brittle: + +For Android Accessibility (11 strategies): +1. Unique stable ID (uniqueId or resourceId) +2. Text alone +3. Text + className (e.g., "Fries" in a `TextView` vs `EditText`) +4. labeledByText (form fields: "the input labeled Email") +5. labeledByText + className +6. className + state flags (editable, checkable, heading) +7. childOf unique parent +8. collectionItemInfo (semantic list position) +9. containsChild (unique child content) +10. Text + childOf parent +11. Index (positional fallback) + +The generator verifies each candidate selector against the resolver to confirm it produces exactly one match before returning it. + +### 6. Coexistence with existing Maestro path + +The existing production pipeline is untouched: +- `ViewHierarchyTreeNode` stays as-is for Maestro-based paths +- `TrailblazeElementSelector` stays as-is for Maestro selector matching +- `TapSelectorV2` stays as-is for Maestro selector generation +- `AccessibilityElementResolver` stays as-is for existing accessibility playback + +`TrailblazeNode` and `TrailblazeNodeSelector` are new, parallel systems. Recordings can store either selector type, with the driver determining which to use. + +## File Layout + +``` +trailblaze-models/src/commonMain/.../api/ + TrailblazeNode.kt — Universal tree node + DriverNodeDetail.kt — Sealed interface (4 driver variants) + TrailblazeNodeSelector.kt — Rich selector model + TrailblazeNodeSelectorResolver.kt — Matches selectors against trees + TrailblazeNodeSelectorGenerator.kt — Computes selectors for target nodes + +trailblaze-accessibility/.../accessibility/ + TrailblazeNodeMapper.kt — AccessibilityNode -> TrailblazeNode +``` + +## What changed + +**Positive:** + +- Selectors can now use `className`, `inputType`, `collectionItemInfo`, `labeledByText`, and 20+ other properties that were previously invisible — dramatically improving disambiguation for duplicate elements in lists, forms, and complex UIs. +- Each driver keeps its native richness intact — no lossy normalization into a shared model that doesn't fit any platform well. +- Adding a new driver (e.g., iOS UIAutomation) means adding one `DriverNodeDetail` data class and one `DriverNodeMatch` data class. The resolver and selector model handle it automatically via sealed class exhaustive matching. +- The matchability annotation system ensures recordings only capture stable properties, preventing flaky tests from transient state. +- Type-safe sealed interfaces prevent the "string key" bugs that come with HashMap-based property bags. + +**Negative:** + +- Two parallel view hierarchy models exist during migration (`ViewHierarchyTreeNode` for Maestro, `TrailblazeNode` for everything else). This is intentional — forcing migration would risk regressions in the Maestro path. +- Each driver-specific matcher has its own matching function in the resolver, leading to some code duplication. This is the trade-off for type safety over a generic property-matching system. +- Selector generation strategies are driver-specific, meaning each new driver needs its own strategy list. However, the spatial and hierarchy strategies are shared. diff --git a/docs/devlog/2026-03-06-trail-yaml-v2-syntax.md b/docs/devlog/2026-03-06-trail-yaml-v2-syntax.md new file mode 100644 index 00000000..c2f83799 --- /dev/null +++ b/docs/devlog/2026-03-06-trail-yaml-v2-syntax.md @@ -0,0 +1,282 @@ +--- +title: "Trail YAML v2 Syntax" +type: decision +date: 2026-03-06 +--- + +# Trail YAML v2 Syntax + +Evolving our YAML syntax based on months of real-world trail authoring. + +## Background + +The current `.trail.yaml` syntax uses generic keywords (`prompts`, `tools`, `config`) that don't convey Trailblaze's identity and create unnecessary nesting. Key pain points: + +- **`prompts` → `recording` → `tools`** is deeply nested for what's conceptually "here's a step and what was recorded." +- **`config`** is generic and doesn't communicate what this block represents in the Trailblaze mental model. +- **`verify`** is a separate keyword from `step`, but verification is really just a type of step — the distinction is better expressed by the tools used (e.g., `assertVisibleBySelector`). +- **`context`** doesn't communicate that the value is injected into the LLM system prompt. +- **No support for pre-seeded variables** — test data like emails, card numbers, and credentials must be hardcoded in step text or the `context` string rather than declared as structured, referenceable values. +- **No test setup concept** — setup steps (launch, sign in, navigate) are mixed in with test steps. There's no checkpoint to replay when iterating, and no way to distinguish "couldn't reach the starting point" from "test failed." +- **File is a list of items** — `[config, prompts, tools]` when there's exactly one of each. A document with named properties is simpler. +- **Maestro was a YAML primitive** — now replaced by `MaestroTrailblazeTool` (see PR #1944), but the broader syntax should be updated to match. + +## What we decided + +### v2 Structure + +The file is a YAML **mapping** (not a list) with two named sections: + +| Section | Purpose | Contains | +| :--- | :--- | :--- | +| `trailhead` | Trail identity, configuration, and setup | id, title, systemPrompt, memory, target, platform, metadata, setup | +| `trail` | The test itself | NL steps with optional recordings | + +The `trailhead` is everything about the starting point: what this trail is, how it's configured, and the steps to get there. The `trail` is the test itself. + +### Keyword Changes + +| v1 Keyword | v2 Keyword | Rationale | +| :--- | :--- | :--- | +| `config` | `trailhead` | The trailhead is where the trail begins — identity, configuration, and setup all live here. | +| `prompts` | `trail` | The test steps — the path you walk. Whether blazing (AI) or following a recording, it's the trail. | +| `recording.tools` | `recording` (under each step) | Flattened from 2 levels to 1. The `tools` wrapper is removed. | +| `tools` (top-level) | `tools` (in step lists) | No longer a standalone top-level block. Now a directly authored deterministic primitive alongside `step` entries in `setup` and `trail`. | +| `context` | `systemPrompt` | No ambiguity — this text is injected into the LLM system prompt. | +| `verify` | removed | Use `step` for everything. Verification intent is expressed by the tools used. | +| `config` fields | `trailhead` fields | `id`, `title`, `systemPrompt`, `memory`, `target`, `platform`, `metadata` move into `trailhead`. | +| (none) | `setup` | Setup steps nested under `trailhead` — a checkpoint for recording iteration and deterministic replay. | + +### v2 Syntax — Full Example + +```yaml +# ── Trailhead: identity, configuration, and setup ────────────────── +trailhead: + id: testrail/suite_71172/section_838052/case_4837714 + title: Verify user cannot load more than $2,000 onto a Gift Card within 24 hours + priority: P0 + + # Optional — unlocks app-specific custom tools (e.g. launchAppSignedIn, deeplinks) + # Without a target, trails run with generic tools only. + target: myapp + platform: ios + + # Injected into the LLM system prompt for this trail + systemPrompt: > + The gift card number to use is {{giftCardNumber}}. + Always dismiss any promotional dialogs before proceeding. + + # Pre-seeded runtime variables — available as {{varName}} in steps and tool params + memory: + giftCardNumber: "7783 3224 0646 3436" + email: testuser+giftcards@example.com + password: "12345678" + + # Informational — never used at runtime, only for reporting/traceability + metadata: + caseId: "4837714" + sectionId: "838052" + testRailUrl: https://testrail.example.com/index.php?/cases/view/12345 + + # Setup steps (checkpoint for recording iteration) + setup: + - step: Launch the app with email {{email}} and password {{password}} + recording: + - myapp_ios_launchAppSignedIn: + email: "{{email}}" + password: "{{password}}" + - tools: + - tap: "Gift cards" + +# ── Trail: the test steps ──────────────────────────────────────────── +trail: + - step: Tap Reload card or check balance + recording: + - tap: "Check balance or reload card" + + - step: Enter the gift card number + recording: + - tap: "0000 0000 0000 0000" + - inputText: "{{giftCardNumber}}" + + - step: Tap Next + recording: + - tap: "Next" + + - step: Tap Add Value + recording: + - tap: "Add value" + + - step: Select $50 option + recording: + - tap: "$50" + + - step: Wait and tap Review sale + recording: + - tap: "Review sale 1 item" + + - step: Tap Charge $50.00 + recording: + - tap: "Charge $50.00" + + - step: Tap on $50 amount + recording: + - tap: "$50" + + # Non-recordable step — AI always handles this, recording is never overwritten + - step: Dismiss any payment confirmation dialogs + recordable: false + + # Direct tools block — hand-authored deterministic sequence, no NL step needed + - tools: + - assertVisible: "Amount exceeds gift card balance limit." + - assertVisible: "Declined" + - assertVisible: "Cancel Payment" +``` + +### blaze.yaml — NL Definition (Cross-Platform) + +The blaze file is primarily NL — no recordings. `tools` blocks are allowed for platform-agnostic deterministic sequences, but platform-specific recordings live in `.trail.yaml` files. + +```yaml +trailhead: + id: suite/71172/section/838052/case/4837714 + title: Verify gift card load limit + memory: + giftCardNumber: "7783 3224 0646 3436" + email: testuser+giftcards@example.com + setup: + - step: Launch the app and sign in with {{email}} + - step: Navigate to Gift Cards + +trail: + - step: Tap Reload card or check balance + - step: Enter the gift card number + - step: Tap Next + - step: Tap Add Value + - step: Select $50 option + - step: Wait and tap Review sale + - step: Tap Charge $50.00 + - step: Tap on $50 amount + - step: Dismiss any payment confirmation dialogs + recordable: false + - step: > + Verify the message "Amount exceeds gift card balance limit" appears. + Verify the message "Declined" appears. + Verify "Cancel Payment" button is visible. +``` + +### Key Design Principles + +**1. Two sections, each with one job.** `trailhead` is where the trail begins — identity, configuration, and setup. `trail` is what you're testing — the test itself. + +**2. Trailhead is the starting point.** Everything about _getting ready_ lives here: what this trail is (`id`, `title`), how it's configured (`systemPrompt`, `memory`, `target`), and the steps to reach the starting state (`setup`). The trailhead is a complete description of where the trail begins. + +**3. Setup is a checkpoint.** During recording, `setup` is a save point. Mess up the test? Replay setup instantly, re-record. This is the primary motivation — it serves the recording and iteration workflow. + +**4. Two kinds of entries: `step` and `tools`.** Both `setup` and `trail` are lists that can contain either kind. A `step` has an NL description (the durable intent) with an optional `recording` (ephemeral derived cache). A `tools` block is a directly authored deterministic sequence — no NL, no recording, hand-written by the author. "Blazing" (AI exploration) is a process/verb, not a keyword. + +**5. `recording` is flat.** `recording.tools` becomes just `recording` — one level of nesting removed. + +**6. `memory` is active, `metadata` is passive.** Memory variables are interpolated at runtime via `{{varName}}`. Metadata is never touched by the framework — purely for reporting and traceability. + +**7. `systemPrompt` is honest.** Calling it what it is removes all ambiguity about where this text ends up. + +**8. `verify` is just a `step` (or `tools`).** Any step can perform verification — the intent is expressed by the tools used, not by a separate keyword. Pure assertion sequences can also be written as direct `tools` blocks. + +**9. `recordable: false` remains per-step.** This flag means "never overwrite this step's recording during re-recording" — useful for steps that should always be handled by the AI. + +**10. File is a mapping, not a list.** Since there's exactly one of each section, named properties are simpler than an anonymous list of items. + +**11. No top-level interleaving.** v1 allowed multiple `prompts` and `tools` blocks interleaved at the top level. v2 has exactly one `trailhead` and one `trail`. Within each list, `step` and `tools` entries can be freely mixed — but the top-level structure is fixed. + +**12. `step` is source of truth, `recording` is ephemeral cache.** The semantic boundary is clear: `step` (NL intent) is the durable, authoritative description. `recording` is a derived materialization — replaceable, rebuildable, secondary. `tools` blocks are different: they are hand-authored and authoritative in their own right. + +### Setup Behavior + +**Execution policy:** +1. If recording exists → replay deterministically (no AI, instant) +2. If no recording → blaze via AI (first run), then save recording +3. If recording fails → re-blaze from NL description, save new recording + +**Failure semantics:** +- Setup failure = "couldn't reach the starting point" → test is **skipped/retried**, not failed +- Trail failure = "the test ran and something didn't work" → test is **failed** + +**Reuse via custom tools:** +Setup is shared across tests through custom tools. A recorded setup sequence can be promoted to a custom tool (e.g., `setupMoneyTab`), then referenced by NL in other tests' setup. + +### Naming Glossary + +| Term | What it is | +| :--- | :--- | +| `trailhead` | Trail identity, configuration, and setup — where the trail begins | +| `setup` | Setup steps within the trailhead (checkpoint for recording iteration) | +| `trail` | Test steps — the path you walk (the test) | +| `step` | Individual action within setup or trail | +| `recording` | Ephemeral derived cache for a step (deterministic replay, replaceable) | +| `tools` | Directly authored deterministic block — hand-written, not derived from a step | +| *blazing* | AI exploration when no recording exists (verb, not keyword) | +| `blaze.yaml` | NL definition file — the plan before you go | +| `*.trail.yaml` | Platform recording file — the trail left behind | +| `memory` | Pre-seeded variables for template interpolation | +| `systemPrompt` | Text injected into LLM system prompt | + +### Future: Execution Mode + +A future enhancement will add a `mode` property to `trailhead` that controls the speed/accuracy tradeoff: + +```yaml +trailhead: + mode: fast # fast | accurate | custom +``` + +| Mode | View Hierarchy | Tool Sets | Target | +| :--- | :--- | :--- | :--- | +| `fast` | Filtered, minimal nodes | Core tools only | Local LLMs, small context windows | +| `accurate` | Full hierarchy, all nodes/bounds | All available tools | Frontier models, max reliability | +| `custom` | Explicit per-trail config | Explicit per-trail config | Fine-tuned control | + +This is intentionally deferred — the right API will emerge from real usage with local vs. frontier models. + +## Migration Strategy + +1. **Build a new v2 parser** alongside the existing one in `trailblaze-models/commonMain`. +2. **Try-catch fallback**: attempt v2 parsing first, fall back to v1 on failure. +3. **Bulk migrate** all `.trail.yaml` and `blaze.yaml` files once v2 is stable. +4. **Delete v1 parser** after migration is complete. + +### v1 → v2 Mapping + +| v1 | v2 | +| :--- | :--- | +| `- config:` (list item) | `trailhead:` (mapping key) | +| `- prompts:` (list item, multiple allowed) | `trail:` (mapping key, exactly one) | +| `- tools:` (list item, standalone top-level) | `- tools:` (entry in `setup`/`trail` lists) | +| `step:` + `recording: tools:` | `step:` + `recording:` | +| `verify:` | `step:` (with assertion tools) | +| `context:` (in config) | `systemPrompt:` (in trailhead) | +| multiple interleaved blocks | single `trailhead` + `trail` | + +## What changed + +**Positive:** +- Two clearly distinct sections — trailhead (starting point) and trail (the test) +- Setup as a checkpoint within trailhead enables recording iteration and deterministic setup replay +- `trailhead` semantically groups identity + config + setup as "everything about the starting point" +- Flat `recording` syntax (one fewer nesting level) +- File is a mapping — simpler than a list when there's one of each section +- Clear semantic boundary: `step` is source of truth, `recording` is ephemeral cache +- `tools` blocks preserved as a hand-authored deterministic primitive alongside `step` entries +- `setup` and `trail` share the same authoring model (mixed `step` and `tools`) +- Structured variable support via `memory` +- `systemPrompt` removes confusion about where context text is used +- Removing `verify` simplifies the model — one fewer concept to learn +- `tools` as a first-class primitive gives authors an escape hatch for deterministic sequences without forcing NL wrapping +- Foundation for future `mode`-based execution configuration +- Setup failure vs trail failure distinction improves test reporting + +**Negative:** +- All existing `.trail.yaml` and `blaze.yaml` files must be migrated (mitigated by try-catch fallback period) +- External tools/scripts that parse trail files need updating +- Two parsers coexist temporarily during migration diff --git a/docs/devlog/2026-03-09-agentic-dev-loop.md b/docs/devlog/2026-03-09-agentic-dev-loop.md new file mode 100644 index 00000000..31ab057c --- /dev/null +++ b/docs/devlog/2026-03-09-agentic-dev-loop.md @@ -0,0 +1,322 @@ +--- +title: "Agentic Development Loop" +type: decision +date: 2026-03-09 +--- + +# Trailblaze Decision 035: Agentic Development Loop + +## Context + +Mobile developers currently debug UI issues through a manual cycle: edit code, build, deploy, manually navigate to the screen, check the fix, repeat. This cycle is slow and breaks flow — the developer spends more time navigating to the right screen than actually thinking about the fix. + +Coding agents (Claude Code, Cursor, etc.) can already edit code and run builds autonomously. What they can't do is interact with the device — tap buttons, navigate screens, verify UI state. Trailblaze fills this gap via MCP. + +## Decision + +### Vision + +A coding agent autonomously: +1. Edits code to fix a bug or build a feature +2. Builds and deploys the app (the coding agent handles this natively) +3. Uses Trailblaze via MCP to control the device, test UI, and read results +4. Iterates until the fix is verified + +Trailblaze is the **hands and eyes for the device**. The coding agent is the **brain for the code**. They communicate via MCP. The coding agent talks to Trailblaze like a user — high-level goals, not individual taps. + +### Architecture + +``` +Coding Agent (Claude Code) Trailblaze MCP Server +┌─────────────────────────┐ ┌──────────────────────────┐ +│ • Reads/writes code │ MCP │ • MultiAgentV3 handles │ +│ • Runs builds │◄──────────►│ multi-step navigation │ +│ • Sends goals: │ (STDIO) │ • Screen analysis (vision)│ +│ "navigate to settings"│ │ • Course correction │ +│ • Reads file paths for │ │ • Trail record/replay │ +│ logcat, screenshots │ │ • Crash detection │ +│ • Never sees screenshots│ │ • Logcat capture │ +│ or view hierarchies │ │ • Returns text summaries │ +└─────────────────────────┘ └──────────────────────────┘ +``` + +### Design Principles + +**1. MCP is the primary interface.** All device interaction goes through MCP. The CLI is for builds (coding agent handles natively) and batch test runs in CI. + +**2. Text-only MCP responses.** No screenshots, no view hierarchies in responses. This protects the coding agent's context window. Trailblaze absorbs all vision/UI data internally and returns natural language summaries. + +**3. File paths for raw data.** MCP responses include `sessionDir` pointing to logcat, screenshots, session logs. The coding agent reads these with its native file tools when needed (e.g., reading crash stack traces). + +**4. Goal-level interaction.** The coding agent says "navigate to account settings", not "tap menu, tap settings, tap account." MultiAgentV3 handles multi-step execution with course correction. + +**5. Blaze once, trail forever.** Setup navigation is explored once by AI (costs LLM). After that, it replays deterministically (free, instant). After every rebuild, the agent replays the setup trail to get back to the screen under test without AI cost. + +**6. Mode-aware behavior.** When `TRAILBLAZE_AS_AGENT`, `step()` uses MultiAgentV3 for multi-step goals. When `MCP_CLIENT_AS_AGENT`, `step()` keeps single-action behavior for clients that want fine-grained control. + +### The Dev Loop in Practice + +``` +Developer: "The withdraw button doesn't work after entering an amount" + +Coding Agent: + 1. Reads the relevant code, identifies the bug + 2. Fixes the code + 3. Runs: ./gradlew installDebug + 4. Calls: trail(RUN, "navigate-to-money-tab") ← instant replay, no AI + 5. Calls: step("enter $50 and tap Withdraw") ← AI navigates + 6. Calls: verify("withdrawal confirmation shown") ← AI checks + 7. Result: "Verification failed — error dialog shown: 'Invalid amount'" + 8. Reads: /logcat.log ← finds stack trace + 9. Fixes the bug based on the stack trace + 10. Repeats from step 3 +``` + +Steps 4-6 take seconds. The developer's fix-build-test cycle drops from minutes to seconds for the navigation portion. + +### MCP Tools + +| Tool | Purpose | Multi-step? | +|---|---|---| +| `device()` | Connect to devices (LIST/CONNECT/ANDROID/IOS) | No | +| `step()` | Execute a UI goal ("navigate to settings") | Yes — MultiAgentV3 | +| `verify()` | Assert something about the screen ("login button is visible") | No — single screen analysis | +| `ask()` | Question about screen state ("what error is shown?") | No — single screen analysis | +| `trail()` | Manage trails (START/SAVE/RUN/LIST/END) | No | +| `setAppTarget()` | Set or create the target app by package name | No | + +#### step() — Goal-Level Execution + +When `TrailblazeMcpMode == TRAILBLAZE_AS_AGENT`: + +``` +step("navigate to settings") +→ Builds single-objective YAML from the goal +→ Calls runYamlBlocking() with MULTI_AGENT_V3 +→ MultiAgentV3 runs multi-step: tap menu → tap Settings → done +→ Captures final screen as NL summary +→ Returns: { + "success": true, + "result": "Navigated to settings. Screen shows: Account, Notifications, Privacy.", + "sessionDir": "/path/to/logs/session_abc/" + } +``` + +#### MCP Response Format + +All tool responses include: + +| Field | Purpose | +|---|---| +| `success` | Did the action succeed? | +| `result` | NL summary of what happened and current screen state | +| `sessionDir` | Absolute path to session logs (logcat, screenshots, hierarchies) | +| `appState` | RUNNING, CRASHED, NOT_RESPONDING, NOT_RUNNING | + +The coding agent uses `sessionDir` to read raw data when needed: +- `/logcat.log` — crash stack traces +- `/screenshots/` — visual state (only when debugging Trailblaze itself) +- `/session.log` — detailed execution log + +### Trailhead: Setup as a Checkpoint + +> See [Decision 026: Trail YAML v2 Syntax](../../devlog/2026-03-06-trail-yaml-v2-syntax.md) for the full trailhead specification. + +The trailhead is the setup portion of a test — launch the app, sign in, navigate to the target screen. In the dev loop: + +1. **First time**: The agent blazes the setup ("launch app and navigate to Money tab"). Trailblaze explores via AI. +2. **Recording saved**: The trailhead steps get recorded as a trail. +3. **Every rebuild after**: `trail(RUN, "navigate-to-money-tab")` replays the setup instantly. No AI cost, deterministic. + +If the app's UI changes (new onboarding flow, redesigned nav), the trail breaks. The system falls back to re-blazing from the NL descriptions and saves a new recording. The developer doesn't need to intervene. + +### Recording Optimization + +> See [Decision 034: Recording Optimization Pipeline](../../devlog/2026-03-09-recording-optimization-pipeline.md) for the full specification. + +In the dev loop, recordings are a **cache**, not a commitment: + +- **One-shot post-processing**: after the first blaze, compute best-effort selectors and extract memory variables +- **If replay works**: saved an LLM call +- **If replay fails**: use data from both runs (blaze + failed replay) to refine selectors once +- **If still fails**: fall back to NL and keep going + +The trailhead recording is the most valuable to optimize — it's replayed dozens of times during a debugging session. Test steps may blaze every time since the code under test is changing. + +### Dynamic App Targets + +External developers create app targets on the fly: + +``` +setAppTarget(packageName="com.example.myapp", alias="myapp") +``` + +- If an existing target matches the package, switches to it +- Otherwise, creates a lightweight `DynamicAppTarget` with the package name +- Dynamic targets persist for the MCP session (not across restarts) +- Built-in app targets continue to work with their custom tools + +### Crash Detection and Recovery + +When the app crashes during a step: + +1. `step()` detects the crash via `ExceptionalScreenState` or process check +2. Response includes `appState: CRASHED` and `sessionDir` +3. The coding agent reads `/logcat.log` for the stack trace +4. The coding agent fixes the code, rebuilds, and replays the trailhead to get back to the crash point + +This closes the loop — the agent can autonomously detect crashes, read the cause, fix the code, and retry. + +### Documentation for Developers + +#### .mcp.json — Drop-in MCP Config + +```json +{ + "mcpServers": { + "trailblaze": { + "command": "./trailblaze", + "args": ["mcp"] + } + } +} +``` + +#### Agent Instructions Template (for CLAUDE.md) + +```markdown +## Mobile UI Testing with Trailblaze + +Trailblaze is connected as an MCP server for mobile device control. + +### Quick start +1. Connect: device(action=ANDROID) or device(action=IOS) +2. Set app: setAppTarget(packageName="com.example.myapp", alias="myapp") +3. Interact: step("navigate to the sign-in screen") +4. Verify: verify("the sign-in form is visible") +5. Ask: ask("what error message is shown?") + +### After code changes +1. Build: ./gradlew installDebug (or your build command) +2. Replay setup: trail(action=RUN, name="navigate-to-signin") +3. Test your change: step("enter email and tap Next") +4. Check results: verify("password screen appears") + +### Recording reusable setup +First time: +1. step("launch myapp and navigate to the target screen") +2. trail(action=SAVE, name="my-setup-trail") + +After that: trail(action=RUN, name="my-setup-trail") — instant, free + +### On failures +- Read sessionDir from the step() response for logcat and screenshots +- step() returns appState: CRASHED when the app crashes +- Read /logcat.log for crash stack traces + +### Guidelines +- Talk to Trailblaze like a user: "navigate to settings" not "tap menu icon" +- One goal at a time. Read the result before deciding the next step. +- Use trail(RUN) after rebuilds to restore state (free, no AI cost) +- Never plan multiple steps ahead — always base next action on current screen +``` + +## Implementation Phases + +### Phase 1: Enhanced step() — Goal-Level Execution (Critical) + +Make `step()` accept user-level goals and execute via MultiAgentV3 when in `TRAILBLAZE_AS_AGENT` mode. + +**Files:** +- `opensource/trailblaze-server/.../StepTool.kt` +- `opensource/trailblaze-host/.../TrailblazeMcpBridgeImpl.kt` + +**Approach:** Build single-objective YAML from the goal, call `runYamlBlocking()` with MultiAgentV3, capture final screen as NL summary. `verify()` and `ask()` remain single-action. + +### Phase 2: Enriched MCP Responses (Critical) + +Add `sessionDir` and `appState` to all MCP tool responses. + +**Files:** +- `opensource/trailblaze-server/.../StepTool.kt` — StepResult, VerifyResult, AskResult +- `opensource/trailblaze-server/.../TrailblazeMcpSessionContext.kt` + +**Approach:** Wire `LogsRepo.getSessionDir()` through to StepTool. Check `AdbCommandUtil.isAppRunning()` on errors to determine `appState`. + +### Phase 3: setAppTarget() — Dynamic App Targets (High) + +External developers create app targets by package name. + +**Files:** +- `opensource/trailblaze-server/.../DeviceManagerToolSet.kt` +- `opensource/trailblaze-models/.../TrailblazeHostAppTarget.kt` +- `opensource/trailblaze-host/.../TrailblazeMcpBridgeImpl.kt` + +### Phase 4: YAML v2 Syntax with Trailhead (High) + +> See [Decision 026](../../devlog/2026-03-06-trail-yaml-v2-syntax.md) + +Implement the v2 YAML parser with `config`, `trailhead`, and `trail` sections. Mapping-based format, flat `recording` syntax, compact tool syntax. + +### Phase 5: Recording Optimization Pipeline (Medium) + +> See [Decision 034](../../devlog/2026-03-09-recording-optimization-pipeline.md) + +Raw capture during blazing, post-processing for selectors/slots/generalization, validation loop for test authoring, best-effort caching for dev loop. + +### Phase 6: Documentation (High) + +Ship `.mcp.json` example and agent instructions template so developers can start immediately. + +### Priority + +| Phase | What | Effort | Impact | +|---|---|---|---| +| **Phase 1** | Enhanced step() with MultiAgentV3 | Medium | **Critical** — enables goal-level interaction | +| **Phase 2** | sessionDir + appState in responses | Small | **Critical** — closes the feedback loop | +| **Phase 6** | Documentation | Small | **High** — unblocks developers immediately | +| **Phase 3** | setAppTarget() dynamic creation | Small | **High** — unblocks external developers | +| **Phase 4** | YAML v2 with trailhead | Medium | **High** — enables setup checkpoints | +| **Phase 5** | Recording optimization pipeline | Large | **Medium** — improves trail stability | + +Phases 1 + 2 + 6 are the MVP. Once those ship, a developer can add Trailblaze as an MCP server to their coding agent and run the full autonomous dev loop. + +## Key Files Reference + +| File | Purpose | +|---|---| +| `opensource/trailblaze-server/.../StepTool.kt` | MCP tools: step(), verify(), ask() | +| `opensource/trailblaze-server/.../TrailTool.kt` | Trail management: START/SAVE/RUN/LIST/END | +| `opensource/trailblaze-server/.../DeviceManagerToolSet.kt` | Device connection, app targets, runPrompt | +| `opensource/trailblaze-host/.../TrailblazeMcpBridgeImpl.kt` | Bridge: runYaml, runYamlBlocking, device selection | +| `opensource/trailblaze-models/.../TrailblazeMcpBridge.kt` | Bridge interface | +| `opensource/trailblaze-models/.../AgentImplementation.kt` | MULTI_AGENT_V3 enum | +| `opensource/trailblaze-models/.../TrailblazeMcpMode.kt` | TRAILBLAZE_AS_AGENT vs MCP_CLIENT_AS_AGENT | +| `opensource/trailblaze-models/.../TrailblazeHostAppTarget.kt` | App target abstract class | +| `opensource/trailblaze-models/.../ScreenAnalysis.kt` | ExceptionalScreenState (crash detection) | +| `opensource/trailblaze-agent/.../MultiAgentV3Runner.kt` | MultiAgentV3 implementation | +| `opensource/trailblaze-server/.../TrailblazeMcpSessionContext.kt` | Per-MCP-session state | +| `opensource/trailblaze-report/.../LogsRepo.kt` | Session log storage | +| `opensource/trailblaze-host/.../TrailblazeCli.kt` | CLI commands | +| `opensource/trailblaze-common/.../AdbCommandUtil.kt` | ADB shell commands, isAppRunning() | + +## Consequences + +**Positive:** +- Developers get autonomous fix-build-test cycles for mobile UI +- Setup navigation replays instantly after rebuilds (no AI cost) +- Crash detection with automatic logcat access closes the debugging loop +- Goal-level interaction protects the coding agent's context window +- Trail recordings make the dev loop faster with each iteration +- Same infrastructure supports manual recording, dev loop, and CI + +**Negative:** +- Depends on MCP support in the coding agent (Claude Code, Cursor) +- MultiAgentV3 execution adds latency to the first blaze of each goal +- `TRAILBLAZE_AS_AGENT` mode changes step() behavior — existing MCP clients need awareness +- Trail recordings can break when app UI changes significantly (mitigated by NL fallback) + +## Related Decisions + +- [Decision 026: Trail YAML v2 Syntax](../../devlog/2026-03-06-trail-yaml-v2-syntax.md) — trailhead, trail, config sections +- [Decision 034: Recording Optimization Pipeline](../../devlog/2026-03-09-recording-optimization-pipeline.md) — post-processing, selectors, memory slots diff --git a/docs/devlog/2026-03-09-recording-optimization-pipeline.md b/docs/devlog/2026-03-09-recording-optimization-pipeline.md new file mode 100644 index 00000000..7257de70 --- /dev/null +++ b/docs/devlog/2026-03-09-recording-optimization-pipeline.md @@ -0,0 +1,354 @@ +--- +title: "Recording Optimization Pipeline" +type: decision +date: 2026-03-09 +--- + +# Recording Optimization Pipeline + +Post-processing recorded trails to make them more reliable and concise. + +## Background + +When the AI blazes a test (explores UI via natural language), it produces a raw execution trace — XY coordinates, view hierarchies, screenshots, and memory state at each action. Currently, selectors are computed at runtime during the blaze, which can be inaccurate — the AI picks whatever works fastest (often text-based selectors or even XY coordinates) without considering long-term repeatability. + +This creates two problems: + +1. **Runtime selectors can be wrong.** The AI guesses a selector, it matches something slightly off, but the tap still works because the coordinates are right. The recording inherits the wrong selector. +2. **Recordings are brittle.** Hardcoded values, text-based selectors, and no variable extraction mean recordings break when data changes, UI shifts, or the test runs against different backend state. + +## What we decided + +### Separate Capture from Optimization + +**During blazing**: capture ground truth only — XY coordinates + full view hierarchy + screenshots + memory state at each action. Do not compute selectors at runtime. + +**After blazing**: a post-processing pipeline transforms raw capture data into optimized, stable recordings using full context from the execution. + +**Optionally before blazing**: a pre-processing step analyzes NL steps to identify memory slots, giving the AI awareness of named variables to capture. + +### Pipeline Architecture + +``` +NL Steps (authored by human or LLM) + │ + ▼ + Pre-Processing (optional) + Analyze NL → identify memory slots + │ + ▼ + Blazing (runtime) + AI executes, raw capture only + XY + hierarchy + screenshot + memory + │ + ▼ + Post-Processing + Selectors, slots, generalization (mode-aware) + │ + ▼ + Validation Loop (policy-dependent) + Replay → compare → refine → repeat + │ + ▼ + Stable Trail ✓ +``` + +### Raw Capture Format + +Each action during blazing captures: + +``` +{ + action: "tap", + coordinates: { x: 340, y: 720 }, + viewHierarchy: { ... }, // full snapshot at action time + screenshot: "path/to/img", // visual context + nlStep: "Tap Add to Cart", // what the AI was trying to do + memoryState: { ... }, // current memory at this point + timestamp: ... +} +``` + +This data already exists in session logs (except `memoryState`, which is easy to add). The raw capture is the **source of truth** that never changes. Post-processing is a lens applied to it — re-run with different settings without re-blazing. + +### Pre-Processing: Slot Analysis + +Before blazing, an LLM analyzes NL steps to identify memory slots: + +**Input:** +```yaml +trail: + - step: Note how many apples are in the cart + - step: Add 2 more apples + - step: Verify apple count increased by 2 +``` + +**Output:** +- Named slots: `appleCount` (captured from screen) +- Relationships: verification uses `appleCount + 2` +- AI instructions injected into system prompt: "You have a memory variable `appleCount`. When you observe the apple count on screen, call `memory.set("appleCount", value)` to store it." + +**Two kinds of slots:** +- **Input slots** — values provided before the test (email, password). Seeded in `config.memory`. +- **Captured slots** — values read from screen at runtime. The AI uses `memory.set()` to store them. + +Pre-processing is optional. Without it, post-processing still extracts slots from the execution log. Pre-processing makes the AI aware of variable names upfront, producing cleaner recordings with meaningful names. + +### Post-Processing + +Post-processing transforms raw capture data into an optimized recording. It has four responsibilities: + +#### 1. Selector Computation + +Resolve XY coordinates to the best available selector using the view hierarchy: + +**Process:** +1. Resolve XY → element (find element whose bounds contain coordinates) +2. Walk up the selector ranking — pick highest-durability property that uniquely identifies the element +3. Validate uniqueness against full hierarchy +4. If not unique, combine properties or add parent context + +**Selector ranking (most to least durable):** + +| Selector type | Durability | Example | +|---|---|---| +| `id` | Best | `id: "add_to_cart_btn"` | +| `contentDescription` | Great | `contentDescription: "Add to cart"` | +| `type + parent context` | Good | `type: Button, parent: "#product-detail"` | +| `text` | Okay | `text: "Add to Cart"` | +| `class + index` | Fragile | `class: "CartButton", index: 2` | +| `xy coordinates` | Worst | `xy: [340, 720]` | + +Text-based selectors are what the AI naturally picks during blazing because they're human-readable. But they break with dynamic data, localization, or minor copy changes. The post-processor upgrades to structural selectors while the NL description preserves readability. + +#### 2. Slot Extraction + +Identify hardcoded values that should be variables: + +**Heuristics:** +- Strings in `inputText` calls → likely input slots (credentials, search terms) +- Values in both a `readText` and a later assertion → captured slots +- Values from `config.systemPrompt` that appear in tool calls → memory variables +- Repeated values across multiple steps → slot candidates + +**Process:** +1. Scan all tool calls for literal values +2. Group values by semantic role (using NL context) +3. Generate meaningful variable names (LLM call using NL descriptions) +4. Replace literals with `{{variableName}}` references +5. Populate `config.memory` with input slot values + +#### 3. Value Generalization + +Replace exact values with patterns where the intent is format, not value: + +| NL Intent | Raw | Generalized | +|---|---|---| +| "Verify a price is shown" | `equals: "$50.00"` | `matches: "\\$\\d+\\.\\d{2}"` | +| "Verify a date appears" | `equals: "March 9, 2026"` | `matches: "\\w+ \\d{1,2}, \\d{4}"` | +| "Verify item count shown" | `equals: "3 items"` | `matches: "\\d+ items"` | +| "Verify total is correct" | `equals: "$50.00"` | `equals: "{{expectedTotal}}"` | + +The decision between regex and expression depends on NL intent — does the test care about a specific computed value, or just that something of the right format appeared? + +#### 4. Expression Detection + +Identify mathematical or logical relationships between captured values: + +- AI read "5", later asserted "7", NL says "increased by 2" → `{{appleCount + 2}}` +- AI read "$25.00" twice, asserted "$50.00", NL says "total" → `{{price * quantity}}` + +### Selector Modes + +Different use cases want different selector strategies. Mode is set per-test or per-step: + +```yaml +config: + selectorMode: adaptive # default for whole test + +trail: + - step: Tap the exact submit button + selectorMode: strict # override for this step +``` + +| Mode | Behavior | Use case | +|---|---|---| +| **strict** | Exact match on id or unique property. Fail if not found. | Regression — must hit this exact element | +| **flexible** | Text or content description. Tolerate minor changes. | Smoke testing — verify the flow works | +| **adaptive** | Fallback chain: id → contentDescription → text → position | General purpose (default) | + +The mode controls how post-processing generates selectors from the same raw data. Re-run post-processing with a different mode to get different recordings without re-blazing. + +### Validation Loop + +After post-processing, validate the recording works by replaying it: + +``` +┌─→ Replay recording deterministically +│ │ +│ ▼ +│ Capture new run data (XY, hierarchies, screenshots) +│ │ +│ ▼ +│ Compare with blaze data: +│ - Did each selector resolve to the correct element? +│ - Same elements hit (compare bounds/properties)? +│ - Assertions produced same results? +│ - Memory slots captured expected values? +│ │ +│ ▼ +│ All matched? ──Yes──→ Trail is stable ✓ +│ │ +│ No +│ │ +│ ▼ +│ Refine using data from BOTH runs: +│ - Two sets of hierarchies to compare +│ - Identify what changed vs what's stable +│ - Pick selectors that work across both runs +│ - If can't stabilize after N iterations → recordable: false +│ │ +└──────┘ +``` + +**Exit criteria:** +- All steps passed on deterministic replay (not blaze) +- Every selector resolved to the correct element (validated by comparing bounds across runs) +- All memory slots populated correctly +- No XY fallbacks needed + +**Convergence failure:** If a step can't stabilize after N iterations (default 3), mark it `recordable: false`. The AI handles it every time. This is an honest answer rather than a flaky test. + +### Workflow Policies + +The same infrastructure serves different workflows via different policies: + +| Workflow | Pre-process | Post-process | Validate | On failure | +|---|---|---|---|---| +| **Test authoring** | Full slot analysis | Full optimization | Loop until stable | Flag unstable steps | +| **Dev loop** | Skip | One-shot, best effort | If fails, refine once with both sessions | Fall back to NL | +| **CI regression** | N/A (done) | N/A (done) | N/A (done) | Re-blaze from NL, alert | + +#### Dev Loop Policy + +The trail is a **cache**, not a commitment. One-shot post-processing, try the replay — if it works, saved an LLM call. If it fails, you now have two runs of data (the blaze and the failed replay), so refine selectors once using both sessions. If that still fails, fall back to NL and keep moving. + +The trailhead trail is the most valuable to optimize — it's replayed dozens of times during debugging. Test steps may blaze every time since the code under test is changing. + +#### Test Authoring Policy + +Full pipeline — the recording will run thousands of times in CI. Pre-process for slots, full post-processing, validation loop until stable. Flag unstable steps. Measurable trail quality. + +### Memory Tools + +The AI uses recordable tools to read/write memory during blazing: + +- **`memory.set(name, value)`** — store a captured value. Recorded as `storeAs`. +- **`memory.get(name)`** — retrieve a stored value. Recorded as `{{name}}`. + +In the recording: + +```yaml +- step: Note the current inventory count + recording: + - readText: + selector: "#inventory-count" + storeAs: inventoryCount + +- step: Verify inventory increased by 2 + recording: + - assertText: + selector: "#inventory-count" + equals: "{{inventoryCount + 2}}" +``` + +### Expression Support + +Recordings support expressions in `{{}}` template syntax: + +- Variable reference: `{{email}}` +- Arithmetic: `{{inventoryCount + 2}}`, `{{price * quantity}}` +- String interpolation: `"Hello {{firstName}}"` + +Expression evaluation happens at replay time after memory slots are populated. + +## Example: Before and After + +### Raw recording (from blaze) + +```yaml +trail: + - step: Sign in with the test account + recording: + - inputText: "alice@example.com" + - tap: "Next" + - inputText: "password123" + - tap: "Sign In" + - step: Note the current inventory count + recording: + - readText: "5" + - step: Add 2 items + recording: + - tap: "Add item" + - tap: "Add item" + - step: Verify inventory increased by 2 + recording: + - assertVisible: "7" +``` + +### After post-processing + +```yaml +config: + memory: + email: alice@example.com + password: password123 + +trail: + - step: Sign in with the test account + recording: + - inputText: "{{email}}" + - tap: + id: "next-button" + - inputText: "{{password}}" + - tap: + id: "sign-in-button" + - step: Note the current inventory count + recording: + - readText: + selector: + id: "inventory-count" + storeAs: inventoryCount + - step: Add 2 items + recording: + - tap: + id: "add-item-button" + - tap: + id: "add-item-button" + - step: Verify inventory increased by 2 + recording: + - assertText: + selector: + id: "inventory-count" + equals: "{{inventoryCount + 2}}" +``` + +Text selectors replaced with ids. Hardcoded credentials replaced with memory variables. Hardcoded inventory values replaced with captured slot + expression. + +## What changed + +**Positive:** +- Selectors computed from ground truth (XY + hierarchy) rather than runtime guesses +- Recordings are templatized — work with different data, accounts, environments +- Same raw capture supports different selector modes without re-blazing +- Validation loop proves repeatability instead of hoping for it +- Progressive enhancement — start with simple post-processing, add sophistication over time +- Dev loop benefits from trails as cache without requiring perfection +- Unstable steps honestly flagged as `recordable: false` rather than producing flaky tests + +**Negative:** +- Post-processing adds time between blaze and usable recording +- Expression evaluation adds complexity to the replay engine +- Pre-processing requires an additional LLM call before blazing +- Selector ranking heuristics will need tuning based on real-world UI patterns +- Memory tools add to the AI's tool surface during blazing diff --git a/docs/devlog/2026-03-11-waypoints-and-app-navigation-graphs.md b/docs/devlog/2026-03-11-waypoints-and-app-navigation-graphs.md new file mode 100644 index 00000000..d7b7a722 --- /dev/null +++ b/docs/devlog/2026-03-11-waypoints-and-app-navigation-graphs.md @@ -0,0 +1,74 @@ +--- +title: "Waypoints and App Navigation Graphs" +type: decision +date: 2026-03-11 +--- + +# Waypoints and App Navigation Graphs + +## Context + +Today, every trail figures out navigation from scratch. If ten trails need to get from the home screen to Settings, the AI navigates there ten times and ten recordings each encode their own copy of that path. When the app changes, all ten break independently. + +What's missing is **structural knowledge about the app** — a reusable map of where you can be and how to get between places. The building blocks exist (`blaze`, `ask`, recording optimization, template substitution, AI fallback), but there's no structural layer on top. + +Consider "Set a 7am weekday alarm": steps 1-2 are **navigation** (launch app, go to Alarm tab), steps 3-4 are **task execution** (create alarm, verify). These are fundamentally different concerns but interleaved in a single trail with no separation. This causes redundant exploration, redundant recordings, no spatial reasoning, and brittle composition. + +## Decision + +### Core Concepts + +**Waypoints** — Named, assertable locations within an app. Defined by identity (e.g., `clock:alarm-tab`), structural assertions (which elements are present/absent/selected), and optional captures (observable values like alarm count). A waypoint is not a screenshot — if the set of available user actions changes, it's a different waypoint. If only content changes, it's the same waypoint. State qualifiers handle different interaction modes (e.g., `clock:stopwatch:idle` vs `clock:stopwatch:running`). + +**Edges** — Short, recorded trail segments moving between exactly two waypoints. Defined by from/to waypoints, tool call steps, and optional variables. Edges are unidirectional — `alarm-tab → settings` (tap menu) is separate from `settings → alarm-tab` (press Back). Edges can be parameterized with `{{variables}}` via template substitution (Decision 024). + +**Navigation Graph** — Directed graph of all waypoints and edges for an app, stored as a `.nav.yaml` file. Multi-hop navigation becomes **pathfinding** — "get from Cities to Stopwatch" resolves to `cities-list → clock-tab → stopwatch:idle` by replaying two edges sequentially. No LLM needed. + +**Edges as Discoverable Skills** — Edges carry metadata (name, description, variables) making them discoverable skills the agent can look up and invoke. A `create-alarm` edge with `alarm_time` and `repeat_days` variables is a skill: the LLM finds it, fills in variables, and it executes deterministically. + +**Trail-to-Trail References** — Trails declare `startAt:` and `endAt:` waypoint references. The execution engine navigates to the starting waypoint using graph edges, then runs only the task-specific steps. Intermediate `checkpoint:` waypoints provide validation and recovery points. + +### Implementation Steps + +Each step depends on the one before it: + +1. **Waypoint Schema and File Format** — Define YAML schema for waypoints and nav graph files. Hand-author an example for a simple app. Open questions: assertion types, mapping to TrailblazeNode model, naming conventions, file location. + +2. **Waypoint Assertion Engine** — `checkWaypoint(waypoint, screenState) → WaypointMatch` resolving assertions against the existing view hierarchy model. Foundation for everything else. + +3. **"Where Am I?" Screen Identification** — `identifyCurrentWaypoint(graph, screenState) → WaypointMatch?` checking all waypoints and returning the best match. More specific waypoints (more assertions) win over less specific ones. + +4. **Edge Recording** — Recording mode: assert `from` waypoint, record navigation steps, assert `to` waypoint. Integrates with existing session recording, post-processed through recording optimization (Decision 034) and variable extraction (Decision 024). + +5. **Edge Playback and Validation** — Execute a recorded edge: assert `from`, replay steps, assert `to`. Validation mode runs every edge in a graph to produce a pass/fail report. AI fallback (Decision 021) can optionally attempt recovery on step failure. + +### Future Phases + +With steps 1-5, the system has all primitives. Built on top: + +- **Graph Pathfinding** — BFS shortest path between waypoints; enables `startAt:` in trail files +- **Automated Exploration** — AI-driven exploration agent discovers waypoints and edges, human reviews +- **Graph Maintenance** — Failed edges trigger localized re-exploration; track volatile app areas +- **Compositional Trail Authoring** — Trails focused purely on task logic; navigation fully separated + +## What changed + +**Positive:** +- Navigation knowledge captured once, reused across all trails +- Multi-hop navigation becomes deterministic pathfinding — no LLM needed +- Trail recordings become shorter, focused on task logic +- Navigation failures fixed in one place, not across every trail +- Graph provides structural app map for coverage analysis + +**Negative:** +- Initial graph creation requires exploration time per app +- Graph maintenance is a new ongoing cost +- State explosion for complex apps — heuristics needed for what differences matter +- Waypoint assertions require tuning (too strict = false negatives, too loose = false positives) + +## Related Decisions + +- Decision 002: Trail Recording Format — edges are small trail recordings +- Decision 021: AI Fallback — recovery when edge playback fails +- Decision 024: Recording Memory Template Substitution — waypoint captures feed edge variables +- Decision 034: Recording Optimization Pipeline — edge steps go through same optimization diff --git a/docs/devlog/2026-03-15-mcp-stdio-http-proxy-architecture.md b/docs/devlog/2026-03-15-mcp-stdio-http-proxy-architecture.md new file mode 100644 index 00000000..fea84d2c --- /dev/null +++ b/docs/devlog/2026-03-15-mcp-stdio-http-proxy-architecture.md @@ -0,0 +1,70 @@ +--- +title: "MCP STDIO-to-HTTP Proxy for Development" +type: devlog +date: 2026-03-15 +--- + +# MCP STDIO-to-HTTP Proxy for Development + +## Summary + +Designed and implemented a lightweight STDIO-to-HTTP proxy (`trailblaze mcp-proxy`) that decouples the MCP client connection from the Trailblaze daemon process. This lets developers restart the daemon for code changes without losing the MCP client connection. + +## The Problem + +Every code change to the MCP server requires a full rebuild and restart. In the current architecture, the STDIO MCP server runs in-process — when it restarts, the stdin/stdout pipe breaks and the MCP client (Claude Desktop, Cursor, etc.) disconnects. MCP clients don't implement reconnection. You have to manually re-add the connection every time. This makes iterative MCP development painfully slow. + +We considered whether Streamable HTTP transport (already implemented) would help, but it's the same problem from the client's perspective — the client loses its session and doesn't know how to recover. + +## The Design + +``` +MCP Client <-- STDIO (stable) --> Proxy <-- HTTP (reconnects) --> Daemon +``` + +The proxy is a long-lived process that: +1. Accepts STDIO from the MCP client (the connection that must not break) +2. Forwards all JSON-RPC to the daemon's `POST /mcp` endpoint +3. Holds a `GET /mcp` SSE connection for daemon-to-client notifications +4. When the daemon dies, queues requests and retries until it comes back +5. On reconnect, replays the `initialize` handshake and `device()` connect call +6. Sends `notifications/tools/list_changed` to the client so it re-fetches tools + +There are two sessions: Client-to-Proxy (never breaks) and Proxy-to-Daemon (breaks and reconnects). The client only sees the first one. + +## Key Decisions + +**Raw HTTP forwarding, not SDK-level proxying.** The Kotlin MCP SDK (v0.8.3) has no built-in proxy mechanism. Its `Server` class interprets messages (parses JSON-RPC, routes to tool handlers). A proxy should forward raw bytes without interpreting them. This makes it simpler, more resilient, and future-proof — it doesn't break when new MCP methods are added. The proxy uses only `java.net.HttpURLConnection` with zero Trailblaze server dependencies. + +**Session replay, not session persistence.** When the daemon restarts, all session state is lost (device connections, claims, cached screen state). Rather than persisting state, the proxy replays the setup calls: `initialize` + `notifications/initialized` + the last `device()` connect. Session logs from previous work are durable in the logs repo, so mid-trail work is recoverable. + +**Separate command for now, unification later.** The proxy lives as `trailblaze mcp-proxy`. The long-term plan is for `trailblaze mcp` to become the proxy internally — it would auto-start the daemon if none is running (via `ensureServerRunning()`), then proxy to it. This matches how `trailblaze run` already works as a client of the daemon. For now, keeping them separate avoids touching the existing STDIO code path. + +**Never auto-kill the daemon.** When the proxy exits (client disconnects), it does not shut down the daemon — even if it could have started one. This avoids the edge case where multiple proxy instances share a daemon and one exiting kills it for the others. `trailblaze stop` is the explicit cleanup. + +## Dead Ends Considered + +- **Hot-reloading the JVM server** — too complex for a Kotlin/Gradle project, JVM startup cost makes this impractical. +- **Having MCP clients reconnect natively** — they don't, and we can't change them. +- **Using SDK transport primitives to build the proxy** — the SDK's `Server`/`Client` classes parse and interpret messages, which is the opposite of what a proxy wants. Lower-level transport wiring would have been more complex than raw HTTP forwarding for no benefit. + +## Development Workflow + +```bash +# Terminal 1 (proxy — start once, stays running) +trailblaze mcp-proxy + +# Terminal 2 (daemon — restart freely for code changes) +trailblaze + +# After code changes: +trailblaze stop +./gradlew :trailblaze-host:classes +trailblaze +# Proxy reconnects automatically, MCP client doesn't notice +``` + +## Implementation + +- `McpProxyCommand.kt` — the proxy command, registered as `trailblaze mcp-proxy` +- Added to `TrailblazeCliCommand` subcommands in `TrailblazeCli.kt` diff --git a/docs/devlog/2026-03-17-ios-trailblaze-node-detail.md b/docs/devlog/2026-03-17-ios-trailblaze-node-detail.md new file mode 100644 index 00000000..a7e29fb5 --- /dev/null +++ b/docs/devlog/2026-03-17-ios-trailblaze-node-detail.md @@ -0,0 +1,77 @@ +--- +title: "iOS TrailblazeNode Support via IosMaestro" +type: devlog +date: 2026-03-17 +--- + +# iOS TrailblazeNode Support via IosMaestro + +## Summary + +Added `DriverNodeDetail.IosMaestro` so iOS view hierarchies get preserved as typed data in `TrailblazeNode` trees instead of being flattened into the LCD `ViewHierarchyTreeNode` model. This populates `ScreenState.trailblazeNodeTree` for iOS (previously always `null`), enabling future selector generation that can match on `className`, separate text fields, and boolean states — things `TrailblazeElementSelector` can't do. + +## What Changed + +- **New variant:** `DriverNodeDetail.IosMaestro` — same shape as `AndroidMaestro` plus `visible` and `ignoreBoundsFiltering` (iOS-specific filtering flags) +- **New match type:** `DriverNodeMatch.IosMaestro` + `TrailblazeNodeSelector.iosMaestro` field +- **Two conversion paths:** `TreeNode.toTrailblazeNodeIosMaestro()` (Maestro fallback) and `ViewHierarchyTreeNode.toIosMaestroTrailblazeNode()` (Square custom hierarchy) +- **Backward compat adapter:** `TrailblazeNode.toViewHierarchyTreeNode()` for all 5 driver types +- **Wired into drivers:** `HostMaestroDriverScreenState` and `SquareTrailblazeIosDriver` both populate `trailblazeNodeTree` +- **39 new tests** covering conversion, serialization, resolver matching, and round-trip compat + +## Key Decisions + +### One iOS variant, not two + +The original plan had `DriverNodeDetail.IosSquare` (Square custom hierarchy) and `DriverNodeDetail.IosMaestro` (Maestro fallback) as separate variants. We dropped `IosSquare` after realizing the Square on-device service serializes to `ViewHierarchyTreeNode` — the same ~18 properties Maestro captures. There's no fidelity gain from having a separate type. If the on-device service later sends richer data (accessibility traits, custom UIKit properties), we can reintroduce a dedicated variant then. + +### Remove non-native iOS boolean properties from matchable set + +`clickable`, `enabled`, and `checked` don't exist natively on iOS — Maestro infers or defaults them. Including them in `MATCHABLE_PROPERTIES` would let the selector generator produce selectors against values that are guesses, not ground truth. Removed all three from the matchable set and from `DriverNodeMatch.IosMaestro`. The `DriverNodeDetail.IosMaestro` data class still *carries* these values (for display/LLM context) but they're marked display-only and can't appear in recorded selectors. Only `text`, `resourceId`, `accessibilityText`, `className`, `hintText`, `focused`, and `selected` are matchable — all properties iOS actually provides. + +### Keep AndroidMaestro and IosMaestro separate (don't merge into one "Maestro" type) + +Even though the schemas are nearly identical (just `visible` and `ignoreBoundsFiltering` extra on iOS), keeping them separate means we can remove unreliable properties from one platform without affecting the other. iOS `checked` is Maestro's best guess from accessibility traits — if it proves too flaky for selectors, we can drop it from `IosMaestro` without touching `AndroidMaestro`. + +### Platform-based inspector badges, not driver-based + +Changed the UI inspector badges from `"a11y"` / `"maestro"` / `"ios-maestro"` to `"android"` / `"ios"` / `"web"` / `"compose"`. Consumers don't need to know about Maestro internals — it caused confusion about Trailblaze's relationship to Maestro. + +### Selector generation stubs, not implementations + +The generator returns `emptyList()` for `IosMaestro` — strategy implementations are Phase 4A work. The resolver and match types are fully functional, so once strategies land, recording and playback via `TrailblazeNodeSelectorResolver` will work end-to-end. + +## Dead Ends + +### Heterogeneous tree builder + +Initially built `buildHeterogeneousTrailblazeNodeTree()` that walked the custom hierarchy, detected system placeholders, and replaced them with Maestro-sourced `TrailblazeNode` subtrees. Realized this duplicated the merge that `mergeHierarchies()` already performs at the `ViewHierarchyTreeNode` level. Deleted it and just call `mergedHierarchy.toIosMaestroTrailblazeNode()` on the already-merged result. + +## Known Gap + +`HostMaestroDriverScreenState` builds its `stableTrailblazeNodeTree` from the raw Maestro `TreeNode` (before custom hierarchy merge). When `SquareTrailblazeIosDriver` is active, its `lastTrailblazeNodeTree` has richer merged data, but the host module can't read it due to module boundaries (trailblaze-host can't depend on uitests-block). Fixing this needs a callback/provider pattern or shared interface — future work. + +## Future Work + +Plan saved in `.agents/knowledge/ios-trailblaze-node-phase4-plan.md`. Priority order: + +1. **Selector generation strategies for IosMaestro** — enables recording with `className`, separate text fields, boolean states +2. **Element-to-node selector bridge** — converts old `TrailblazeElementSelector` recordings to `TrailblazeNodeSelector` for playback via the new resolver +3. **TrailblazeNode-aware filtering** — replaces `ViewHierarchyFilter` for the TrailblazeNode path +4. **TrailblazeNode compact formatter** — replaces `ViewHierarchyCompactFormatter` for LLM context +5. **Migrate ElementMatcherUsingMaestro** — final unification on `TrailblazeNodeSelectorResolver` + +## Files + +| File | Action | +|------|--------| +| `DriverNodeDetail.kt` | Add `IosMaestro` variant | +| `TrailblazeNodeSelector.kt` | Add `iosMaestro` field + `DriverNodeMatch.IosMaestro` | +| `TrailblazeNodeSelectorGenerator.kt` | Stub branches + `buildStructuralMatch`/`buildTargetMatch` | +| `TrailblazeNodeSelectorResolver.kt` | `matchesIosMaestro()` | +| `InspectTrailblazeNodeComposable.kt` | Display branches + `IosMaestroProperties` + platform badges | +| `TrailblazeNodeMapperMaestro.kt` | **NEW** — `TreeNode.toTrailblazeNodeIosMaestro()` | +| `TrailblazeNodeMapperIosMaestro.kt` | **NEW** — `ViewHierarchyTreeNode.toIosMaestroTrailblazeNode()` | +| `TrailblazeNodeCompat.kt` | **NEW** — backward compat adapter | +| `HostMaestroDriverScreenState.kt` | Populate `trailblazeNodeTree` for iOS | +| `SquareTrailblazeIosDriver.kt` | `lastTrailblazeNodeTree` at all return paths | diff --git a/docs/devlog/2026-03-17-mcp-api-redesign-and-ios-fixes.md b/docs/devlog/2026-03-17-mcp-api-redesign-and-ios-fixes.md new file mode 100644 index 00000000..bc5f7b0c --- /dev/null +++ b/docs/devlog/2026-03-17-mcp-api-redesign-and-ios-fixes.md @@ -0,0 +1,84 @@ +--- +title: "MCP API Redesign: verify→blaze, Mode Defaults, iOS launchApp Fix" +type: devlog +date: 2026-03-17 +--- + +# MCP API Redesign: verify→blaze, Mode Defaults, iOS launchApp Fix + +## Summary + +A session of design discussion and bug fixes covering: collapsing `verify()` into `blaze(hint="VERIFY")`, removing the standalone `verify` MCP tool, making `TRAILBLAZE_AS_AGENT` vs `MCP_CLIENT_AS_AGENT` mode configurable with flavor-appropriate defaults, and fixing iOS `launchApp` failures on system apps. + +--- + +## What Changed (Already Landed) + +### Session progress UI: child tool blocks inside objectives +- **Problem:** In MCP mode, tool logs (e.g. `launchApp`) arrive *after* `ObjectiveCompleteLog` due to fire-and-forget timing. They were rendering as a separate sibling row rather than inside the objective's expanded section. +- **Fix:** `buildProgressItems` (SessionProgressHelpers.kt) now gathers `toolsBetween` before emitting the `ObjectiveItem` so a `ToolBlockItem` always follows its parent. The composable (`SessionProgressComposable.kt`) was updated to pass the child tool block into `ObjectiveStepRow` and render it inside the `AnimatedVisibility` expanded section. Sibling rendering with `padding(start=32.dp)` was removed. +- **Files:** `opensource/trailblaze-ui/src/commonMain/kotlin/xyz/block/trailblaze/ui/tabs/session/SessionProgressComposable.kt`, `SessionProgressHelpers.kt` + +### iOS launchApp on system apps +- **Problem:** `launchApp(appId="com.apple.mobilecal", launchMode=REINSTALL)` always failed on iOS system apps because `clearState=true` triggers `simctl erase`/uninstall, which is prohibited for system-defined apps. The exception was thrown before the actual launch. +- **Fix:** `Orchestra.kt` now catches `clearAppState` and `setPermissions` failures individually, logs warnings, and proceeds to `maestro.launchApp()` rather than aborting. This makes `launchApp` resilient for system apps without breaking user app behavior. +- **File:** `opensource/trailblaze-android/src/main/java/xyz/block/trailblaze/android/maestro/orchestra/Orchestra.kt` + +--- + +## What Changed (Landed in Follow-up) + +### Collapse `verify()` into `blaze(hint="VERIFY")` + +**Decision:** Drop the standalone `verify()` MCP tool. `toolHint` is the right abstraction for "what kind of tools should the inner agent use" — no need for a separate tool. + +**Vocabulary:** +- `blaze(goal, hint="VERIFY")` → read-only assertion tools, recorded as `VerificationStep`, returns `passed: Boolean?` +- `blaze(goal)` → interactive, recorded as `DirectionStep` +- `ask(question)` → pure vision analysis, not recorded (unchanged) + +**What landed in `StepToolSet.kt`:** +- `isVerify = toolHint?.uppercase()?.trim() == "VERIFY"` detected at start of `blaze()` +- `RecommendationContext.hint` set to "Verify this assertion using read-only tools only. Do not tap, swipe, or type." for verify mode +- Early returns (`objectiveAppearsAchieved`, `objectiveAppearsImpossible`) return `passed = true/false` when `isVerify` +- `promptStep` is `VerificationStep(verify = goal)` vs `DirectionStep(step = goal)` depending on `isVerify` +- `RecordedStepType.VERIFY` used for recording instead of `STEP` +- `passed: Boolean? = null` added to `StepResult` +- `VerifyResult` data class removed +- `"verify"` removed from `McpToolProfile.MINIMAL_TOOL_NAMES` +- `McpRealDeviceIntegrationTest` updated to use `blaze("...", hint="VERIFY")` + +### Mode default configurable per build flavor + +**Decision:** `TRAILBLAZE_AS_AGENT` stays the internal default; OSS CLI gets `MCP_CLIENT_AS_AGENT`. + +**What landed:** +- `var defaultMode: TrailblazeMcpMode = TrailblazeMcpMode.TRAILBLAZE_AS_AGENT` added to `TrailblazeMcpServer` alongside `defaultToolProfile` +- Both session creation sites in `TrailblazeMcpServer` now pass `mode = defaultMode` +- `TrailblazeCli.kt` sets `app.trailblazeMcpServer.defaultMode = TrailblazeMcpMode.MCP_CLIENT_AS_AGENT` for both HTTP and direct STDIO transports + +--- + +## Design Context (for future reference) + +### Two-tier tool architecture +- **Session level:** `toolProfile=MINIMAL` sets the baseline for the whole session (what the outer MCP client sees) +- **Call level:** `toolHint` on `blaze()` overrides for a single call (what Trailblaze's inner agent can use) +- `VERIFY` hint = OBSERVATION + VERIFICATION tools (read-only, no tap/swipe/type) +- `NAVIGATION` hint = MINIMAL + launchApp/openUrl/scroll + +### `ask()` stays as a distinct tool +Kept separate because it has no device interaction at all — pure vision analysis of a screenshot. Returns information, not pass/fail. `blaze(toolHint=VERIFY)` can use tools and returns pass/fail. The naming asymmetry (`blaze` = imperative, `ask` = interrogative) is intentional and reflects the different character of the operation. + +### MCP session timeline view (future work, not started) +A `TRAILBLAZE_AS_AGENT` MCP session currently re-uses the objective/trail timeline view, which feels forced for interactive blaze sessions. Discussed: +- Each `blaze()` call → renders like an objective row (goal + tools + screenshots) +- `verify` calls (now blaze+hint) → lighter assertion row +- `ask()` calls → minimal annotation or hidden +- "Save as trail step" affordance on blaze rows +- This requires detecting session type from logs (`McpAgentRunLog` presence) and rendering differently from trail sessions. Not yet started. + +### `TrailblazeMcpMode` context +- `MCP_CLIENT_AS_AGENT`: Client is the agent, Trailblaze exposes primitives. Recommended for OSS. +- `TRAILBLAZE_AS_AGENT`: Trailblaze is the agent, client sends goals via `blaze`/`ask`. Current active usage with `MINIMAL` toolset start arg. +- Mode is already configurable at runtime via `config(action=SET, key="mode", value=...)`. The missing piece is just the default. diff --git a/docs/devlog/2026-03-20-on-device-screenshot-memory-optimization.md b/docs/devlog/2026-03-20-on-device-screenshot-memory-optimization.md new file mode 100644 index 00000000..a5cc5bdb --- /dev/null +++ b/docs/devlog/2026-03-20-on-device-screenshot-memory-optimization.md @@ -0,0 +1,72 @@ +--- +title: "Screenshot Format Optimization (WebP Everywhere)" +type: devlog +date: 2026-03-20 +--- + +# Screenshot Format Optimization (WebP Everywhere) + +## Summary + +Switched all screenshot encoding to WebP across every platform (Android, JVM host, Playwright) +by fixing a PNG encoding bug and adding Skia-based WebP encoding on JVM. Measured ~4x +reduction in screenshot sizes at equivalent visual quality. + +## The Problem + +On-device ATF runs were hitting OOM during long agent sessions. Screenshots were a major +memory contributor due to two bugs: + +1. **Encoding bug**: `bitmap.toByteArray()` defaulted to PNG (lossless, ~100 KB per screenshot) + even though the config specified JPEG. Both `AndroidOnDeviceUiAutomatorScreenState` and + `AccessibilityServiceScreenState` had this bug. + +2. **Missing scaling**: `AccessibilityServiceScreenState` had no `ScreenshotScalingConfig` at + all — full 1080x1920 device resolution, PNG encoded. + +## Measured CI Results + +Compared screenshots from the same test (`creditCardSignature` accessibility) across builds: + +| Config | Format | Resolution | Avg size/screenshot | +| :--- | :--- | :--- | :--- | +| PNG bug (before) | PNG | 1080x1920 | **101.6 KB** | +| Low-res test | WebP | 910x512 | **12.8 KB** | +| Full-res WebP (final) | WebP | 768x1365 | **25.1 KB** | + +The format change (PNG → WebP) provided the ~4x reduction. Resolution reduction added +another ~2x but degraded LLM quality, so we kept full 1536x768 resolution. + +## What We Changed + +| Change | Impact | +| :--- | :--- | +| Fix encoding bug in both Android ScreenState classes | PNG → WebP, ~4x smaller | +| Apply `ScreenshotScalingConfig` to `AccessibilityServiceScreenState` | Was missing entirely | +| Add `WEBP` to `TrailblazeImageFormat` + `ImageFormatDetector` | Framework-wide WebP support | +| WebP encoding on JVM host via Skia (`BufferedImageUtils`) | Host screenshots now WebP too | +| WebP encoding in Playwright via Skiko | Consistent output everywhere | +| Unify `DEFAULT` and `ON_DEVICE` configs to WebP 1536x768 | Single config, no special cases | +| Extract `scaleAndEncode()` / `annotateScreenshotBytes()` into `AndroidBitmapUtils` | Shared code, -34 lines | + +### Platform encoding matrix + +| Platform | Encoder | +| :--- | :--- | +| Android instrumentation | `Bitmap.CompressFormat.WEBP` / `WEBP_LOSSY` | +| Android accessibility | Same | +| JVM host (Maestro driver) | Skia via Skiko (bundled with Compose Desktop) | +| Playwright | Skia via Skiko | +| WASM/browser (decoding only) | Native browser WebP support | + +### Memory analysis + +- **Stored bytes**: ~25 KB WebP per screenshot in memory (was ~100 KB PNG) +- **Peak bitmap during annotation**: 768x1365 = ~4.0 MB (was 1080x1920 = ~7.9 MB) +- **Double lossy compression**: Android annotation path encodes clean → decodes → annotates → + re-encodes. Minimal quality impact at 80%. Host path avoids this by annotating before scaling. + +## Also Fixed + +- `List.removeFirst()` → `removeAt(0)` in `PromptStepStatus.kt`. Java 21+ API not available + on Android runtime. Caused ~31 test failures across 3 CI steps. diff --git a/docs/devlog/2026-03-20-on-device-screenshot-optimization.md b/docs/devlog/2026-03-20-on-device-screenshot-optimization.md new file mode 100644 index 00000000..29e1a737 --- /dev/null +++ b/docs/devlog/2026-03-20-on-device-screenshot-optimization.md @@ -0,0 +1,138 @@ +--- +title: "Screenshot Format Optimization (WebP Everywhere)" +type: decision +date: 2026-03-20 +--- + +# Screenshot Format Optimization (WebP Everywhere) + +Optimizing screenshot capture for memory-constrained on-device execution. + +## Background + +On-device remote device farm execution runs the Trailblaze agent in-process on the +Android device, where memory is heavily constrained. OOM crashes were observed during +long-running agent sessions. Investigation revealed two issues: + +1. **Screenshots encoded as PNG regardless of config.** Both `AndroidOnDeviceUiAutomatorScreenState` + and `AccessibilityServiceScreenState` called `bitmap.toByteArray()` with default parameters + (PNG, quality 100) even though `ScreenshotScalingConfig` specified JPEG at 80%. PNG is + lossless and produces byte arrays ~4x larger than lossy formats for typical UI screenshots. + +2. **`AccessibilityServiceScreenState` had no scaling at all.** It captured at full device + resolution (1080x1920) with no `ScreenshotScalingConfig`, while the instrumentation driver + at least applied dimension scaling (though with the wrong format). + +### LLM Provider Image Tiling Analysis + +The 1536x768 default was chosen to optimize across all three LLM vision providers: + +| Provider | Tiling Mechanism | 1536x768 Cost | +| :--- | :--- | :--- | +| OpenAI | Scales shortest side to 768px, tiles into 512x512 squares | For phone screenshots (~2.2:1 aspect), internally upscaled to ~768x1706 → 8 tiles | +| Anthropic | Proportional to pixel count, long edge capped at 1568px | ~1,600 tokens (fits under 1568px auto-downscale threshold) | +| Google Gemini | Tiles at 768x768, each tile = 258 tokens | 2 tiles (516 tokens) | + +**1536x768** sits at the optimal boundary — fits under Anthropic's 1568px limit, matches +OpenAI's tile grid, and is exactly 2 Google tiles. + +### Image Format Comparison + +Theoretical estimates vs **measured CI data** (accessibility test suite, phone screenshot capture): + +| Format | Estimated | Measured (768x1365 phone screenshot) | +| :--- | :--- | :--- | +| PNG (was actual default due to bug) | ~50-280 KB | **avg 101.6 KB** (baseline, main branch) | +| WebP 80% (final) | ~20-50 KB | **avg 25.1 KB** (after fix) | + +**Measured ~4x reduction** from PNG → WebP at the same 1536x768 resolution. + +### WebP Platform Support + +WebP is universally supported across all platforms in the Trailblaze stack: + +| Platform | Encoding | Decoding | +| :--- | :--- | :--- | +| Android (API 28+) | `Bitmap.CompressFormat.WEBP` (API 14+) / `WEBP_LOSSY` (API 30+) | Native | +| JVM host (Compose Desktop) | Skia via Skiko (already bundled) | Skia / Coil3 | +| Playwright (JVM) | Skia via Skiko | Skia | +| WASM (browser) | N/A (browser handles) | Native browser support (97%) | +| LLM providers | N/A | OpenAI, Anthropic, Google all accept `image/webp` | + +### Resolution Decision + +We initially tried 1024x512 to further reduce memory, but CI data showed: +- The format change (PNG → WebP) was the dominant win (~4x) +- Resolution reduction added ~2x more savings but degraded LLM quality +- Accessibility tests at 1024x512 had worse pass rates than main + +Since the format fix alone provides sufficient memory savings, we kept full 1536x768 +resolution to preserve LLM vision quality. + +## What we decided + +### 1. WebP as the universal default + +`ScreenshotScalingConfig.DEFAULT` uses WebP 80% at 1536x768 for all platforms. `ON_DEVICE` +is an alias for `DEFAULT` — there is no longer a separate on-device config since the format +is now the same everywhere. + +### 2. Fix encoding bug + +Both `AndroidOnDeviceUiAutomatorScreenState` and `AccessibilityServiceScreenState` now use +the config's `imageFormat` and `compressionQuality` instead of hardcoded PNG defaults. + +### 3. WebP encoding on all JVM platforms via Skia + +Added `WEBP` to `TrailblazeImageFormat`. JVM host code (`BufferedImageUtils`, +`PlaywrightScreenState`) encodes WebP via Skia (`org.jetbrains.skia.Image.encodeToData`), +which is already bundled with Compose Desktop via Skiko. No new dependencies — Skiko is +made explicit in `trailblaze-playwright`'s `build.gradle.kts` but was already in the +transitive dependency tree. + +### 4. Shared bitmap helpers (Android) + +Extracted `scaleAndEncode()` and `annotateScreenshotBytes()` into `AndroidBitmapUtils` to +eliminate duplicated bitmap pipeline code between the two Android ScreenState implementations. +The shared `annotateScreenshotBytes()` includes OOM diagnostics via `MemoryDiagnostics`. + +### 5. Memory profile during annotation + +The `annotatedScreenshotBytes` path decodes the already-scaled `_screenshotBytes` (WebP, +~25 KB) back to a bitmap, draws set-of-mark overlays, and re-encodes to WebP. This means: + +- **Peak bitmap memory**: one 768x1365 ARGB_8888 bitmap = ~4.0 MB (down from ~7.9 MB at + full 1080x1920 before scaling was applied) +- **Double lossy compression**: the clean screenshot is WebP-encoded, then decoded for + annotation and re-encoded. At 80% quality, one extra round-trip produces minimal artifacts. + The old PNG path had zero degradation (lossless), but the memory savings justify the + tradeoff — set-of-mark bounding boxes and labels are high-contrast and survive compression + artifacts well. +- **Stored bytes**: both `screenshotBytes` (~25 KB) and `annotatedScreenshotBytes` (~25-40 KB) + are WebP. Only the compressed byte arrays remain in memory; bitmaps are recycled immediately. + +### 6. Filename detection + +Screenshot filenames use `ImageFormatDetector.detectFormat()` which inspects the actual byte +content (RIFF/WEBP magic numbers), not the config. This ensures correct `.webp` file +extensions regardless of which encoding path produced the bytes. + +## What changed + +**Positive:** + +- ~4x measured reduction in screenshot byte array size everywhere (PNG → WebP) +- ~2x reduction in peak bitmap memory during on-device annotation (768x1365 vs 1080x1920) +- Consistent WebP output across all platforms — no special cases or fallbacks +- Smaller CI artifacts for storage cost savings +- Shared bitmap code between Android ScreenState implementations +- No code changes needed at call sites — the default config handles it +- Full LLM vision quality preserved (same 1536x768 resolution, WebP ≥ JPEG quality) + +**Negative:** + +- Double lossy compression on the Android annotation path (encode clean → decode → annotate + → encode). Minimal impact at 80% quality on high-contrast set-of-mark overlays. + The host annotation path avoids this by annotating the full-res BufferedImage before scaling. +- Uses deprecated `Bitmap.CompressFormat.WEBP` on Android API 28-29 (suppressed warning). + Unavoidable until minSdk is raised to 30. diff --git a/docs/devlog/index.md b/docs/devlog/index.md new file mode 100644 index 00000000..a5bf578c --- /dev/null +++ b/docs/devlog/index.md @@ -0,0 +1,32 @@ +# Devlog + +This is a chronological record of decisions and development notes for the Trailblaze project. Each entry captures a moment in time — what we were thinking, what we decided, and why. + +Entries tagged as **Decision** record significant architectural or technical choices. Other entries are development notes that capture implementation details, debugging sessions, and lessons learned. + +## Index + + +> *Auto-generated. Do not edit manually.* + +| Date | Title | Type | +| :--- | :--- | :--- | +| 2026-03-20 | [Screenshot Format Optimization (WebP Everywhere)](2026-03-20-on-device-screenshot-optimization.md) | Decision | +| 2026-03-17 | [MCP API Redesign: verify→blaze, Mode Defaults, iOS launchApp Fix](2026-03-17-mcp-api-redesign-and-ios-fixes.md) | Devlog | +| 2026-03-17 | [iOS TrailblazeNode Support via IosMaestro](2026-03-17-ios-trailblaze-node-detail.md) | Devlog | +| 2026-03-15 | [MCP STDIO-to-HTTP Proxy for Development](2026-03-15-mcp-stdio-http-proxy-architecture.md) | Devlog | +| 2026-03-09 | [Recording Optimization Pipeline](2026-03-09-recording-optimization-pipeline.md) | Decision | +| 2026-03-06 | [Trail YAML v2 Syntax](2026-03-06-trail-yaml-v2-syntax.md) | Decision | +| 2026-03-04 | [TrailblazeNode — Type-Safe Driver-Specific View Hierarchy](2026-03-04-trailblaze-node-view-hierarchy.md) | Decision | +| 2026-01-29 | [Device-Specific Trail Recordings](2026-01-29-device-specific-trail-recordings.md) | Decision | +| 2026-01-28 | [Logging and Reporting Architecture](2026-01-28-logging-and-reporting.md) | Decision | +| 2026-01-28 | [Kotlin as Primary Language](2026-01-28-kotlin-language.md) | Decision | +| 2026-01-28 | [Koog Library for LLM Communication](2026-01-28-koog-llm-client.md) | Decision | +| 2026-01-28 | [Custom Tool Authoring](2026-01-28-custom-tool-authoring.md) | Decision | +| 2026-01-28 | [Handwritten Agent Loop](2026-01-28-agent-loop-implementation.md) | Decision | +| 2026-01-14 | [Tool Naming Convention](2026-01-14-tool-naming-convention.md) | Decision | +| 2026-01-01 | [Tool Execution Modes](2026-01-01-tool-execution-modes.md) | Decision | +| 2026-01-01 | [Maestro as Current Execution Backend](2026-01-01-maestro-integration.md) | Decision | +| 2025-10-01 | [Trail Recording Format (YAML)](2025-10-01-trail-recording-format.md) | Decision | +| 2025-10-01 | [LLM as Compiler Architecture](2025-10-01-llm-as-compiler.md) | Decision | + diff --git a/docs/mcp/index.md b/docs/mcp/index.md index f5a00d08..d3c6cb29 100644 --- a/docs/mcp/index.md +++ b/docs/mcp/index.md @@ -300,7 +300,7 @@ This enables: - **Custom replanning logic** - Your client decides when to retry or try alternatives - **Cost optimization** - Trailblaze uses a cheap vision model for screen analysis -See the migration guide under `docs/decisions/009_migration_guide.md` for details. +See the [Kotlin language decision](../devlog/2026-01-28-kotlin-language.md) for details on migration. ## Troubleshooting diff --git a/examples/android-sample-app-uitests/build.gradle.kts b/examples/android-sample-app-uitests/build.gradle.kts index df8bb6fa..d8904ef8 100644 --- a/examples/android-sample-app-uitests/build.gradle.kts +++ b/examples/android-sample-app-uitests/build.gradle.kts @@ -57,11 +57,11 @@ dependencies { // --------------------------------------------------------------------------- // generateSampleAppTests // Scans ../android-sample-app/trails/android-ondevice-instrumentation/**/*.trail.yaml -// and writes a JUnit test class so the sample-app trails can run on Test Farm. +// and writes a JUnit test class so the sample-app trails can run on a remote device farm. // Usage: ./gradlew :examples:android-sample-app-uitests:generateSampleAppTests // --------------------------------------------------------------------------- tasks.register("generateSampleAppTests") { - description = "Generate JUnit instrumentation tests from trail YAML files for Test Farm" + description = "Generate JUnit instrumentation tests from trail YAML files for remote device farm" group = "trailblaze" val trailsDir = file("../android-sample-app/trails/android-ondevice-instrumentation") diff --git a/trailblaze-common/src/jvmAndAndroid/kotlin/xyz/block/trailblaze/util/GitUtils.kt b/trailblaze-common/src/jvmAndAndroid/kotlin/xyz/block/trailblaze/util/GitUtils.kt index 1acb9344..bdff44bf 100644 --- a/trailblaze-common/src/jvmAndAndroid/kotlin/xyz/block/trailblaze/util/GitUtils.kt +++ b/trailblaze-common/src/jvmAndAndroid/kotlin/xyz/block/trailblaze/util/GitUtils.kt @@ -18,6 +18,11 @@ object GitUtils { null } + /** + * Returns the git repository root directory, or null if not in a git repo. + * Callers must handle null gracefully — release/binary builds may not be in a git repo + * and should fall back to configured or default paths (e.g., ~/.trailblaze/logs). + */ fun getGitRootViaCommand(): String? = runGitCommand("rev-parse", "--show-toplevel") // Helper to get git root directory diff --git a/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/CliConfigHelper.kt b/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/CliConfigHelper.kt index 966bce5e..a1fb39fe 100644 --- a/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/CliConfigHelper.kt +++ b/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/CliConfigHelper.kt @@ -5,6 +5,7 @@ import xyz.block.trailblaze.devices.TrailblazeDevicePlatform import xyz.block.trailblaze.devices.TrailblazeDriverType import xyz.block.trailblaze.logs.client.TrailblazeJson import xyz.block.trailblaze.mcp.AgentImplementation +import xyz.block.trailblaze.model.TrailblazeHostAppTarget import xyz.block.trailblaze.ui.TrailblazePortManager import xyz.block.trailblaze.ui.TrailblazeDesktopUtil import xyz.block.trailblaze.ui.models.TrailblazeServerState.SavedTrailblazeAppConfig @@ -53,6 +54,16 @@ val CONFIG_KEYS: Map = listOf( get = { config -> config.llmModel }, set = { config, value -> config.copy(llmModel = value) }, ), + ConfigKey( + name = "app", + description = "Target app for device connections and custom tools", + validValues = "App target ID (e.g., square, cash, none)", + get = { config -> config.selectedTargetAppId ?: "not set" }, + set = { config, value -> + val targetId = if (value.equals(TrailblazeHostAppTarget.DefaultTrailblazeHostAppTarget.id, ignoreCase = true)) null else value.lowercase() + config.copy(selectedTargetAppId = targetId) + }, + ), ConfigKey( name = "agent", description = "Agent implementation", diff --git a/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/McpProxy.kt b/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/McpProxy.kt index e7ec24d5..c8df7298 100644 --- a/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/McpProxy.kt +++ b/trailblaze-host/src/main/java/xyz/block/trailblaze/cli/McpProxy.kt @@ -94,6 +94,9 @@ class McpProxy( // Whether the proxy is shutting down private val shutdownRequested = AtomicBoolean(false) + // The daemon process we launched (if any) -- prevents double-launching + private val daemonProcess = AtomicReference(null) + fun run(): Int { System.setProperty("java.awt.headless", "true") @@ -211,8 +214,15 @@ class McpProxy( /** * Start the daemon in headless mode. + * Skips if a previously launched daemon process is still alive. */ private fun startDaemon(log: (String) -> Unit) { + val existing = daemonProcess.get() + if (existing != null && existing.isAlive) { + log("Daemon process still starting -- skipping duplicate launch.") + return + } + val launcher = findLauncher() if (launcher == null) { log("Cannot auto-start daemon: trailblaze launcher not found. Start it manually with: trailblaze app") @@ -230,7 +240,8 @@ class McpProxy( } pb.redirectOutput(ProcessBuilder.Redirect.DISCARD) pb.redirectError(ProcessBuilder.Redirect.DISCARD) - pb.start() + val process = pb.start() + daemonProcess.set(process) log("Daemon process launched.") } catch (e: Exception) { log("Failed to start daemon: ${e.message}") diff --git a/trailblaze-host/src/main/java/xyz/block/trailblaze/host/TrailblazeHostYamlRunner.kt b/trailblaze-host/src/main/java/xyz/block/trailblaze/host/TrailblazeHostYamlRunner.kt index ae744c45..6dc40eaa 100644 --- a/trailblaze-host/src/main/java/xyz/block/trailblaze/host/TrailblazeHostYamlRunner.kt +++ b/trailblaze-host/src/main/java/xyz/block/trailblaze/host/TrailblazeHostYamlRunner.kt @@ -68,6 +68,7 @@ import xyz.block.trailblaze.toolcalls.TrailblazeToolResult import xyz.block.trailblaze.toolcalls.TrailblazeToolSet import xyz.block.trailblaze.toolcalls.toolName import xyz.block.trailblaze.tracing.TrailblazeTraceExporter +import xyz.block.trailblaze.ui.TrailblazeDesktopUtil import xyz.block.trailblaze.ui.TrailblazeDeviceManager import xyz.block.trailblaze.util.Console import xyz.block.trailblaze.util.GitUtils @@ -116,7 +117,9 @@ object TrailblazeHostYamlRunner { client = loggingRule.trailblazeLogServerClient, isServerAvailable = true, // Host runner always has a server running writeToDisk = { traceJson -> - val logsDir = File(GitUtils.getGitRootViaCommand(), "logs") + val gitRoot = GitUtils.getGitRootViaCommand() + val logsDir = if (gitRoot != null) File(gitRoot, "logs") + else File(TrailblazeDesktopUtil.getDefaultAppDataDirectory(), "logs") val sessionDir = File(logsDir, sessionId.value) sessionDir.mkdirs() File(sessionDir, "trace.json").writeText(traceJson) diff --git a/trailblaze-host/src/main/java/xyz/block/trailblaze/host/rules/HostTrailblazeLoggingRule.kt b/trailblaze-host/src/main/java/xyz/block/trailblaze/host/rules/HostTrailblazeLoggingRule.kt index dfec7c9e..4932753f 100644 --- a/trailblaze-host/src/main/java/xyz/block/trailblaze/host/rules/HostTrailblazeLoggingRule.kt +++ b/trailblaze-host/src/main/java/xyz/block/trailblaze/host/rules/HostTrailblazeLoggingRule.kt @@ -7,6 +7,7 @@ import xyz.block.trailblaze.logs.client.TrailblazeScreenStateLog import xyz.block.trailblaze.logs.model.SessionId import xyz.block.trailblaze.report.utils.LogsRepo import xyz.block.trailblaze.rules.TrailblazeLoggingRule +import xyz.block.trailblaze.ui.TrailblazeDesktopUtil import xyz.block.trailblaze.ui.TrailblazePortManager import xyz.block.trailblaze.util.GitUtils @@ -46,7 +47,10 @@ class HostTrailblazeLoggingRule( explicitLogsDir?.let { return it } val gitRoot = GitUtils.getGitRootViaCommand() - return File(gitRoot, "logs") + if (gitRoot != null) return File(gitRoot, "logs") + + // Release/binary builds: use ~/.trailblaze/logs as the default logs directory + return File(TrailblazeDesktopUtil.getDefaultAppDataDirectory(), "logs") } } } diff --git a/trailblaze-host/src/main/java/xyz/block/trailblaze/ui/MainTrailblazeApp.kt b/trailblaze-host/src/main/java/xyz/block/trailblaze/ui/MainTrailblazeApp.kt index f81e847a..c7909692 100644 --- a/trailblaze-host/src/main/java/xyz/block/trailblaze/ui/MainTrailblazeApp.kt +++ b/trailblaze-host/src/main/java/xyz/block/trailblaze/ui/MainTrailblazeApp.kt @@ -74,6 +74,15 @@ import java.awt.GraphicsEnvironment import java.awt.Window +/** + * When true, always show the main window on macOS instead of starting minimized to tray. + * This improves discoverability of the app vs being hidden in the menu bar. + * + * Note: When --headless is passed (e.g. via `trailblaze mcp`), the window is always hidden + * regardless of this flag — headless mode only starts the daemon/server with a tray icon. + */ +private const val ALWAYS_SHOW_WINDOW = true + private data class WindowStateSnapshot( val width: Int, val height: Int, @@ -146,9 +155,9 @@ class MainTrailblazeApp( application { val currentServerState by trailblazeSavedSettingsRepo.serverStateFlow.collectAsState() - // Window visibility state - starts hidden in headless mode, visible otherwise - // Closing the window hides it instead of quitting (can reopen from tray) - var windowVisible by remember { mutableStateOf(!headless) } + // Window visibility state - starts hidden in headless mode, visible otherwise. + // ALWAYS_SHOW_WINDOW overrides headless to improve app discoverability on macOS. + var windowVisible by remember { mutableStateOf(if (headless) false else ALWAYS_SHOW_WINDOW) } // Track the AWT window for bringing to front var awtWindow by remember { mutableStateOf(null) } diff --git a/trailblaze-models/src/commonMain/kotlin/xyz/block/trailblaze/api/ScreenshotScalingConfig.kt b/trailblaze-models/src/commonMain/kotlin/xyz/block/trailblaze/api/ScreenshotScalingConfig.kt index f6341e80..b2d68a14 100644 --- a/trailblaze-models/src/commonMain/kotlin/xyz/block/trailblaze/api/ScreenshotScalingConfig.kt +++ b/trailblaze-models/src/commonMain/kotlin/xyz/block/trailblaze/api/ScreenshotScalingConfig.kt @@ -31,7 +31,7 @@ data class ScreenshotScalingConfig( val DEFAULT = ScreenshotScalingConfig() /** - * Alias for [DEFAULT]. On-device in-process execution (e.g. Android Test Farm) uses the + * Alias for [DEFAULT]. On-device in-process execution (e.g. remote device farm) uses the * same config as host — WebP at 1536x768. Kept as a named constant for clarity at call * sites where the on-device context matters. */ diff --git a/trailblaze-report/src/main/java/xyz/block/trailblaze/report/snapshot/SnapshotMetadata.kt b/trailblaze-report/src/main/java/xyz/block/trailblaze/report/snapshot/SnapshotMetadata.kt index 7e8ec99b..79351f33 100644 --- a/trailblaze-report/src/main/java/xyz/block/trailblaze/report/snapshot/SnapshotMetadata.kt +++ b/trailblaze-report/src/main/java/xyz/block/trailblaze/report/snapshot/SnapshotMetadata.kt @@ -43,7 +43,7 @@ data class SnapshotMetadata( // Extract filename from screenshotFile (could be a filename or a full URL) val screenshotPath = snapshotLog.screenshotFile val screenshotFileName = if (screenshotPath.startsWith("http://") || screenshotPath.startsWith("https://")) { - // Extract filename from URL (e.g., from S3 URL after ATF upload) + // Extract filename from URL (e.g., from S3 URL after device farm upload) // URL format: https://...?key=...%2Ffilename.png val keyParam = screenshotPath.substringAfter("key=", "") if (keyParam.isNotEmpty()) {