block · handstandsam · Mar 29, 2026
diff --git a/.claude/skills/devlog/SKILL.md b/.claude/skills/devlog/SKILL.md
@@ -0,0 +1,68 @@
+---
+name: devlog
+description: Write or update a devlog entry in the devlog directory. Use when the user asks to write a devlog, record a decision, document what happened, or says "write up what we did".
+---
+
+# Devlog
+
+Write or update a devlog entry in the devlog directory.
+
+## Devlog Format
+
+Devlog entries are development journal posts that capture decisions, discoveries, and plans as work happens. They're written for the team — concise, honest, and useful for future reference.
+
+**Filename:** `YYYY-MM-DD-<topic-slug>.md`
+
+Use today's date and a short kebab-case topic slug.
+
+## Entry Structure
+
+```markdown
+---
+title: "Short Descriptive Title"
+type: devlog
+date: YYYY-MM-DD
+---
+
+# Title
+
+## Summary
+1-3 sentences on what this entry covers.
+
+## <Sections as needed>
+Use whatever sections make sense for the content. Common ones:
+- What Changed
+- Key Decisions (and rationale)
+- What We Learned
+- Open Questions
+- Future Work
+
+Keep it direct. No filler. Write like you're explaining to a teammate who will read this in 3 months.
+```
+
+## Front Matter Fields
+
+| Field | Required | Values |
+| :--- | :--- | :--- |
+| `title` | Yes | Short descriptive title |
+| `type` | Yes | `decision` (architectural/technical choice) or `devlog` (development note) |
+| `date` | Yes | `YYYY-MM-DD` format |
+
+Use `type: decision` when recording a significant architectural or technical choice. Use `type: devlog` for development notes, debugging sessions, and implementation details.
+
+## Guidelines
+
+- **Be opinionated.** Capture *why* decisions were made, not just what happened.
+- **Include the dead ends.** What didn't work and why is often more valuable than what did.
+- **Link to context.** Reference PRs, branches, test names, file paths — make it traceable.
+- **One entry per topic.** Don't combine unrelated work. Multiple entries on the same day is fine.
+
+## Before Writing
+
+1. Check existing devlog entries to avoid duplicating a topic
+2. If updating an existing topic, consider appending to the existing entry rather than creating a new one
+3. Review the current conversation context for decisions, discoveries, and rationale worth capturing
+
+## Invocation
+
+When the user says `/devlog`, ask what topic to write about if it's not clear from context. If you've been working on something substantial in the current session, suggest writing about that.
diff --git a/docs/devlog/2025-10-01-llm-as-compiler.md b/docs/devlog/2025-10-01-llm-as-compiler.md
@@ -0,0 +1,64 @@
+---
+title: "LLM as Compiler Architecture"
+type: decision
+date: 2025-10-01
+---
+
+# LLM as Compiler Architecture
+
+The core architectural insight behind Trailblaze — treating the LLM as a compiler rather than a chatbot.
+
+## Background
+
+Traditional UI test frameworks require developers to write explicit, imperative test code. We want to enable natural language test authoring while maintaining deterministic execution.
+
+## What we decided
+
+Trailblaze treats the **LLM as a compiler** that transforms natural language test cases into deterministic tool sequences.
+
+### The Compiler Metaphor
+
+```
+Natural Language  →  LLM + Agent + Tools  →  Trail Recording
+   (Source)              (Compiler)           (Output/IR)
+```
+
+| Concept | Traditional Compiler | Trailblaze |
+| :--- | :--- | :--- |
+| Source | Code (.c, .kt) | Natural language test steps |
+| Compiler | gcc, kotlinc | LLM + Trailblaze Agent |
+| IR/Output | Assembly, bytecode | Trail YAML (tool sequence) |
+| Runtime | CPU, JVM | Device + Maestro/Tools |
+
+### Compilation Flow
+
+```
+Test Case Steps → LLM interprets steps → Execute tools on device
+        ↓                    ↓                       ↓
+  Natural Language    Agent orchestration    Success/Failure
+        ↓                    ↓                       ↓
+                      On failure: retry      Record successful run
+                      with context           as .trail.yaml
+```
+
+### Key Properties
+
+- **Compilation happens once**: First successful run is recorded
+- **Replay is deterministic**: Subsequent runs use recording, no LLM needed
+- **Self-healing on failure**: LLM can adapt and retry when UI changes
+- **Recompilation on demand**: Force AI mode to generate new recording
+
+### Agent Loop
+
+1. LLM receives test step + current screen state
+2. LLM selects and invokes tools
+3. Tools execute via Maestro/device drivers
+4. On success → record tool invocation
+5. On failure → provide error context, retry
+6. After all steps → save complete `.trail.yaml`
+
+## What changed
+
+**Positive:** Natural language authoring, deterministic replay, self-healing capability, familiar mental model for engineers.
+
+**Negative:** Initial "compilation" requires LLM (cost/latency); recordings may need "recompilation" when UI changes significantly.
diff --git a/docs/devlog/2025-10-01-trail-recording-format.md b/docs/devlog/2025-10-01-trail-recording-format.md
@@ -0,0 +1,174 @@
+---
+title: "Trail Recording Format (YAML)"
+type: decision
+date: 2025-10-01
+---
+
+# Trail Recording Format (YAML)
+
+Building on our monorepo structure, we needed a format for recording UI test interactions.
+
+## Background
+
+Trailblaze uses an LLM to interpret natural language test steps and execute them. We need a way to capture successful executions as **deterministic recordings** that can replay without LLM involvement, ensuring consistency and reducing costs.
+
+## What we decided
+
+Trail recordings use a **YAML format** (`.trail.yaml`) that captures the mapping from natural language steps to tool invocations.
+
+### Format Structure
+
+```yaml
+- prompts:
+    - step: Launch the app signed in with user@example.com
+      recording:
+        tools:
+          - app_ios_launchAppSignedIn:
+              email: user@example.com
+              password: "12345678"
+    - step: Add a pizza to the cart and click 'Review sale'
+      recording:
+        tools:
+          - scrollUntilTextIsVisible:
+              text: Pizza
+              direction: DOWN
+          - tapOnElementWithAccessibilityText:
+              accessibilityText: Pizza
+          - tapOnElementWithAccessibilityText:
+              accessibilityText: Review sale 1 item
+    - step: Verify the total is correct
+      recordable: false  # Always uses AI, never replays from recording
+```
+
+### Step-Level Recordability
+
+Each step has a `recordable` flag (default: `true`):
+- **`recordable: true`**: Step can be recorded and replayed deterministically
+- **`recordable: false`**: Step always requires AI interpretation, even in recorded mode
+
+Use `recordable: false` for steps that need dynamic behavior (e.g., verification steps that should re-evaluate on each run).
+
+> **Note:** This is separate from tool-level `isRecordable` (see [Tool Execution Modes](2026-01-01-tool-execution-modes.md)).
+
+### Key Properties
+
+- **Human-readable**: YAML is easy to inspect, edit, and version control
+- **Deterministic**: Recordings replay exactly the same tool sequence
+- **Step-aligned**: Each natural language step maps to its tool invocations
+- **Platform-specific**: Trails are stored per platform/device (e.g., `ios-iphone.trail.yaml`)
+
+### Storage Convention
+
+Trails are organized by test case hierarchy:
+```
+trails/suite_{id}/section_{id}/case_{id}/
+├── ios-iphone.trail.yaml
+├── ios-ipad.trail.yaml
+└── android-phone.trail.yaml
+```
+
+### Execution Modes
+
+1. **AI Mode**: LLM interprets steps, executes tools, records successful runs
+2. **Recorded Mode**: Replay existing `.trail.yaml` without LLM (fast, deterministic)
+
+### Raw Maestro Blocks (Deprecated)
+
+The trail format supports a `maestro:` block for raw Maestro commands:
+
+```yaml
+# Deprecated - avoid in new trails
+- maestro:
+    - tapOn:
+        id: "com.example:id/button"
+    - assertVisible:
+        text: "Success"
+```
+
+**This is deprecated.** Prefer using Trailblaze tools instead:
+
+```yaml
+# Preferred - tools can be recorded and processed by the agent
+- prompts:
+    - step: Tap the submit button and verify success
+      recording:
+        tools:
+          - tapOnElementWithText:
+              text: Submit
+          - assertVisible:
+              text: Success
+```
+
+**Principle:** Trailblaze supports a limited subset of Maestro. Every supported Maestro command should have a corresponding Trailblaze tool that:
+- Can be selected by the LLM agent
+- Can be recorded in trails
+- Provides a consistent abstraction across platforms
+
+Raw `maestro:` blocks bypass the agent and recording system, making them harder to maintain and migrate.
+
+### No Conditionals in Trail Recordings
+
+Trail recordings intentionally contain **no conditional logic or branching**. A recording is simply a list of Trailblaze tool invocations that execute sequentially.
+
+```yaml
+# This is what a recording looks like - just tool calls, no conditionals
+- prompts:
+    - step: Navigate to settings
+      recording:
+        tools:
+          - tapOnElementWithAccessibilityText:
+              accessibilityText: Settings
+          - waitForElementWithText:
+              text: Account Settings
+```
+
+**Why no conditionals?**
+
+1. **Simplicity**: Recordings are easy to read, review, and debug
+2. **Determinism**: No runtime branching means predictable, reproducible execution
+3. **Code is better for logic**: Conditional behavior belongs in custom Trailblaze tools (see [Tool Naming Convention](2026-01-14-tool-naming-convention.md) and [Custom Tool Authoring](2026-01-28-custom-tool-authoring.md))
+
+**Where conditionals belong:**
+
+- **Custom tools**: App-specific or platform-specific tools can contain arbitrary code, including conditionals. For example, a `myapp_ios_handleOptionalPopup` tool might check for and dismiss a popup if present.
+- **Within a single natural language step**: Test authors can write conditionals in the step text for LLM interpretation (e.g., "If a popup appears, dismiss it"). However, this requires AI mode and cannot be recorded.
+
+**What doesn't work:** Branching from one natural language step to different subsequent steps based on conditions. The step sequence in `trail.yaml` is always linear.
+
+### Non-Goal: Code Generation
+
+Trailblaze intentionally does **not** generate traditional test code (Playwright scripts, XCUITest, Espresso, etc.). While technically possible—recorded tool calls contain all necessary information—this is explicitly not a goal.
+
+**Trailblaze is a runtime, not a codegen tool.**
+
+Think of it like the difference between:
+- **Java bytecode**: Runs on the JVM, not compiled to native code
+- **Trail files**: Run on Trailblaze, not compiled to test scripts
+
+The trail format is the artifact. Trailblaze interprets and executes it.
+
+**Why not generate code?**
+
+| Capability | Trail Runtime | Generated Code |
+| :--- | :--- | :--- |
+| AI Fallback | ✅ Re-derive from prompt when recording fails | ❌ Static—fails are just failures |
+| Self-healing | ✅ Natural language is always available for recovery | ❌ Once generated, prompt is gone |
+| Visual debugging | ✅ Desktop app replays with screenshots | ❌ Stack traces and logs only |
+| Edit by non-engineers | ✅ Modify natural language steps | ❌ Must edit TypeScript/Swift/Kotlin |
+| Cross-platform | ✅ One prompt, multiple recordings | ❌ Separate codegen per platform |
+
+**Positioning clarity:**
+
+Code generation would position Trailblaze as "yet another test recorder"—competing with Playwright Codegen, Appium Inspector, Maestro Studio, etc. These tools are mature and do codegen well.
+
+Trailblaze's value is different: **tests defined in natural language, recorded for deterministic replay, with AI fallback when recordings break**. The trail file is not an intermediate artifact to be compiled away—it's the test definition that retains its semantic meaning at runtime.
+
+**What about exporting for debugging?**
+
+For debugging purposes, Trailblaze could provide a "view as code" feature that shows what the equivalent Playwright/XCUITest code would look like—without actually generating runnable files. This helps developers understand what a recording does in familiar terms, while keeping the trail as the source of truth.
+
+## What changed
+
+**Positive:** Reproducible tests, reduced LLM costs on replay, easy debugging via readable YAML, version-controllable recordings.
+
+**Negative:** Platform-specific recordings may diverge; recordings become stale if UI changes.
diff --git a/docs/devlog/2026-01-01-maestro-integration.md b/docs/devlog/2026-01-01-maestro-integration.md
@@ -0,0 +1,60 @@
+---
+title: "Maestro as Current Execution Backend"
+type: decision
+date: 2026-01-01
+---
+
+# Maestro as Current Execution Backend
+
+Choosing our execution backend for driving UI interactions.
+
+## Background
+
+Trailblaze needs to interact with mobile devices to perform UI actions (taps, swipes, text input) and query screen state. Building and maintaining these low-level device interaction implementations across multiple platforms (Android, iOS) requires significant effort and ongoing maintenance.
+
+[Maestro](https://maestro.mobile.dev/) is an open source mobile UI testing framework that already provides robust, cross-platform device interaction capabilities with an active community.
+
+## What we decided
+
+**Trailblaze currently uses Maestro as its primary execution backend for device interactions, but Maestro is not an intrinsic part of the Trailblaze architecture.**
+
+Maestro handles the majority of UI interactions, but Trailblaze also uses **ADB commands and shell commands** directly for certain device control operations. This hybrid approach gives us flexibility—Maestro for high-level UI actions, and direct device commands when lower-level control is needed.
+
+### Why Maestro (For Now)
+
+- **Avoids reinventing the wheel**: Maestro provides battle-tested implementations for taps, swipes, scrolls, text input, and screen queries across Android and iOS
+- **Community maintenance**: We benefit from bug fixes, platform updates (new Android/iOS versions), and improvements contributed by the broader community
+- **Reduced dependency surface**: Using a focused tool means we don't need to pull in larger testing framework dependencies
+
+### Not a Permanent Coupling
+
+Trailblaze's core value is in its LLM-driven test generation and trail recording/replay architecture—not in how device interactions are executed. We may choose to replace Maestro in the future if:
+
+- A better-suited tool emerges
+- Our requirements diverge from Maestro's direction
+- We need tighter control over the execution layer
+
+Tool implementations should remain abstracted such that swapping execution backends is feasible.
+
+### On-Device Orchestra Fork
+
+Maestro's standard architecture assumes a host machine driving a connected device. For Trailblaze's on-device execution mode, we maintain **a copy of Maestro's Orchestra code** in our codebase.
+
+This is necessary because:
+- Maestro's base implementation doesn't work when running directly on the device
+- Pulling in the full Maestro dependency would bring unnecessary transitive dependencies
+- We need a minimal, self-contained implementation for the on-device use case
+
+**Maintenance requirement**: When upgrading Maestro versions, the Orchestra copy must be reviewed and updated to incorporate relevant changes while preserving on-device compatibility.
+
+## What changed
+
+**Positive:**
+- Faster time-to-market by leveraging existing device interaction code
+- Benefit from community improvements without maintaining low-level platform code
+- Clear abstraction boundary makes future migration possible
+
+**Negative:**
+- Dependent on external project's stability and direction
+- Orchestra fork requires manual sync during Maestro upgrades
+- Must track Maestro releases for security patches and compatibility updates