Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions .claude/skills/devlog/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
name: devlog
description: Write or update a devlog entry in the devlog directory. Use when the user asks to write a devlog, record a decision, document what happened, or says "write up what we did".
---

# Devlog

Write or update a devlog entry in the devlog directory.

## Devlog Format

Devlog entries are development journal posts that capture decisions, discoveries, and plans as work happens. They're written for the team — concise, honest, and useful for future reference.

**Filename:** `YYYY-MM-DD-<topic-slug>.md`

Use today's date and a short kebab-case topic slug.

## Entry Structure

```markdown
---
title: "Short Descriptive Title"
type: devlog
date: YYYY-MM-DD
---

# Title

## Summary
1-3 sentences on what this entry covers.

## <Sections as needed>
Use whatever sections make sense for the content. Common ones:
- What Changed
- Key Decisions (and rationale)
- What We Learned
- Open Questions
- Future Work

Keep it direct. No filler. Write like you're explaining to a teammate who will read this in 3 months.
```

## Front Matter Fields

| Field | Required | Values |
| :--- | :--- | :--- |
| `title` | Yes | Short descriptive title |
| `type` | Yes | `decision` (architectural/technical choice) or `devlog` (development note) |
| `date` | Yes | `YYYY-MM-DD` format |

Use `type: decision` when recording a significant architectural or technical choice. Use `type: devlog` for development notes, debugging sessions, and implementation details.

## Guidelines

- **Be opinionated.** Capture *why* decisions were made, not just what happened.
- **Include the dead ends.** What didn't work and why is often more valuable than what did.
- **Link to context.** Reference PRs, branches, test names, file paths — make it traceable.
- **One entry per topic.** Don't combine unrelated work. Multiple entries on the same day is fine.

## Before Writing

1. Check existing devlog entries to avoid duplicating a topic
2. If updating an existing topic, consider appending to the existing entry rather than creating a new one
3. Review the current conversation context for decisions, discoveries, and rationale worth capturing

## Invocation

When the user says `/devlog`, ask what topic to write about if it's not clear from context. If you've been working on something substantial in the current session, suggest writing about that.
64 changes: 64 additions & 0 deletions docs/devlog/2025-10-01-llm-as-compiler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: "LLM as Compiler Architecture"
type: decision
date: 2025-10-01
---

# LLM as Compiler Architecture

The core architectural insight behind Trailblaze — treating the LLM as a compiler rather than a chatbot.

## Background

Traditional UI test frameworks require developers to write explicit, imperative test code. We want to enable natural language test authoring while maintaining deterministic execution.

## What we decided

Trailblaze treats the **LLM as a compiler** that transforms natural language test cases into deterministic tool sequences.

### The Compiler Metaphor

```
Natural Language → LLM + Agent + Tools → Trail Recording
(Source) (Compiler) (Output/IR)
```

| Concept | Traditional Compiler | Trailblaze |
| :--- | :--- | :--- |
| Source | Code (.c, .kt) | Natural language test steps |
| Compiler | gcc, kotlinc | LLM + Trailblaze Agent |
| IR/Output | Assembly, bytecode | Trail YAML (tool sequence) |
| Runtime | CPU, JVM | Device + Maestro/Tools |

### Compilation Flow

```
Test Case Steps → LLM interprets steps → Execute tools on device
↓ ↓ ↓
Natural Language Agent orchestration Success/Failure
↓ ↓ ↓
On failure: retry Record successful run
with context as .trail.yaml
```

### Key Properties

- **Compilation happens once**: First successful run is recorded
- **Replay is deterministic**: Subsequent runs use recording, no LLM needed
- **Self-healing on failure**: LLM can adapt and retry when UI changes
- **Recompilation on demand**: Force AI mode to generate new recording

### Agent Loop

1. LLM receives test step + current screen state
2. LLM selects and invokes tools
3. Tools execute via Maestro/device drivers
4. On success → record tool invocation
5. On failure → provide error context, retry
6. After all steps → save complete `.trail.yaml`

## What changed

**Positive:** Natural language authoring, deterministic replay, self-healing capability, familiar mental model for engineers.

**Negative:** Initial "compilation" requires LLM (cost/latency); recordings may need "recompilation" when UI changes significantly.
174 changes: 174 additions & 0 deletions docs/devlog/2025-10-01-trail-recording-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: "Trail Recording Format (YAML)"
type: decision
date: 2025-10-01
---

# Trail Recording Format (YAML)

Building on our monorepo structure, we needed a format for recording UI test interactions.

## Background

Trailblaze uses an LLM to interpret natural language test steps and execute them. We need a way to capture successful executions as **deterministic recordings** that can replay without LLM involvement, ensuring consistency and reducing costs.

## What we decided

Trail recordings use a **YAML format** (`.trail.yaml`) that captures the mapping from natural language steps to tool invocations.

### Format Structure

```yaml
- prompts:
- step: Launch the app signed in with user@example.com
recording:
tools:
- app_ios_launchAppSignedIn:
email: user@example.com
password: "12345678"
- step: Add a pizza to the cart and click 'Review sale'
recording:
tools:
- scrollUntilTextIsVisible:
text: Pizza
direction: DOWN
- tapOnElementWithAccessibilityText:
accessibilityText: Pizza
- tapOnElementWithAccessibilityText:
accessibilityText: Review sale 1 item
- step: Verify the total is correct
recordable: false # Always uses AI, never replays from recording
```

### Step-Level Recordability

Each step has a `recordable` flag (default: `true`):
- **`recordable: true`**: Step can be recorded and replayed deterministically
- **`recordable: false`**: Step always requires AI interpretation, even in recorded mode

Use `recordable: false` for steps that need dynamic behavior (e.g., verification steps that should re-evaluate on each run).

> **Note:** This is separate from tool-level `isRecordable` (see [Tool Execution Modes](2026-01-01-tool-execution-modes.md)).

### Key Properties

- **Human-readable**: YAML is easy to inspect, edit, and version control
- **Deterministic**: Recordings replay exactly the same tool sequence
- **Step-aligned**: Each natural language step maps to its tool invocations
- **Platform-specific**: Trails are stored per platform/device (e.g., `ios-iphone.trail.yaml`)

### Storage Convention

Trails are organized by test case hierarchy:
```
trails/suite_{id}/section_{id}/case_{id}/
├── ios-iphone.trail.yaml
├── ios-ipad.trail.yaml
└── android-phone.trail.yaml
```

### Execution Modes

1. **AI Mode**: LLM interprets steps, executes tools, records successful runs
2. **Recorded Mode**: Replay existing `.trail.yaml` without LLM (fast, deterministic)

### Raw Maestro Blocks (Deprecated)

The trail format supports a `maestro:` block for raw Maestro commands:

```yaml
# Deprecated - avoid in new trails
- maestro:
- tapOn:
id: "com.example:id/button"
- assertVisible:
text: "Success"
```

**This is deprecated.** Prefer using Trailblaze tools instead:

```yaml
# Preferred - tools can be recorded and processed by the agent
- prompts:
- step: Tap the submit button and verify success
recording:
tools:
- tapOnElementWithText:
text: Submit
- assertVisible:
text: Success
```

**Principle:** Trailblaze supports a limited subset of Maestro. Every supported Maestro command should have a corresponding Trailblaze tool that:
- Can be selected by the LLM agent
- Can be recorded in trails
- Provides a consistent abstraction across platforms

Raw `maestro:` blocks bypass the agent and recording system, making them harder to maintain and migrate.

### No Conditionals in Trail Recordings

Trail recordings intentionally contain **no conditional logic or branching**. A recording is simply a list of Trailblaze tool invocations that execute sequentially.

```yaml
# This is what a recording looks like - just tool calls, no conditionals
- prompts:
- step: Navigate to settings
recording:
tools:
- tapOnElementWithAccessibilityText:
accessibilityText: Settings
- waitForElementWithText:
text: Account Settings
```

**Why no conditionals?**

1. **Simplicity**: Recordings are easy to read, review, and debug
2. **Determinism**: No runtime branching means predictable, reproducible execution
3. **Code is better for logic**: Conditional behavior belongs in custom Trailblaze tools (see [Tool Naming Convention](2026-01-14-tool-naming-convention.md) and [Custom Tool Authoring](2026-01-28-custom-tool-authoring.md))

**Where conditionals belong:**

- **Custom tools**: App-specific or platform-specific tools can contain arbitrary code, including conditionals. For example, a `myapp_ios_handleOptionalPopup` tool might check for and dismiss a popup if present.
- **Within a single natural language step**: Test authors can write conditionals in the step text for LLM interpretation (e.g., "If a popup appears, dismiss it"). However, this requires AI mode and cannot be recorded.

**What doesn't work:** Branching from one natural language step to different subsequent steps based on conditions. The step sequence in `trail.yaml` is always linear.

### Non-Goal: Code Generation

Trailblaze intentionally does **not** generate traditional test code (Playwright scripts, XCUITest, Espresso, etc.). While technically possible—recorded tool calls contain all necessary information—this is explicitly not a goal.

**Trailblaze is a runtime, not a codegen tool.**

Think of it like the difference between:
- **Java bytecode**: Runs on the JVM, not compiled to native code
- **Trail files**: Run on Trailblaze, not compiled to test scripts

The trail format is the artifact. Trailblaze interprets and executes it.

**Why not generate code?**

| Capability | Trail Runtime | Generated Code |
| :--- | :--- | :--- |
| AI Fallback | ✅ Re-derive from prompt when recording fails | ❌ Static—fails are just failures |
| Self-healing | ✅ Natural language is always available for recovery | ❌ Once generated, prompt is gone |
| Visual debugging | ✅ Desktop app replays with screenshots | ❌ Stack traces and logs only |
| Edit by non-engineers | ✅ Modify natural language steps | ❌ Must edit TypeScript/Swift/Kotlin |
| Cross-platform | ✅ One prompt, multiple recordings | ❌ Separate codegen per platform |

**Positioning clarity:**

Code generation would position Trailblaze as "yet another test recorder"—competing with Playwright Codegen, Appium Inspector, Maestro Studio, etc. These tools are mature and do codegen well.

Trailblaze's value is different: **tests defined in natural language, recorded for deterministic replay, with AI fallback when recordings break**. The trail file is not an intermediate artifact to be compiled away—it's the test definition that retains its semantic meaning at runtime.

**What about exporting for debugging?**

For debugging purposes, Trailblaze could provide a "view as code" feature that shows what the equivalent Playwright/XCUITest code would look like—without actually generating runnable files. This helps developers understand what a recording does in familiar terms, while keeping the trail as the source of truth.

## What changed

**Positive:** Reproducible tests, reduced LLM costs on replay, easy debugging via readable YAML, version-controllable recordings.

**Negative:** Platform-specific recordings may diverge; recordings become stale if UI changes.
60 changes: 60 additions & 0 deletions docs/devlog/2026-01-01-maestro-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: "Maestro as Current Execution Backend"
type: decision
date: 2026-01-01
---

# Maestro as Current Execution Backend

Choosing our execution backend for driving UI interactions.

## Background

Trailblaze needs to interact with mobile devices to perform UI actions (taps, swipes, text input) and query screen state. Building and maintaining these low-level device interaction implementations across multiple platforms (Android, iOS) requires significant effort and ongoing maintenance.

[Maestro](https://maestro.mobile.dev/) is an open source mobile UI testing framework that already provides robust, cross-platform device interaction capabilities with an active community.

## What we decided

**Trailblaze currently uses Maestro as its primary execution backend for device interactions, but Maestro is not an intrinsic part of the Trailblaze architecture.**

Maestro handles the majority of UI interactions, but Trailblaze also uses **ADB commands and shell commands** directly for certain device control operations. This hybrid approach gives us flexibility—Maestro for high-level UI actions, and direct device commands when lower-level control is needed.

### Why Maestro (For Now)

- **Avoids reinventing the wheel**: Maestro provides battle-tested implementations for taps, swipes, scrolls, text input, and screen queries across Android and iOS
- **Community maintenance**: We benefit from bug fixes, platform updates (new Android/iOS versions), and improvements contributed by the broader community
- **Reduced dependency surface**: Using a focused tool means we don't need to pull in larger testing framework dependencies

### Not a Permanent Coupling

Trailblaze's core value is in its LLM-driven test generation and trail recording/replay architecture—not in how device interactions are executed. We may choose to replace Maestro in the future if:

- A better-suited tool emerges
- Our requirements diverge from Maestro's direction
- We need tighter control over the execution layer

Tool implementations should remain abstracted such that swapping execution backends is feasible.

### On-Device Orchestra Fork

Maestro's standard architecture assumes a host machine driving a connected device. For Trailblaze's on-device execution mode, we maintain **a copy of Maestro's Orchestra code** in our codebase.

This is necessary because:
- Maestro's base implementation doesn't work when running directly on the device
- Pulling in the full Maestro dependency would bring unnecessary transitive dependencies
- We need a minimal, self-contained implementation for the on-device use case

**Maintenance requirement**: When upgrading Maestro versions, the Orchestra copy must be reviewed and updated to incorporate relevant changes while preserving on-device compatibility.

## What changed

**Positive:**
- Faster time-to-market by leveraging existing device interaction code
- Benefit from community improvements without maintaining low-level platform code
- Clear abstraction boundary makes future migration possible

**Negative:**
- Dependent on external project's stability and direction
- Orchestra fork requires manual sync during Maestro upgrades
- Must track Maestro releases for security patches and compatibility updates
Loading
Loading