diff --git a/skills/audit-agent-experience/LICENSE.txt b/skills/audit-agent-experience/LICENSE.txt new file mode 100644 index 0000000..f2f4397 --- /dev/null +++ b/skills/audit-agent-experience/LICENSE.txt @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 Browserbase, Inc. + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/skills/audit-agent-experience/SKILL.md b/skills/audit-agent-experience/SKILL.md new file mode 100644 index 0000000..bcaf0a9 --- /dev/null +++ b/skills/audit-agent-experience/SKILL.md @@ -0,0 +1,444 @@ +--- +name: audit-agent-experience +description: "Audit the developer experience of a product, SDK, docs site, or SKILL.md by dropping multiple Claude subagents at it with only a tiny task prompt and real tools (WebFetch, Bash, Write). Agents must discover the docs themselves, install deps, ask for credentials if needed, and attempt real execution. The skill captures each agent's trace — tool calls, retries, wall time, errors — and scores on Setup Friction, Speed, Efficiency, Error Recovery, and Doc Quality, then emits an HTML report with an A–F grade and concrete fixes. Use when the user asks to audit agent experience, test a skill, audit docs for agents, check if a SDK is agent-friendly, validate a SKILL.md, measure agent DX, or benchmark how painful onboarding is for an AI agent. Triggers: 'audit agent experience', 'test this skill', 'audit docs for agents', 'is my SDK agent-friendly', 'run a DX audit', 'agent experience test', 'test my docs', 'how do agents do with my product'." +license: MIT +metadata: + author: jay + version: "1.4.0" +allowed-tools: Read WebFetch Write Bash AskUserQuestion Agent +--- + +# Audit Agent Experience + +Evaluate how well a product/SDK/docs surface works when an AI agent actually tries to onboard and do a realistic task — **starting from a short one-sentence prompt**, with nothing pasted in. The agent must find the docs, install what it needs, and attempt real work. That's the only honest test of agent DX. + +The skill spawns multiple subagents in parallel, captures each one's tool-call trace, and scores the experience using the same dimensions as the Skill Test Arena dashboard: Setup Friction, Speed, Efficiency, Error Recovery, Doc Quality. + +## Core principle + +**Do not spoonfeed.** The subagent gets a tiny prompt like *"Get started with {product} and {do its primary thing}"*. It must discover the docs, choose the path, and hit real failures. A good doc survives this; a bad doc does not. + +## Workflow + +Execute these steps in order. Do not skip ahead. + +### Step 1 — Identify the target and define the abstract goal + +Resolve what the user is asking to audit. The target may arrive in one of three forms: + +- **URL** — a docs site or product page (e.g., `https://docs.example.com`). This is the *seed* the subagents start from. +- **Repo / file path** — for SKILL.md audits or SDK repos. +- **Product name** — if the user is vague ("test my product"), ask via `AskUserQuestion` for the URL or repo. + +**This skill is product-agnostic. Never assume what the user wants to audit.** Do not infer a target from environment signals (operator's email domain, git remote, repo name, recent files, memory, CLAUDE.md). Even if context strongly suggests a particular company, the user-facing question must NOT pre-fill or default to any specific product, URL, or company name. Ask open-endedly with neutral options only: e.g., "Paste a URL", "Paste a local path", "Type a product name". If the user did not name a target in their invocation, ask them — start fresh, no priors. + +**Research lightly** *after* the user has named a target. 1 WebFetch max, enough to confirm: what is this product, and does it have a getting-started guide? You're identifying *that there is a flow to follow*, not extracting the steps. The whole point is to let the docs dictate the path. + +**Define ONE abstract goal, not a step-by-step checklist.** The goal should be at the level of "complete the onboarding" or "make the product do its primary thing once" — NOT a list of specific actions. + +Why: prescriptive checklists steer agents. If you tell them "navigate to example.com" but the docs' quickstart navigates to a different URL, the agent is torn between your instruction and the docs. That pollutes the test. + +Examples of good abstract goals (the target product is supplied by the user — the examples below are illustrative only, not defaults): +- A search API → *"Complete the getting-started guide. Success = your code successfully calls the API and prints whatever the docs treat as a meaningful result."* +- A payments API → *"Complete the getting-started flow for making a test charge. Success = you have a charge ID or equivalent confirmation."* +- A browser-automation SDK → *"Complete the getting-started guide end-to-end. Success = you have code that runs a cloud browser session using whatever approach the docs recommend."* +- A SKILL.md → *"Follow the skill's instructions and produce a successful outcome for its advertised job."* + +Examples of BAD goals (too prescriptive — don't do this): +- ~~"Navigate to https://example.com"~~ (steers — the docs may pick a different URL) +- ~~"Use Playwright"~~ (the docs may recommend Stagehand or Selenium) +- ~~"Print the page title"~~ (the docs may print session ID, response body, anything) + +The subagent will self-report against the abstract goal: *did I complete the onboarding as the docs described?* (yes / no / partial). The concrete sub-outcomes the agent *actually achieved* live in their trace under `primary_outcome_achieved`, not in a pre-defined checklist. + +If the target has no clear getting-started flow (rare — even a README is a flow), ask the user what "done" means before continuing. + +### Step 2 — Gather audit config via AskUserQuestion + +Use `AskUserQuestion` in a **single call with 4 questions**. Options: max 4 per question. + +1. **Test depth** (single-select, header: `"Depth"`): + - `5 agents (Recommended)` — balanced coverage + - `3 agents` — quick sanity check + - `10 agents` — thorough, higher cost + +2. **Programming languages** (multiSelect, header: `"Languages"`): pick up to 4 — `Python`, `TypeScript`, `Go`, `Shell/Bash` (let user deselect). + +3. **Personas** (multiSelect, header: `"Personas"`): + - `Standard (Recommended)` — neutral baseline, no behavioral flavoring. Just "do the task." Best for unbiased measurement. + - `Pragmatic` — just get it working, fastest path + - `Thorough` — read the docs end-to-end before coding + - `Skeptical` — verify claims the docs make + +4. **Execution mode** (single-select, header: `"Exec mode"`): + - `Allow Bash (Recommended)` — subagents can run `npm install`, `curl`, etc. on your machine. Most realistic. + - `Draft-only` — subagents may fetch docs and write code but won't execute anything. Safer. + +After the user answers, gather one more question about model choice: + +5. **Model** (single-select, header: `"Model"`): + - `Sonnet (Recommended)` — balanced cost/quality, default for most audits + - `Opus` — strongest reasoning, highest cost; good for dense/ambiguous docs + - `Haiku` — cheapest, fastest; good for checking if docs are agent-friendly to smaller models + - `Mixed comparison` — split agents across Opus + Sonnet + Haiku so you can see how doc quality varies by model size. Useful for "are my docs robust even to weaker models?" + +Pass the chosen model to each `Agent` invocation via the `model` parameter. If `Mixed`, distribute N agents roughly equally across the 3 models (round-robin by slot index) and record which model each agent used in the trace + report. + +After the user answers, you have: `depth` (N), `languages[]`, `personas[]`, `exec_mode`, `model`. + +**If `exec_mode = "Allow Bash"`**, follow up with a second AskUserQuestion asking about credentials: + +- **Credentials** (single-select, header: `"Credentials"`): + - `Auto-discover (Recommended)` — skill checks the user's env vars, common dotfiles, and credential managers; only prompts for paste if nothing found. Best for repeat use and for cases where another operator is running the audit. + - `None — let agents block (friction test)` — agents hit the credential wall, counts as Setup Friction. Best for pure docs audits. + - `Paste manually` — you paste keys directly; skill injects them. Use when you don't have keys stored locally yet. + +If user picks `Auto-discover`, run **Step 2.5** below before continuing. If `Paste manually` (or auto-discover falls back), AskUserQuestion asks for the credential **values** — not the names. The skill then writes them to each workspace `.env` using **generic, product-agnostic names**: + +- Primary credential → `API_KEY` +- Secondary (e.g. project/org ID) → `PROJECT_ID` +- Third (e.g. webhook secret) → `SECRET` + +**Do NOT use product-specific names** like `BROWSERBASE_API_KEY`, `EXA_API_KEY`, `STRIPE_SECRET_KEY`. Those names steer the agent — they see `BROWSERBASE_API_KEY` in env and skip ever reading the docs to find out what env var the SDK actually expects. The generic name forces them to: + +1. Read the docs to discover the product's actual env var name (e.g. `BROWSERBASE_API_KEY`). +2. Map the generic `API_KEY` value into whatever form the SDK requires — either re-export (`export BROWSERBASE_API_KEY=$API_KEY`) or pass inline in code (`new Browserbase({ apiKey: process.env.API_KEY })`). + +If an agent fails to figure out the mapping, that's a doc quality signal — the docs weren't clear about credential naming. + +### Step 2.5 — Credential auto-discovery (only if user picked `Auto-discover`) + +Run a tiered lookup. **Stop at the first tier that produces a usable candidate.** Never print credential values to chat — only names and source paths. The user picks by name; the skill internally maps name → value → workspace `.env`. + +**Derive the product slug** from the target URL/repo to bias toward relevant matches. e.g. `https://docs.browserbase.com` → slug `browserbase`. Use lowercase substring match (case-insensitive) when ranking candidates. + +**Tier 1 — Already-exported env vars (free, zero side effects):** + +```bash +printenv | grep -iE '^[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' | cut -d= -f1 +``` + +This returns names only. If any names contain the product slug, those are top candidates. + +**Tier 2 — Narrow dotfile scan (a hardcoded short list, NOT a recursive grep):** + +```bash +grep -hE '^[[:space:]]*export[[:space:]]+[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' \ + ~/.zshrc ~/.bashrc ~/.bash_profile ~/.zprofile ~/.env ./.env ./.envrc 2>/dev/null \ + | sed -E 's/^[[:space:]]*export[[:space:]]+([A-Z0-9_]+)=.*/\1/' \ + | sort -u +``` + +Files allowed: `~/.zshrc`, `~/.bashrc`, `~/.bash_profile`, `~/.zprofile`, `~/.env`, `./.env`, `./.envrc`. **Do NOT expand this list. Do NOT recurse. Do NOT scan `~/Library`, `~/.config/`, `~/Documents`, etc.** This is the entire allowlist; anything else is out of scope and risks leaking unrelated secrets. + +For each match, record `(NAME, source_path)`. Read the value lazily — only when the user has confirmed the choice — by re-grepping the specific source file for that exact name. + +**Tier 3 — Credential manager (only if `op` or `security` is on PATH AND tiers 1–2 had no good match):** + +- 1Password CLI: skip unless `op account list` exits 0 (i.e. user is signed in). Don't trigger an interactive auth flow inside the skill. +- macOS Keychain: `security find-generic-password -l "" -w` — try once with the most likely name (e.g. `BROWSERBASE_API_KEY`); silent failure means not stored. + +If a credential manager produces hits, list them as candidates the same way as tiers 1–2. + +**Tier 4 — Fallback to paste:** If all tiers above produced zero candidates, fall through to the manual paste flow described in Step 2. + +**Presenting candidates to the user.** After tiers 1–3: + +- **If exactly 1 candidate** and its name contains the product slug → use it silently. Log a one-line confirmation in chat: `Using BROWSERBASE_API_KEY from ~/.zshrc.` (Name + source only — never the value.) +- **If multiple candidates**, AskUserQuestion (single-select, header: `"Use which credential?"`) with up to 4 options: + - One option per top candidate, formatted ` (from )` + - Plus a `Paste manually instead` escape hatch + - If >3 candidates, show the top 3 by slug-relevance and add a `Show all` option that re-asks with the rest. +- **If no candidates** → fall through to Tier 4 (paste). + +**Reading the value.** Once the user has confirmed a choice (or it was auto-selected), read the value: +- Tier 1: `printenv ` (capture stdout, do not echo). +- Tier 2: re-grep the specific source file for the exact `export =` line and parse the RHS, stripping surrounding quotes. +- Tier 3: `op read "op:////"` or `security find-generic-password -l -w`. + +Write the value into per-agent workspace `.env` files using the same generic names (`API_KEY`, `PROJECT_ID`, `SECRET`) as the paste flow — see Step 2. The discovery layer is upstream of injection; downstream behavior (generic names, agent must read docs to map them) is unchanged. + +**Orchestrator-retained credentials.** After writing per-agent `.env` files, the orchestrator keeps the **original product-specific names → values** (e.g. `BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID`) available to itself for downstream verification work in Steps 6 / 6.5 / 8 — for example, calling the product's API with `curl` to confirm that a session ID an agent reported actually resolves, or fetching session metadata to enrich the report. The orchestrator can read them with `printenv` (no need to store anywhere — the parent shell already has them since auto-discover sourced them from there). + +This is asymmetric on purpose: the subagents see only generic `API_KEY` / `PROJECT_ID` / `SECRET` so the doc-quality test stays honest (they must read the docs to discover the real var name). The orchestrator is not being audited, so it can use the real names freely for verification. + +**Privacy guarantees the skill must uphold:** +- Never write a credential value to chat output, the trace, the report, or any file outside the per-agent workspace `.env`. +- Never re-export the value into a **subagent's** workspace under a product-specific name. Subagents only see the generic names. +- Treat values as opaque strings — do not log length, prefix, or fingerprint. +- The HTML report records that auto-discovery happened (and which name was used) but never the value. + +### Step 3 — Safety check + +If `exec_mode = "Allow Bash"`, print a brief warning to chat before spawning: *"Agents may run real shell commands (npm install, curl, pip install, git clone) on this machine. Make sure you're in a directory you're okay with agents modifying. Continue in 5 seconds or Ctrl-C to abort."* — then continue. + +Do not run `sleep` — just proceed after printing. The user reads the warning before the agents start working. + +### Step 4 — Generate tiny prompts (no checklist) + +For each of N variants, produce a `(persona, language, prompt)` tuple. The prompt is **one or two sentences**, stating the abstract goal + language. **No sub-checklist, no prescriptive steps.** + +Template: + +``` +{persona_prefix} {product}'s getting-started guide using {language}. You've completed it when you've done whatever the guide treats as the primary successful outcome. +``` + +Examples (using `Acme` as a placeholder — substitute the user-supplied product name): +- Pragmatic × TypeScript → *"Get started with Acme using TypeScript (Node.js). Complete whatever the getting-started guide considers a successful first run."* +- Thorough × Python → *"Read Acme's getting-started guide and follow it end-to-end using Python. You're done when the guide's expected outcome is achieved."* +- Skeptical × Shell → *"Figure out how to complete Acme's getting-started flow using bash/curl. Note anything in the docs that seems wrong as you go."* + +The subagent is NOT told what the success outcome is — they have to read the docs to figure that out. That's the point: if the docs are good, they'll convey it clearly. If the docs are bad, the agent won't know when they're done, which IS a finding. + +Read `references/prompt-variants.md` for the persona prefix library. Cross-product personas × languages, truncate to N. If cells < N, repeat with slight wording variation on the prefix. + +Never paste doc content into the prompt. + +### Step 5 — Spawn N subagents in parallel + +Read `references/subagent-brief.md` — the full brief each subagent receives. It tells them: +- You are a real developer doing a real task +- Use your real tools (`WebFetch`, `Bash` if allowed, `Write`) +- If you need credentials, ask the user via a clear stop-and-ask message (the skill captures this as friction) +- Return a structured trace at the end with tool calls, errors, timing estimates, completion status + +For each variant, invoke the `Agent` tool (subagent_type: `general-purpose`). Pass `model: "opus" | "sonnet" | "haiku"` per the user's choice. For `Mixed`, rotate models across the N slots deterministically (agent 1 → opus, agent 2 → sonnet, agent 3 → haiku, agent 4 → opus, …) and record the assigned model in the per-agent report row. + +All N calls in **one message** so they run in parallel. + +The subagent's prompt = the brief + their tiny task. The brief passes through `exec_mode` so the subagent knows whether Bash is available. + +**Wait for all N agents to return before continuing to Step 6.** When agents are run in the background, completion notifications arrive one at a time and it is easy to lose count. Maintain a simple in-memory tally of returned-vs-spawned and, when the last agent reports back, print one explicit milestone line to chat: *"All N agents returned — moving to trace parsing."* Do not start Step 6 until that line has been printed. If the user asks "are the agents still running?" mid-flight, answer with the current `/` count from your tally, not from re-counting prior chat output. + +**Verification of agent claims using orchestrator credentials.** Before scoring, if Step 2.5 retained product-specific credentials, the orchestrator may use them to spot-check claims that subagents made (e.g. confirming a session ID with `curl -H "X-BB-API-Key: $BROWSERBASE_API_KEY" https://api.browserbase.com/v1/sessions/`). Treat any unresolved IDs as evidence the agent may have hallucinated. Never include the credential header in the report — only the verification result (resolved / not resolved). + +### Step 6 — Parse structured traces AND keep the full prose + +Each subagent returns two things in one response: +1. A fenced JSON trace at the end (structured self-report). +2. All the prose before it — reasoning, tool output, and what the agent actually did. + +**Retain both.** Do not throw the prose away after extracting JSON. The prose is where you catch things the JSON self-report misses. + +Extract JSON using: `/```json\s*(\{[\s\S]*?\})\s*```\s*$/`. Mark malformed/missing as `errored` with a `raw_tail`. If >50% errored, warn and offer retry. + +Compute the top-line numbers from the JSON: +- **Onboarding success rate** = fraction of agents with `onboarding_status = "completed"`. +- **Docs-promise-match rate** = fraction of agents with `docs_promise_met = true`. + +### Step 6.25 — Annotate URL provenance per-WebFetch (inline in trace) + +**Subagents don't have search** — they guess URLs from training-data priors. Reports must show *per WebFetch call* where the URL came from, rendered as a small muted line directly under the tool input block in the trace. Do NOT put this at the top of the report as a general callout — it's only useful inline where the reader can correlate it to the specific call. + +Classify each `WebFetch` URL into one of four provenance categories and render with the matching label + color: + +- **`TRAINING PRIOR`** (violet) — URL is a guess from training data (product name + common doc-site conventions like `/introduction`, `/quickstart`, `/sdk/{lang}`). Typical for the first 1–2 WebFetch calls. +- **`FROM LLMS.TXT`** (blue) — URL appears in the output of a prior `llms.txt` fetch in the same trace. +- **`FROM PREV PAGE`** (green) — URL was listed in the output of a previous WebFetch or Bash tool call in the same trace. +- **`GUESS · 404`** (amber) — URL was guessed but 404'd — this is the most interesting category for doc-quality scoring (the URL *should* exist by convention but doesn't). + +Classification heuristic: +1. If the same trace earlier contained a successful `llms.txt` WebFetch whose output mentioned this URL → `FROM LLMS.TXT` +2. Else if the same trace earlier contained any WebFetch/Bash output that mentioned this exact URL → `FROM PREV PAGE` +3. Else if the subsequent tool_result has `err: true` with 404 content → `GUESS · 404` +4. Else → `TRAINING PRIOR` + +Score interpretation: +- **Lots of `TRAINING PRIOR` that succeed** = product is well-represented in training data (head start). +- **Lots of `GUESS · 404`** = URL taxonomy drifts from common conventions → real doc-discoverability finding. +- **`FROM LLMS.TXT` appearing often after `GUESS · 404`** = `llms.txt` is carrying the docs' discoverability. Credit it explicitly in the findings. + +### Step 6.5 — Narrative cross-agent review (CRITICAL) + +Before scoring, re-read the **full prose** from every agent. The JSON trace is the agent's self-report — an agent that hallucinated a wrong package name will also describe it correctly in its own trace. The truth lives in the tool output and the prose. + +Scan for these patterns across the N transcripts: + +1. **Convergent mistakes.** Did multiple agents try the same wrong thing? Wrong npm package name (e.g., `exa` vs `exa-js`), wrong endpoint, wrong env var, wrong import? If 3/3 agents used the wrong package, that's a **doc quality disaster** even if each "completed" the task. Agents don't invent identical wrong answers — shared training-data residue means the docs aren't overriding the model's wrong priors. + +2. **Hallucinated artifacts.** Compare each agent's `primary_outcome_achieved` claim against what their tool output actually shows. If they claim "printed the title" but no title-fetching tool call appears in their Bash output, they're confabulating. Likely means the doc was unclear enough that the agent pattern-matched instead of reading. + +3. **Inconsistent outcomes.** If 3 agents describe 3 different "successful" end-states, the docs don't clearly define success. + +4. **Silent workarounds.** Did agents patch a bug (missing `await`, wrong env var name, undocumented required parameter) that a human copy-paster wouldn't have? Flag these — they're invisible DX taxes only captured in prose. + +5. **Tool-output vs. narrative contradictions.** Sometimes an agent says "it worked" but the stderr from their Bash call says otherwise, and they failed to notice. Grep tool outputs in the prose for `error`, `404`, `401`, `deprecated`, `warning`. + +Write a 3–5 sentence **Narrative Review** summary and include it prominently in the final report. This often surfaces the highest-value findings of the whole audit. + +### Step 7 — Score the 5 Arena dimensions + +Read `references/evaluation-rubric.md` for full criteria. Score 0–100 based on aggregated evidence. + +**Onboarding success rate is the primary sanity check.** See `references/evaluation-rubric.md` § 0 for the exact cap tiers — at <50% completion, every dimension is capped at 55 regardless of other evidence. + +- **Setup Friction (25%)** — credential prompts, auth retries, install errors. Goal items in the "setup" phase failing = big hit. +- **Speed (20%)** — total wall time, time-to-first-working-code. +- **Efficiency (20%)** — tool calls per passed goal item, wasted calls. +- **Error Recovery (15%)** — did errors block goal items, or did agents route around? +- **Doc Quality (20%)** — did docs supply what was needed to pass the checklist? + +Weighted total → letter grade (90+ A, 75+ B, 60+ C, 45+ D, else F). + +### Step 8 — Synthesise findings + +Produce: + +- **Executive summary** — 2–3 sentences. Lead with the grade and the single biggest friction. +- **What went well** — 3–5 bullets. +- **What didn't** — 3–5 bullets. +- **Common friction patterns** — anything hit by ≥2 agents (the high-signal fixes). +- **Session timeline** — aggregate phases across agents (Research, Setup, Execution, Validation) with rough times. +- **Tool call breakdown** — totals across all agents by tool type. +- **Recommended fixes** — prioritised, each citing the doc section or SDK method and a specific rewrite. + +### Step 9 — Render the HTML report + +Read `assets/report-template.html`. Fill placeholders: + +`{{TITLE}}`, `{{TARGET_REF}}`, `{{META}}`, `{{GRADE_LETTER}}`, `{{GRADE_CLASS}}`, `{{OVERALL_SCORE}}`, `{{AGENT_COUNT}}`, `{{COMPLETED_COUNT}}`, `{{STUCK_COUNT}}`, `{{ERRORED_COUNT}}`, `{{NARRATIVE_REVIEW_SECTION}}` (see format below), `{{EXEC_SUMMARY}}`, `{{WENT_WELL_ITEMS}}`, `{{DIDNT_GO_WELL_ITEMS}}`, `{{TIMELINE_SECTION}}`, `{{TOOL_BREAKDOWN_SECTION}}`, `{{METRICS_GRID}}`, `{{PATTERNS_SECTION}}`, `{{FIXES_LIST}}`, `{{AGENT_RESULTS_TABLE}}` (at-a-glance summary table — see format below), `{{AGENT_TRACES_SECTION}}` (full collapsible per-agent trace cards — see format below). + +**Section order in the rendered report** (the template enforces this — do not reorder): +1. Scorecard + agent-status stat grid +2. Narrative Review (`{{NARRATIVE_REVIEW_SECTION}}`) +3. Executive Summary +4. Recommended Fixes +5. What Agents Said (worked / didn't) +6. Common Friction Patterns (`{{PATTERNS_SECTION}}`) +7. Quantitative Metrics +8. Tool Call Breakdown +9. Session Timeline +10. Per-agent Runs (results table + traces) + +Rationale: opinion before data. The reader needs the verdict (narrative + exec summary) and the actionable fix list before being asked to absorb metrics or timelines. Reference-y sections (timeline, tool breakdown) sit near the bottom for verification, not framing. + +**`{{NARRATIVE_REVIEW_SECTION}}` format.** A `
` containing a `
Narrative Review
` and a `
` with the 3–5 sentence cross-agent summary from Step 6.5. This is the highest-value finding of the audit — keep prose tight, lead with the strongest observation. If Step 6.5 produced no notable cross-agent finding, render the section with a one-line body: `No cross-agent patterns of note — agents converged on the docs' intended path with minor individual variation.` Do not omit the section. + +The 5 dimension scores are still computed (they feed the overall weighted score and letter grade), but **do not render a per-dimension breakdown section** — it adds visual weight without giving the reader anything actionable beyond what the narrative review and recommended fixes already cover. Keep dimension scoring internal. + +**`{{AGENT_RESULTS_TABLE}}` format.** A `` rendered immediately above the per-agent cards. One row per agent with these columns (in order): + +1. **#** — slot index (1-based), right-aligned, monospace. +2. **Persona × Language** — e.g. `Standard · TypeScript`. Use the values from the agent's JSON trace. +3. **Model** — render this column ONLY when `model = Mixed` (otherwise omit the column entirely; the single model is named in the header `{{META}}` line). +4. **Status** — a `` matching `onboarding_status` from the JSON (`completed`, `partial`, `stuck`, `blocked-on-credentials`). Map `errored` (parser-failed traces) to its own pill. +5. **Tool calls** — sum of `count` across the agent's `tool_calls[]` array. Right-aligned, monospace. +6. **Time** — `wall_time_estimate_sec` from the JSON, formatted as `92s` (or `2m 14s` if ≥120s). Right-aligned, monospace. + +Rationale: the cards below are detailed but require expanding each one. The table gives a one-screen comparison so the reader can spot outliers (the agent that took 3× as long, the one that fired 2× the tool calls) before drilling in. + +**`{{AGENT_TRACES_SECTION}}` format.** One `
` per agent. Each card's summary line MUST include the model used (e.g. `opus`) alongside persona/language chips. The card expands to show: + +1. **Event log (from `detailed_trace`)** — rendered in **compact Arena-style**: monospace rows with color-coded bracketed labels, minimal chrome, no dots or timeline lines. Each row is one line of text; Input/Output blocks appear as indented `
` blocks directly under their tool call (always visible, not click-to-expand — users want to scan the flow).
+
+   The FIRST event in every log is the **prompt that was sent to that subagent**, rendered with the gold `[PROMPT]` label at timestamp `[setup]`. The full prompt is behind a small click-to-expand button (the only collapsible in the stream — prompts are long and users don't always need them).
+
+   Visual structure:
+
+   ```
+   [setup]     [PROMPT]       Task prompt sent to subagent   [▸ Show full prompt]
+   [+0ms]      [MILESTONE]    agent_started
+   [+100ms]    [THOUGHT]      I'll start by discovering the docs.
+   [+1.2s]     [TOOL_USE #1]   WebFetch
+                 Input: { "url": "...", "prompt": "..." }
+   [+3.4s]     [TOOL_RESULT #1]
+                 Output: # Example Product ...
+   [+4.5s]     [TOOL_USE #2]   Bash
+                 Input: { "command": "npm install ...", "description": "..." }
+   [+9.2s]     [TOOL_RESULT #2]
+                 Output: added 12 packages ...
+   [+12s]      [ERROR]         install · PEP 668 blocked · recovered
+   [+45s]      [RESULT ✓]      Session created, task done.
+   ```
+
+   CSS conventions (compact monospace, light background):
+   - Container: `.trace-timeline` — light gray background (`#fafaf9`), monospace font throughout, 0.78rem font-size, scrollable (max 640px)
+   - Each row: no grid, just inline text. `[time]` (muted) + `[LABEL]` (colored, bold) + body content
+   - Bracketed label color per type:
+     - `[PROMPT]`: gold
+     - `[MILESTONE]`: blue
+     - `[THOUGHT]`: violet (body text also italic + muted)
+     - `[TOOL USE]`: orange (`#ff6b35` or whatever brand accent the report uses)
+     - `[TOOL RESULT]`: green (or red if errored)
+     - `[ERROR]`: red (body also red)
+     - `[RESULT ✓]`: green (body green, bold)
+   - Tool-name: orange + semibold
+   - Input/Output: visible inline as `.trace-io` blocks with colored left-border (orange for input, green for output, red for errors). `
` block shows the **full** tool input as JSON — never abbreviate. For `WebFetch` specifically, that means showing *both* the `url` AND the `prompt` args — the `prompt` is what the agent asked the page's content to be distilled to, and it's critical signal for understanding agent intent. If the input is large, truncate the value (not the structure) with `…` inside the relevant string.
+   - Prompt block is the exception — it's collapsed by default (prompts are long). Its summary IS visible as a small "▸ Show full prompt" button.
+   - Never revert to dark background — clashes with rest of report.
+   - No grid, no dots, no vertical line — keep it text-flow.
+
+**The main agent keeps each subagent's prompt.** When spawning agents in Step 5, cache the full prompt text keyed by agent index so you can retrieve it for the report. Future-you (rendering) needs to look up what was sent to which agent.
+
+2. **Agent's final prose summary** — kept as a secondary scrollable box below the event log (this is the self-report; the trace is the ground truth).
+3. Tool calls summary grid (name, count, purpose) — quick reference
+4. Evidence (session ID, stdout, etc)
+5. Friction points
+6. Errors (if any)
+7. Positive moments
+
+The event log is the star of the show — this is what gives users the same "I can see exactly what the agent did and thought" experience as the Arena trace view. The prose summary is a narrative recap but the trace is the primary record.
+
+**Per-agent results table** must include a `Model` column when `model = Mixed`, so cross-model comparison is visible at a glance. When a single model was used, mention it once in the header `{{META}}` line instead.
+
+HTML-escape all user-supplied strings. Doc quotes go in `` or `
`. + +**All URLs must be clickable.** When the report references: +- Relative doc paths (e.g. `/quickstart`) → wrap as `{path}` where `{TARGET_BASE_URL}` is the audit target's origin (e.g. `https://docs.example.com`) +- Session/resource IDs (e.g. `f0ec58cc`) → link to the full resource URL (e.g. `https://app.example.com/sessions/{full-id}`) with a `↗` suffix indicating external link +- Full URLs appearing in prose → already linkable, just ensure they're wrapped in `` not just `` + +The CSS for these link classes: +```css +a.doc-link { text-decoration: none; color: inherit; } +a.doc-link:hover code { background: #fff4ef; border-color: var(--brand); color: var(--brand); } +a.session-link { color: #166534; text-decoration: none; } +a.session-link:hover { text-decoration: underline; } +``` + +Rationale: a 404 finding is useless if the user can't click to verify. A session ID is useless if the user can't click through to the recording. Every URL-like string in the report should be one click away from verification. + +### Step 10 — Save and surface + +Save to `./audit-agent-experience--.html` (cwd). Slug = lowercase target basename with non-alphanumerics → `-`. Timestamp = `YYYYMMDD-HHMMSS`. + +Print to chat: +- Grade, overall score, and the single biggest fix. +- Count summary: N agents, M completed, K stuck. +- The full file path. + +Open via `Bash: open ` on macOS if `exec_mode` allowed it; otherwise just print the path. + +### Step 11 — Clean up workspaces + +If `exec_mode = "Allow Bash"` and you created per-agent subdirectories under `./dx-audit-tmp/` (or similar), delete that tree after the report is rendered: + +```bash +rm -rf ./dx-audit-tmp/ +``` + +Rationale: agents install node_modules, venvs, Go modules, etc. — often tens of MB per agent. Leaving them around pollutes the user's repo and wastes disk. + +**Exception:** if a subagent's `onboarding_status` is `stuck`, or its trace was marked `errored` (JSON parse failed), leave that agent's subdir in place and note it in chat — the user may want to inspect the failing state. Delete only the completed / blocked-on-creds agents' dirs. + +If `exec_mode = "Draft-only"`, no cleanup is needed (no files were written outside the report). + +## Reference files + +- **`references/evaluation-rubric.md`** — 5-dimension scoring rubric (Arena methodology). +- **`references/prompt-variants.md`** — Persona prefix library and core-task heuristics. +- **`references/subagent-brief.md`** — Verbatim brief + trace JSON schema. + +## Assets + +- **`assets/report-template.html`** — HTML template with placeholders, stamped into the final report. + +## Constraints + +- Never paste the target doc into the subagent's prompt — that's the whole point. +- `exec_mode = Draft-only` must disable Bash execution in the subagent brief. +- Never test a target the user didn't explicitly name. +- **Never pre-fill a product, URL, or company in any user-facing question.** Ignore environment signals (email domain, git remote, repo name, memory). Start fresh — the operator may be auditing anyone. +- If a subagent asks for credentials, **that counts as friction** in the score — don't "help" it by auto-providing. Let the agent hit the wall and record it. +- Never write to files outside cwd except the HTML report. diff --git a/skills/audit-agent-experience/assets/report-template.html b/skills/audit-agent-experience/assets/report-template.html new file mode 100644 index 0000000..476b678 --- /dev/null +++ b/skills/audit-agent-experience/assets/report-template.html @@ -0,0 +1,364 @@ + + + + + +Audit Agent Experience — {{TITLE}} + + + + + + +
+ +
+
+

Audit Agent Experience

+
{{TARGET_REF}}
+
{{META}}
+
+ +
+ +
+
{{GRADE_LETTER}}
+
+
Overall DX Score
+
{{OVERALL_SCORE}} / 100
+
Weighted across 5 dimensions · {{AGENT_COUNT}} agents tested
+
+
+ +
+
Agents
{{AGENT_COUNT}}
+
Completed
{{COMPLETED_COUNT}}
+
Stuck
{{STUCK_COUNT}}
+
Errored
{{ERRORED_COUNT}}
+
+ + {{NARRATIVE_REVIEW_SECTION}} + +
+

Executive Summary

+
+

{{EXEC_SUMMARY}}

+
+
+ +
+

Recommended Fixes

+ {{FIXES_LIST}} +
+ +
+

What Agents Said

+
+
+

What worked

+
    {{WENT_WELL_ITEMS}}
+
+
+

What didn't

+
    {{DIDNT_GO_WELL_ITEMS}}
+
+
+
+ + {{PATTERNS_SECTION}} + +
+

Quantitative Metrics

+ {{METRICS_GRID}} +
+ +
+

Tool Call Breakdown

+ {{TOOL_BREAKDOWN_SECTION}} +
+ +
+

Session Timeline

+ {{TIMELINE_SECTION}} +
+ +
+

Per-agent Runs {{AGENT_COUNT}}

+ {{AGENT_RESULTS_TABLE}} + {{AGENT_TRACES_SECTION}} +
+ +
+ +
+ + + diff --git a/skills/audit-agent-experience/references/evaluation-rubric.md b/skills/audit-agent-experience/references/evaluation-rubric.md new file mode 100644 index 0000000..24f806a --- /dev/null +++ b/skills/audit-agent-experience/references/evaluation-rubric.md @@ -0,0 +1,160 @@ +# Evaluation Rubric — Arena Methodology + +Score each dimension 0–100 based on aggregated trace evidence. Ground every score in specific trace fields: `tool_calls`, `errors`, `retries`, `interruptions_asking_for_creds`, `onboarding_status`, `friction_points`. + +## Table of contents + +1. **Goal completion rate (primary sanity check)** +2. Setup Friction (25%) +3. Speed (20%) +4. Efficiency (20%) +5. Error Recovery (15%) +6. Doc Quality (20%) +7. Score-to-grade mapping +8. Calibration notes + +--- + +## 0. Onboarding success rate — the sanity floor + +Before scoring any dimension, compute two rates across all valid agents: + +``` +onboarding_success_rate = count(onboarding_status == "completed") / count(agents) +docs_promise_met_rate = count(docs_promise_met == true) / count(agents) +``` + +**Cap rule (based on onboarding_success_rate):** + +- ≥ 0.9 → no cap +- ≥ 0.7 → cap any dimension at 85 +- ≥ 0.5 → cap any dimension at 70 +- < 0.5 → cap any dimension at 55 (docs fundamentally failed) + +Rationale: if agents couldn't complete onboarding, no amount of nice prose or fast fetches earns the docs an A. + +**docs_promise_met_rate is independent signal.** An agent can "complete" something (status = completed) but have `docs_promise_met = false` if the docs didn't clearly state what completion looks like. A low docs_promise_met_rate with a high onboarding_success_rate = "agents are succeeding in spite of the docs, not because of them." Flag this explicitly in the report. + +**Look at `primary_outcome_achieved` across agents.** If all 5 agents achieved slightly different outcomes, the docs are ambiguous about what success means — that's a doc quality issue. If they all converge on the same outcome, the docs are clear. + +**Narrative review findings (from Step 6.5) dominate over structured scores.** If the prose review surfaces convergent hallucinations (e.g., all agents used the wrong npm package) or systematic doc bugs invisible to the JSON self-report, cap Doc Quality at 50 regardless of other signals. An agent completing the task using a wrong-but-similar package isn't success — it means the docs left enough ambiguity for training-data priors to take over. That's a fundamental doc failure. + +**Model-mix analysis (when `model = Mixed`).** If the user ran a mixed-model audit, compare onboarding success + docs_promise_met + friction count across models: +- If Opus succeeds but Haiku fails → the docs lean on reasoning the smaller model can't do. That's a doc clarity gap (docs should work for all capable agents, not just the best). +- If Haiku and Sonnet both succeed but flag more friction than Opus → same finding at lower severity. +- If all three succeed equally cleanly → the docs are robust. Rare and great signal. + +Flag the model-mix gap in the report's narrative review section: *"Haiku struggled at X where Opus breezed through — docs too reliant on model-level inference."* + +--- + +## 1. Setup Friction (weight 25%) + +**Question:** How much ceremony stands between "I want to try this" and "I have code running"? + +**Signals:** +- `interruptions_asking_for_creds` — every ask is friction. +- `errors[].stage = "config"` — misconfig issues. +- `errors[].message` containing `401`, `403`, `auth`, `unauthorized` — credential-related. +- `friction_points[].phase = "setup"` — setup-stage pain. +- Retries on install or first auth attempt. + +**Anchors:** +- 90+: Zero credential friction (agent found keys, or none required). No auth retries. Install worked first try. +- 70: One small friction — a credential prompt or a single retry. +- 50: Multiple setup frictions — e.g., credential hunt + install conflict. +- 30: Agent spent most of its effort just trying to get set up. +- <20: Agent never got past setup. + +## 2. Speed (weight 20%) + +**Question:** How fast did agents get to working code? + +**Signals:** +- `wall_time_estimate_sec` — total run time. +- `time_to_first_working_code_sec` — time to a running snippet. +- Relative to task complexity (a Stripe charge should take longer than a `curl`). + +**Anchors:** +- 90+: Under 2 minutes to working code for a simple task. +- 70: 2–5 minutes — reasonable. +- 50: 5–10 minutes — noticeable drag. +- 30: Over 10 minutes — painful. +- <20: Never finished within the run. + +Adjust for task complexity. A payments flow is not a cloud browser session. + +## 3. Efficiency (weight 20%) + +**Question:** Did agents get there in a straight line, or did they wander? + +**Signals:** +- Sum of `tool_calls[].count` across all agents — total work. +- `code_attempts` — how many drafts before working. +- `retries` — repeated failing calls. +- `doc_pages_fetched` — if >5 pages for a simple task, docs are fragmented. +- `completed_subtasks` / total `tool_calls` ratio. + +**Anchors:** +- 90+: Under 10 tool calls for a simple task, zero wasted calls, single working draft. +- 70: 10–20 calls, one retry or minor exploration. +- 50: 20–40 calls, some wandering — agent wasn't sure where to look. +- 30: 40+ calls — agent is thrashing. +- <20: Pathological loop or massive exploration. + +## 4. Error Recovery (weight 15%) + +**Question:** When something broke, did the agent (and the docs) help each other recover? + +**Signals:** +- `errors[].recovered` — recovery rate. +- Whether errors led to `retries` that succeeded or to `onboarding_status` degradation. +- Docs surfaces relevant error info when fetched after an error (check `friction_points` for "no troubleshooting" notes). +- `friction_points[].severity = critical | high` at the execution phase. + +**Anchors:** +- 90+: Zero errors, or all errors recovered cleanly with clear doc guidance. +- 70: Errors happened but agents recovered — minor friction. +- 50: Errors slowed progress noticeably; docs didn't help. +- 30: Errors frequently fatal; docs silent on failure modes. +- <20: Every error killed the run. + +## 5. Doc Quality (weight 20%) + +**Question:** Did the docs provide what agents needed, when they needed it? + +**Signals:** +- `doc_pages_fetched` (fragmentation signal if high for a simple task). +- `friction_points` mentioning broken examples, missing info, unclear sections. +- `positive_moments` citing concrete doc wins. +- Whether the code in docs was copy-pasteable and worked. +- Presence/absence of a `llms.txt`, quickstart, or clear API reference. + +**Anchors:** +- 90+: A single quickstart page + working code got the agent to done. Minimal fragmentation. +- 70: Had to piece it together from 2–3 pages, but each was correct. +- 50: Fragmented or stale in places — examples needed adaptation. +- 30: Docs omit critical info (error handling, session lifecycle, etc.). +- <20: Docs either wrong, absent, or actively misleading. + +## 6. Score-to-grade mapping + +``` +total = setup*0.25 + speed*0.20 + efficiency*0.20 + recovery*0.15 + doc*0.20 +``` + +| Total | Grade | +|---------|-------| +| 90–100 | A | +| 75–89 | B | +| 60–74 | C | +| 45–59 | D | +| 0–44 | F | + +## 7. Calibration notes + +- **Don't inflate.** Every dimension at 80+ requires evidence. Default-to-B, move to A only with clear wins. +- **Don't deflate.** An absent signal is not a bad signal — a dim with no complaints starts at 75, not 50. +- **Weight severity.** One `critical` friction_point is worth 5 `low` ones. +- **Cite evidence per dimension.** The report shows a one-line rationale per dim — always quote or reference a specific trace field. +- **Blocked-on-credentials is not a failure of the docs** (unless the docs pretend credentials aren't needed). Score setup friction accordingly, but don't dock Doc Quality for an agent correctly refusing to invent keys. diff --git a/skills/audit-agent-experience/references/prompt-variants.md b/skills/audit-agent-experience/references/prompt-variants.md new file mode 100644 index 0000000..8939665 --- /dev/null +++ b/skills/audit-agent-experience/references/prompt-variants.md @@ -0,0 +1,101 @@ +# Prompt Variants + +The subagent gets **one or two sentences**. Do not paste docs. Do not list "rules". The realism of the audit depends on the prompt being as thin as a real developer's first thought. + +## Template + +``` +{persona_prefix} {product}'s getting-started guide using {language}. You've completed it when you've done whatever the guide treats as its primary successful outcome. +``` + +No checklist. No prescriptive steps. The agent reads the docs and decides what "done" means. Only `{persona_prefix}` and `{language}` vary between agents. + +## Persona prefixes + +### Standard (default, no persona flavoring) +> Follow + +No adjectives, no role-play, no behavioral hint. Just the task. Use this as the neutral baseline — removes the Hawthorne effect of telling the agent "you are X type of developer" and lets you measure the docs against an agent doing its natural thing. + +### Pragmatic +> Skim and then follow + +Behavioural hint: shortest path to working. Skips docs when possible. Flags friction bluntly. + +### Thorough +> Read and then follow + +Behavioural hint: reads end-to-end before coding. Surfaces ambiguity. Catches docs that don't survive a close read. + +### Skeptical +> Follow — note anything in the docs that seems wrong or unclear as you go while following + +Behavioural hint: verifies claims. Calls out marketing vs. code. + +## Core task (single phrase — derived from the target) + +Pick **one** core task for the whole audit. All subagents do the same task — only persona and language change. + +Heuristics: + +- **SDK / product with docs site** → "use it to do its primary function once". Examples of how to derive the core task (illustrative only — apply the same logic to whatever product the user named): + - A payments API → "charge a test card" + - An auth provider → "sign up a user" + - A database/backend → "insert a row and read it back" + - A browser-automation SDK → "run a session" + - A search API → "make one query and parse the result" +- **SKILL.md** → "use this skill to do its advertised job" +- **API reference page** → "make one representative call and handle the response" +- **Tutorial / guide** → "follow the guide to completion" +- **README for library** → "install it and run the smallest meaningful example" + +If the target is ambiguous, ask the user via AskUserQuestion before Step 4. + +## Language tail + +Append after the core task: + +- Python → `using Python` +- TypeScript → `using TypeScript (Node.js)` +- Go → `using Go` +- Shell/Bash → `using bash/curl only` + +## Final shape — examples + +Substitute `{product}` with whatever the user named in Step 1. The shape is identical regardless of target. + +Worked example with `{product} = Acme` (placeholder — not a default): + +- **Standard × TypeScript** → *"Follow Acme's getting-started guide using TypeScript (Node.js). You've completed it when you've done whatever the guide treats as its primary successful outcome."* +- **Pragmatic × Python** → *"Skim and then follow Acme's getting-started guide using Python. You've completed it when you've done whatever the guide treats as its primary successful outcome."* +- **Thorough × Go** → *"Read and then follow Acme's getting-started guide using Go. You've completed it when you've done whatever the guide treats as its primary successful outcome."* +- **Skeptical × Shell** → *"Follow Acme's getting-started guide using bash/curl only — note anything in the docs that seems wrong or unclear as you go. You've completed it when you've done whatever the guide treats as its primary successful outcome."* + +Each agent figures out on its own what the success outcome is from the docs. + +## Cross-product rule + +Generate cells = |personas| × |languages|. Truncate or repeat to hit N: + +- If cells ≥ N → take the first N in row-major order (persona rotation, then language). +- If cells < N → reuse cells in the same order; distinguish reruns by appending a slight task variation to the tail, e.g. `"and print the full error if anything fails"` or `"and also capture a screenshot if possible"`. + +## What NOT to do + +- Do **not** paste doc content, URL content, or code examples into the prompt. +- Do **not** give a numbered list of rules ("1. Use only X. 2. Do Y..."). +- Do **not** say "simulate a newbie" or use theatrical persona role-play — it just invites model performance-for-performance's-sake. The **prefix alone** shapes behaviour enough. +- Do **not** provide the answer in the prompt. If you find yourself writing "hint: use the Playwright quickstart at …", delete it. + +## Seed URL placement + +If the target's **name** alone might be ambiguous (e.g., "Clerk" could be many things), you may include the URL as the tail: `"…starting from https://clerk.com/docs"`. Prefer name-only when unambiguous. + +## Final check + +Before passing the prompt to the subagent: + +- [ ] Under ~30 words? +- [ ] No checklist, no step-by-step, no doc content pasted in? +- [ ] The success criterion is left implicit ("whatever the guide treats as its primary successful outcome") — not dictated? +- [ ] Persona is expressed by the prefix, not by "you are simulating…"? diff --git a/skills/audit-agent-experience/references/subagent-brief.md b/skills/audit-agent-experience/references/subagent-brief.md new file mode 100644 index 0000000..4019945 --- /dev/null +++ b/skills/audit-agent-experience/references/subagent-brief.md @@ -0,0 +1,158 @@ +# Subagent Brief + +The brief wraps the tiny task prompt. Fill `{{EXEC_MODE}}` and `{{TASK_PROMPT}}`. Nothing else. + +## Template + +``` +You are an AI agent being benchmarked on how well you can onboard to a product using only its public documentation and your real tools. Your goal is to actually attempt the task below end-to-end. + +Task: +{{TASK_PROMPT}} + +You are succeeding when you complete whatever the product's getting-started guide treats as "done" — its primary successful outcome. The docs tell you what that is; you decide what counts as success based on them, and honestly report whether you got there. + +Do NOT assume the success criterion is any particular action. Read the docs, infer what the guide is trying to accomplish, attempt it, and tell me whether you succeeded. + +Rules of engagement: +- Use your real tools: WebFetch to read docs, Read/Write for files, {{EXEC_MODE_TOOLS}}. +- Discover the docs yourself. Do NOT expect the docs to be pasted into this prompt — there are none attached. +- {{CREDENTIALS_RULE}} +- Work in the current directory; do not cd elsewhere. Clean up temp files you create. +- If a step fails, try once more with a different approach before giving up. Do not loop indefinitely. +- Stay on-task: you are not allowed to browse unrelated docs or explore the web outside what the task requires. + +Execution mode: {{EXEC_MODE}} +{{EXEC_MODE_NOTE}} + +As you work, internally track: +- Every tool call you make (name + 1-line purpose). +- Every error you hit (short message + whether you recovered). +- Every page you fetched from the target's documentation. +- Every credential/config prompt you issued to the user. +- Roughly how long the agent-reasoning portion took (seconds — your best estimate). +- Whether you achieved the end-state the task asked for. + +## Output format + +Write your working thoughts, attempts, and final outcome in free prose. Then — at the very end — emit exactly one fenced JSON code block matching this schema. Nothing after it. + +```json +{ + "persona": "", + "language": "", + "task_prompt": "", + "onboarding_status": "completed | partial | stuck | blocked-on-credentials", + "primary_outcome_achieved": "", + "success_criterion_from_docs": "", + "docs_promise_met": , + "evidence": "", + "wall_time_estimate_sec": , + "time_to_first_working_code_sec": , + "tool_calls": [ + {"tool": "WebFetch", "count": 4, "purpose": "fetched docs pages"}, + {"tool": "Bash", "count": 7, "purpose": "npm install, ran script"}, + {"tool": "Write", "count": 1, "purpose": "created index.ts"} + ], + "doc_pages_fetched": , + "errors": [ + {"stage": "install | config | execution | doc-fetch", "message": "", "recovered": true} + ], + "retries": , + "interruptions_asking_for_creds": , + "code_attempts": , + "completed_subtasks": [ + "fetched quickstart", + "installed deps", + "wrote minimal script", + "ran successfully with real credentials" + ], + "detailed_trace": [ + {"t_ms": 0, "type": "milestone", "message": "agent_started"}, + {"t_ms": 800, "type": "assistant_thought", "message": "I'll start by discovering the docs."}, + {"t_ms": 1200, "type": "tool_use", "n": 1, "tool": "WebFetch", "input": {"url": "https://docs.example.com", "prompt": "What is this?"}}, + {"t_ms": 3400, "type": "tool_result", "n": 1, "output": "# Example Product\n\nGet started by..."}, + {"t_ms": 4000, "type": "assistant_thought", "message": "Now I need to install the SDK."}, + {"t_ms": 4500, "type": "tool_use", "n": 2, "tool": "Bash", "input": {"command": "npm install example-sdk", "description": "Install SDK"}}, + {"t_ms": 9200, "type": "tool_result", "n": 2, "output": "added 12 packages, audited 13 packages..."}, + {"t_ms": 12000, "type": "error", "n": 3, "stage": "install", "message": "PEP 668 blocked system pip", "recovered": true}, + {"t_ms": 45000, "type": "result", "success": true, "summary": "Session created, task done."} + ], + "friction_points": [ + { + "severity": "critical | high | medium | low", + "phase": "setup | config | execution | teardown", + "quote_or_section": "", + "issue": "" + } + ], + "positive_moments": [ + "" + ] +} +``` + +## Detailed trace — required field + +`detailed_trace` is an ordered array of every significant event during your run: milestones, your own reasoning steps, every tool call with its input, every tool result (truncate output to ~500 chars if very long), errors, and the final result. This mirrors how a real trace viewer shows an agent run — the user should be able to follow exactly what you did and why. + +Event types (use `type` field): +- `milestone` — stage markers (e.g. `"agent_started"`, `"sdk_installed"`, `"first_code_written"`, `"first_run_attempted"`) +- `assistant_thought` — your own reasoning/narration between tool calls. One per distinct thought. Be genuine — this is the "thinking out loud" between actions. +- `tool_use` — every tool call. Include `n` (1-indexed sequence), `tool` (name), `input` (full input args as an object). +- `tool_result` — every tool result. Include matching `n`, `output` (trimmed to ≤500 chars with `...` if truncated). If the result was an error, also include `error: true`. +- `error` — any error that wasn't already covered by a `tool_result` (e.g. a thrown runtime error, an assertion failure). Include `stage`, `message`, `recovered` (bool). +- `result` — final outcome. Include `success` (bool) and `summary` (one-liner). + +Timestamps: `t_ms` = milliseconds since you started the task. You don't have real timestamps — estimate based on how long each step felt. It's fine to round to the nearest 100ms. The ordering is what matters most; the spacing just gives a visual sense of pacing. + +**Redact credentials.** If a tool input or result contains a raw API key, token, or secret value, replace it with `[REDACTED]` before writing it into the trace. + +## Final checks before you output + +- The JSON block is the LAST thing in your response. +- **`success_criterion_from_docs` is extracted from the docs, not invented.** Quote or paraphrase what the getting-started guide says the end-state is ("you should see…", "the session is now running…", "you've successfully made your first charge", etc). If the docs never say what success looks like, that's itself a finding — state `"docs do not define a success criterion"` and set `docs_promise_met: false`. +- `onboarding_status = "completed"` only when you believe you reached the docs' success criterion. +- `"blocked-on-credentials"` if the only thing stopping you is missing keys. +- `docs_promise_met` is independent of `onboarding_status` — the docs might set a bar you met (true) or might have been silent/misleading (false) even if you "completed" something. +- `friction_points` cite specific passages, absences, or errors — not vague complaints. +- If nothing went wrong, `friction_points` may be `[]`. If nothing stood out, `positive_moments` may be `[]`. +- `wall_time_estimate_sec` is your best estimate of how long this would take a real agent doing real tool calls. +``` + +## Placeholders to fill + +- `{{TASK_PROMPT}}` — the one-sentence task (from prompt-variants). Abstract, no steps. +- `{{CREDENTIALS_RULE}}` — depends on whether the user opted to auto-inject credentials: + + **If credentials = None (friction test):** + > If the task needs credentials (API keys, tokens), STOP and ask the user in one clear message. That ask counts as friction — do not try to fake or skip it. Even in draft-only mode, you must still issue the credential request — mark it in your trace and continue drafting what you can. + + **If credentials = Auto-inject (user provided):** + > Credentials have been provided in your workspace `.env` file under **generic, product-agnostic names** (`API_KEY`, and optionally `PROJECT_ID` / `SECRET`). The names `API_KEY` / `PROJECT_ID` / `SECRET` are NOT what the product's SDK necessarily expects — they are just neutral containers for the values. + > + > To use them: (1) read the product's docs to find out what env var / config option the SDK actually expects (e.g. the docs might say "set `FOOBAR_API_KEY`"). (2) Then either re-export in shell (`export FOOBAR_API_KEY=$API_KEY`) before running, or pass directly in code (`new Foo({ apiKey: process.env.API_KEY })`). You must not assume a specific env var name — discover it from the docs. + > + > Do NOT log, echo, or print the raw credential values. If you hit an auth error despite having credentials, record it in `errors[]` with stage="config" — that's real friction (and usually a doc-clarity issue). `interruptions_asking_for_creds` should be 0 since the values are provided; but if you can't figure out the mapping from the docs, record that as a friction_point. + +## Exec mode resolution + +Fill placeholders based on user's `exec_mode`: + +- `exec_mode = "Allow Bash (Recommended)"`: + - `{{EXEC_MODE}}` → `Allow Bash (real execution on host machine)` + - `{{EXEC_MODE_TOOLS}}` → `Bash to install deps and run code` + - `{{EXEC_MODE_NOTE}}` → `You may run npm, pip, curl, git clone, node, python, etc. Be conservative — don't modify files outside cwd unless the task requires it.` + +- `exec_mode = "Draft-only"`: + - `{{EXEC_MODE}}` → `Draft-only (no shell execution)` + - `{{EXEC_MODE_TOOLS}}` → `but NOT Bash` + - `{{EXEC_MODE_NOTE}}` → `Do not run any Bash commands. Draft the code you would run, list the commands you would execute, and score the docs based on what you could gather from WebFetch alone.` + +## Notes for the skill driver (you, the parent) + +- Invoke each Agent in parallel (one message, multiple Agent calls). +- Use `subagent_type: "general-purpose"`. +- Parse the last fenced JSON block with regex: `/```json\s*(\{[\s\S]*?\})\s*```\s*$/`. +- If parsing fails, mark trace `errored` with a `raw_tail` (last 500 chars) for debugging. +- If a subagent stops to ask for credentials, its `onboarding_status` should be `"blocked-on-credentials"`. The ask itself is captured in `interruptions_asking_for_creds` — do not interactively answer during the run.