From 9dc73e4066bb01ceb4c82a21ead01e0c828709d5 Mon Sep 17 00:00:00 2001
From: Phil Leggetter <phil@leggetter.co.uk>
Date: Wed, 22 Apr 2026 20:15:07 +0100
Subject: [PATCH 1/3] docs: refresh TESTING.md with three-level model and eval
 framing

- Introduce code examples, static quality (Tessl), and agent scenario layers; connect scenarios to evals as the missing feedback loop.
- Expand agent scenario docs: prerequisites, scenario comparison table, score-delta framing, Layer 2 titled Evals.
- Keep test-examples.sh examples aligned with main (event-gateway only until outpost ships examples/).
- Drop outpost-specific example paths that are not on main yet.

Made-with: Cursor
---
 TESTING.md | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/TESTING.md b/TESTING.md
index 5c4e0c2..93ef3ef 100644
--- a/TESTING.md
+++ b/TESTING.md
@@ -1,6 +1,12 @@
 # Testing Hookdeck Agent Skills
 
-This document covers automated testing for code examples in the `event-gateway` skill. Follows the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md).
+Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed).
+
+The first level validates that the code examples work. The second validates that the skill files are well-formed. The third answers the harder question: **can agents actually use these skills to get things done?**
+
+That third level — agent scenario testing — is where testing becomes evaluation. Traditional developer experience had natural feedback loops: support tickets, onboarding funnels, user interviews. With agents, that signal disappears. The agent either succeeds or silently moves on. Evals are the feedback loop you get back.
+
+---
 
 ## Code Example Tests
 
@@ -114,9 +120,9 @@ We use two layers to evaluate skills: **skill quality** (static) and **agent sce
 
 Baseline: run `npm run skill:review` periodically and record scores; use them to guide skill improvements.
 
-### Layer 2: Agent Scenarios (Custom Tool)
+### Layer 2: Agent Scenarios (Evals)
 
-The scenario tester installs skills, runs Claude Code with a scenario prompt, and writes a scored report. Use it to check that an agent can actually follow the staged workflow.
+This is where testing becomes evaluation. The scenario tester installs skills, runs Claude Code with a scenario prompt, and writes a scored report. It answers: can an agent actually follow the staged workflow to accomplish a real task?
 
 **Prerequisites:** [Claude Code CLI](https://claude.ai/download) installed and logged in (`ANTHROPIC_API_KEY` or `claude login`). The tool runs a preflight that sends a short prompt to the CLI; if you see "Claude CLI did not respond within 15s", the CLI may be blocked (e.g. in a restricted sandbox). Run with a full environment or ensure the CLI can reach the API.
 
@@ -135,12 +141,18 @@ npx tsx tools/agent-scenario-tester/src/index.ts run receive-webhooks express
 
 **Options:** `--dry-run`, `--verbose`, `--timeout <seconds>` (default 300).
 
-**Scenarios:** Defined in `scenarios.yaml`. Initial set:
+**Scenarios:** Defined in `scenarios.yaml`. Three scenarios test increasingly interesting agent behaviors:
 
 - **receive-webhooks** — Setup Hookdeck, build handler with signature verification, run `hookdeck listen`, document inspect/retry workflow. Tests stages 01–04 (iterate is documentation-only: agent documents how to list request → event → attempt and retry; no live traffic required).
 - **receive-provider-webhooks** — Same plus a provider (e.g. Stripe). Use `--provider stripe`. Only the event-gateway skill is pre-installed; the agent is expected to discover and use the provider skill from webhook-skills (e.g. `npx skills add hookdeck/webhook-skills --skill stripe-webhooks -y -g`) and use the provider SDK in the handler. Tests composition and the provider-webhooks checklist.
 - **investigate-delivery-health** — Documentation-only: assume the user has had webhooks for a week and wants to understand delivery performance (success vs failure, backlog, latency). The prompt does **not** mention "metrics" or "hookdeck gateway metrics"; the assessor checks whether the agent used metrics CLI commands. Use to verify that agents discover and use metrics from the skill when the task implies it.
 
+| Scenario | Tests | Key question |
+|----------|-------|-------------|
+| `receive-webhooks` | Core skill usage | Can the agent follow the skill to set up webhook receiving? |
+| `receive-provider-webhooks` | Composition | Does the agent discover and install a Stripe-specific skill on its own? |
+| `investigate-delivery-health` | Discovery | Does the agent find diagnostic tools (CLI metrics, MCP) when they aren't mentioned in the prompt? |
+
 ### Scenario run checklist
 
 Run these and evaluate results; iterate on skills or prompts as needed.
@@ -164,6 +176,8 @@ Run these and evaluate results; iterate on skills or prompts as needed.
    - **Tool:** Refine prompts in `scenarios.yaml`, adjust the evaluation rubric, or add scenarios.
 4. **Re-run** to confirm improvements; repeat as needed.
 
+The A/B unit is **skill version × scenario set → score delta**. Change a skill, re-run the scenarios, compare scores.
+
 CI runs scenario tests on-demand (workflow_dispatch) and weekly (schedule). Use artifacts to monitor regressions and guide further skill improvements.
 
 ### Evaluation Rubric (receive-webhooks)

From dd425fa32d2ce12610cb47e9294d343ec2bcbc25 Mon Sep 17 00:00:00 2001
From: Phil Leggetter <phil@leggetter.co.uk>
Date: Wed, 22 Apr 2026 20:17:39 +0100
Subject: [PATCH 2/3] docs(TESTING): restore webhook-skills lineage for code
 examples

Made-with: Cursor
---
 TESTING.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/TESTING.md b/TESTING.md
index 93ef3ef..0d9ebfb 100644
--- a/TESTING.md
+++ b/TESTING.md
@@ -1,5 +1,7 @@
 # Testing Hookdeck Agent Skills
 
+This document covers automated testing for code examples in the **`event-gateway`** skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md).
+
 Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed).
 
 The first level validates that the code examples work. The second validates that the skill files are well-formed. The third answers the harder question: **can agents actually use these skills to get things done?**

From e501272ae04ed41cbc2d6e85a6f5dadc41817cd5 Mon Sep 17 00:00:00 2001
From: Phil Leggetter <phil@leggetter.co.uk>
Date: Wed, 22 Apr 2026 20:24:37 +0100
Subject: [PATCH 3/3] chore(docs): remove bold

---
 TESTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/TESTING.md b/TESTING.md
index 0d9ebfb..2b85887 100644
--- a/TESTING.md
+++ b/TESTING.md
@@ -1,6 +1,6 @@
 # Testing Hookdeck Agent Skills
 
-This document covers automated testing for code examples in the **`event-gateway`** skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md).
+This document covers automated testing for code examples in the `event-gateway` skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md).
 
 Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed).