From 9dc73e4066bb01ceb4c82a21ead01e0c828709d5 Mon Sep 17 00:00:00 2001 From: Phil Leggetter Date: Wed, 22 Apr 2026 20:15:07 +0100 Subject: [PATCH 1/3] docs: refresh TESTING.md with three-level model and eval framing - Introduce code examples, static quality (Tessl), and agent scenario layers; connect scenarios to evals as the missing feedback loop. - Expand agent scenario docs: prerequisites, scenario comparison table, score-delta framing, Layer 2 titled Evals. - Keep test-examples.sh examples aligned with main (event-gateway only until outpost ships examples/). - Drop outpost-specific example paths that are not on main yet. Made-with: Cursor --- TESTING.md | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/TESTING.md b/TESTING.md index 5c4e0c2..93ef3ef 100644 --- a/TESTING.md +++ b/TESTING.md @@ -1,6 +1,12 @@ # Testing Hookdeck Agent Skills -This document covers automated testing for code examples in the `event-gateway` skill. Follows the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md). +Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed). + +The first level validates that the code examples work. The second validates that the skill files are well-formed. The third answers the harder question: **can agents actually use these skills to get things done?** + +That third level — agent scenario testing — is where testing becomes evaluation. Traditional developer experience had natural feedback loops: support tickets, onboarding funnels, user interviews. With agents, that signal disappears. The agent either succeeds or silently moves on. Evals are the feedback loop you get back. + +--- ## Code Example Tests @@ -114,9 +120,9 @@ We use two layers to evaluate skills: **skill quality** (static) and **agent sce Baseline: run `npm run skill:review` periodically and record scores; use them to guide skill improvements. -### Layer 2: Agent Scenarios (Custom Tool) +### Layer 2: Agent Scenarios (Evals) -The scenario tester installs skills, runs Claude Code with a scenario prompt, and writes a scored report. Use it to check that an agent can actually follow the staged workflow. +This is where testing becomes evaluation. The scenario tester installs skills, runs Claude Code with a scenario prompt, and writes a scored report. It answers: can an agent actually follow the staged workflow to accomplish a real task? **Prerequisites:** [Claude Code CLI](https://claude.ai/download) installed and logged in (`ANTHROPIC_API_KEY` or `claude login`). The tool runs a preflight that sends a short prompt to the CLI; if you see "Claude CLI did not respond within 15s", the CLI may be blocked (e.g. in a restricted sandbox). Run with a full environment or ensure the CLI can reach the API. @@ -135,12 +141,18 @@ npx tsx tools/agent-scenario-tester/src/index.ts run receive-webhooks express **Options:** `--dry-run`, `--verbose`, `--timeout ` (default 300). -**Scenarios:** Defined in `scenarios.yaml`. Initial set: +**Scenarios:** Defined in `scenarios.yaml`. Three scenarios test increasingly interesting agent behaviors: - **receive-webhooks** — Setup Hookdeck, build handler with signature verification, run `hookdeck listen`, document inspect/retry workflow. Tests stages 01–04 (iterate is documentation-only: agent documents how to list request → event → attempt and retry; no live traffic required). - **receive-provider-webhooks** — Same plus a provider (e.g. Stripe). Use `--provider stripe`. Only the event-gateway skill is pre-installed; the agent is expected to discover and use the provider skill from webhook-skills (e.g. `npx skills add hookdeck/webhook-skills --skill stripe-webhooks -y -g`) and use the provider SDK in the handler. Tests composition and the provider-webhooks checklist. - **investigate-delivery-health** — Documentation-only: assume the user has had webhooks for a week and wants to understand delivery performance (success vs failure, backlog, latency). The prompt does **not** mention "metrics" or "hookdeck gateway metrics"; the assessor checks whether the agent used metrics CLI commands. Use to verify that agents discover and use metrics from the skill when the task implies it. +| Scenario | Tests | Key question | +|----------|-------|-------------| +| `receive-webhooks` | Core skill usage | Can the agent follow the skill to set up webhook receiving? | +| `receive-provider-webhooks` | Composition | Does the agent discover and install a Stripe-specific skill on its own? | +| `investigate-delivery-health` | Discovery | Does the agent find diagnostic tools (CLI metrics, MCP) when they aren't mentioned in the prompt? | + ### Scenario run checklist Run these and evaluate results; iterate on skills or prompts as needed. @@ -164,6 +176,8 @@ Run these and evaluate results; iterate on skills or prompts as needed. - **Tool:** Refine prompts in `scenarios.yaml`, adjust the evaluation rubric, or add scenarios. 4. **Re-run** to confirm improvements; repeat as needed. +The A/B unit is **skill version × scenario set → score delta**. Change a skill, re-run the scenarios, compare scores. + CI runs scenario tests on-demand (workflow_dispatch) and weekly (schedule). Use artifacts to monitor regressions and guide further skill improvements. ### Evaluation Rubric (receive-webhooks) From dd425fa32d2ce12610cb47e9294d343ec2bcbc25 Mon Sep 17 00:00:00 2001 From: Phil Leggetter Date: Wed, 22 Apr 2026 20:17:39 +0100 Subject: [PATCH 2/3] docs(TESTING): restore webhook-skills lineage for code examples Made-with: Cursor --- TESTING.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/TESTING.md b/TESTING.md index 93ef3ef..0d9ebfb 100644 --- a/TESTING.md +++ b/TESTING.md @@ -1,5 +1,7 @@ # Testing Hookdeck Agent Skills +This document covers automated testing for code examples in the **`event-gateway`** skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md). + Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed). The first level validates that the code examples work. The second validates that the skill files are well-formed. The third answers the harder question: **can agents actually use these skills to get things done?** From e501272ae04ed41cbc2d6e85a6f5dadc41817cd5 Mon Sep 17 00:00:00 2001 From: Phil Leggetter Date: Wed, 22 Apr 2026 20:24:37 +0100 Subject: [PATCH 3/3] chore(docs): remove bold --- TESTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TESTING.md b/TESTING.md index 0d9ebfb..2b85887 100644 --- a/TESTING.md +++ b/TESTING.md @@ -1,6 +1,6 @@ # Testing Hookdeck Agent Skills -This document covers automated testing for code examples in the **`event-gateway`** skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md). +This document covers automated testing for code examples in the `event-gateway` skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md). Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed).