Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 20 additions & 4 deletions TESTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# Testing Hookdeck Agent Skills

This document covers automated testing for code examples in the `event-gateway` skill. Follows the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md).
This document covers automated testing for code examples in the `event-gateway` skill. The example tests follow the same patterns as [hookdeck/webhook-skills](https://github.com/hookdeck/webhook-skills/blob/main/TESTING.md).

Hookdeck tests its agent skills at three levels: **code example tests** (unit/integration tests for the example applications shipped with each skill), **static quality checks** (linting and scoring skill files), and **agent scenario testing** (giving real agents tasks and scoring whether they succeed).

The first level validates that the code examples work. The second validates that the skill files are well-formed. The third answers the harder question: **can agents actually use these skills to get things done?**

That third level — agent scenario testing — is where testing becomes evaluation. Traditional developer experience had natural feedback loops: support tickets, onboarding funnels, user interviews. With agents, that signal disappears. The agent either succeeds or silently moves on. Evals are the feedback loop you get back.

---

## Code Example Tests

Expand Down Expand Up @@ -114,9 +122,9 @@ We use two layers to evaluate skills: **skill quality** (static) and **agent sce

Baseline: run `npm run skill:review` periodically and record scores; use them to guide skill improvements.

### Layer 2: Agent Scenarios (Custom Tool)
### Layer 2: Agent Scenarios (Evals)

The scenario tester installs skills, runs Claude Code with a scenario prompt, and writes a scored report. Use it to check that an agent can actually follow the staged workflow.
This is where testing becomes evaluation. The scenario tester installs skills, runs Claude Code with a scenario prompt, and writes a scored report. It answers: can an agent actually follow the staged workflow to accomplish a real task?

**Prerequisites:** [Claude Code CLI](https://claude.ai/download) installed and logged in (`ANTHROPIC_API_KEY` or `claude login`). The tool runs a preflight that sends a short prompt to the CLI; if you see "Claude CLI did not respond within 15s", the CLI may be blocked (e.g. in a restricted sandbox). Run with a full environment or ensure the CLI can reach the API.

Expand All @@ -135,12 +143,18 @@ npx tsx tools/agent-scenario-tester/src/index.ts run receive-webhooks express

**Options:** `--dry-run`, `--verbose`, `--timeout <seconds>` (default 300).

**Scenarios:** Defined in `scenarios.yaml`. Initial set:
**Scenarios:** Defined in `scenarios.yaml`. Three scenarios test increasingly interesting agent behaviors:

- **receive-webhooks** — Setup Hookdeck, build handler with signature verification, run `hookdeck listen`, document inspect/retry workflow. Tests stages 01–04 (iterate is documentation-only: agent documents how to list request → event → attempt and retry; no live traffic required).
- **receive-provider-webhooks** — Same plus a provider (e.g. Stripe). Use `--provider stripe`. Only the event-gateway skill is pre-installed; the agent is expected to discover and use the provider skill from webhook-skills (e.g. `npx skills add hookdeck/webhook-skills --skill stripe-webhooks -y -g`) and use the provider SDK in the handler. Tests composition and the provider-webhooks checklist.
- **investigate-delivery-health** — Documentation-only: assume the user has had webhooks for a week and wants to understand delivery performance (success vs failure, backlog, latency). The prompt does **not** mention "metrics" or "hookdeck gateway metrics"; the assessor checks whether the agent used metrics CLI commands. Use to verify that agents discover and use metrics from the skill when the task implies it.

| Scenario | Tests | Key question |
|----------|-------|-------------|
| `receive-webhooks` | Core skill usage | Can the agent follow the skill to set up webhook receiving? |
| `receive-provider-webhooks` | Composition | Does the agent discover and install a Stripe-specific skill on its own? |
| `investigate-delivery-health` | Discovery | Does the agent find diagnostic tools (CLI metrics, MCP) when they aren't mentioned in the prompt? |

### Scenario run checklist

Run these and evaluate results; iterate on skills or prompts as needed.
Expand All @@ -164,6 +178,8 @@ Run these and evaluate results; iterate on skills or prompts as needed.
- **Tool:** Refine prompts in `scenarios.yaml`, adjust the evaluation rubric, or add scenarios.
4. **Re-run** to confirm improvements; repeat as needed.

The A/B unit is **skill version × scenario set → score delta**. Change a skill, re-run the scenarios, compare scores.

CI runs scenario tests on-demand (workflow_dispatch) and weekly (schedule). Use artifacts to monitor regressions and guide further skill improvements.

### Evaluation Rubric (receive-webhooks)
Expand Down
Loading