diff --git a/README.md b/README.md index e984bb7..8f07287 100644 --- a/README.md +++ b/README.md @@ -1,46 +1,52 @@ # AgentAssert -**Proof-of-concept:** Playwright tests + reusable assertion helpers for tool-calling LLM agents (MCP-shaped tool schemas, in-process tools in the demo). +**What this is:** Automated tests for an **AI agent** that can **call tools** (for example: “read this file,” “call this API”). The tests check **what the agent decided to do** (which tools it picked, in what order, and what it said at the end)—not just the final text. -This repository is **`"private": true`** in `package.json` — it is **not** published to npm and is **not** a productized “framework.” It is a **reference implementation** you can clone, read, and copy from. A clean entry point exists for imports (`index.ts` → `framework/`), but publishing a real package would add build steps, versioning, and semver guarantees — out of scope for this POC. +**For QA:** Think of it like testing **business rules on decisions**, not only checking a screen or a single return value. Failures attach **traces** (a step-by-step log) so you can see prompts, tool calls, and outputs in the Playwright HTML report. + +This repo is a **sample / reference project** (`"private": true` in `package.json`—not published as a product). You can copy patterns from it into your own codebase. --- ## Repository layout -| Path | Role | -|------|------| -| **`framework/`** | Reusable assertions (`AgentAssert`, `HeuristicContractMatcher`, `BehaviorContract`) and **shared types** (`framework/types.ts` — traces, contracts, tool shapes). | -| **`index.ts`** | Barrel export so you can `import { AgentAssert, … } from 'agent-assert'` when vendoring this repo. | -| **`examples/agent/`** | **Demo system under test:** LLM loop, `ToolRegistry`, file-reader / api-caller tools. Replace with your own agent; keep compatible `AgentTrace` / `AgentOutput` shapes if you reuse the assertions. | -| **`tests/`** | Playwright specs and fixtures wired to the demo agent via `tests/fixtures/setup.ts`. | + +| Folder / file | In plain terms | +|---------------|----------------| +| **`framework/`** | **Checks you reuse:** “Was tool X used?” “Does the answer follow our rules?” Helpers include `AgentAssert`, `HeuristicContractMatcher`, and shared types for traces and contracts. | +| **`index.ts`** | Single entry point for importing the framework from this repo when you copy it in. | +| **`examples/agent/`** | **Demo agent** (the app under test): chat loop, tool list, sample tools (`file-reader`, `api-caller`). Swap this for your real agent; keep the same trace shape if you keep these tests. | +| **`tests/`** | Playwright test files plus **`tests/fixtures/setup.ts`**, which builds the demo agent for each scenario. | + --- ## About -A working proof-of-concept that demonstrates five testing patterns for AI agents that call tools through a **`ToolRegistry`** (the same tool definitions map cleanly to Anthropic/OpenAI tool formats and to MCP-style schemas). The repo does not run a live MCP server by default; tools execute in-process so tests stay fast and deterministic. +**What gets tested:** Five patterns that cover **tool choice**, **output rules**, **order of steps**, **staying inside allowed tools**, and **behavior when something fails**. + +**Tool list:** Tools are registered in a **`ToolRegistry`** (a named list the model sees and that your code runs when the model asks). There is **no separate MCP server** in the default demo—tools run **inside the test process** so runs are **fast and stable**. -**Behavior contracts** are checked by **`HeuristicContractMatcher`**: required fields, **case-insensitive keyword overlap**, **regex** forbidden phrases, and optional custom validators — *not* embedding similarity or LLM-as-judge semantics. The weighted **`confidence`** score is a tuning signal, not a calibrated measure of meaning. That avoids pretending the cheap matcher is “semantic” while still letting you reject exact-string tests for variable LLM wording. See **Heuristic matching: scope and limits** below. +**“Behavior contracts”:** Rules like “must include these fields,” “should mention these words,” “must not match these patterns.” The matcher is **rule-based** (keywords and patterns)—**not** a second AI judging the first, and **not** “true” semantic understanding. The **`confidence`** score is a **rough pass/fail helper** for tuning, not a scientific accuracy percentage. More detail: **Heuristic matching: scope and limits** below. -The deliberate use of Playwright (not Jest, not Vitest) as the test runner is itself a publishable insight. +**Why Playwright:** Same tool as browser E2E tests, but here it drives **API + agent** flows. You get **retries**, **timeouts per test**, and **HTML reports** with attachments (good for flaky LLM output). --- ## Architecture: How the Pieces Connect ``` -┌─────────────────────────────────────────────────────────┐ -│ YOUR TEST FILE │ +┌───────────────────────────────────────────────────────────────────────────────────┐ +│ YOUR TEST FILE │ │ import { AgentAssert } from 'agent-assert' // or ../../framework/AgentAssert.js │ -│ const trace = await agent.run("some prompt") │ -│ AgentAssert.toolWasInvoked(trace, 'file-reader') │ -│ AgentAssert.satisfiesContract(trace.output, CONTRACT) │ -└─────────────────┬──────────────────────────┬────────────┘ +│ const trace = await agent.run("some prompt") │ +│ AgentAssert.toolWasInvoked(trace, 'file-reader') │ +│ AgentAssert.satisfiesContract(trace.output, CONTRACT) │ +└─────────────────┬──────────────────────────┬──────────────────────────────────────┘ │ │ ┌───────────▼─────────┐ ┌───────────▼─────────────┐ - │ Agent (SUT) │ │ AgentAssert │ - │ │ │ (assertion library) │ + │ Agent (app under │ │ AgentAssert │ + │ test) │ │ (checks / asserts) │ │ 1. Send prompt to │ │ │ │ LLM (Anthropic / │ │ - toolWasInvoked() │ │ OpenAI / Ollama)│ │ - satisfiesContract() │ @@ -48,87 +54,131 @@ The deliberate use of Playwright (not Jest, not Vitest) as the test runner is it │ from model │ │ │ │ 3. Execute tool │ │ - traceFollowsSequence()│ │ 4. Send result back │ │ │ - │ 5. Capture TRACE │ └───────────┬─────────────┘ + │ 5. Build TRACE │ └───────────┬─────────────┘ └────────┬────────────┘ │ - │ ┌──────────▼──────────────┐ - ┌──────────▼───────────┐ │ HeuristicContractMatcher │ - │ ToolRegistry │ │ │ - │ │ │ Layer 1: Structure │ - │ file-reader → exec() │ │ Layer 2: Keywords (BoW) │ - │ api-caller → exec() │ │ Layer 3: Forbidden (regex)│ - └──────────────────────┘ │ Layer 4: Custom │ - │ │ - │ Returns: MatchResult │ - │ { confidence: 0.82 } │ - └─────────────────────────┘ + │ ┌────────────▼──────────────┐ + ┌──────────▼───────────┐ │ HeuristicContractMatcher │ + │ ToolRegistry │ │ │ + │ │ │ 1. Structure (fields) │ + │ file-reader → run() │ │ 2. Keywords (word checks) │ + │ api-caller → run() │ │ 3. Forbidden (patterns) │ + └──────────────────────┘ │ 4. Optional custom rule │ + │ │ + │ → MatchResult + score │ + └───────────────────────────┘ ``` ### Data Flow (Step by Step) -1. **Test calls `agent.run(prompt)`** — sends a natural language task -2. **Agent sends prompt to the configured LLM API** (Anthropic, OpenAI, or Ollama via OpenAI-compatible API) — along with tool definitions from ToolRegistry -3. **The model responds with tool calls** — Anthropic: `tool_use` blocks; OpenAI: `function` tool calls — e.g. "call file-reader with path X" -4. **Agent executes the tool** via ToolRegistry — gets back `ToolResult` -5. **Agent sends tool result back to the model** — the model may request more tools -6. **Loop continues** until the model produces a final text response -7. **Agent builds `AgentTrace`** — captures EVERY step (tool calls, tool results, reasoning, output) -8. **Test receives the trace** and passes it to AgentAssert methods -9. **AgentAssert uses HeuristicContractMatcher** to score output against BehaviorContracts -10. **MatchResult returned** with heuristic confidence and per-layer details +```mermaid +flowchart TB + subgraph T["1 Test"] + A["test calls agent.run(prompt)"] + end + + subgraph G["2 Agent loop"] + B["Send prompt plus tool defs from ToolRegistry to LLM API"] + C{"Model response?"} + D["Tool calls: Anthropic tool_use / OpenAI function calls e.g. file-reader path X"] + E["ToolRegistry.execute returns ToolResult"] + F["Send tool results back to model"] + H["Final text answer"] + end + + subgraph TR["3 Trace and assertions"] + I["Build AgentTrace: reasoning, tool_call, tool_result, output"] + J["Test passes trace to AgentAssert"] + K["AgentAssert plus HeuristicContractMatcher vs BehaviorContract"] + L["MatchResult: confidence and per-layer detail"] + end + + A --> B + B --> C + C -->|tools| D + D --> E + E --> F + F --> C + C -->|done| H + H --> I + I --> J + J --> K + K --> L +``` + + + +1. The test starts the agent with a **plain-language task** (`agent.run(prompt)`). +2. The agent sends that task to the **LLM** (Anthropic, OpenAI, or local Ollama) and includes the **list of tools** from the ToolRegistry. +3. The model may answer with **“please run tool X with these inputs”** (the exact format differs by vendor; see diagram above). +4. The agent runs the real tool code through the ToolRegistry and gets a **ToolResult** (success/failure + data). +5. That result goes **back to the model**; steps 3–5 may **repeat** until the model stops asking for tools. +6. The model eventually returns a **final text** answer (or stops). +7. The agent assembles an **`AgentTrace`**: a **full log** of reasoning, each tool call, each tool result, and final output. +8. The test sends that trace into **AgentAssert** helpers (pass/fail checks). +9. For text rules, **HeuristicContractMatcher** compares the output to a **BehaviorContract** (keywords, patterns, etc.). +10. You get a **`MatchResult`**: pass/fail style outcome, a **confidence** score, and **details** per check layer (useful when debugging failures). --- ## The Five Testing Patterns ### Pattern 1: Tool Invocation Assertion + **File:** `tests/behavioral/intent-routing.spec.ts` -**What it tests:** Did the agent select the correct tool for the given intent? +**What it tests:** For a user request like “read this log file,” did the agent actually call the **right tool** (e.g. file-reader) and not something else? -**Why it's unique:** Traditional tests check return values. Agent tests check *decisions*. The agent might return a plausible-looking summary even when it called the wrong tool (or no tool at all). +**Why it matters:** A bad answer can still *sound* fine. This checks the **decision** (which tool ran), not only the final text. **Key assertion:** + ```typescript const result = AgentAssert.toolWasInvoked(trace, 'file-reader', { filePath: /.*\.log$/ }); AgentAssert.expectMatched(result, 'file-reader should be invoked'); // embeds AgentAssert.formatResult(result) on failure ``` **What to look at in the code:** -- `AgentAssert.toolWasInvoked()` — walks the trace looking for tool_call steps -- `paramMatchers` — regex patterns validated against tool input parameters -- Negative assertion — `AgentAssert.expectNotMatched(result, '...')` verifies a tool was NOT called (rich failure output via `formatResult`) + +- `AgentAssert.toolWasInvoked()` — finds **tool_call** steps in the trace +- `paramMatchers` — optional checks on **arguments** passed to the tool (e.g. file path) +- `AgentAssert.expectNotMatched()` — asserts a tool was **not** used (when the test expects a negative case) --- ### Pattern 2: Behavior Contract Validation + **File:** `tests/behavioral/output-contract.spec.ts` -**What it tests:** Does the output satisfy a **heuristic** contract (fields + keywords + patterns), not an exact string match? +**What it tests:** Does the final answer follow **agreed rules** (required fields, must-include words, forbidden phrases)—without requiring an **exact** copy-paste string match? -**Why it's unique:** `expect(output).toBe("...")` breaks on every LLM run. Contracts define cheap rules that often track “good enough” outputs. Synonymous phrasing can still fail if keywords don’t align — widen keywords, lower thresholds, add a **customValidator**, or upgrade to embeddings / LLM-judge (see limits section below). +**Why it matters:** LLM wording changes every run. Contracts check **flexible rules** instead of one frozen sentence. If wording drifts, add keywords, relax thresholds, or add a **customValidator** (see **Heuristic matching: scope and limits**). **Key assertion:** + ```typescript const result = AgentAssert.satisfiesContract(trace.output, BehaviorContract.SUMMARIZATION, 0.5); AgentAssert.expectMatched(result, 'SUMMARIZATION contract should pass'); ``` **What to look at in the code:** -- `BehaviorContract.ts` — pre-built contracts with required fields, keywords, forbidden patterns -- `HeuristicContractMatcher.evaluate()` — structure + keyword overlap + forbidden regex (+ optional custom) -- `minKeywordMatchRatio` — how much of the keyword list must appear as substrings -- `forbiddenPatterns` — regex matches force a contract failure path + +- `BehaviorContract.ts` — ready-made rule sets (fields, keywords, forbidden patterns) +- `HeuristicContractMatcher.evaluate()` — runs structure + keywords + forbidden patterns (+ optional custom rule) +- `minKeywordMatchRatio` — how many of the listed keywords must appear +- `forbiddenPatterns` — if the output matches these, the contract **fails** --- ### Pattern 3: Multi-Step Trace Verification + **File:** `tests/behavioral/tool-invocation.spec.ts` -**What it tests:** Did the agent follow a valid reasoning path through multiple tool calls? +**What it tests:** For workflows that need **more than one tool** (e.g. read file, then call API), did the agent follow a **sensible order** of steps? -**Why it's unique:** No equivalent in Selenium/Playwright browser testing. Browser tests check page state, not the application's intermediate reasoning steps. +**Why it matters:** UI tests usually see **screens**. Here you inspect the **sequence of tool calls** inside the trace—something classic UI checks do not cover. **Key assertion:** + ```typescript const result = AgentAssert.traceFollowsSequence(trace, [ { type: 'tool_call', toolName: 'file-reader' }, @@ -138,41 +188,47 @@ AgentAssert.expectMatched(result, 'trace should show file-reader then api-caller ``` **What to look at in the code:** -- `AgentAssert.traceFollowsSequence()` — checks steps appear in order (not necessarily consecutive) -- The trace captures reasoning steps between tool calls -- `toolCallCountInRange()` — sanity check on how many tools were called + +- `AgentAssert.traceFollowsSequence()` — checks that steps appear in the **right order** (they do not have to be back-to-back) +- The trace can include **reasoning** lines between tool calls +- `toolCallCountInRange()` — optional check that the agent did not call tools **too many** or **too few** times --- ### Pattern 4: Boundary/Scope Enforcement + **File:** `tests/boundary/hallucination-guard.spec.ts` -**What it tests:** Did the agent stay within its defined task boundaries? +**What it tests:** Did the agent **only use tools it is allowed to use** and avoid **out-of-scope** actions? -**Why it's unique:** LLMs hallucinate tool calls, fabricate data, and take unsolicited actions. Nobody tests for this systematically. +**Why it matters:** Models sometimes invent tool names or try extra actions. This pattern catches **scope creep** and **fake tool calls**. **Key assertion:** + ```typescript const result = AgentAssert.boundaryNotViolated(trace, ['file-reader']); AgentAssert.expectMatched(result, 'only file-reader should be used'); ``` **What to look at in the code:** -- `boundaryNotViolated()` — checks that ONLY allowed tools were called -- `createFileOnlyAgent()` — test factory that registers limited tools -- `HALLUCINATION_PROMPTS` — adversarial prompts designed to trigger out-of-scope behavior -- `SCOPE_BOUNDED` contract — forbidden patterns for scope-creep language + +- `boundaryNotViolated()` — fails if any **tool outside the allow-list** was called +- `createFileOnlyAgent()` — builds an agent with **only** certain tools registered (tight boundary) +- `HALLUCINATION_PROMPTS` — tricky prompts meant to provoke **wrong** or **extra** tool use +- `SCOPE_BOUNDED` contract — rules that flag **forbidden** wording when the task should stay narrow --- ### Pattern 5: Failure & Retry Observability + **File:** `tests/boundary/retry-behavior.spec.ts` -**What it tests:** Does the agent degrade gracefully when tools fail? +**What it tests:** When a tool **errors** (file missing, API down), does the agent **admit the failure** and avoid claiming success? Does it **retry** within reasonable limits? -**Why it's unique:** Most agent tests don't simulate tool failures at all. +**Why it matters:** Many demos only test the happy path. Here failures are **forced** so you can see honest vs misleading behavior. **Key assertions (examples):** + ```typescript // Honest reporting: output mentions failure / contract passes expect(mentionsFailure, 'output should reflect tool failure').toBe(true); @@ -193,130 +249,144 @@ AgentAssert.expectMatched( ``` **What to look at in the code:** -- `createFailingFileReaderAgent()` — agent with tools wired to always fail -- `GRACEFUL_FAILURE` contract — requires error keywords, forbids success-claiming language -- `retryCount` in trace metadata — tracks how many tool results had `success: false` in the agent loop -- Cascade failure test — verifies downstream behavior when file-reader fails (see `textReflectsUpstreamFailure` / tool traces) + +- `createFailingFileReaderAgent()` — file-reader always **fails** on purpose (error-path testing) +- `GRACEFUL_FAILURE` contract — expects **honest** error language; forbids **false success** claims +- `retryCount` in trace metadata — counts tool results with **`success: false`** +- Cascade tests — what happens **after** the first tool fails (downstream steps) --- ## Key Files Explained ### framework/types.ts -Shared types for traces, contracts, and (in the demo) tool definitions. Assertions and `examples/agent/` both import from here so the SUT and matchers stay aligned. -- `AgentTrace` — what assertions operate on -- `TraceStep`, `AgentOutput`, `ContractDefinition`, `MatchResult`, `ToolDefinition`, … +**Shared shapes** for traces, contracts, and tools. Tests and the demo agent both use these types so “what we record” and “what we assert” stay in sync. + +- **`AgentTrace`** — the main object assertions read +- Other types: steps, output, contracts, tool definitions, match results ### examples/agent/agent.ts -**Demo system under test** — not part of the reusable assertion layer. The tool-calling loop is the reference pattern: -1. Send prompt + tool definitions to **Anthropic Messages API** or **OpenAI Chat Completions** (see `AgentConfig.provider`) -2. The model responds with text and/or tool calls (`tool_use` vs `function` / `tool_calls` depending on provider) -3. Execute requested tools -4. Send tool results back in the provider-specific message format -5. Repeat until the model gives a final answer -6. Build the AgentTrace from everything that happened +**Demo agent** (the application under test). It is **not** the reusable assertion library. Flow in plain terms: + +1. Send the user message and tool list to the LLM (**Anthropic** or **OpenAI-style** API, including Ollama). +2. Read the model reply: plain text and/or **requests to call tools**. +3. Run the requested tools; send results back to the model. +4. Repeat until the conversation finishes. +5. Package everything into an **`AgentTrace`**. -**Provider selection:** Pass `provider: 'anthropic' | 'openai' | 'ollama'`, or set `LLM_PROVIDER` in the environment. Keys: `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` (or pass `apiKey`). For **`LLM_PROVIDER=ollama`**, the OpenAI client uses `OPENAI_BASE_URL`, then `OLLAMA_BASE_URL`, then defaults to `http://127.0.0.1:11434/v1`. For **`LLM_PROVIDER=openai`**, only **`OPENAI_BASE_URL`** is used for a custom base (Azure, proxy, etc.); **`OLLAMA_BASE_URL` is ignored** so a stale shell variable cannot send OpenAI traffic to Ollama by accident. +**How to pick a provider:** Set `LLM_PROVIDER` (or `provider` in code) to `anthropic`, `openai`, or `ollama`. Use **`ANTHROPIC_API_KEY`** / **`OPENAI_API_KEY`** as needed. For Ollama, the client uses **`OPENAI_BASE_URL`**, then **`OLLAMA_BASE_URL`**, then **`http://127.0.0.1:11434/v1`**. For **`LLM_PROVIDER=openai`**, only **`OPENAI_BASE_URL`** sets a custom base (Azure, proxy, etc.); **`OLLAMA_BASE_URL`** is not used so you do not accidentally send OpenAI traffic to Ollama. -**Important:** The system prompt in this file shapes agent behavior. If you change it, update the test contracts to match. +**If you edit the system prompt here**, update test expectations and contracts so they still match what you want the agent to do. ### examples/agent/tools/file-reader.ts and api-caller.ts -Demo tools use MCP-aligned JSON schemas and register through **`ToolRegistry`**. In this POC they run locally (file-reader reads from disk, api-caller uses mock responses). To connect them to a real MCP server, replace the `execute` function with MCP transport calls — the schema stays the same. -**Security note:** `file-reader.ts` includes path traversal protection. Read the comments. +**Sample tools:** JSON schema + **run** function. In the demo they run **locally** (real disk read; API tool uses **mock** responses). To use a real MCP server later, you would replace the **run** logic—not necessarily the schema. + +**Security:** `file-reader.ts` blocks path tricks; read the file comments before changing paths. ### examples/agent/tools/registry.ts -Maps tool names to definitions. Provides `toAnthropicTools()` and `toOpenAITools()` so the same tool definitions work with either API. This is the bridge between your tool definitions and the LLM. + +**Name → tool implementation.** Also turns the same definitions into the **Anthropic** or **OpenAI** tool format the API expects. ### framework/HeuristicContractMatcher.ts -**Heuristic evaluation (not deep semantics).** Layers: -1. **Structural** (40% weight) — required fields (and optional length) -2. **Keywords** (35% weight, or 25% + 10% custom when `customValidator` is set) — bag-of-words style: substring presence for each listed keyword -3. **Forbidden** (25% weight) — regex patterns; any hit triggers the contract-failure path -4. **Custom** (optional) — your own validator in the contract +**Rule-based scoring (not “AI understands meaning”).** Rough weights: + +1. **Structure** (~40%) — required fields (and optional length checks) +2. **Keywords** (~35%, adjusted if you add a custom rule) — listed words must appear as **substrings** (after lowercasing) +3. **Forbidden** (~25%) — if a **regex** matches, the contract fails that path +4. **Custom** (optional) — your own function for extra checks + +The **`confidence`** number is a **blend of those checks**—for **pass/fail thresholds** and tuning, not a true “semantic similarity” score. -The headline `confidence` is a **weighted average of those scores** — useful for ranking and thresholds, not as a semantic similarity score. +**Tuning (for less flaky runs):** -**Tuning knobs:** -- `minKeywordMatchRatio` — lower = more lenient keyword layer -- Assertion threshold on `result.confidence` — lower = fewer flaky tests -- `forbiddenPatterns` — stricter guardrails (regex can be brittle; test them) -- **Synonyms** — add alternate phrasings to `requiredIntentKeywords`, or use `customValidator` / external judges (below) +- Lower **`minKeywordMatchRatio`** — easier keyword pass +- Lower the **assertion threshold** on `confidence` — fewer false failures +- Tighten **`forbiddenPatterns`** — stricter “must not say” rules (test regexes carefully) +- Add **synonyms** to keyword lists or use **`customValidator`** when wording varies a lot ### Heuristic matching: scope and limits -| Approach | What this repo does | What would be “more semantic” | -|----------|---------------------|--------------------------------| -| Keyword list | Substring checks after lowercasing | LLM-as-judge, entailment models | -| Confidence | Weighted heuristic blend | Calibrated metrics or judge scores | -| Same meaning, different words | Can **fail** unless keywords or patterns cover both | Embeddings vs reference texts, synonym lists | -**Possible upgrades (not implemented here):** call a second model to grade outputs against the contract; embed output and reference snippets and compare cosine similarity; use an NLP library for paraphrase / NLI. Those add latency, cost, and complexity — the heuristic matcher stays intentionally cheap and explicit. +| Topic | This repo | “Stronger” approaches (not built in) | +| -------------------------- | ---------------------------------------------------- | --------------------------------------------- | +| Wording | Keyword **substring** checks | Second model as judge, specialized NLP models | +| Score | Weighted **rules** | Calibrated scores from another system | +| Same idea, different words | Can **fail** if keywords do not cover both phrasings | Embeddings, synonym lists, judges | + + +**Not included here on purpose:** a second LLM to grade answers, vector similarity, or heavy NLP—those add **cost**, **latency**, and **ops**. This matcher is meant to stay **simple and inspectable**. ### framework/BehaviorContract.ts -Pre-built contracts for common task types. Each contract defines what "correct" means for that task type. The five contracts: SUMMARIZATION, API_ACTION, MULTI_STEP, SCOPE_BOUNDED, GRACEFUL_FAILURE. + +Ready-made **named rule sets** for typical tasks: **SUMMARIZATION**, **API_ACTION**, **MULTI_STEP**, **SCOPE_BOUNDED**, **GRACEFUL_FAILURE**. ### framework/AgentAssert.ts -The public API. Core assertions (each returns a **`MatchResult`**): -1. `toolWasInvoked(trace, toolName, paramMatchers?)` — was a tool called? -2. `satisfiesContract(output, contract, minConfidence?)` — does output meet the contract? -3. `boundaryNotViolated(trace, allowedTools)` — only allowed tools used? -4. `traceFollowsSequence(trace, expectedSequence)` — correct execution order? -5. `toolCallCountInRange(trace, min, max)` — reasonable number of tool calls? +**Main API** for tests (each method returns a **`MatchResult`**): + +1. `toolWasInvoked(...)` — was this tool used (and optional argument checks)? +2. `satisfiesContract(...)` — does the final output follow the contract? +3. `boundaryNotViolated(...)` — only tools from this allow-list? +4. `traceFollowsSequence(...)` — did steps happen in this order? +5. `toolCallCountInRange(...)` — tool call count within min/max? Helpers: -- `formatResult(result, debugHint?)` — pretty-print a `MatchResult` for logs or messages -- `expectMatched(result, context)` / `expectNotMatched(result, context)` — thin wrappers around Playwright’s `expect` so failures include `formatResult` automatically +- `formatResult(...)` — readable dump for logs or failure messages +- `expectMatched` / `expectNotMatched` — Playwright-friendly helpers that attach **`formatResult`** on failure --- ### tests/env-llm.ts and `.env` -Playwright loads **`tests/env-llm.ts`** from **`playwright.config.ts`** (`applyLlmVarsFromDotEnv()`). Selected keys from a project-root **`.env`** file are merged into `process.env` (`.env` wins over existing shell vars for those keys): `LLM_PROVIDER`, `LLM_MODEL`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `OLLAMA_API_KEY`, `OLLAMA_BASE_URL`. Copy **`.env.example`** to **`.env`** and fill in keys so tests and IDE runs see the same configuration without exporting variables manually. +**Local runs:** Copy **`.env.example`** to **`.env`** and set keys (provider, model, API keys, Ollama URL). Playwright loads these via **`tests/env-llm.ts`** so tests see the same settings as your IDE. + +**GitHub Actions:** CI reads the same kind of values from repo **Settings → Secrets and variables → Actions** (no code edit needed). Workflow file: **`.github/workflows/ci.yml`**. + -**GitHub Actions CI** (`.github/workflows/ci.yml`) installs Ollama, pulls the configured model, and runs the smoke test. **Provider, model, and optional base URL** are read from repository **Variables** or **Secrets** (no code change): +| Variable / secret | What it’s for | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `LLM_PROVIDER` | Usually `ollama` in CI | +| `LLM_MODEL` | e.g. `llama3.2:3b` (local) or a `*:cloud` model | +| `OLLAMA_BASE_URL` | Optional; default `http://127.0.0.1:11434/v1` | +| `OLLAMA_API_KEY` | **Secret** — needed only for **Ollama Cloud** models (`*:cloud`). Without it, you may see **HTTP 500**. [Keys](https://ollama.com/settings/keys) | -| Name | Purpose | -|------|--------| -| `LLM_PROVIDER` | e.g. `ollama` | -| `LLM_MODEL` | e.g. `llama3.2:3b` or `deepseek-v3.2:cloud` | -| `OLLAMA_BASE_URL` | Optional; default `http://127.0.0.1:11434/v1` | -| `OLLAMA_API_KEY` | **Repository secret** — required when `LLM_MODEL` is an Ollama **Cloud** tag (`*:cloud`). Without it, Ollama returns **HTTP 500**. [Create a key](https://ollama.com/settings/keys). | -**Best practice:** put **non-sensitive** values under **Variables**. Use **Secrets** for **`OLLAMA_API_KEY`** and other keys. Storing `LLM_MODEL` as a *secret* works but **masks** it in logs—prefer **Variables** for model name unless needed. If neither Variable nor Secret is set for provider/model, CI defaults to **`ollama`** + **`llama3.2:3b`** (local, no key; stronger tool-calling than `1b`). +**Tip:** Put **non-secret** values in **Variables**; put **keys** in **Secrets**. If provider/model are unset in CI, defaults are **`ollama`** + **`llama3.2:3b`**. -**Run CI on a chosen branch:** **Actions** → **CI** → **Run workflow**. GitHub shows **Use workflow from** — that dropdown is the branch selector for manual runs (the workflow file and checkout use that branch unless you override). Optionally fill **Checkout ref** in the form to checkout a different branch or full ref (e.g. `refs/heads/feature/x`). +**Manual CI run:** **Actions → CI → Run workflow**. Use **Use workflow from** to pick the branch. Optional **Checkout ref** overrides which git ref is checked out. -**Local vs Cloud tags:** tags like `llama3.2:3b` run fully locally. Tags ending in **`:cloud`** need **`OLLAMA_API_KEY`** in Secrets. Override **`LLM_MODEL`** in Variables if you want another local model (e.g. newer builds when Ollama adds them). +**Local vs Cloud model names:** Names like `llama3.2:3b` run on the runner. Names ending in **`*:cloud`** need **`OLLAMA_API_KEY`**. --- ### playwright.config.ts (high level) -- **`retries: 1`** — each failed test runs one more time (LLM outputs vary) -- **Timeouts:** default **45s**; **`behavioral`** project **120s** locally, **300s** when **`CI=true`** (e.g. GitHub Actions + Ollama on CPU); **`boundary`** **45s** -- **`trace: 'off'`** — browser-style Playwright traces are disabled (this suite does not use a browser). Failures still get rich attachments from **`registerAgentTraceForDiagnostics`** in `tests/fixtures/setup.ts` (see below) -- **`workers: 3`** — tune for your API rate limits -- **HTML report `title`** — includes resolved LLM provider and model for quick scanning +- **`retries: 1`** — one retry per failed test (helps with flaky LLM output) +- **Timeouts:** **45s** default; **behavioral** tests **120s** locally, **300s** in CI (`CI=true`) because Ollama on CPU can be slow; **boundary** stays **45s** +- **`trace: 'off'`** — no browser trace zip (these tests are not UI tests). Failures still get **attachments** from **`registerAgentTraceForDiagnostics`** +- **`workers: 3`** — parallel runs (watch API rate limits) +- **Report title** — shows which provider/model ran ### tests/fixtures/setup.ts -Shared test factories (`createTestAgent`, `createFailingFileReaderAgent`, `createFileOnlyAgent`, etc.), fixture file paths (`FIXTURE_DIR` / `test-fixtures/`), and **`registerAgentTraceForDiagnostics`** — wires trace attachments into the Playwright report (see **Failure diagnostics** above). +**Test helpers:** builds agents (`createTestAgent`, `createFailingFileReaderAgent`, …), test file paths, and **`registerAgentTraceForDiagnostics`** so traces show up in the HTML report when something fails. --- ## Setup & Running ### Prerequisites -- Node.js 18+ -- An API key for cloud use, **or** a local [Ollama](https://ollama.com/) (or other OpenAI-compatible) server — see **Local Ollama** below + +- **Node.js 18+** +- Either **API keys** for a cloud LLM **or** a local **[Ollama](https://ollama.com/)** install (free local models; see below) ### Install + ```bash cd agent-assert # or your clone folder name npm install @@ -325,9 +395,9 @@ npx playwright install # Playwright still expects browser binaries to be presen ### Configure API keys -Prefer a **`.env`** file at the repo root (see **`.env.example`**). The same variables work if you `export` them in the shell. +Use a **`.env`** file at the repo root (see **`.env.example`**), or `export` the same names in your terminal. -**Anthropic (default)** — set the key and optionally pin the provider (default is `anthropic` if `LLM_PROVIDER` is unset): +**Anthropic (default if unset)** — set the key; optionally set `LLM_PROVIDER=anthropic`: ```bash export ANTHROPIC_API_KEY="sk-ant-..." @@ -335,7 +405,7 @@ export ANTHROPIC_API_KEY="sk-ant-..." export LLM_PROVIDER="anthropic" ``` -**OpenAI** — set the OpenAI key and select the provider. If you omit `LLM_MODEL`, the agent defaults to `gpt-4o`: +**OpenAI** — set `OPENAI_API_KEY` and `LLM_PROVIDER=openai`. Default model is `gpt-4o` if you omit `LLM_MODEL`: ```bash export OPENAI_API_KEY="sk-..." @@ -344,7 +414,7 @@ export LLM_PROVIDER="openai" export LLM_MODEL="gpt-4o-mini" ``` -**Local Ollama** — no paid API key required; uses the OpenAI-compatible endpoint (`http://127.0.0.1:11434/v1` by default). Start Ollama, pull a model (e.g. `ollama pull qwen3.5`), then: +**Local Ollama** — no cloud API key. Start Ollama, `ollama pull `, then: ```bash export LLM_PROVIDER="ollama" @@ -354,25 +424,27 @@ export LLM_MODEL="qwen3.5:latest" npx playwright test ``` -Alternatively, keep `LLM_PROVIDER=openai` and point at Ollama with **`OPENAI_BASE_URL`** only (do not rely on `OLLAMA_BASE_URL` for this provider). +Or use `LLM_PROVIDER=openai` with **`OPENAI_BASE_URL`** pointing at Ollama (do not depend on **`OLLAMA_BASE_URL`** for that mode). -You can also pass `provider`, `apiKey`, `baseURL`, and `model` when constructing `Agent` in code instead of using environment variables. +You can also set `provider`, `apiKey`, `baseURL`, and `model` in code when creating the `Agent`. ### Failure diagnostics (HTML report) -Tests that call **`registerAgentTraceForDiagnostics(testInfo, trace)`** attach: +When tests use **`registerAgentTraceForDiagnostics`**, the report can include: -- **`agent-run-summary.txt`** — every run (provider, model, tool order, output preview) -- On failure: **`agent-diagnostics.txt`** (full trace dump) and **`playwright-failure.txt`** (Playwright error context) +- **`agent-run-summary.txt`** — every run: provider, model, tools used, output snippet +- On failure: **`agent-diagnostics.txt`** (full trace) and **`playwright-failure.txt`** -Open the **Attachments** tab in the HTML report for the failing test. +Open the failing test → **Attachments** in the Playwright HTML report. ### Run All Tests + ```bash npx playwright test ``` ### Run Specific Pattern + ```bash npx playwright test tests/behavioral/intent-routing.spec.ts npx playwright test tests/boundary/ @@ -381,6 +453,7 @@ npx playwright test --project=boundary ``` ### View HTML Report + ```bash npx playwright show-report ``` @@ -390,97 +463,99 @@ npx playwright show-report ## How to Extend ### Add a New Tool -1. Create `examples/agent/tools/your-tool.ts` following the same factory pattern as `file-reader.ts` -2. Register it in the ToolRegistry in your test setup -3. Add mock responses in `setup.ts` -4. Write tests using `AgentAssert.toolWasInvoked(trace, 'your-tool')` + +1. Add `examples/agent/tools/your-tool.ts` (same style as `file-reader.ts`) +2. **Register** it on the ToolRegistry in test setup +3. Add any **mock data** in `setup.ts` if needed +4. Write tests with `AgentAssert.toolWasInvoked(trace, 'your-tool')` ### Add a New Contract -1. Add a new static property to `BehaviorContract.ts` -2. Define requiredFields, requiredIntentKeywords, forbiddenPatterns -3. Set minKeywordMatchRatio (start with 0.2, tune from there) -4. Add a customValidator if you need logic beyond keywords/patterns -5. Write a test that asserts against your new contract + +1. Add a new entry in `BehaviorContract.ts` +2. Set **required fields**, **keywords**, **forbidden patterns** +3. Set **`minKeywordMatchRatio`** (try ~0.2, then tune) +4. Optional: **`customValidator`** for rules keywords cannot express +5. Add a test that calls `satisfiesContract` with your contract ### Add a New Assertion Method -1. Add a static method to `AgentAssert.ts` -2. Accept `AgentTrace` or `AgentOutput` as input -3. Return `MatchResult` -4. Use `HeuristicContractMatcher` methods internally if needed -5. Include detailed reasons in the `details` array - -### Adapt for Another LLM Provider (beyond Anthropic, OpenAI, and Ollama) -1. Add a branch in `examples/agent/agent.ts` alongside the existing Anthropic and OpenAI-compatible loops -2. Add a `toYourProviderTools()` (or equivalent) on `ToolRegistry` if the tool schema differs -3. Map that provider’s tool-call and tool-result messages into the same `TraceStep` shapes the framework already expects -4. The framework layer (AgentAssert, HeuristicContractMatcher, BehaviorContract) stays UNCHANGED — it operates on `AgentTrace`, which is provider-agnostic + +1. Add a static method on `AgentAssert.ts` +2. Input: **`AgentTrace`** or **`AgentOutput`** +3. Return: **`MatchResult`** +4. Reuse `HeuristicContractMatcher` where it fits +5. Put clear **reason strings** in `details` for failures + +### Another LLM Provider (not Anthropic / OpenAI / Ollama) + +1. Add a new code path in `examples/agent/agent.ts` for that API +2. If tool JSON differs, add a converter (like `toOpenAITools`) on `ToolRegistry` +3. Map responses into the same **`TraceStep`** shapes as today +4. **`framework/`** can stay the same—it only reads **`AgentTrace`** ### Connect to a Real MCP Server -1. Replace the `execute` function in your tool with MCP client calls -2. Use `@modelcontextprotocol/sdk` for the transport layer -3. Keep the same `inputSchema` — MCP and Anthropic tool schemas are aligned by design -4. Update mock configurations in test setup to toggle between mock and live modes + +1. In each tool, replace local `execute` with **MCP client** calls +2. Optional: `@modelcontextprotocol/sdk` for transport +3. Keep **`inputSchema`** stable where possible +4. In tests, switch **mock vs live** in `setup.ts` --- ## Why Playwright (Not Jest or Vitest) -| Feature | Jest/Vitest | Playwright | -|---------|------------|------------| -| Parallel isolation | Shared process | Separate worker processes | -| Built-in retries | Manual config | `retries: 1` in this project’s config | -| HTML reports | Needs plugin | Built in | -| Timeout granularity | Per-suite | Per-test, per-suite, global | -| Trace capture | None | This suite uses `trace: 'off'` + custom attachments on failure | -| Future browser testing | Separate framework | Same framework | -| Fixture system | beforeEach | test.extend() with typed fixtures | -The deliberate choice of Playwright for non-browser testing is itself a publishable insight for the article. +| | Jest / Vitest | Playwright (this repo) | +| ------------------------- | ------------------ | ----------------------------------------------- | +| Isolation | Often one process | **Workers** (separate processes) | +| Retries | Extra setup | **`retries: 1`** built in | +| HTML report | Add-on | **Built in** | +| Timeouts | Mostly suite-level | **Per test** + suite + global | +| This project | — | Custom **attachments** on failure (agent trace) | +| Later: real browser tests | Another tool | **Same** Playwright | + + +Using Playwright without a browser is intentional: **reports, retries, and timeouts** fit long, flaky LLM runs well. --- ## Cost Awareness -Each test run calls a real LLM API. Costs depend on provider and model. +Cloud LLM calls **cost money** per run. **Local Ollama** costs $0 but uses CPU/time. + +**Rough idea (Anthropic, example model):** small tests ~**$0.01–0.03** per run; multi-tool tests a bit more. **Full suite** scales with how many tests run and whether **retries** fire (up to about **2×** if every test retries once). -**Anthropic** — with `claude-sonnet-4-20250514` (order-of-magnitude; varies by prompt length): -- Simple tests (1 tool call): ~$0.01-0.03 per run -- Multi-step tests (2+ tool calls): ~$0.03-0.08 per run -- Full suite (**25** tests × 1 run): scale the above by test mix -- Full suite with Playwright retries (**25** tests × up to **2** attempts each when flaky): up to ~2× the single-run cost +**OpenAI:** See [OpenAI pricing](https://openai.com/pricing) for your model. -**OpenAI** — pricing follows [OpenAI’s current rates](https://openai.com/pricing) for the model you set (e.g. `gpt-4o`, `gpt-4o-mini`). +**Save money while developing:** -**To reduce costs during development:** -- Use a smaller/cheaper model (`claude-haiku`, `gpt-4o-mini`, etc.) via `AgentConfig.model` or `LLM_MODEL` -- Run individual test files, not the full suite -- Keep `maxToolRounds` low (default 10 is already conservative) +- Use **cheaper / smaller** models (`LLM_MODEL`, or `Agent` config) +- Run **one file** or one test, not the whole suite +- **`maxToolRounds`** is already capped (default 10) --- ## Troubleshooting -**Tests timeout (behavioral tests: 120s local, 300s on CI):** -LLM APIs and local Ollama on CPU can be slow (especially in GitHub Actions — first inference after `ollama pull` can take minutes). The **`behavioral`** project uses a longer cap when **`CI=true`**. You can raise `behavioralTimeoutMs` in `playwright.config.ts` if needed. For Ollama in CI, ensure the model is pulled before tests and the runner has enough RAM. +**Timeouts (behavioral tests — 120s local, 300s on CI):** +Cloud APIs and **Ollama on CPU** (especially in GitHub Actions) can be **slow**. CI already uses a **longer** timeout. If needed, increase **`behavioralTimeoutMs`** in `playwright.config.ts`. In CI, confirm **`ollama pull`** finished and the runner has **enough RAM**. + +**Flaky pass/fail:** +Normal for LLMs. Try: relax **`minKeywordMatchRatio`**, add **keywords**, lower the **confidence** threshold, rely on **`retries: 1`**. -**Tests are flaky (pass sometimes, fail sometimes):** -This is expected with LLM testing. Three strategies: -1. Lower `minKeywordMatchRatio` in the contract -2. Add more keywords to the contract -3. Lower the confidence threshold in the assertion -4. Playwright’s `retries: 1` handles transient variation +**401 / wrong host:** +Match **`LLM_PROVIDER`** to the key you use. For **`openai`**, custom bases use **`OPENAI_BASE_URL`** (not `OLLAMA_BASE_URL`). For **local Ollama**, use **`LLM_PROVIDER=ollama`** and the Ollama `/v1` URL. -**Wrong provider or API URL (401 / unexpected host):** -Confirm `LLM_PROVIDER` matches the key you set. For `openai`, set **`OPENAI_BASE_URL`** for a custom endpoint; **`OLLAMA_BASE_URL` is not read** for that provider. Use **`LLM_PROVIDER=ollama`** with Ollama’s `/v1` base if you intend local Ollama. +**Ollama 500 with `*:cloud` models:** +Set **`OLLAMA_API_KEY`**, or use a **local** model name without `:cloud`. -**Ollama `500` / `internal service error` with `*:cloud` models:** -Cloud-tagged models may require an Ollama Cloud API key; use a **local** model tag for the same behavior without keys, or set optional **`OLLAMA_API_KEY`**. +**Ollama 400 — “does not support tools”:** +Pick a model that supports **tool calling** (see [Ollama models with Tools](https://ollama.com/search?c=tools)). Reasoning-only models (e.g. some **`deepseek-r1`** tags) may fail here. -**Agent output is not JSON:** -The system prompt tells Claude to respond in JSON, but it sometimes wraps it in markdown fences. The `parseOutput()` method in `agent.ts` handles this. If you see `taskType: "unknown"`, the JSON parsing failed entirely — check the raw text in the trace. +**Output not JSON / `taskType: "unknown"`:** +The demo expects JSON; the parser in **`examples/agent/agent.ts`** strips common markdown fences. If it still fails, inspect **raw text** in the trace attachment. -**File-reader returns "access denied":** -Path traversal protection. The file path must be within the configured `basePath`. Check that `FIXTURE_DIR` resolves correctly. +**File-reader "access denied":** +Paths must stay under the allowed folder (**path safety**). Check **`FIXTURE_DIR`** and paths in the test. -**"Tool X is not registered" error in trace:** -The agent tried to call a tool that wasn't in the ToolRegistry. This is actually a valid test finding — it means the agent hallucinated a tool call. Check the `boundaryNotViolated` assertion. +**"Tool X is not registered":** +The model asked for a tool that **was not registered**—often a **hallucinated** tool name. Relevant checks: **`boundaryNotViolated`**, allow-lists in setup. \ No newline at end of file