From c4bc91bac3748be57b3a54878dd1f6475bb4e89a Mon Sep 17 00:00:00 2001 From: Biresh Patel Date: Sun, 19 Apr 2026 23:57:23 -0500 Subject: [PATCH] Revise README to clarify the purpose and structure of the project, emphasizing automated testing for AI agents that call tools. Update sections on repository layout, testing patterns, and behavior contracts for improved understanding. Enhance descriptions of the demo agent and tool registry functionalities. --- README.md | 479 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 277 insertions(+), 202 deletions(-) diff --git a/README.md b/README.md index e984bb7..8f07287 100644 --- a/README.md +++ b/README.md @@ -1,46 +1,52 @@ # AgentAssert -**Proof-of-concept:** Playwright tests + reusable assertion helpers for tool-calling LLM agents (MCP-shaped tool schemas, in-process tools in the demo). +**What this is:** Automated tests for an **AI agent** that can **call tools** (for example: “read this file,” “call this API”). The tests check **what the agent decided to do** (which tools it picked, in what order, and what it said at the end)—not just the final text. -This repository is **`"private": true`** in `package.json` — it is **not** published to npm and is **not** a productized “framework.” It is a **reference implementation** you can clone, read, and copy from. A clean entry point exists for imports (`index.ts` → `framework/`), but publishing a real package would add build steps, versioning, and semver guarantees — out of scope for this POC. +**For QA:** Think of it like testing **business rules on decisions**, not only checking a screen or a single return value. Failures attach **traces** (a step-by-step log) so you can see prompts, tool calls, and outputs in the Playwright HTML report. + +This repo is a **sample / reference project** (`"private": true` in `package.json`—not published as a product). You can copy patterns from it into your own codebase. --- ## Repository layout -| Path | Role | -|------|------| -| **`framework/`** | Reusable assertions (`AgentAssert`, `HeuristicContractMatcher`, `BehaviorContract`) and **shared types** (`framework/types.ts` — traces, contracts, tool shapes). | -| **`index.ts`** | Barrel export so you can `import { AgentAssert, … } from 'agent-assert'` when vendoring this repo. | -| **`examples/agent/`** | **Demo system under test:** LLM loop, `ToolRegistry`, file-reader / api-caller tools. Replace with your own agent; keep compatible `AgentTrace` / `AgentOutput` shapes if you reuse the assertions. | -| **`tests/`** | Playwright specs and fixtures wired to the demo agent via `tests/fixtures/setup.ts`. | + +| Folder / file | In plain terms | +|---------------|----------------| +| **`framework/`** | **Checks you reuse:** “Was tool X used?” “Does the answer follow our rules?” Helpers include `AgentAssert`, `HeuristicContractMatcher`, and shared types for traces and contracts. | +| **`index.ts`** | Single entry point for importing the framework from this repo when you copy it in. | +| **`examples/agent/`** | **Demo agent** (the app under test): chat loop, tool list, sample tools (`file-reader`, `api-caller`). Swap this for your real agent; keep the same trace shape if you keep these tests. | +| **`tests/`** | Playwright test files plus **`tests/fixtures/setup.ts`**, which builds the demo agent for each scenario. | + --- ## About -A working proof-of-concept that demonstrates five testing patterns for AI agents that call tools through a **`ToolRegistry`** (the same tool definitions map cleanly to Anthropic/OpenAI tool formats and to MCP-style schemas). The repo does not run a live MCP server by default; tools execute in-process so tests stay fast and deterministic. +**What gets tested:** Five patterns that cover **tool choice**, **output rules**, **order of steps**, **staying inside allowed tools**, and **behavior when something fails**. + +**Tool list:** Tools are registered in a **`ToolRegistry`** (a named list the model sees and that your code runs when the model asks). There is **no separate MCP server** in the default demo—tools run **inside the test process** so runs are **fast and stable**. -**Behavior contracts** are checked by **`HeuristicContractMatcher`**: required fields, **case-insensitive keyword overlap**, **regex** forbidden phrases, and optional custom validators — *not* embedding similarity or LLM-as-judge semantics. The weighted **`confidence`** score is a tuning signal, not a calibrated measure of meaning. That avoids pretending the cheap matcher is “semantic” while still letting you reject exact-string tests for variable LLM wording. See **Heuristic matching: scope and limits** below. +**“Behavior contracts”:** Rules like “must include these fields,” “should mention these words,” “must not match these patterns.” The matcher is **rule-based** (keywords and patterns)—**not** a second AI judging the first, and **not** “true” semantic understanding. The **`confidence`** score is a **rough pass/fail helper** for tuning, not a scientific accuracy percentage. More detail: **Heuristic matching: scope and limits** below. -The deliberate use of Playwright (not Jest, not Vitest) as the test runner is itself a publishable insight. +**Why Playwright:** Same tool as browser E2E tests, but here it drives **API + agent** flows. You get **retries**, **timeouts per test**, and **HTML reports** with attachments (good for flaky LLM output). --- ## Architecture: How the Pieces Connect ``` -┌─────────────────────────────────────────────────────────┐ -│ YOUR TEST FILE │ +┌───────────────────────────────────────────────────────────────────────────────────┐ +│ YOUR TEST FILE │ │ import { AgentAssert } from 'agent-assert' // or ../../framework/AgentAssert.js │ -│ const trace = await agent.run("some prompt") │ -│ AgentAssert.toolWasInvoked(trace, 'file-reader') │ -│ AgentAssert.satisfiesContract(trace.output, CONTRACT) │ -└─────────────────┬──────────────────────────┬────────────┘ +│ const trace = await agent.run("some prompt") │ +│ AgentAssert.toolWasInvoked(trace, 'file-reader') │ +│ AgentAssert.satisfiesContract(trace.output, CONTRACT) │ +└─────────────────┬──────────────────────────┬──────────────────────────────────────┘ │ │ ┌───────────▼─────────┐ ┌───────────▼─────────────┐ - │ Agent (SUT) │ │ AgentAssert │ - │ │ │ (assertion library) │ + │ Agent (app under │ │ AgentAssert │ + │ test) │ │ (checks / asserts) │ │ 1. Send prompt to │ │ │ │ LLM (Anthropic / │ │ - toolWasInvoked() │ │ OpenAI / Ollama)│ │ - satisfiesContract() │ @@ -48,87 +54,131 @@ The deliberate use of Playwright (not Jest, not Vitest) as the test runner is it │ from model │ │ │ │ 3. Execute tool │ │ - traceFollowsSequence()│ │ 4. Send result back │ │ │ - │ 5. Capture TRACE │ └───────────┬─────────────┘ + │ 5. Build TRACE │ └───────────┬─────────────┘ └────────┬────────────┘ │ - │ ┌──────────▼──────────────┐ - ┌──────────▼───────────┐ │ HeuristicContractMatcher │ - │ ToolRegistry │ │ │ - │ │ │ Layer 1: Structure │ - │ file-reader → exec() │ │ Layer 2: Keywords (BoW) │ - │ api-caller → exec() │ │ Layer 3: Forbidden (regex)│ - └──────────────────────┘ │ Layer 4: Custom │ - │ │ - │ Returns: MatchResult │ - │ { confidence: 0.82 } │ - └─────────────────────────┘ + │ ┌────────────▼──────────────┐ + ┌──────────▼───────────┐ │ HeuristicContractMatcher │ + │ ToolRegistry │ │ │ + │ │ │ 1. Structure (fields) │ + │ file-reader → run() │ │ 2. Keywords (word checks) │ + │ api-caller → run() │ │ 3. Forbidden (patterns) │ + └──────────────────────┘ │ 4. Optional custom rule │ + │ │ + │ → MatchResult + score │ + └───────────────────────────┘ ``` ### Data Flow (Step by Step) -1. **Test calls `agent.run(prompt)`** — sends a natural language task -2. **Agent sends prompt to the configured LLM API** (Anthropic, OpenAI, or Ollama via OpenAI-compatible API) — along with tool definitions from ToolRegistry -3. **The model responds with tool calls** — Anthropic: `tool_use` blocks; OpenAI: `function` tool calls — e.g. "call file-reader with path X" -4. **Agent executes the tool** via ToolRegistry — gets back `ToolResult` -5. **Agent sends tool result back to the model** — the model may request more tools -6. **Loop continues** until the model produces a final text response -7. **Agent builds `AgentTrace`** — captures EVERY step (tool calls, tool results, reasoning, output) -8. **Test receives the trace** and passes it to AgentAssert methods -9. **AgentAssert uses HeuristicContractMatcher** to score output against BehaviorContracts -10. **MatchResult returned** with heuristic confidence and per-layer details +```mermaid +flowchart TB + subgraph T["1 Test"] + A["test calls agent.run(prompt)"] + end + + subgraph G["2 Agent loop"] + B["Send prompt plus tool defs from ToolRegistry to LLM API"] + C{"Model response?"} + D["Tool calls: Anthropic tool_use / OpenAI function calls e.g. file-reader path X"] + E["ToolRegistry.execute returns ToolResult"] + F["Send tool results back to model"] + H["Final text answer"] + end + + subgraph TR["3 Trace and assertions"] + I["Build AgentTrace: reasoning, tool_call, tool_result, output"] + J["Test passes trace to AgentAssert"] + K["AgentAssert plus HeuristicContractMatcher vs BehaviorContract"] + L["MatchResult: confidence and per-layer detail"] + end + + A --> B + B --> C + C -->|tools| D + D --> E + E --> F + F --> C + C -->|done| H + H --> I + I --> J + J --> K + K --> L +``` + + + +1. The test starts the agent with a **plain-language task** (`agent.run(prompt)`). +2. The agent sends that task to the **LLM** (Anthropic, OpenAI, or local Ollama) and includes the **list of tools** from the ToolRegistry. +3. The model may answer with **“please run tool X with these inputs”** (the exact format differs by vendor; see diagram above). +4. The agent runs the real tool code through the ToolRegistry and gets a **ToolResult** (success/failure + data). +5. That result goes **back to the model**; steps 3–5 may **repeat** until the model stops asking for tools. +6. The model eventually returns a **final text** answer (or stops). +7. The agent assembles an **`AgentTrace`**: a **full log** of reasoning, each tool call, each tool result, and final output. +8. The test sends that trace into **AgentAssert** helpers (pass/fail checks). +9. For text rules, **HeuristicContractMatcher** compares the output to a **BehaviorContract** (keywords, patterns, etc.). +10. You get a **`MatchResult`**: pass/fail style outcome, a **confidence** score, and **details** per check layer (useful when debugging failures). --- ## The Five Testing Patterns ### Pattern 1: Tool Invocation Assertion + **File:** `tests/behavioral/intent-routing.spec.ts` -**What it tests:** Did the agent select the correct tool for the given intent? +**What it tests:** For a user request like “read this log file,” did the agent actually call the **right tool** (e.g. file-reader) and not something else? -**Why it's unique:** Traditional tests check return values. Agent tests check *decisions*. The agent might return a plausible-looking summary even when it called the wrong tool (or no tool at all). +**Why it matters:** A bad answer can still *sound* fine. This checks the **decision** (which tool ran), not only the final text. **Key assertion:** + ```typescript const result = AgentAssert.toolWasInvoked(trace, 'file-reader', { filePath: /.*\.log$/ }); AgentAssert.expectMatched(result, 'file-reader should be invoked'); // embeds AgentAssert.formatResult(result) on failure ``` **What to look at in the code:** -- `AgentAssert.toolWasInvoked()` — walks the trace looking for tool_call steps -- `paramMatchers` — regex patterns validated against tool input parameters -- Negative assertion — `AgentAssert.expectNotMatched(result, '...')` verifies a tool was NOT called (rich failure output via `formatResult`) + +- `AgentAssert.toolWasInvoked()` — finds **tool_call** steps in the trace +- `paramMatchers` — optional checks on **arguments** passed to the tool (e.g. file path) +- `AgentAssert.expectNotMatched()` — asserts a tool was **not** used (when the test expects a negative case) --- ### Pattern 2: Behavior Contract Validation + **File:** `tests/behavioral/output-contract.spec.ts` -**What it tests:** Does the output satisfy a **heuristic** contract (fields + keywords + patterns), not an exact string match? +**What it tests:** Does the final answer follow **agreed rules** (required fields, must-include words, forbidden phrases)—without requiring an **exact** copy-paste string match? -**Why it's unique:** `expect(output).toBe("...")` breaks on every LLM run. Contracts define cheap rules that often track “good enough” outputs. Synonymous phrasing can still fail if keywords don’t align — widen keywords, lower thresholds, add a **customValidator**, or upgrade to embeddings / LLM-judge (see limits section below). +**Why it matters:** LLM wording changes every run. Contracts check **flexible rules** instead of one frozen sentence. If wording drifts, add keywords, relax thresholds, or add a **customValidator** (see **Heuristic matching: scope and limits**). **Key assertion:** + ```typescript const result = AgentAssert.satisfiesContract(trace.output, BehaviorContract.SUMMARIZATION, 0.5); AgentAssert.expectMatched(result, 'SUMMARIZATION contract should pass'); ``` **What to look at in the code:** -- `BehaviorContract.ts` — pre-built contracts with required fields, keywords, forbidden patterns -- `HeuristicContractMatcher.evaluate()` — structure + keyword overlap + forbidden regex (+ optional custom) -- `minKeywordMatchRatio` — how much of the keyword list must appear as substrings -- `forbiddenPatterns` — regex matches force a contract failure path + +- `BehaviorContract.ts` — ready-made rule sets (fields, keywords, forbidden patterns) +- `HeuristicContractMatcher.evaluate()` — runs structure + keywords + forbidden patterns (+ optional custom rule) +- `minKeywordMatchRatio` — how many of the listed keywords must appear +- `forbiddenPatterns` — if the output matches these, the contract **fails** --- ### Pattern 3: Multi-Step Trace Verification + **File:** `tests/behavioral/tool-invocation.spec.ts` -**What it tests:** Did the agent follow a valid reasoning path through multiple tool calls? +**What it tests:** For workflows that need **more than one tool** (e.g. read file, then call API), did the agent follow a **sensible order** of steps? -**Why it's unique:** No equivalent in Selenium/Playwright browser testing. Browser tests check page state, not the application's intermediate reasoning steps. +**Why it matters:** UI tests usually see **screens**. Here you inspect the **sequence of tool calls** inside the trace—something classic UI checks do not cover. **Key assertion:** + ```typescript const result = AgentAssert.traceFollowsSequence(trace, [ { type: 'tool_call', toolName: 'file-reader' }, @@ -138,41 +188,47 @@ AgentAssert.expectMatched(result, 'trace should show file-reader then api-caller ``` **What to look at in the code:** -- `AgentAssert.traceFollowsSequence()` — checks steps appear in order (not necessarily consecutive) -- The trace captures reasoning steps between tool calls -- `toolCallCountInRange()` — sanity check on how many tools were called + +- `AgentAssert.traceFollowsSequence()` — checks that steps appear in the **right order** (they do not have to be back-to-back) +- The trace can include **reasoning** lines between tool calls +- `toolCallCountInRange()` — optional check that the agent did not call tools **too many** or **too few** times --- ### Pattern 4: Boundary/Scope Enforcement + **File:** `tests/boundary/hallucination-guard.spec.ts` -**What it tests:** Did the agent stay within its defined task boundaries? +**What it tests:** Did the agent **only use tools it is allowed to use** and avoid **out-of-scope** actions? -**Why it's unique:** LLMs hallucinate tool calls, fabricate data, and take unsolicited actions. Nobody tests for this systematically. +**Why it matters:** Models sometimes invent tool names or try extra actions. This pattern catches **scope creep** and **fake tool calls**. **Key assertion:** + ```typescript const result = AgentAssert.boundaryNotViolated(trace, ['file-reader']); AgentAssert.expectMatched(result, 'only file-reader should be used'); ``` **What to look at in the code:** -- `boundaryNotViolated()` — checks that ONLY allowed tools were called -- `createFileOnlyAgent()` — test factory that registers limited tools -- `HALLUCINATION_PROMPTS` — adversarial prompts designed to trigger out-of-scope behavior -- `SCOPE_BOUNDED` contract — forbidden patterns for scope-creep language + +- `boundaryNotViolated()` — fails if any **tool outside the allow-list** was called +- `createFileOnlyAgent()` — builds an agent with **only** certain tools registered (tight boundary) +- `HALLUCINATION_PROMPTS` — tricky prompts meant to provoke **wrong** or **extra** tool use +- `SCOPE_BOUNDED` contract — rules that flag **forbidden** wording when the task should stay narrow --- ### Pattern 5: Failure & Retry Observability + **File:** `tests/boundary/retry-behavior.spec.ts` -**What it tests:** Does the agent degrade gracefully when tools fail? +**What it tests:** When a tool **errors** (file missing, API down), does the agent **admit the failure** and avoid claiming success? Does it **retry** within reasonable limits? -**Why it's unique:** Most agent tests don't simulate tool failures at all. +**Why it matters:** Many demos only test the happy path. Here failures are **forced** so you can see honest vs misleading behavior. **Key assertions (examples):** + ```typescript // Honest reporting: output mentions failure / contract passes expect(mentionsFailure, 'output should reflect tool failure').toBe(true); @@ -193,130 +249,144 @@ AgentAssert.expectMatched( ``` **What to look at in the code:** -- `createFailingFileReaderAgent()` — agent with tools wired to always fail -- `GRACEFUL_FAILURE` contract — requires error keywords, forbids success-claiming language -- `retryCount` in trace metadata — tracks how many tool results had `success: false` in the agent loop -- Cascade failure test — verifies downstream behavior when file-reader fails (see `textReflectsUpstreamFailure` / tool traces) + +- `createFailingFileReaderAgent()` — file-reader always **fails** on purpose (error-path testing) +- `GRACEFUL_FAILURE` contract — expects **honest** error language; forbids **false success** claims +- `retryCount` in trace metadata — counts tool results with **`success: false`** +- Cascade tests — what happens **after** the first tool fails (downstream steps) --- ## Key Files Explained ### framework/types.ts -Shared types for traces, contracts, and (in the demo) tool definitions. Assertions and `examples/agent/` both import from here so the SUT and matchers stay aligned. -- `AgentTrace` — what assertions operate on -- `TraceStep`, `AgentOutput`, `ContractDefinition`, `MatchResult`, `ToolDefinition`, … +**Shared shapes** for traces, contracts, and tools. Tests and the demo agent both use these types so “what we record” and “what we assert” stay in sync. + +- **`AgentTrace`** — the main object assertions read +- Other types: steps, output, contracts, tool definitions, match results ### examples/agent/agent.ts -**Demo system under test** — not part of the reusable assertion layer. The tool-calling loop is the reference pattern: -1. Send prompt + tool definitions to **Anthropic Messages API** or **OpenAI Chat Completions** (see `AgentConfig.provider`) -2. The model responds with text and/or tool calls (`tool_use` vs `function` / `tool_calls` depending on provider) -3. Execute requested tools -4. Send tool results back in the provider-specific message format -5. Repeat until the model gives a final answer -6. Build the AgentTrace from everything that happened +**Demo agent** (the application under test). It is **not** the reusable assertion library. Flow in plain terms: + +1. Send the user message and tool list to the LLM (**Anthropic** or **OpenAI-style** API, including Ollama). +2. Read the model reply: plain text and/or **requests to call tools**. +3. Run the requested tools; send results back to the model. +4. Repeat until the conversation finishes. +5. Package everything into an **`AgentTrace`**. -**Provider selection:** Pass `provider: 'anthropic' | 'openai' | 'ollama'`, or set `LLM_PROVIDER` in the environment. Keys: `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` (or pass `apiKey`). For **`LLM_PROVIDER=ollama`**, the OpenAI client uses `OPENAI_BASE_URL`, then `OLLAMA_BASE_URL`, then defaults to `http://127.0.0.1:11434/v1`. For **`LLM_PROVIDER=openai`**, only **`OPENAI_BASE_URL`** is used for a custom base (Azure, proxy, etc.); **`OLLAMA_BASE_URL` is ignored** so a stale shell variable cannot send OpenAI traffic to Ollama by accident. +**How to pick a provider:** Set `LLM_PROVIDER` (or `provider` in code) to `anthropic`, `openai`, or `ollama`. Use **`ANTHROPIC_API_KEY`** / **`OPENAI_API_KEY`** as needed. For Ollama, the client uses **`OPENAI_BASE_URL`**, then **`OLLAMA_BASE_URL`**, then **`http://127.0.0.1:11434/v1`**. For **`LLM_PROVIDER=openai`**, only **`OPENAI_BASE_URL`** sets a custom base (Azure, proxy, etc.); **`OLLAMA_BASE_URL`** is not used so you do not accidentally send OpenAI traffic to Ollama. -**Important:** The system prompt in this file shapes agent behavior. If you change it, update the test contracts to match. +**If you edit the system prompt here**, update test expectations and contracts so they still match what you want the agent to do. ### examples/agent/tools/file-reader.ts and api-caller.ts -Demo tools use MCP-aligned JSON schemas and register through **`ToolRegistry`**. In this POC they run locally (file-reader reads from disk, api-caller uses mock responses). To connect them to a real MCP server, replace the `execute` function with MCP transport calls — the schema stays the same. -**Security note:** `file-reader.ts` includes path traversal protection. Read the comments. +**Sample tools:** JSON schema + **run** function. In the demo they run **locally** (real disk read; API tool uses **mock** responses). To use a real MCP server later, you would replace the **run** logic—not necessarily the schema. + +**Security:** `file-reader.ts` blocks path tricks; read the file comments before changing paths. ### examples/agent/tools/registry.ts -Maps tool names to definitions. Provides `toAnthropicTools()` and `toOpenAITools()` so the same tool definitions work with either API. This is the bridge between your tool definitions and the LLM. + +**Name → tool implementation.** Also turns the same definitions into the **Anthropic** or **OpenAI** tool format the API expects. ### framework/HeuristicContractMatcher.ts -**Heuristic evaluation (not deep semantics).** Layers: -1. **Structural** (40% weight) — required fields (and optional length) -2. **Keywords** (35% weight, or 25% + 10% custom when `customValidator` is set) — bag-of-words style: substring presence for each listed keyword -3. **Forbidden** (25% weight) — regex patterns; any hit triggers the contract-failure path -4. **Custom** (optional) — your own validator in the contract +**Rule-based scoring (not “AI understands meaning”).** Rough weights: + +1. **Structure** (~40%) — required fields (and optional length checks) +2. **Keywords** (~35%, adjusted if you add a custom rule) — listed words must appear as **substrings** (after lowercasing) +3. **Forbidden** (~25%) — if a **regex** matches, the contract fails that path +4. **Custom** (optional) — your own function for extra checks + +The **`confidence`** number is a **blend of those checks**—for **pass/fail thresholds** and tuning, not a true “semantic similarity” score. -The headline `confidence` is a **weighted average of those scores** — useful for ranking and thresholds, not as a semantic similarity score. +**Tuning (for less flaky runs):** -**Tuning knobs:** -- `minKeywordMatchRatio` — lower = more lenient keyword layer -- Assertion threshold on `result.confidence` — lower = fewer flaky tests -- `forbiddenPatterns` — stricter guardrails (regex can be brittle; test them) -- **Synonyms** — add alternate phrasings to `requiredIntentKeywords`, or use `customValidator` / external judges (below) +- Lower **`minKeywordMatchRatio`** — easier keyword pass +- Lower the **assertion threshold** on `confidence` — fewer false failures +- Tighten **`forbiddenPatterns`** — stricter “must not say” rules (test regexes carefully) +- Add **synonyms** to keyword lists or use **`customValidator`** when wording varies a lot ### Heuristic matching: scope and limits -| Approach | What this repo does | What would be “more semantic” | -|----------|---------------------|--------------------------------| -| Keyword list | Substring checks after lowercasing | LLM-as-judge, entailment models | -| Confidence | Weighted heuristic blend | Calibrated metrics or judge scores | -| Same meaning, different words | Can **fail** unless keywords or patterns cover both | Embeddings vs reference texts, synonym lists | -**Possible upgrades (not implemented here):** call a second model to grade outputs against the contract; embed output and reference snippets and compare cosine similarity; use an NLP library for paraphrase / NLI. Those add latency, cost, and complexity — the heuristic matcher stays intentionally cheap and explicit. +| Topic | This repo | “Stronger” approaches (not built in) | +| -------------------------- | ---------------------------------------------------- | --------------------------------------------- | +| Wording | Keyword **substring** checks | Second model as judge, specialized NLP models | +| Score | Weighted **rules** | Calibrated scores from another system | +| Same idea, different words | Can **fail** if keywords do not cover both phrasings | Embeddings, synonym lists, judges | + + +**Not included here on purpose:** a second LLM to grade answers, vector similarity, or heavy NLP—those add **cost**, **latency**, and **ops**. This matcher is meant to stay **simple and inspectable**. ### framework/BehaviorContract.ts -Pre-built contracts for common task types. Each contract defines what "correct" means for that task type. The five contracts: SUMMARIZATION, API_ACTION, MULTI_STEP, SCOPE_BOUNDED, GRACEFUL_FAILURE. + +Ready-made **named rule sets** for typical tasks: **SUMMARIZATION**, **API_ACTION**, **MULTI_STEP**, **SCOPE_BOUNDED**, **GRACEFUL_FAILURE**. ### framework/AgentAssert.ts -The public API. Core assertions (each returns a **`MatchResult`**): -1. `toolWasInvoked(trace, toolName, paramMatchers?)` — was a tool called? -2. `satisfiesContract(output, contract, minConfidence?)` — does output meet the contract? -3. `boundaryNotViolated(trace, allowedTools)` — only allowed tools used? -4. `traceFollowsSequence(trace, expectedSequence)` — correct execution order? -5. `toolCallCountInRange(trace, min, max)` — reasonable number of tool calls? +**Main API** for tests (each method returns a **`MatchResult`**): + +1. `toolWasInvoked(...)` — was this tool used (and optional argument checks)? +2. `satisfiesContract(...)` — does the final output follow the contract? +3. `boundaryNotViolated(...)` — only tools from this allow-list? +4. `traceFollowsSequence(...)` — did steps happen in this order? +5. `toolCallCountInRange(...)` — tool call count within min/max? Helpers: -- `formatResult(result, debugHint?)` — pretty-print a `MatchResult` for logs or messages -- `expectMatched(result, context)` / `expectNotMatched(result, context)` — thin wrappers around Playwright’s `expect` so failures include `formatResult` automatically +- `formatResult(...)` — readable dump for logs or failure messages +- `expectMatched` / `expectNotMatched` — Playwright-friendly helpers that attach **`formatResult`** on failure --- ### tests/env-llm.ts and `.env` -Playwright loads **`tests/env-llm.ts`** from **`playwright.config.ts`** (`applyLlmVarsFromDotEnv()`). Selected keys from a project-root **`.env`** file are merged into `process.env` (`.env` wins over existing shell vars for those keys): `LLM_PROVIDER`, `LLM_MODEL`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `OLLAMA_API_KEY`, `OLLAMA_BASE_URL`. Copy **`.env.example`** to **`.env`** and fill in keys so tests and IDE runs see the same configuration without exporting variables manually. +**Local runs:** Copy **`.env.example`** to **`.env`** and set keys (provider, model, API keys, Ollama URL). Playwright loads these via **`tests/env-llm.ts`** so tests see the same settings as your IDE. + +**GitHub Actions:** CI reads the same kind of values from repo **Settings → Secrets and variables → Actions** (no code edit needed). Workflow file: **`.github/workflows/ci.yml`**. + -**GitHub Actions CI** (`.github/workflows/ci.yml`) installs Ollama, pulls the configured model, and runs the smoke test. **Provider, model, and optional base URL** are read from repository **Variables** or **Secrets** (no code change): +| Variable / secret | What it’s for | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `LLM_PROVIDER` | Usually `ollama` in CI | +| `LLM_MODEL` | e.g. `llama3.2:3b` (local) or a `*:cloud` model | +| `OLLAMA_BASE_URL` | Optional; default `http://127.0.0.1:11434/v1` | +| `OLLAMA_API_KEY` | **Secret** — needed only for **Ollama Cloud** models (`*:cloud`). Without it, you may see **HTTP 500**. [Keys](https://ollama.com/settings/keys) | -| Name | Purpose | -|------|--------| -| `LLM_PROVIDER` | e.g. `ollama` | -| `LLM_MODEL` | e.g. `llama3.2:3b` or `deepseek-v3.2:cloud` | -| `OLLAMA_BASE_URL` | Optional; default `http://127.0.0.1:11434/v1` | -| `OLLAMA_API_KEY` | **Repository secret** — required when `LLM_MODEL` is an Ollama **Cloud** tag (`*:cloud`). Without it, Ollama returns **HTTP 500**. [Create a key](https://ollama.com/settings/keys). | -**Best practice:** put **non-sensitive** values under **Variables**. Use **Secrets** for **`OLLAMA_API_KEY`** and other keys. Storing `LLM_MODEL` as a *secret* works but **masks** it in logs—prefer **Variables** for model name unless needed. If neither Variable nor Secret is set for provider/model, CI defaults to **`ollama`** + **`llama3.2:3b`** (local, no key; stronger tool-calling than `1b`). +**Tip:** Put **non-secret** values in **Variables**; put **keys** in **Secrets**. If provider/model are unset in CI, defaults are **`ollama`** + **`llama3.2:3b`**. -**Run CI on a chosen branch:** **Actions** → **CI** → **Run workflow**. GitHub shows **Use workflow from** — that dropdown is the branch selector for manual runs (the workflow file and checkout use that branch unless you override). Optionally fill **Checkout ref** in the form to checkout a different branch or full ref (e.g. `refs/heads/feature/x`). +**Manual CI run:** **Actions → CI → Run workflow**. Use **Use workflow from** to pick the branch. Optional **Checkout ref** overrides which git ref is checked out. -**Local vs Cloud tags:** tags like `llama3.2:3b` run fully locally. Tags ending in **`:cloud`** need **`OLLAMA_API_KEY`** in Secrets. Override **`LLM_MODEL`** in Variables if you want another local model (e.g. newer builds when Ollama adds them). +**Local vs Cloud model names:** Names like `llama3.2:3b` run on the runner. Names ending in **`*:cloud`** need **`OLLAMA_API_KEY`**. --- ### playwright.config.ts (high level) -- **`retries: 1`** — each failed test runs one more time (LLM outputs vary) -- **Timeouts:** default **45s**; **`behavioral`** project **120s** locally, **300s** when **`CI=true`** (e.g. GitHub Actions + Ollama on CPU); **`boundary`** **45s** -- **`trace: 'off'`** — browser-style Playwright traces are disabled (this suite does not use a browser). Failures still get rich attachments from **`registerAgentTraceForDiagnostics`** in `tests/fixtures/setup.ts` (see below) -- **`workers: 3`** — tune for your API rate limits -- **HTML report `title`** — includes resolved LLM provider and model for quick scanning +- **`retries: 1`** — one retry per failed test (helps with flaky LLM output) +- **Timeouts:** **45s** default; **behavioral** tests **120s** locally, **300s** in CI (`CI=true`) because Ollama on CPU can be slow; **boundary** stays **45s** +- **`trace: 'off'`** — no browser trace zip (these tests are not UI tests). Failures still get **attachments** from **`registerAgentTraceForDiagnostics`** +- **`workers: 3`** — parallel runs (watch API rate limits) +- **Report title** — shows which provider/model ran ### tests/fixtures/setup.ts -Shared test factories (`createTestAgent`, `createFailingFileReaderAgent`, `createFileOnlyAgent`, etc.), fixture file paths (`FIXTURE_DIR` / `test-fixtures/`), and **`registerAgentTraceForDiagnostics`** — wires trace attachments into the Playwright report (see **Failure diagnostics** above). +**Test helpers:** builds agents (`createTestAgent`, `createFailingFileReaderAgent`, …), test file paths, and **`registerAgentTraceForDiagnostics`** so traces show up in the HTML report when something fails. --- ## Setup & Running ### Prerequisites -- Node.js 18+ -- An API key for cloud use, **or** a local [Ollama](https://ollama.com/) (or other OpenAI-compatible) server — see **Local Ollama** below + +- **Node.js 18+** +- Either **API keys** for a cloud LLM **or** a local **[Ollama](https://ollama.com/)** install (free local models; see below) ### Install + ```bash cd agent-assert # or your clone folder name npm install @@ -325,9 +395,9 @@ npx playwright install # Playwright still expects browser binaries to be presen ### Configure API keys -Prefer a **`.env`** file at the repo root (see **`.env.example`**). The same variables work if you `export` them in the shell. +Use a **`.env`** file at the repo root (see **`.env.example`**), or `export` the same names in your terminal. -**Anthropic (default)** — set the key and optionally pin the provider (default is `anthropic` if `LLM_PROVIDER` is unset): +**Anthropic (default if unset)** — set the key; optionally set `LLM_PROVIDER=anthropic`: ```bash export ANTHROPIC_API_KEY="sk-ant-..." @@ -335,7 +405,7 @@ export ANTHROPIC_API_KEY="sk-ant-..." export LLM_PROVIDER="anthropic" ``` -**OpenAI** — set the OpenAI key and select the provider. If you omit `LLM_MODEL`, the agent defaults to `gpt-4o`: +**OpenAI** — set `OPENAI_API_KEY` and `LLM_PROVIDER=openai`. Default model is `gpt-4o` if you omit `LLM_MODEL`: ```bash export OPENAI_API_KEY="sk-..." @@ -344,7 +414,7 @@ export LLM_PROVIDER="openai" export LLM_MODEL="gpt-4o-mini" ``` -**Local Ollama** — no paid API key required; uses the OpenAI-compatible endpoint (`http://127.0.0.1:11434/v1` by default). Start Ollama, pull a model (e.g. `ollama pull qwen3.5`), then: +**Local Ollama** — no cloud API key. Start Ollama, `ollama pull `, then: ```bash export LLM_PROVIDER="ollama" @@ -354,25 +424,27 @@ export LLM_MODEL="qwen3.5:latest" npx playwright test ``` -Alternatively, keep `LLM_PROVIDER=openai` and point at Ollama with **`OPENAI_BASE_URL`** only (do not rely on `OLLAMA_BASE_URL` for this provider). +Or use `LLM_PROVIDER=openai` with **`OPENAI_BASE_URL`** pointing at Ollama (do not depend on **`OLLAMA_BASE_URL`** for that mode). -You can also pass `provider`, `apiKey`, `baseURL`, and `model` when constructing `Agent` in code instead of using environment variables. +You can also set `provider`, `apiKey`, `baseURL`, and `model` in code when creating the `Agent`. ### Failure diagnostics (HTML report) -Tests that call **`registerAgentTraceForDiagnostics(testInfo, trace)`** attach: +When tests use **`registerAgentTraceForDiagnostics`**, the report can include: -- **`agent-run-summary.txt`** — every run (provider, model, tool order, output preview) -- On failure: **`agent-diagnostics.txt`** (full trace dump) and **`playwright-failure.txt`** (Playwright error context) +- **`agent-run-summary.txt`** — every run: provider, model, tools used, output snippet +- On failure: **`agent-diagnostics.txt`** (full trace) and **`playwright-failure.txt`** -Open the **Attachments** tab in the HTML report for the failing test. +Open the failing test → **Attachments** in the Playwright HTML report. ### Run All Tests + ```bash npx playwright test ``` ### Run Specific Pattern + ```bash npx playwright test tests/behavioral/intent-routing.spec.ts npx playwright test tests/boundary/ @@ -381,6 +453,7 @@ npx playwright test --project=boundary ``` ### View HTML Report + ```bash npx playwright show-report ``` @@ -390,97 +463,99 @@ npx playwright show-report ## How to Extend ### Add a New Tool -1. Create `examples/agent/tools/your-tool.ts` following the same factory pattern as `file-reader.ts` -2. Register it in the ToolRegistry in your test setup -3. Add mock responses in `setup.ts` -4. Write tests using `AgentAssert.toolWasInvoked(trace, 'your-tool')` + +1. Add `examples/agent/tools/your-tool.ts` (same style as `file-reader.ts`) +2. **Register** it on the ToolRegistry in test setup +3. Add any **mock data** in `setup.ts` if needed +4. Write tests with `AgentAssert.toolWasInvoked(trace, 'your-tool')` ### Add a New Contract -1. Add a new static property to `BehaviorContract.ts` -2. Define requiredFields, requiredIntentKeywords, forbiddenPatterns -3. Set minKeywordMatchRatio (start with 0.2, tune from there) -4. Add a customValidator if you need logic beyond keywords/patterns -5. Write a test that asserts against your new contract + +1. Add a new entry in `BehaviorContract.ts` +2. Set **required fields**, **keywords**, **forbidden patterns** +3. Set **`minKeywordMatchRatio`** (try ~0.2, then tune) +4. Optional: **`customValidator`** for rules keywords cannot express +5. Add a test that calls `satisfiesContract` with your contract ### Add a New Assertion Method -1. Add a static method to `AgentAssert.ts` -2. Accept `AgentTrace` or `AgentOutput` as input -3. Return `MatchResult` -4. Use `HeuristicContractMatcher` methods internally if needed -5. Include detailed reasons in the `details` array - -### Adapt for Another LLM Provider (beyond Anthropic, OpenAI, and Ollama) -1. Add a branch in `examples/agent/agent.ts` alongside the existing Anthropic and OpenAI-compatible loops -2. Add a `toYourProviderTools()` (or equivalent) on `ToolRegistry` if the tool schema differs -3. Map that provider’s tool-call and tool-result messages into the same `TraceStep` shapes the framework already expects -4. The framework layer (AgentAssert, HeuristicContractMatcher, BehaviorContract) stays UNCHANGED — it operates on `AgentTrace`, which is provider-agnostic + +1. Add a static method on `AgentAssert.ts` +2. Input: **`AgentTrace`** or **`AgentOutput`** +3. Return: **`MatchResult`** +4. Reuse `HeuristicContractMatcher` where it fits +5. Put clear **reason strings** in `details` for failures + +### Another LLM Provider (not Anthropic / OpenAI / Ollama) + +1. Add a new code path in `examples/agent/agent.ts` for that API +2. If tool JSON differs, add a converter (like `toOpenAITools`) on `ToolRegistry` +3. Map responses into the same **`TraceStep`** shapes as today +4. **`framework/`** can stay the same—it only reads **`AgentTrace`** ### Connect to a Real MCP Server -1. Replace the `execute` function in your tool with MCP client calls -2. Use `@modelcontextprotocol/sdk` for the transport layer -3. Keep the same `inputSchema` — MCP and Anthropic tool schemas are aligned by design -4. Update mock configurations in test setup to toggle between mock and live modes + +1. In each tool, replace local `execute` with **MCP client** calls +2. Optional: `@modelcontextprotocol/sdk` for transport +3. Keep **`inputSchema`** stable where possible +4. In tests, switch **mock vs live** in `setup.ts` --- ## Why Playwright (Not Jest or Vitest) -| Feature | Jest/Vitest | Playwright | -|---------|------------|------------| -| Parallel isolation | Shared process | Separate worker processes | -| Built-in retries | Manual config | `retries: 1` in this project’s config | -| HTML reports | Needs plugin | Built in | -| Timeout granularity | Per-suite | Per-test, per-suite, global | -| Trace capture | None | This suite uses `trace: 'off'` + custom attachments on failure | -| Future browser testing | Separate framework | Same framework | -| Fixture system | beforeEach | test.extend() with typed fixtures | -The deliberate choice of Playwright for non-browser testing is itself a publishable insight for the article. +| | Jest / Vitest | Playwright (this repo) | +| ------------------------- | ------------------ | ----------------------------------------------- | +| Isolation | Often one process | **Workers** (separate processes) | +| Retries | Extra setup | **`retries: 1`** built in | +| HTML report | Add-on | **Built in** | +| Timeouts | Mostly suite-level | **Per test** + suite + global | +| This project | — | Custom **attachments** on failure (agent trace) | +| Later: real browser tests | Another tool | **Same** Playwright | + + +Using Playwright without a browser is intentional: **reports, retries, and timeouts** fit long, flaky LLM runs well. --- ## Cost Awareness -Each test run calls a real LLM API. Costs depend on provider and model. +Cloud LLM calls **cost money** per run. **Local Ollama** costs $0 but uses CPU/time. + +**Rough idea (Anthropic, example model):** small tests ~**$0.01–0.03** per run; multi-tool tests a bit more. **Full suite** scales with how many tests run and whether **retries** fire (up to about **2×** if every test retries once). -**Anthropic** — with `claude-sonnet-4-20250514` (order-of-magnitude; varies by prompt length): -- Simple tests (1 tool call): ~$0.01-0.03 per run -- Multi-step tests (2+ tool calls): ~$0.03-0.08 per run -- Full suite (**25** tests × 1 run): scale the above by test mix -- Full suite with Playwright retries (**25** tests × up to **2** attempts each when flaky): up to ~2× the single-run cost +**OpenAI:** See [OpenAI pricing](https://openai.com/pricing) for your model. -**OpenAI** — pricing follows [OpenAI’s current rates](https://openai.com/pricing) for the model you set (e.g. `gpt-4o`, `gpt-4o-mini`). +**Save money while developing:** -**To reduce costs during development:** -- Use a smaller/cheaper model (`claude-haiku`, `gpt-4o-mini`, etc.) via `AgentConfig.model` or `LLM_MODEL` -- Run individual test files, not the full suite -- Keep `maxToolRounds` low (default 10 is already conservative) +- Use **cheaper / smaller** models (`LLM_MODEL`, or `Agent` config) +- Run **one file** or one test, not the whole suite +- **`maxToolRounds`** is already capped (default 10) --- ## Troubleshooting -**Tests timeout (behavioral tests: 120s local, 300s on CI):** -LLM APIs and local Ollama on CPU can be slow (especially in GitHub Actions — first inference after `ollama pull` can take minutes). The **`behavioral`** project uses a longer cap when **`CI=true`**. You can raise `behavioralTimeoutMs` in `playwright.config.ts` if needed. For Ollama in CI, ensure the model is pulled before tests and the runner has enough RAM. +**Timeouts (behavioral tests — 120s local, 300s on CI):** +Cloud APIs and **Ollama on CPU** (especially in GitHub Actions) can be **slow**. CI already uses a **longer** timeout. If needed, increase **`behavioralTimeoutMs`** in `playwright.config.ts`. In CI, confirm **`ollama pull`** finished and the runner has **enough RAM**. + +**Flaky pass/fail:** +Normal for LLMs. Try: relax **`minKeywordMatchRatio`**, add **keywords**, lower the **confidence** threshold, rely on **`retries: 1`**. -**Tests are flaky (pass sometimes, fail sometimes):** -This is expected with LLM testing. Three strategies: -1. Lower `minKeywordMatchRatio` in the contract -2. Add more keywords to the contract -3. Lower the confidence threshold in the assertion -4. Playwright’s `retries: 1` handles transient variation +**401 / wrong host:** +Match **`LLM_PROVIDER`** to the key you use. For **`openai`**, custom bases use **`OPENAI_BASE_URL`** (not `OLLAMA_BASE_URL`). For **local Ollama**, use **`LLM_PROVIDER=ollama`** and the Ollama `/v1` URL. -**Wrong provider or API URL (401 / unexpected host):** -Confirm `LLM_PROVIDER` matches the key you set. For `openai`, set **`OPENAI_BASE_URL`** for a custom endpoint; **`OLLAMA_BASE_URL` is not read** for that provider. Use **`LLM_PROVIDER=ollama`** with Ollama’s `/v1` base if you intend local Ollama. +**Ollama 500 with `*:cloud` models:** +Set **`OLLAMA_API_KEY`**, or use a **local** model name without `:cloud`. -**Ollama `500` / `internal service error` with `*:cloud` models:** -Cloud-tagged models may require an Ollama Cloud API key; use a **local** model tag for the same behavior without keys, or set optional **`OLLAMA_API_KEY`**. +**Ollama 400 — “does not support tools”:** +Pick a model that supports **tool calling** (see [Ollama models with Tools](https://ollama.com/search?c=tools)). Reasoning-only models (e.g. some **`deepseek-r1`** tags) may fail here. -**Agent output is not JSON:** -The system prompt tells Claude to respond in JSON, but it sometimes wraps it in markdown fences. The `parseOutput()` method in `agent.ts` handles this. If you see `taskType: "unknown"`, the JSON parsing failed entirely — check the raw text in the trace. +**Output not JSON / `taskType: "unknown"`:** +The demo expects JSON; the parser in **`examples/agent/agent.ts`** strips common markdown fences. If it still fails, inspect **raw text** in the trace attachment. -**File-reader returns "access denied":** -Path traversal protection. The file path must be within the configured `basePath`. Check that `FIXTURE_DIR` resolves correctly. +**File-reader "access denied":** +Paths must stay under the allowed folder (**path safety**). Check **`FIXTURE_DIR`** and paths in the test. -**"Tool X is not registered" error in trace:** -The agent tried to call a tool that wasn't in the ToolRegistry. This is actually a valid test finding — it means the agent hallucinated a tool call. Check the `boundaryNotViolated` assertion. +**"Tool X is not registered":** +The model asked for a tool that **was not registered**—often a **hallucinated** tool name. Relevant checks: **`boundaryNotViolated`**, allow-lists in setup. \ No newline at end of file