Testing and evaluation framework for AI agents. Define test suites in YAML, grade agent outputs with 6 pluggable graders, track results over time, and detect regressions with statistical comparison.
AI agents are hard to test. They're non-deterministic, they call tools, and their outputs vary between runs. Traditional unit tests don't cut it.
- π― YAML-based test suites β Define inputs, expected outputs, and grading criteria declaratively
- π Statistical regression detection β Welch's t-test across multiple runs, not just pass/fail
- π 6 built-in graders β Exact match, contains, regex, tool-check, LLM-judge, and custom
- π AgentLens integration β Import real production sessions as test cases
- π° Cost & latency tracking β Know what each eval costs in tokens and dollars
- ποΈ SQLite result storage β Every run is persisted for historical comparison
pip install agentevalkit# suite.yaml
name: my-agent-tests
agent: my_agent:run
cases:
- name: basic-math
input: "What is 2 + 2?"
expected:
output_contains: ["4"]
grader: contains
- name: tool-usage
input: "Search for the weather in NYC"
expected:
tools_called: ["web_search"]
grader: tool-check
- name: format-check
input: "List 3 colors"
expected:
pattern: "\\d\\.\\s+\\w+"
grader: regex# my_agent.py
from agenteval.models import AgentResult
def run(input_text: str) -> AgentResult:
# Your agent logic here
return AgentResult(
output="The answer is 4.",
tools_called=[{"name": "web_search", "args": {"query": "weather NYC"}}],
tokens_in=12,
tokens_out=8,
cost_usd=0.0003,
)$ agenteval run --suite suite.yaml --verbose
============================================================
Suite: my-agent-tests | Run: c1c6493118d5
============================================================
PASS basic-addition (score=1.00, 150ms)
PASS capital-city (score=1.00, 200ms)
PASS quantum-summary (score=1.00, 350ms)
PASS tool-usage (score=1.00, 280ms)
PASS list-format (score=1.00, 120ms)
Total: 5 Passed: 5 Failed: 0 Pass rate: 100%
Cost: $0.0023 Avg latency: 220ms
| Grader | What it checks | Expected fields |
|---|---|---|
exact |
Exact string match | output |
contains |
Substring presence | output_contains: [list] |
regex |
Pattern matching | pattern |
tool-check |
Tools were called | tools_called: [list] |
llm-judge |
LLM evaluates quality | criteria (free-form) |
custom |
Your own function | grader_config: {function: "mod:fn"} |
Compare runs with Welch's t-test to detect statistically significant regressions:
$ agenteval compare c1c6493118d5,d17a2dce0222 4ee7e40601e3,ba5b0dde212b
============================================================================
Comparing: c1c6493118d5,d17a2dce0222 vs 4ee7e40601e3,ba5b0dde212b
Alpha: 0.05 Regression threshold: 0.0
============================================================================
Case Base Target Diff p-value Sig Status
----------------------------------------------------------------------------
basic-addition 1.000 1.000 +0.000 β
capital-city 1.000 0.500 -0.500 0.4533
quantum-summary 1.000 0.500 -0.500 0.4533
tool-usage 1.000 0.000 -1.000 0.0000 * βΌ regressed
list-format 1.000 0.500 -0.500 0.4533
Summary: 0 improved, 1 regressed, 4 unchanged
β 1 regression(s) detected!
Run the same suite multiple times and compare groups: agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2. Uses scipy when available, falls back to pure Python.
Import real agent sessions from AgentLens as test suites:
agenteval import --from agentlens --db sessions.db --output suite.yaml --grader contains
# Imported 42 cases β suite.yamlTurn production traffic into regression tests β no manual test writing needed.
Every eval tracks tokens and cost. Your agent callable returns AgentResult with tokens_in, tokens_out, and cost_usd, and AgentEval aggregates them per run.
Full annotated example:
name: my-agent-tests # Suite name (shown in reports)
agent: my_module:my_agent # Default agent callable (module:function)
defaults: # Defaults applied to all cases
grader: contains
grader_config:
ignore_case: true
cases:
- name: basic-math # Unique case name
input: "What is 2 + 2?" # Input passed to agent
expected: # Grader-specific expected values
output_contains: ["4"]
grader: contains # Override default grader
tags: [math, basic] # Tags for filtering (--tag math)
- name: tool-usage
input: "Search for weather"
expected:
tools_called: ["web_search"]
grader: tool-check
- name: quality-check
input: "Explain gravity"
expected:
criteria: "Should mention Newton or Einstein, be scientifically accurate"
grader: llm-judge
grader_config:
model: gpt-4o-mini # LLM judge model
api_base: https://api.openai.com/v1
- name: custom-validation
input: "Generate a JSON object"
expected: {}
grader: custom
grader_config:
function: my_graders:validate_json # Your grader functionagenteval run --suite suite.yaml [--agent module:fn] [--verbose] [--tag math] [--timeout 30] [--db agenteval.db]--suiteβ Path to YAML suite file (required)--agentβ Override the agent callable from the suite--verbose/-vβ Show per-case pass/fail details--tagβ Filter cases by tag (repeatable)--timeoutβ Per-case timeout in seconds (default: 30)--dbβ SQLite database path (default:agenteval.db)
Exit code is 1 if any case fails.
agenteval list [--suite-filter name] [--limit 20] [--db agenteval.db]$ agenteval list --limit 5
ID Suite Passed Failed Rate Created
--------------------------------------------------------------------------------
aeccd5e53f03 math-agent-demo 2 3 40% 2026-02-12T21:12:12
4f3e380f622c math-agent-demo 3 2 60% 2026-02-12T21:12:12
bd4ef3a0727b math-agent-demo 1 4 20% 2026-02-12T21:12:12
e2ca43e99852 math-agent-demo 3 2 60% 2026-02-12T21:12:11
32ed650cab6d math-agent-demo 2 3 40% 2026-02-12T21:12:11
agenteval compare RUN_A RUN_B [--alpha 0.05] [--threshold 0.0] [--stats/--no-stats]
agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2 # Multi-run comparisonagenteval import --from agentlens --db sessions.db --output suite.yaml [--grader contains] [--limit 100]Compares result.output exactly with expected.output. Config: ignore_case: bool.
expected:
output: "The answer is 42."
grader: exact
grader_config:
ignore_case: trueChecks that all substrings in expected.output_contains appear in the output.
expected:
output_contains: ["Paris", "France"]
grader: containsMatches result.output against expected.pattern (Python regex). Config: flags: [IGNORECASE, DOTALL, MULTILINE].
expected:
pattern: "\\d+\\.\\d+"
grader: regex
grader_config:
flags: [IGNORECASE]Verifies expected tools were called. Config: ordered: bool for sequence matching.
expected:
tools_called: ["web_search", "calculator"]
grader: tool-check
grader_config:
ordered: trueSends the input, output, and criteria to an LLM for evaluation. Requires OPENAI_API_KEY or compatible API.
expected:
criteria: "Response should be helpful, accurate, and concise"
grader: llm-judge
grader_config:
model: gpt-4o-miniImports and calls your own grader function. Must accept (case: EvalCase, result: AgentResult) -> GradeResult.
grader: custom
grader_config:
function: my_module:my_graderContributions welcome! This project uses:
- pytest for testing (127 tests passing)
- ruff for linting
- src layout (
src/agenteval/)
git clone https://github.com/amitpaz1/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest| Project | Description | |
|---|---|---|
| AgentLens | Observability & audit trail for AI agents | |
| Lore | Cross-agent memory and lesson sharing | |
| AgentGate | Human-in-the-loop approval gateway | |
| FormBridge | Agent-human mixed-mode forms | |
| AgentEval | Testing & evaluation framework | β¬ οΈ you are here |
| agentkit-mesh | Agent discovery & delegation | |
| agentkit-cli | Unified CLI orchestrator | |
| agentkit-guardrails | Reactive policy guardrails |
MIT β see LICENSE.