Skip to content

agentkitai/agenteval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AgentEval πŸ§ͺ

PyPI Tests Python 3.9+ License: MIT

Testing and evaluation framework for AI agents. Define test suites in YAML, grade agent outputs with 6 pluggable graders, track results over time, and detect regressions with statistical comparison.


Why AgentEval?

AI agents are hard to test. They're non-deterministic, they call tools, and their outputs vary between runs. Traditional unit tests don't cut it.

  • 🎯 YAML-based test suites β€” Define inputs, expected outputs, and grading criteria declaratively
  • πŸ“Š Statistical regression detection β€” Welch's t-test across multiple runs, not just pass/fail
  • πŸ”Œ 6 built-in graders β€” Exact match, contains, regex, tool-check, LLM-judge, and custom
  • πŸ”— AgentLens integration β€” Import real production sessions as test cases
  • πŸ’° Cost & latency tracking β€” Know what each eval costs in tokens and dollars
  • πŸ—„οΈ SQLite result storage β€” Every run is persisted for historical comparison

Quick Start

pip install agentevalkit

1. Define a test suite

# suite.yaml
name: my-agent-tests
agent: my_agent:run

cases:
  - name: basic-math
    input: "What is 2 + 2?"
    expected:
      output_contains: ["4"]
    grader: contains

  - name: tool-usage
    input: "Search for the weather in NYC"
    expected:
      tools_called: ["web_search"]
    grader: tool-check

  - name: format-check
    input: "List 3 colors"
    expected:
      pattern: "\\d\\.\\s+\\w+"
    grader: regex

2. Create your agent callable

# my_agent.py
from agenteval.models import AgentResult

def run(input_text: str) -> AgentResult:
    # Your agent logic here
    return AgentResult(
        output="The answer is 4.",
        tools_called=[{"name": "web_search", "args": {"query": "weather NYC"}}],
        tokens_in=12,
        tokens_out=8,
        cost_usd=0.0003,
    )

3. Run the eval

$ agenteval run --suite suite.yaml --verbose

============================================================
Suite: my-agent-tests  |  Run: c1c6493118d5
============================================================
  PASS  basic-addition (score=1.00, 150ms)
  PASS  capital-city (score=1.00, 200ms)
  PASS  quantum-summary (score=1.00, 350ms)
  PASS  tool-usage (score=1.00, 280ms)
  PASS  list-format (score=1.00, 120ms)

Total: 5  Passed: 5  Failed: 0  Pass rate: 100%
Cost: $0.0023  Avg latency: 220ms

Features

🎯 6 Built-in Graders

Grader What it checks Expected fields
exact Exact string match output
contains Substring presence output_contains: [list]
regex Pattern matching pattern
tool-check Tools were called tools_called: [list]
llm-judge LLM evaluates quality criteria (free-form)
custom Your own function grader_config: {function: "mod:fn"}

πŸ“Š Statistical Comparison

Compare runs with Welch's t-test to detect statistically significant regressions:

$ agenteval compare c1c6493118d5,d17a2dce0222 4ee7e40601e3,ba5b0dde212b

============================================================================
Comparing: c1c6493118d5,d17a2dce0222 vs 4ee7e40601e3,ba5b0dde212b
Alpha: 0.05  Regression threshold: 0.0
============================================================================

Case                          Base   Target     Diff   p-value  Sig Status
----------------------------------------------------------------------------
  basic-addition             1.000    1.000   +0.000         β€”
  capital-city               1.000    0.500   -0.500    0.4533
  quantum-summary            1.000    0.500   -0.500    0.4533
  tool-usage                 1.000    0.000   -1.000    0.0000    * β–Ό regressed
  list-format                1.000    0.500   -0.500    0.4533

Summary: 0 improved, 1 regressed, 4 unchanged

⚠ 1 regression(s) detected!

Run the same suite multiple times and compare groups: agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2. Uses scipy when available, falls back to pure Python.

πŸ”— AgentLens Integration

Import real agent sessions from AgentLens as test suites:

agenteval import --from agentlens --db sessions.db --output suite.yaml --grader contains
# Imported 42 cases β†’ suite.yaml

Turn production traffic into regression tests β€” no manual test writing needed.

πŸ’° Cost & Latency Tracking

Every eval tracks tokens and cost. Your agent callable returns AgentResult with tokens_in, tokens_out, and cost_usd, and AgentEval aggregates them per run.


YAML Suite Format

Full annotated example:

name: my-agent-tests           # Suite name (shown in reports)
agent: my_module:my_agent      # Default agent callable (module:function)

defaults:                       # Defaults applied to all cases
  grader: contains
  grader_config:
    ignore_case: true

cases:
  - name: basic-math            # Unique case name
    input: "What is 2 + 2?"     # Input passed to agent
    expected:                    # Grader-specific expected values
      output_contains: ["4"]
    grader: contains             # Override default grader
    tags: [math, basic]          # Tags for filtering (--tag math)

  - name: tool-usage
    input: "Search for weather"
    expected:
      tools_called: ["web_search"]
    grader: tool-check

  - name: quality-check
    input: "Explain gravity"
    expected:
      criteria: "Should mention Newton or Einstein, be scientifically accurate"
    grader: llm-judge
    grader_config:
      model: gpt-4o-mini         # LLM judge model
      api_base: https://api.openai.com/v1

  - name: custom-validation
    input: "Generate a JSON object"
    expected: {}
    grader: custom
    grader_config:
      function: my_graders:validate_json  # Your grader function

CLI Reference

agenteval run

agenteval run --suite suite.yaml [--agent module:fn] [--verbose] [--tag math] [--timeout 30] [--db agenteval.db]
  • --suite β€” Path to YAML suite file (required)
  • --agent β€” Override the agent callable from the suite
  • --verbose / -v β€” Show per-case pass/fail details
  • --tag β€” Filter cases by tag (repeatable)
  • --timeout β€” Per-case timeout in seconds (default: 30)
  • --db β€” SQLite database path (default: agenteval.db)

Exit code is 1 if any case fails.

agenteval list

agenteval list [--suite-filter name] [--limit 20] [--db agenteval.db]
$ agenteval list --limit 5

ID             Suite                Passed   Failed   Rate     Created
--------------------------------------------------------------------------------
aeccd5e53f03   math-agent-demo      2        3        40%      2026-02-12T21:12:12
4f3e380f622c   math-agent-demo      3        2        60%      2026-02-12T21:12:12
bd4ef3a0727b   math-agent-demo      1        4        20%      2026-02-12T21:12:12
e2ca43e99852   math-agent-demo      3        2        60%      2026-02-12T21:12:11
32ed650cab6d   math-agent-demo      2        3        40%      2026-02-12T21:12:11

agenteval compare

agenteval compare RUN_A RUN_B [--alpha 0.05] [--threshold 0.0] [--stats/--no-stats]
agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2   # Multi-run comparison

agenteval import

agenteval import --from agentlens --db sessions.db --output suite.yaml [--grader contains] [--limit 100]

Grader Reference

exact

Compares result.output exactly with expected.output. Config: ignore_case: bool.

expected:
  output: "The answer is 42."
grader: exact
grader_config:
  ignore_case: true

contains

Checks that all substrings in expected.output_contains appear in the output.

expected:
  output_contains: ["Paris", "France"]
grader: contains

regex

Matches result.output against expected.pattern (Python regex). Config: flags: [IGNORECASE, DOTALL, MULTILINE].

expected:
  pattern: "\\d+\\.\\d+"
grader: regex
grader_config:
  flags: [IGNORECASE]

tool-check

Verifies expected tools were called. Config: ordered: bool for sequence matching.

expected:
  tools_called: ["web_search", "calculator"]
grader: tool-check
grader_config:
  ordered: true

llm-judge

Sends the input, output, and criteria to an LLM for evaluation. Requires OPENAI_API_KEY or compatible API.

expected:
  criteria: "Response should be helpful, accurate, and concise"
grader: llm-judge
grader_config:
  model: gpt-4o-mini

custom

Imports and calls your own grader function. Must accept (case: EvalCase, result: AgentResult) -> GradeResult.

grader: custom
grader_config:
  function: my_module:my_grader

Contributing

Contributions welcome! This project uses:

  • pytest for testing (127 tests passing)
  • ruff for linting
  • src layout (src/agenteval/)
git clone https://github.com/amitpaz1/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest

🧰 AgentKit Ecosystem

Project Description
AgentLens Observability & audit trail for AI agents
Lore Cross-agent memory and lesson sharing
AgentGate Human-in-the-loop approval gateway
FormBridge Agent-human mixed-mode forms
AgentEval Testing & evaluation framework ⬅️ you are here
agentkit-mesh Agent discovery & delegation
agentkit-cli Unified CLI orchestrator
agentkit-guardrails Reactive policy guardrails

License

MIT β€” see LICENSE.

About

Testing and evaluation framework for AI agents

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages