md-evals

Evaluate AI skills with scientific rigor. Compare prompts with and without injected context using A/B testing, multiple LLM providers, and production-grade evaluation techniques.

Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM.

Inspired by LangChain skills-benchmarks.

📚 Full Documentation | Quick Start | GitHub Models Guide | Examples

Why md-evals?

Building AI applications that work reliably requires scientific validation. md-evals makes it easy:

Challenge	Solution
🤔 "Does my skill actually help?"	A/B test Control vs Treatment automatically
💰 "Can't afford to evaluate with expensive APIs?"	Use free GitHub Models (Claude, GPT-4, DeepSeek)
📊 "How do I know if my results are real?"	Hybrid regex + LLM-as-judge evaluation
🔄 "Evaluating 100+ test cases manually is tedious"	Parallel workers, beautiful terminal output, JSON/Markdown export
✅ "How do I prevent bad skills from merging?"	Built-in linter (400-line limit, best practices)
🏗️ "Will this integrate with my CI/CD?"	Simple YAML config, exit codes for automation

Features

✨ A/B Testing: Compare Control (no skill) vs Treatment (with skill) prompts side-by-side
🎯 Multiple Treatments: Run wildcards like LCC_* to test different skill variations in one go
🧠 Hybrid Evaluation: Combine regex pattern matching + LLM-as-a-judge for flexible validation
🚀 Multiple LLM Providers: GitHub Models (free!), OpenAI, Anthropic, LiteLLM, and more
📋 Linter: Enforce 400-line limit, quality checks, and best practices for SKILL.md
📊 Rich Output: Beautiful terminal tables with pass rates, comparisons, and statistics
💾 Export: JSON, Markdown, or table format for reporting and analysis
⚡ Parallel Execution: Run multiple tests concurrently for faster feedback
🎉 GitHub Models Support: Use free/low-cost models (Claude 3.5, GPT-4, DeepSeek, Grok)
🔬 Deterministic Graders: File, command, and state graders for side-effect evaluation
🔄 Three-Phase Pipeline: Structure → Analyze → Generate sequential evaluation
📜 Contract Assertions: Define output contracts and validate A/B variants against them
🏗️ Workspace Runner: Isolated temp workspaces for reproducible task evaluation

Installation

Using `uv` (Recommended)

# Clone the repository
git clone https://github.com/JNZader/md-evals.git
cd md-evals

# Install with uv (fastest)
uv sync

# Activate virtual environment
source .venv/bin/activate

Using `pip`

git clone https://github.com/JNZader/md-evals.git
cd md-evals

# Install dependencies
pip install -e .

Requirements: Python 3.12+

Quick Start

1. Initialize your evaluation

md-evals init

This creates:

eval.yaml - Your evaluation config
SKILL.md - Template for your AI skill

2. Run evaluation

md-evals run

3. Check your skill

md-evals lint        # Validate SKILL.md
md-evals list        # List treatments and tests

⏱️ Complete example in 2 minutes

# 1. Create evaluation
md-evals init

# 2. Preflight auth (env var first, gh login fallback)
md-evals smoke --provider github-models

# 3. Run with GitHub Models (free!)
export GITHUB_TOKEN="github_pat_..."
md-evals run --provider github-models --model claude-3.5-sonnet --config eval.yaml

# 4. View results
# → Beautiful table with Control vs Treatment comparison
# → Pass rates and statistics

🎉 GitHub Models: Free LLM Evaluation

Evaluate your skills completely free using GitHub's Models API (public preview):

Setup (One-time)

# Preferred: set GITHUB_TOKEN directly
export GITHUB_TOKEN="github_pat_..."

# Fallback for users already logged in with GitHub CLI
gh auth login

# Verify auth preflight before first run
md-evals smoke --provider github-models --config examples/eval_with_github_models.yaml

Run Evaluation with Free Models

# Use Claude 3.5 Sonnet (200k context, free!)
md-evals run --config eval.yaml --provider github-models --model claude-3.5-sonnet

# Or use GPT-4o
md-evals run --config eval.yaml --provider github-models --model gpt-4o

# Or use DeepSeek R1 (fastest)
md-evals run --config eval.yaml --provider github-models --model deepseek-r1

Available Models

Model	Context	Best For	Cost
`claude-3.5-sonnet`	200k	Reasoning, complex tasks	🟢 Free
`gpt-4o`	128k	General-purpose, balanced	🟢 Free
`deepseek-r1`	64k	Speed, cost efficiency	🟢 Free
`grok-3`	128k	Latest, edge cases	🟢 Free

Rate Limits: 15 requests/min (public preview) · Full Guide →

Configuration

Create eval.yaml to define your evaluation. Here's a complete example:

name: "My AI Skill Evaluation"
version: "1.0"
description: "Evaluate skill effectiveness with Control vs Treatment"

defaults:
  model: "claude-3.5-sonnet"
  provider: "github-models"  # Free! (or: openai, anthropic, etc.)
  temperature: 0.7
  max_tokens: 500

treatments:
  CONTROL:
    description: "Baseline: No skill injected"
    skill_path: null
  
  WITH_SKILL:
    description: "Treatment: With skill injected"
    skill_path: "./SKILL.md"
  
  WITH_SKILL_V2:
    description: "Alternative skill variant"
    skill_path: "./SKILL_V2.md"

tests:
  - name: "test_basic_greeting"
    prompt: "Greet {name} and ask how they're doing."
    variables:
      name: "Alice"
    evaluators:
      - type: "regex"
        name: "has_greeting"
        pattern: "(hello|hi|greetings)"
      - type: "llm"
        name: "is_friendly"
        criteria: "Does the response feel warm and friendly?"
  
  - name: "test_complex_reasoning"
    prompt: "Explain {concept} to a {audience}."
    variables:
      concept: "quantum computing"
      audience: "5-year-old child"
    evaluators:
      - type: "llm"
        name: "is_age_appropriate"
        criteria: "Is the explanation suitable for a 5-year-old?"

Key Sections

Section	Purpose
`defaults`	LLM model, provider, temperature, token limits
`treatments`	Different skill configurations to compare
`tests`	Test cases with prompts, variables, and evaluators

Evaluators

type: regex - Pattern matching (fast, deterministic)
type: llm - LLM-as-judge (flexible, intelligent)

Commands

Command	Purpose
`md-evals init`	🚀 Scaffold `eval.yaml` and `SKILL.md` templates
`md-evals run`	▶️ Run evaluations (Control vs Treatment)
`md-evals run --treatment WITH_SKILL`	🎯 Run specific treatment
`md-evals lint`	✅ Validate SKILL.md (400-line limit, best practices)
`md-evals list`	📋 List available treatments and tests
`md-evals list-models`	🤖 List available models per provider
`md-evals smoke --provider github-models --config eval.yaml`	🧪 Local preflight (provider, config, auth)

Common Workflows

# Evaluate with default provider
md-evals run

# Use specific provider and model
md-evals run --provider github-models --model claude-3.5-sonnet

# Run only specific treatment
md-evals run --treatment WITH_SKILL

# Export results as JSON
md-evals run --output json > results.json

# Run with 4 parallel workers
md-evals run -n 4

# Repeat each test 5 times (for statistical significance)
md-evals run --count 5

# Export to Markdown report
md-evals run --output markdown > report.md

# Validate before running
md-evals lint

Full Options Reference

`run`

-c, --config FILE - Config file (default: eval.yaml)
-t, --treatment TREATMENT - Run specific treatment(s)
-m, --model MODEL - Override model
-p, --provider PROVIDER - Provider: github-models, openai, anthropic, etc.
-n WORKERS - Parallel workers (default: 1)
--count N - Repeat tests N times for statistical validation
-o, --output FORMAT - Output format: table (default), json, markdown
--no-lint - Skip SKILL.md linting
--debug - Enable debug logging

`list-models`

-p, --provider PROVIDER - Filter by provider
-v, --verbose - Show metadata (temperature ranges, costs, rate limits)

Deterministic Graders

Beyond regex and LLM-as-judge evaluators, md-evals includes deterministic graders that check side effects of agent task execution (files created, commands run, workspace state) rather than LLM output text.

All graders implement the Grader protocol (md_evals.graders.base) and return EvaluatorResult, so they integrate seamlessly with the existing reporter and pipeline infrastructure.

File Graders

Grader	Purpose
`FileExistsGrader`	Assert a file exists (or does not exist) in the workspace
`FileContentGrader`	Assert file content matches a regex pattern or exact string
`FileSizeGrader`	Assert file size is within a min/max byte range

Command Grader

CommandGrader runs a shell command inside the workspace and asserts on exit code and optionally stdout content. Useful for verifying that generated code compiles, tests pass, or scripts produce expected output.

from md_evals.graders import CommandGrader

grader = CommandGrader(
    name="tests_pass",
    command="python -m pytest tests/",
    expected_exit_code=0,
    expected_output="passed",
    timeout=30,
)

State Grader

StateGrader compares workspace file-system state before and after task execution. It tracks created, deleted, and modified files using modification time snapshots.

from md_evals.graders import StateGrader

grader = StateGrader(
    name="check_state",
    expected_created=["output.json"],
    expected_deleted=["temp.txt"],
    expected_modified=["config.yaml"],
)
# Call grader.snapshot(workspace) before task, grader.grade(workspace) after

Three-Phase Evaluation Pipeline

The ThreePhaseEvaluator (md_evals.three_phase) orchestrates evaluation in three sequential phases with fail-fast behavior:

Structure — Validate input/output format (JSON valid? fields present? types correct?)
Analyze — Evaluate quality of analysis (keyword coverage, section coverage, minimum length)
Generate — Evaluate final output quality (pattern matching, constraint checking)

If a required phase fails, subsequent phases are skipped. Each phase has configurable weight for scoring.

Phase Graders

Phase	Grader	Purpose
Structure	`JSONValidGrader`	Validate JSON format (file or string mode)
Structure	`RequiredFieldsGrader`	Check required fields exist (dot-notation for nesting)
Structure	`FieldTypeGrader`	Validate field types (str, int, float, bool, list, dict)
Analyze	`KeywordCoverageGrader`	Check keyword/concept coverage with configurable threshold
Analyze	`SectionCoverageGrader`	Check for expected sections/headings via regex patterns
Analyze	`MinLengthGrader`	Enforce minimum word count and/or character count
Generate	`OutputMatchGrader`	Match regex patterns (AND logic, with negate option)
Generate	`ConstraintGrader`	Enforce max words, max chars, and forbidden patterns

Example

from md_evals.three_phase import ThreePhaseEvaluator, PhaseConfig
from md_evals.graders import JSONValidGrader, RequiredFieldsGrader, KeywordCoverageGrader

evaluator = ThreePhaseEvaluator(
    structure=PhaseConfig(
        graders=[
            JSONValidGrader(name="valid_json", path="output.json"),
            RequiredFieldsGrader(name="has_fields", path="output.json",
                                 required_fields=["name", "metadata.version"]),
        ],
        weight=0.3,
        required=True,
    ),
    analyze=PhaseConfig(
        graders=[KeywordCoverageGrader(name="covers_topics",
                                        path="output.json",
                                        keywords=["architecture", "testing"],
                                        pass_threshold=0.8)],
        weight=0.4,
    ),
    generate=PhaseConfig(graders=[], weight=0.3),
)
result = evaluator.evaluate(workspace_path)
# result.passed, result.overall_score, result.failed_phase

Contract-Based Assertions

Define structural contracts for outputs and validate them deterministically using ContractAssertionGrader. The ABContractGrader extends this for A/B testing — both variants must satisfy the same contract while producing different content.

OutputContract

from md_evals.graders import OutputContract, ContractAssertionGrader, ABContractGrader

contract = OutputContract(
    required_sections=[r"^## Purpose", r"^## Implementation"],
    format_rules=[r"```python"],
    forbidden_patterns=[r"TODO", r"FIXME"],
    min_words=50,
    max_words=2000,
)

# Single output validation
grader = ContractAssertionGrader(
    name="contract_check",
    contract=contract,
    path="output.md",        # file mode
    # content="...",         # or content mode
)

# A/B contract validation
ab_grader = ABContractGrader(
    name="ab_contract",
    contract=contract,
    variant_a="Control output...",
    variant_b="Treatment output...",
)

Workspace Runner

WorkspaceRunner (md_evals.workspace) manages the full lifecycle for deterministic evaluation in isolated temporary directories:

Create temporary workspace
Set up files (SetupFile with path and content)
Snapshot state (for StateGrader baselines)
Execute the task command (with timeout)
Apply all graders
Cleanup

Example

from md_evals.workspace import WorkspaceRunner, WorkspaceConfig, SetupFile
from md_evals.graders import FileExistsGrader, CommandGrader

config = WorkspaceConfig(
    name="test_code_generation",
    setup_files=[
        SetupFile(path="requirements.txt", content="pytest\n"),
        SetupFile(path="src/main.py", content="print('hello')"),
    ],
    task_command="python src/main.py > output.txt",
    graders=[
        FileExistsGrader(name="output_created", path="output.txt"),
        CommandGrader(name="syntax_ok", command="python -m py_compile src/main.py"),
    ],
    task_timeout=60,
)

runner = WorkspaceRunner()
result = runner.run(config)
# result.passed, result.grader_results, result.task_exit_code

Development

Setup

# Install with dev dependencies
uv sync --extra dev

# Activate virtual environment
source .venv/bin/activate

Testing

md-evals has a comprehensive test suite with 94.95% code coverage and 321 passing tests.

Quick Start

# Run all tests
pytest

# Run tests in parallel (73% faster)
pytest -n 4

# View coverage report
pytest --cov=md_evals --cov-report=html
open htmlcov/index.html

Test Documentation

Complete testing guides for different audiences:

Guide	Audience	Purpose
TESTING.md	Everyone	How to run tests, markers, parallel execution
TEST_DEVELOPMENT_GUIDE.md	Developers	Writing new tests, fixtures, mocking strategies
TEST_ARCHITECTURE.md	Tech Leads	Test organization, fixture hierarchy, isolation patterns
TEST_CI_INTEGRATION.md	DevOps/CI Engineers	CI/CD setup, Docker, reporting, multiple platforms
TEST_QUICK_REFERENCE.md	All	Command cheat sheet, one-liners, common patterns
TEST_COVERAGE_ANALYSIS.md	Maintainers	Coverage gaps, improvement roadmap, module analysis

Common Testing Tasks

# Run only unit tests (fast feedback)
pytest -m unit

# Run only integration tests
pytest -m integration

# Run specific test file
pytest tests/test_github_models_provider.py -v

# Debug a specific test
pytest tests/test_engine.py::TestExecutionEngine::test_run_basic -vvv --pdb

# Run tests that match pattern
pytest -k "github_models"

# Skip slow tests (faster local development)
pytest -m "not slow"

# Generate all reports
pytest -n 4 \
  --cov=md_evals \
  --cov-report=html \
  --cov-report=xml \
  --cov-report=json \
  --junit-xml=test-results.xml

Test Coverage

Overall: 94.95% (production standard: 90%)
Critical modules: >95% (engine, evaluators, config)
Test count: 321 tests (unit, integration, E2E, performance)
Execution time: 6.63s parallel / 22.09s serial

Test Structure

tests/
├── conftest.py                    # Shared fixtures and config
├── test_cli.py                    # CLI command tests (100+ tests)
├── test_engine.py                 # Core evaluation engine
├── test_evaluator.py              # Regex & LLM evaluators
├── test_github_models_provider.py # Provider tests (43 tests)
├── test_e2e_workflow.py          # End-to-end workflow tests
├── test_linter.py                 # SKILL.md validation
├── test_reporter.py               # Report generation
└── ... (10+ test files total)

Performance

Configuration	Time	Speedup
Serial	22.09s	—
Parallel (4 workers)	6.63s	73%
Unit tests only	~5s	78%
Fast tests (no slow)	~10s	55%

For more details, see TESTING.md.

Project Structure

md_evals/
├── cli.py                    # Command-line interface
├── engine.py                 # Evaluation engine (A/B testing)
├── llm.py                    # LLM provider interface
├── config.py                 # YAML config parsing
├── three_phase.py            # Three-phase evaluation pipeline
├── workspace.py              # WorkspaceRunner for isolated evaluation
├── providers/                # LLM provider implementations
│   ├── github_models.py     # GitHub Models (free!)
│   ├── openai_provider.py
│   ├── anthropic_provider.py
│   └── litellm_provider.py
├── evaluators/               # Evaluation strategies
│   ├── regex_evaluator.py
│   └── llm_evaluator.py
├── graders/                  # Deterministic graders
│   ├── base.py              # Grader protocol
│   ├── file_graders.py      # FileExists, FileContent, FileSize
│   ├── command_grader.py    # CommandGrader (shell commands)
│   ├── state_grader.py      # StateGrader (file-system diffs)
│   ├── structure_grader.py  # JSONValid, RequiredFields, FieldType
│   ├── analysis_grader.py   # KeywordCoverage, SectionCoverage, MinLength
│   ├── generation_grader.py # OutputMatch, ConstraintGrader
│   └── contract_grader.py   # OutputContract, ContractAssertion, ABContract
└── pipeline/                 # Plugin evaluation pipeline
    ├── pipeline.py
    ├── runner.py
    └── ...

tests/
├── test_engine.py
├── test_github_models_provider.py  # 43 tests
├── test_provider_registry.py       # 11 tests
└── ...

Community & Support

📖 Documentation

Full Guide - Installation, tutorials, API reference
GitHub Models Setup - Free LLM evaluation guide
Examples - Real-world usage examples

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

Fork → Branch → Pull Request workflow
Code style guidelines (Ruff, 100 char lines)
Testing requirements (>80% coverage)
Conventional Commits format

📋 Community

CODE_OF_CONDUCT.md - Our community standards
SECURITY.md - Vulnerability disclosure process
Issues - Report bugs or request features
Discussions - Ask questions and share ideas

📝 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.github		.github
apps		apps
docs		docs
examples		examples
md_evals		md_evals
openspec		openspec
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-pages.sh		build-pages.sh
docker-compose.prod.yaml		docker-compose.prod.yaml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

md-evals

Why md-evals?

Features

Installation

Using uv (Recommended)

Using pip

Quick Start

1. Initialize your evaluation

2. Run evaluation

3. Check your skill

⏱️ Complete example in 2 minutes

🎉 GitHub Models: Free LLM Evaluation

Setup (One-time)

Run Evaluation with Free Models

Available Models

Configuration

Key Sections

Evaluators

Commands

Common Workflows

Full Options Reference

run

list-models

Deterministic Graders

File Graders

Command Grader

State Grader

Three-Phase Evaluation Pipeline

Phase Graders

Example

Contract-Based Assertions

OutputContract

Workspace Runner

Example

Development

Setup

Testing

Quick Start

Test Documentation

Common Testing Tasks

Test Coverage

Test Structure

Performance

Project Structure

Community & Support

📖 Documentation

🤝 Contributing

📋 Community

📝 License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using `uv` (Recommended)

Using `pip`

`run`

`list-models`

Packages