Evaluate AI skills with scientific rigor. Compare prompts with and without injected context using A/B testing, multiple LLM providers, and production-grade evaluation techniques.
Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM.
Inspired by LangChain skills-benchmarks.
📚 Full Documentation | Quick Start | GitHub Models Guide | Examples
Building AI applications that work reliably requires scientific validation. md-evals makes it easy:
| Challenge | Solution |
|---|---|
| 🤔 "Does my skill actually help?" | A/B test Control vs Treatment automatically |
| 💰 "Can't afford to evaluate with expensive APIs?" | Use free GitHub Models (Claude, GPT-4, DeepSeek) |
| 📊 "How do I know if my results are real?" | Hybrid regex + LLM-as-judge evaluation |
| 🔄 "Evaluating 100+ test cases manually is tedious" | Parallel workers, beautiful terminal output, JSON/Markdown export |
| ✅ "How do I prevent bad skills from merging?" | Built-in linter (400-line limit, best practices) |
| 🏗️ "Will this integrate with my CI/CD?" | Simple YAML config, exit codes for automation |
- ✨ A/B Testing: Compare Control (no skill) vs Treatment (with skill) prompts side-by-side
- 🎯 Multiple Treatments: Run wildcards like
LCC_*to test different skill variations in one go - 🧠 Hybrid Evaluation: Combine regex pattern matching + LLM-as-a-judge for flexible validation
- 🚀 Multiple LLM Providers: GitHub Models (free!), OpenAI, Anthropic, LiteLLM, and more
- 📋 Linter: Enforce 400-line limit, quality checks, and best practices for SKILL.md
- 📊 Rich Output: Beautiful terminal tables with pass rates, comparisons, and statistics
- 💾 Export: JSON, Markdown, or table format for reporting and analysis
- ⚡ Parallel Execution: Run multiple tests concurrently for faster feedback
- 🎉 GitHub Models Support: Use free/low-cost models (Claude 3.5, GPT-4, DeepSeek, Grok)
- 🔬 Deterministic Graders: File, command, and state graders for side-effect evaluation
- 🔄 Three-Phase Pipeline: Structure → Analyze → Generate sequential evaluation
- 📜 Contract Assertions: Define output contracts and validate A/B variants against them
- 🏗️ Workspace Runner: Isolated temp workspaces for reproducible task evaluation
# Clone the repository
git clone https://github.com/JNZader/md-evals.git
cd md-evals
# Install with uv (fastest)
uv sync
# Activate virtual environment
source .venv/bin/activategit clone https://github.com/JNZader/md-evals.git
cd md-evals
# Install dependencies
pip install -e .Requirements: Python 3.12+
md-evals initThis creates:
eval.yaml- Your evaluation configSKILL.md- Template for your AI skill
md-evals runmd-evals lint # Validate SKILL.md
md-evals list # List treatments and tests# 1. Create evaluation
md-evals init
# 2. Preflight auth (env var first, gh login fallback)
md-evals smoke --provider github-models
# 3. Run with GitHub Models (free!)
export GITHUB_TOKEN="github_pat_..."
md-evals run --provider github-models --model claude-3.5-sonnet --config eval.yaml
# 4. View results
# → Beautiful table with Control vs Treatment comparison
# → Pass rates and statisticsEvaluate your skills completely free using GitHub's Models API (public preview):
# Preferred: set GITHUB_TOKEN directly
export GITHUB_TOKEN="github_pat_..."
# Fallback for users already logged in with GitHub CLI
gh auth login
# Verify auth preflight before first run
md-evals smoke --provider github-models --config examples/eval_with_github_models.yaml# Use Claude 3.5 Sonnet (200k context, free!)
md-evals run --config eval.yaml --provider github-models --model claude-3.5-sonnet
# Or use GPT-4o
md-evals run --config eval.yaml --provider github-models --model gpt-4o
# Or use DeepSeek R1 (fastest)
md-evals run --config eval.yaml --provider github-models --model deepseek-r1| Model | Context | Best For | Cost |
|---|---|---|---|
claude-3.5-sonnet |
200k | Reasoning, complex tasks | 🟢 Free |
gpt-4o |
128k | General-purpose, balanced | 🟢 Free |
deepseek-r1 |
64k | Speed, cost efficiency | 🟢 Free |
grok-3 |
128k | Latest, edge cases | 🟢 Free |
Rate Limits: 15 requests/min (public preview) · Full Guide →
Create eval.yaml to define your evaluation. Here's a complete example:
name: "My AI Skill Evaluation"
version: "1.0"
description: "Evaluate skill effectiveness with Control vs Treatment"
defaults:
model: "claude-3.5-sonnet"
provider: "github-models" # Free! (or: openai, anthropic, etc.)
temperature: 0.7
max_tokens: 500
treatments:
CONTROL:
description: "Baseline: No skill injected"
skill_path: null
WITH_SKILL:
description: "Treatment: With skill injected"
skill_path: "./SKILL.md"
WITH_SKILL_V2:
description: "Alternative skill variant"
skill_path: "./SKILL_V2.md"
tests:
- name: "test_basic_greeting"
prompt: "Greet {name} and ask how they're doing."
variables:
name: "Alice"
evaluators:
- type: "regex"
name: "has_greeting"
pattern: "(hello|hi|greetings)"
- type: "llm"
name: "is_friendly"
criteria: "Does the response feel warm and friendly?"
- name: "test_complex_reasoning"
prompt: "Explain {concept} to a {audience}."
variables:
concept: "quantum computing"
audience: "5-year-old child"
evaluators:
- type: "llm"
name: "is_age_appropriate"
criteria: "Is the explanation suitable for a 5-year-old?"| Section | Purpose |
|---|---|
defaults |
LLM model, provider, temperature, token limits |
treatments |
Different skill configurations to compare |
tests |
Test cases with prompts, variables, and evaluators |
type: regex- Pattern matching (fast, deterministic)type: llm- LLM-as-judge (flexible, intelligent)
| Command | Purpose |
|---|---|
md-evals init |
🚀 Scaffold eval.yaml and SKILL.md templates |
md-evals run |
|
md-evals run --treatment WITH_SKILL |
🎯 Run specific treatment |
md-evals lint |
✅ Validate SKILL.md (400-line limit, best practices) |
md-evals list |
📋 List available treatments and tests |
md-evals list-models |
🤖 List available models per provider |
md-evals smoke --provider github-models --config eval.yaml |
🧪 Local preflight (provider, config, auth) |
# Evaluate with default provider
md-evals run
# Use specific provider and model
md-evals run --provider github-models --model claude-3.5-sonnet
# Run only specific treatment
md-evals run --treatment WITH_SKILL
# Export results as JSON
md-evals run --output json > results.json
# Run with 4 parallel workers
md-evals run -n 4
# Repeat each test 5 times (for statistical significance)
md-evals run --count 5
# Export to Markdown report
md-evals run --output markdown > report.md
# Validate before running
md-evals lint-c, --config FILE- Config file (default:eval.yaml)-t, --treatment TREATMENT- Run specific treatment(s)-m, --model MODEL- Override model-p, --provider PROVIDER- Provider:github-models,openai,anthropic, etc.-n WORKERS- Parallel workers (default: 1)--count N- Repeat tests N times for statistical validation-o, --output FORMAT- Output format:table(default),json,markdown--no-lint- Skip SKILL.md linting--debug- Enable debug logging
-p, --provider PROVIDER- Filter by provider-v, --verbose- Show metadata (temperature ranges, costs, rate limits)
Beyond regex and LLM-as-judge evaluators, md-evals includes deterministic graders that check side effects of agent task execution (files created, commands run, workspace state) rather than LLM output text.
All graders implement the Grader protocol (md_evals.graders.base) and return EvaluatorResult, so they integrate seamlessly with the existing reporter and pipeline infrastructure.
| Grader | Purpose |
|---|---|
FileExistsGrader |
Assert a file exists (or does not exist) in the workspace |
FileContentGrader |
Assert file content matches a regex pattern or exact string |
FileSizeGrader |
Assert file size is within a min/max byte range |
CommandGrader runs a shell command inside the workspace and asserts on exit code and optionally stdout content. Useful for verifying that generated code compiles, tests pass, or scripts produce expected output.
from md_evals.graders import CommandGrader
grader = CommandGrader(
name="tests_pass",
command="python -m pytest tests/",
expected_exit_code=0,
expected_output="passed",
timeout=30,
)StateGrader compares workspace file-system state before and after task execution. It tracks created, deleted, and modified files using modification time snapshots.
from md_evals.graders import StateGrader
grader = StateGrader(
name="check_state",
expected_created=["output.json"],
expected_deleted=["temp.txt"],
expected_modified=["config.yaml"],
)
# Call grader.snapshot(workspace) before task, grader.grade(workspace) afterThe ThreePhaseEvaluator (md_evals.three_phase) orchestrates evaluation in three sequential phases with fail-fast behavior:
- Structure — Validate input/output format (JSON valid? fields present? types correct?)
- Analyze — Evaluate quality of analysis (keyword coverage, section coverage, minimum length)
- Generate — Evaluate final output quality (pattern matching, constraint checking)
If a required phase fails, subsequent phases are skipped. Each phase has configurable weight for scoring.
| Phase | Grader | Purpose |
|---|---|---|
| Structure | JSONValidGrader |
Validate JSON format (file or string mode) |
| Structure | RequiredFieldsGrader |
Check required fields exist (dot-notation for nesting) |
| Structure | FieldTypeGrader |
Validate field types (str, int, float, bool, list, dict) |
| Analyze | KeywordCoverageGrader |
Check keyword/concept coverage with configurable threshold |
| Analyze | SectionCoverageGrader |
Check for expected sections/headings via regex patterns |
| Analyze | MinLengthGrader |
Enforce minimum word count and/or character count |
| Generate | OutputMatchGrader |
Match regex patterns (AND logic, with negate option) |
| Generate | ConstraintGrader |
Enforce max words, max chars, and forbidden patterns |
from md_evals.three_phase import ThreePhaseEvaluator, PhaseConfig
from md_evals.graders import JSONValidGrader, RequiredFieldsGrader, KeywordCoverageGrader
evaluator = ThreePhaseEvaluator(
structure=PhaseConfig(
graders=[
JSONValidGrader(name="valid_json", path="output.json"),
RequiredFieldsGrader(name="has_fields", path="output.json",
required_fields=["name", "metadata.version"]),
],
weight=0.3,
required=True,
),
analyze=PhaseConfig(
graders=[KeywordCoverageGrader(name="covers_topics",
path="output.json",
keywords=["architecture", "testing"],
pass_threshold=0.8)],
weight=0.4,
),
generate=PhaseConfig(graders=[], weight=0.3),
)
result = evaluator.evaluate(workspace_path)
# result.passed, result.overall_score, result.failed_phaseDefine structural contracts for outputs and validate them deterministically using ContractAssertionGrader. The ABContractGrader extends this for A/B testing — both variants must satisfy the same contract while producing different content.
from md_evals.graders import OutputContract, ContractAssertionGrader, ABContractGrader
contract = OutputContract(
required_sections=[r"^## Purpose", r"^## Implementation"],
format_rules=[r"```python"],
forbidden_patterns=[r"TODO", r"FIXME"],
min_words=50,
max_words=2000,
)
# Single output validation
grader = ContractAssertionGrader(
name="contract_check",
contract=contract,
path="output.md", # file mode
# content="...", # or content mode
)
# A/B contract validation
ab_grader = ABContractGrader(
name="ab_contract",
contract=contract,
variant_a="Control output...",
variant_b="Treatment output...",
)WorkspaceRunner (md_evals.workspace) manages the full lifecycle for deterministic evaluation in isolated temporary directories:
- Create temporary workspace
- Set up files (
SetupFilewith path and content) - Snapshot state (for
StateGraderbaselines) - Execute the task command (with timeout)
- Apply all graders
- Cleanup
from md_evals.workspace import WorkspaceRunner, WorkspaceConfig, SetupFile
from md_evals.graders import FileExistsGrader, CommandGrader
config = WorkspaceConfig(
name="test_code_generation",
setup_files=[
SetupFile(path="requirements.txt", content="pytest\n"),
SetupFile(path="src/main.py", content="print('hello')"),
],
task_command="python src/main.py > output.txt",
graders=[
FileExistsGrader(name="output_created", path="output.txt"),
CommandGrader(name="syntax_ok", command="python -m py_compile src/main.py"),
],
task_timeout=60,
)
runner = WorkspaceRunner()
result = runner.run(config)
# result.passed, result.grader_results, result.task_exit_code# Install with dev dependencies
uv sync --extra dev
# Activate virtual environment
source .venv/bin/activatemd-evals has a comprehensive test suite with 94.95% code coverage and 321 passing tests.
# Run all tests
pytest
# Run tests in parallel (73% faster)
pytest -n 4
# View coverage report
pytest --cov=md_evals --cov-report=html
open htmlcov/index.htmlComplete testing guides for different audiences:
| Guide | Audience | Purpose |
|---|---|---|
| TESTING.md | Everyone | How to run tests, markers, parallel execution |
| TEST_DEVELOPMENT_GUIDE.md | Developers | Writing new tests, fixtures, mocking strategies |
| TEST_ARCHITECTURE.md | Tech Leads | Test organization, fixture hierarchy, isolation patterns |
| TEST_CI_INTEGRATION.md | DevOps/CI Engineers | CI/CD setup, Docker, reporting, multiple platforms |
| TEST_QUICK_REFERENCE.md | All | Command cheat sheet, one-liners, common patterns |
| TEST_COVERAGE_ANALYSIS.md | Maintainers | Coverage gaps, improvement roadmap, module analysis |
# Run only unit tests (fast feedback)
pytest -m unit
# Run only integration tests
pytest -m integration
# Run specific test file
pytest tests/test_github_models_provider.py -v
# Debug a specific test
pytest tests/test_engine.py::TestExecutionEngine::test_run_basic -vvv --pdb
# Run tests that match pattern
pytest -k "github_models"
# Skip slow tests (faster local development)
pytest -m "not slow"
# Generate all reports
pytest -n 4 \
--cov=md_evals \
--cov-report=html \
--cov-report=xml \
--cov-report=json \
--junit-xml=test-results.xml- Overall: 94.95% (production standard: 90%)
- Critical modules: >95% (engine, evaluators, config)
- Test count: 321 tests (unit, integration, E2E, performance)
- Execution time: 6.63s parallel / 22.09s serial
tests/
├── conftest.py # Shared fixtures and config
├── test_cli.py # CLI command tests (100+ tests)
├── test_engine.py # Core evaluation engine
├── test_evaluator.py # Regex & LLM evaluators
├── test_github_models_provider.py # Provider tests (43 tests)
├── test_e2e_workflow.py # End-to-end workflow tests
├── test_linter.py # SKILL.md validation
├── test_reporter.py # Report generation
└── ... (10+ test files total)
| Configuration | Time | Speedup |
|---|---|---|
| Serial | 22.09s | — |
| Parallel (4 workers) | 6.63s | 73% |
| Unit tests only | ~5s | 78% |
| Fast tests (no slow) | ~10s | 55% |
For more details, see TESTING.md.
md_evals/
├── cli.py # Command-line interface
├── engine.py # Evaluation engine (A/B testing)
├── llm.py # LLM provider interface
├── config.py # YAML config parsing
├── three_phase.py # Three-phase evaluation pipeline
├── workspace.py # WorkspaceRunner for isolated evaluation
├── providers/ # LLM provider implementations
│ ├── github_models.py # GitHub Models (free!)
│ ├── openai_provider.py
│ ├── anthropic_provider.py
│ └── litellm_provider.py
├── evaluators/ # Evaluation strategies
│ ├── regex_evaluator.py
│ └── llm_evaluator.py
├── graders/ # Deterministic graders
│ ├── base.py # Grader protocol
│ ├── file_graders.py # FileExists, FileContent, FileSize
│ ├── command_grader.py # CommandGrader (shell commands)
│ ├── state_grader.py # StateGrader (file-system diffs)
│ ├── structure_grader.py # JSONValid, RequiredFields, FieldType
│ ├── analysis_grader.py # KeywordCoverage, SectionCoverage, MinLength
│ ├── generation_grader.py # OutputMatch, ConstraintGrader
│ └── contract_grader.py # OutputContract, ContractAssertion, ABContract
└── pipeline/ # Plugin evaluation pipeline
├── pipeline.py
├── runner.py
└── ...
tests/
├── test_engine.py
├── test_github_models_provider.py # 43 tests
├── test_provider_registry.py # 11 tests
└── ...
- Full Guide - Installation, tutorials, API reference
- GitHub Models Setup - Free LLM evaluation guide
- Examples - Real-world usage examples
We welcome contributions! Please see CONTRIBUTING.md for:
- Fork → Branch → Pull Request workflow
- Code style guidelines (Ruff, 100 char lines)
- Testing requirements (>80% coverage)
- Conventional Commits format
- CODE_OF_CONDUCT.md - Our community standards
- SECURITY.md - Vulnerability disclosure process
- Issues - Report bugs or request features
- Discussions - Ask questions and share ideas
MIT