An open-source rules-based framework for evaluating AI agent performance across various industries and use cases.
-
Updated
Aug 4, 2025 - Python
An open-source rules-based framework for evaluating AI agent performance across various industries and use cases.
A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.
Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.
Mini eval harness to turn “the model felt different” into measurable checks for honesty, calibration, prompt influence, and boundaries.
Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."