evaluation-harness

Here are 4 public repositories matching this topic...

najeed / ai-agent-eval-harness

An open-source rules-based framework for evaluating AI agent performance across various industries and use cases.

python open-source benchmarking data-science ai-agents llm generative-ai ai-agents-framework ai-agent-tools evaluation-harness

Updated Aug 4, 2025
Python

Arnav-Ajay / rag-retrieval-eval

Star

A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.

ai-systems failure-analysis rag-evaluation evaluation-harness retrieval-observability

Updated Jan 10, 2026
Python

Arnav-Ajay / rag-reranking-playground

Star

Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.

bm25 reranking rag failure-analysis hybrid-retrieval evaluation-harness

Updated Jan 23, 2026
Python

tpertner / honesty-harness

Star

Mini eval harness to turn “the model felt different” into measurable checks for honesty, calibration, prompt influence, and boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 14, 2026
Python

Improve this page

Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-harness

Here are 4 public repositories matching this topic...

najeed / ai-agent-eval-harness

Arnav-Ajay / rag-retrieval-eval

Arnav-Ajay / rag-reranking-playground

tpertner / honesty-harness

Improve this page

Add this topic to your repo