A config-driven evaluation framework for agentic systems and GenAI applications built on Microsoft Foundry SDK. Get started in minutes with YAML-based experiment configuration.
Whether you're evaluating RAG applications, multi-agent systems, or custom GenAI workflows, this framework reduces boilerplate and accelerates iteration.
- π Quick Setup β Config-driven YAML files; swap datasets, models, or metrics instantly
- π Plug-and-Play β Modular architecture for custom data loaders, evaluators, and pipeline stages
- π AI Foundry Evaluators β Built-in RAG metrics (Relevance, Coherence, Groundedness) and Agentic metrics (Task Adherence, Tool Call Accuracy)
- π― Custom Evaluators β Create domain-specific metrics with simple Python classes
- βοΈ LLM-as-Judge β Build custom AI judges using prompty templates for flexible scoring
- π Experimentation Pipelines β Combine data loading, inference, and evaluation in configurable YAML pipelines
flowchart LR
A[π Dataset] --> B[π€ Inference]
B --> C[βοΈ Evaluation]
C --> D[π Results]
This framework provides two core capabilities for rapid AI evaluation:
- Simplified Evaluation SDK Integration β Easily add both built-in Microsoft Foundry evaluators and custom metrics with minimal code
- Pipeline-Based Architecture β Connect your experiments, inference modules, and data loaders through a configurable pipeline defined in YAML
Agentic-Evaluations/
βββ src/
β βββ agent_evaluation/ # Core evaluation engine
β β βββ agentic_ops/ # Runner, client, base evaluator
β βββ evaluations/
β βββ offline/
β βββ agentic_evaluation/ # Agentic metrics (tool call accuracy, recall@k)
β βββ ai_judge_evaluation_custom/ # LLM-as-Judge with prompty templates
β βββ genai_evaluation_foundry/ # Built-in RAG evaluators (Relevance, Coherence)
β βββ pipeline_experiment_evaluation/ # Full pipeline (data β inference β eval)
β βββ utils/ # Shared constants and utilities
βββ assets/ # Architecture diagrams
βββ .env.template # Environment variable template
βββ requirements.txt # Python dependencies
βββ README.md
Each evaluation folder follows the same layout: datasets/ for input data, evaluator/ for evaluation logic, and an experiment.yaml for configuration. Evaluation outputs are written to the shared src/evaluations/offline/reports/ directory.
- Folder Structure
- Getting Started
- Samples
- Visualization Dashboard
- Configuration Guide
- Creating New Evaluations
- Evaluators Reference
- Pipeline Architecture
- References
- Python 3.11+ and Git
- Azure CLI installed and authenticated
- Microsoft Foundry project with GPT-4o deployment
# 1. Clone and install
git clone https://github.com/Azure-Samples/Agentic-Evaluations.git
cd Agentic-Evaluations
python -m venv .venv
.venv\Scripts\activate # Windows PowerShell
# source .venv/bin/activate # Linux/macOS
pip install -r requirements.txt
# 2. Azure login
az login
az account set --subscription "<your-subscription-id>"
# 3. Configure environment
cp .env.template .env
# Edit .env with your Microsoft Foundry credentials:
# EVAL_AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com/
# EVAL_AZURE_OPENAI_MODEL=<your-deployment-name>
# EVAL_AZURE_OPENAI_VERSION=2024-12-01-preview
# 4. Run a sample evaluation
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/genai_evaluation_foundry/experiment.yamlResults: src/evaluations/offline/reports/{run_id}_{eval_dir_name}.json
Standard GenAI/RAG β Built-in evaluators (Relevance, Coherence, Fluency)
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/rag_evaluation_foundry/experiment.yamlAgentic Systems β Agent invocation accuracy, recall@k, hallucination detection
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/agentic_evaluation/experiment.yamlCustom AI Judge β LLM-as-Judge with prompty templates
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/ai_judge_evaluation_custom/experiment.yamlFull Pipeline β Data loading β Inference β Evaluation
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/pipeline_experiment_evaluation/experiment.yamlAll evaluation runs produce a JSON report in src/evaluations/offline/reports/ using the naming pattern {run_id}_{eval_dir_name}.json. The Agentic Evaluation Dashboard is a Streamlit app that reads these reports and renders them as interactive visualizations - no manual parsing required.
python -m streamlit run src/evaluations/offline/reports/dashboard.pyKey capabilities:
- Overview page β aggregate metric gauges and multi-run summary tables for every evaluation type
- Run detail page β pass/fail rates, agent routing analysis, per-row score breakdown, and reasoning drill-downs
- Run comparison page β metric trend charts across multiple runs of the same evaluation
For full usage instructions, gauge scale conventions, and how to extend display names for custom evaluators, see the Dashboard README.
app_name: Agentic-Evals
experiment_name: My_Evaluation
evaluation:
run_local: True # Local execution (recommended)
input_path: datasets/
input_file: my_data.jsonl
output_path: src/evaluations/offline/reports/
evaluators: # Evaluators to run
relevance: "relevance_evaluator"
coherence: "coherence_evaluator"
evaluator_config: # Map dataset fields to evaluator inputs
relevance:
column_mapping:
query: "${data.query}"
response: "${data.response}"
pipeline: # Pipeline stages
- base_path: evaluator
module: eval_main.eval_main
config_key: evaluationKey Points:
${data.<field>}syntax maps JSONL dataset fields to evaluator parameters- Evaluator keys become column names in results
- See Built-in Evaluators for parameter requirements
cp -r src/evaluations/offline/genai_evaluation_foundry src/evaluations/offline/my_evaluationCreate a JSONL file in datasets/:
{"query": "What is the weather?", "response": "It's sunny and 72Β°F.", "context": "Weather data..."}from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator
class EvaluatorFactory:
EVALUATOR_FACTORIES = {
"relevance_evaluator": RelevanceEvaluator,
"coherence_evaluator": CoherenceEvaluator,
}Update experiment.yaml with your evaluators and column mappings, then:
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/my_evaluation/experiment.yaml# evaluator/evaluator_repo/my_evaluator.py
class MyCustomEvaluator:
def __call__(self, query, response, **kwargs):
score = self.calculate_score(query, response)
return {"my_metric": score}Register in eval_factory.py and add to your experiment.yaml.
For AI Foundry's evaluators for Agentic and RAG see the official documentation:
π Microsoft Foundry Evaluator Reference
The framework supports flexible pipeline configurations. Choose the pattern that fits your workflow:
Use when you already have model responses and want to evaluate them.
flowchart LR
A[π JSONL Dataset] --> B[βοΈ Evaluation Module]
B --> C[π Results JSON]
Use for end-to-end testing with your agent or model.
flowchart LR
A[π Input Queries] --> B[π€ Inference Module]
B --> C[βοΈ Evaluation Module]
C --> D[π Results JSON]
Use for production workflows with external data sources.
flowchart LR
A[βοΈ Azure Blob] --> B[π₯ Data Loader]
B --> C[π Preprocessor]
C --> D[π€ Inference]
D --> E[βοΈ Evaluation]
E --> F[π Results]
pipeline:
- base_path: data_loader # Stage 1: Load data
module: loader.load_data
config_key: data_config
- base_path: inference # Stage 2: Run inference
module: agent.run_inference
config_key: inference_config
- base_path: evaluator # Stage 3: Evaluate results
module: eval_main.eval_main
config_key: evaluationEach pipeline stage is independently configurableβadd, remove, or reorder stages as needed.
Important: Running evaluations with Azure OpenAI models incurs costs based on token usage. The LLM-as-Judge evaluators call the model for each row in your dataset, so larger datasets will result in higher costs. Monitor your Azure subscription spending regularly and set up Azure Cost Management alerts. See Azure OpenAI pricing for details.
When you are done experimenting, delete any Azure resources you created to avoid unnecessary charges:
az group delete --name <your-resource-group> --yes --no-waitAll sample datasets included in this repository are fully synthetic. They use fictional entities (Northwind Health, Contoso) and simulated agent interactions (smart-home device controls, weather lookups). No real customer data, personally identifiable information, or production telemetry is included in any dataset.
MIT License - see LICENSE for details.
This project welcomes contributions. See CONTRIBUTING.md for guidelines.
This project has adopted the Microsoft Open Source Code of Conduct.
