Skip to content

Azure-Samples/Agentic-Evaluations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Evaluation Framework using Microsoft Foundry

License: MIT Python 3.12+ Microsoft Foundry

A config-driven evaluation framework for agentic systems and GenAI applications built on Microsoft Foundry SDK. Get started in minutes with YAML-based experiment configuration.

Whether you're evaluating RAG applications, multi-agent systems, or custom GenAI workflows, this framework reduces boilerplate and accelerates iteration.

Key Features

  • πŸš€ Quick Setup β€” Config-driven YAML files; swap datasets, models, or metrics instantly
  • πŸ”Œ Plug-and-Play β€” Modular architecture for custom data loaders, evaluators, and pipeline stages
  • πŸ“Š AI Foundry Evaluators β€” Built-in RAG metrics (Relevance, Coherence, Groundedness) and Agentic metrics (Task Adherence, Tool Call Accuracy)
  • 🎯 Custom Evaluators β€” Create domain-specific metrics with simple Python classes
  • βš–οΈ LLM-as-Judge β€” Build custom AI judges using prompty templates for flexible scoring
  • πŸ”— Experimentation Pipelines β€” Combine data loading, inference, and evaluation in configurable YAML pipelines
flowchart LR
    A[πŸ“„ Dataset] --> B[πŸ€– Inference]
    B --> C[βš™οΈ Evaluation]
    C --> D[πŸ“Š Results]
Loading

About This Framework

This framework provides two core capabilities for rapid AI evaluation:

  1. Simplified Evaluation SDK Integration β€” Easily add both built-in Microsoft Foundry evaluators and custom metrics with minimal code
  2. Pipeline-Based Architecture β€” Connect your experiments, inference modules, and data loaders through a configurable pipeline defined in YAML

Evaluation Pipeline


Folder Structure

Agentic-Evaluations/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agent_evaluation/           # Core evaluation engine
β”‚   β”‚   └── agentic_ops/            # Runner, client, base evaluator
β”‚   └── evaluations/
β”‚       └── offline/
β”‚           β”œβ”€β”€ agentic_evaluation/          # Agentic metrics (tool call accuracy, recall@k)
β”‚           β”œβ”€β”€ ai_judge_evaluation_custom/  # LLM-as-Judge with prompty templates
β”‚           β”œβ”€β”€ genai_evaluation_foundry/    # Built-in RAG evaluators (Relevance, Coherence)
β”‚           β”œβ”€β”€ pipeline_experiment_evaluation/ # Full pipeline (data β†’ inference β†’ eval)
β”‚           └── utils/                       # Shared constants and utilities
β”œβ”€β”€ assets/                         # Architecture diagrams
β”œβ”€β”€ .env.template                   # Environment variable template
β”œβ”€β”€ requirements.txt                # Python dependencies
└── README.md

Each evaluation folder follows the same layout: datasets/ for input data, evaluator/ for evaluation logic, and an experiment.yaml for configuration. Evaluation outputs are written to the shared src/evaluations/offline/reports/ directory.


Table of Contents


Getting Started

Prerequisites

  • Python 3.11+ and Git
  • Azure CLI installed and authenticated
  • Microsoft Foundry project with GPT-4o deployment

Quick Start

# 1. Clone and install
git clone https://github.com/Azure-Samples/Agentic-Evaluations.git
cd Agentic-Evaluations
python -m venv .venv
.venv\Scripts\activate  # Windows PowerShell
# source .venv/bin/activate  # Linux/macOS
pip install -r requirements.txt

# 2. Azure login
az login
az account set --subscription "<your-subscription-id>"

# 3. Configure environment
cp .env.template .env
# Edit .env with your Microsoft Foundry credentials:
#   EVAL_AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com/
#   EVAL_AZURE_OPENAI_MODEL=<your-deployment-name>
#   EVAL_AZURE_OPENAI_VERSION=2024-12-01-preview

# 4. Run a sample evaluation
python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/genai_evaluation_foundry/experiment.yaml

Results: src/evaluations/offline/reports/{run_id}_{eval_dir_name}.json


Samples

Standard GenAI/RAG β€” Built-in evaluators (Relevance, Coherence, Fluency)

python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/rag_evaluation_foundry/experiment.yaml

Agentic Systems β€” Agent invocation accuracy, recall@k, hallucination detection

python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/agentic_evaluation/experiment.yaml

Custom AI Judge β€” LLM-as-Judge with prompty templates

python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/ai_judge_evaluation_custom/experiment.yaml

Full Pipeline β€” Data loading β†’ Inference β†’ Evaluation

python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/pipeline_experiment_evaluation/experiment.yaml

Visualization Dashboard

All evaluation runs produce a JSON report in src/evaluations/offline/reports/ using the naming pattern {run_id}_{eval_dir_name}.json. The Agentic Evaluation Dashboard is a Streamlit app that reads these reports and renders them as interactive visualizations - no manual parsing required.

python -m streamlit run src/evaluations/offline/reports/dashboard.py

Key capabilities:

  • Overview page β€” aggregate metric gauges and multi-run summary tables for every evaluation type
  • Run detail page β€” pass/fail rates, agent routing analysis, per-row score breakdown, and reasoning drill-downs
  • Run comparison page β€” metric trend charts across multiple runs of the same evaluation

For full usage instructions, gauge scale conventions, and how to extend display names for custom evaluators, see the Dashboard README.


Configuration Guide

experiment.yaml Structure

app_name: Agentic-Evals
experiment_name: My_Evaluation

evaluation:
  run_local: True                    # Local execution (recommended)
  input_path: datasets/
  input_file: my_data.jsonl
  output_path: src/evaluations/offline/reports/
  
  evaluators:                        # Evaluators to run
    relevance: "relevance_evaluator"
    coherence: "coherence_evaluator"
  
  evaluator_config:                  # Map dataset fields to evaluator inputs
    relevance:
      column_mapping:
        query: "${data.query}"
        response: "${data.response}"

pipeline:                            # Pipeline stages
  - base_path: evaluator
    module: eval_main.eval_main
    config_key: evaluation

Key Points:

  • ${data.<field>} syntax maps JSONL dataset fields to evaluator parameters
  • Evaluator keys become column names in results
  • See Built-in Evaluators for parameter requirements

Creating New Evaluations

1. Copy a sample and prepare your dataset

cp -r src/evaluations/offline/genai_evaluation_foundry src/evaluations/offline/my_evaluation

Create a JSONL file in datasets/:

{"query": "What is the weather?", "response": "It's sunny and 72Β°F.", "context": "Weather data..."}

2. Register evaluators in eval_factory.py

from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator

class EvaluatorFactory:
    EVALUATOR_FACTORIES = {
        "relevance_evaluator": RelevanceEvaluator,
        "coherence_evaluator": CoherenceEvaluator,
    }

3. Configure and run

Update experiment.yaml with your evaluators and column mappings, then:

python -m src.agent_evaluation.agentic_ops.runner --config_file src/evaluations/offline/my_evaluation/experiment.yaml

Adding Custom Evaluators

# evaluator/evaluator_repo/my_evaluator.py
class MyCustomEvaluator:
    def __call__(self, query, response, **kwargs):
        score = self.calculate_score(query, response)
        return {"my_metric": score}

Register in eval_factory.py and add to your experiment.yaml.


Evaluators Reference

For AI Foundry's evaluators for Agentic and RAG see the official documentation:

πŸ“– Microsoft Foundry Evaluator Reference


Pipeline Architecture

The framework supports flexible pipeline configurations. Choose the pattern that fits your workflow:

Pipeline 1: Evaluation Only

Use when you already have model responses and want to evaluate them.

flowchart LR
    A[πŸ“„ JSONL Dataset] --> B[βš™οΈ Evaluation Module]
    B --> C[πŸ“Š Results JSON]
Loading

Pipeline 2: Inference + Evaluation

Use for end-to-end testing with your agent or model.

flowchart LR
    A[πŸ“ Input Queries] --> B[πŸ€– Inference Module]
    B --> C[βš™οΈ Evaluation Module]
    C --> D[πŸ“Š Results JSON]
Loading

Pipeline 3: Full Pipeline (Data Loading + Inference + Evaluation)

Use for production workflows with external data sources.

flowchart LR
    A[☁️ Azure Blob] --> B[πŸ“₯ Data Loader]
    B --> C[πŸ”„ Preprocessor]
    C --> D[πŸ€– Inference]
    D --> E[βš™οΈ Evaluation]
    E --> F[πŸ“Š Results]
Loading

Pipeline Configuration in YAML

pipeline:
  - base_path: data_loader      # Stage 1: Load data
    module: loader.load_data
    config_key: data_config

  - base_path: inference        # Stage 2: Run inference
    module: agent.run_inference
    config_key: inference_config

  - base_path: evaluator        # Stage 3: Evaluate results
    module: eval_main.eval_main
    config_key: evaluation

Each pipeline stage is independently configurableβ€”add, remove, or reorder stages as needed.



References


Cost Considerations

Important: Running evaluations with Azure OpenAI models incurs costs based on token usage. The LLM-as-Judge evaluators call the model for each row in your dataset, so larger datasets will result in higher costs. Monitor your Azure subscription spending regularly and set up Azure Cost Management alerts. See Azure OpenAI pricing for details.

Clean Up Resources

When you are done experimenting, delete any Azure resources you created to avoid unnecessary charges:

az group delete --name <your-resource-group> --yes --no-wait

Data Provenance

All sample datasets included in this repository are fully synthetic. They use fictional entities (Northwind Health, Contoso) and simulated agent interactions (smart-home device controls, weather lookups). No real customer data, personally identifiable information, or production telemetry is included in any dataset.


License

MIT License - see LICENSE for details.

Contributing

This project welcomes contributions. See CONTRIBUTING.md for guidelines.

This project has adopted the Microsoft Open Source Code of Conduct.

About

Evaluation and Visualization of Agentic Systems using Microsoft Foundry

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages