Implement AI Dev Kit Test Framework and Evaluation Skill

> **Milestone**: [AI Dev Kit Test Framework](https://github.com/databricks-solutions/ai-dev-kit/milestone/1)

## Overview

Build a comprehensive test and evaluation framework for AI Dev Kit skills and tools, with an orchestrating skill that ties everything together and enables continuous improvement.

## Vision

A complete testing pyramid that validates skills at multiple levels, from basic unit tests to sophisticated LLM-based evaluations, all integrated into the development workflow.

## Test Levels

| Level | Type | What It Tests | Tool |
|-------|------|---------------|------|
| 1 | [**Unit Tests**](https://github.com/databricks-solutions/ai-dev-kit/issues/404) | Individual functions, classes | pytest |
| 2 | [**Integration Tests**](https://github.com/databricks-solutions/ai-dev-kit/issues/405) | Complete workflows, API interactions | pytest + Databricks |
| 3 | [**Static Skill Eval**](https://github.com/databricks-solutions/ai-dev-kit/issues/406) | Skill quality without execution | LLM Judge |
| 4 | [**Thinking Eval**](https://github.com/databricks-solutions/ai-dev-kit/issues/407) | LLM reasoning during execution | LLM Judge |
| 5 | [**Output Eval**](https://github.com/databricks-solutions/ai-dev-kit/issues/408) | Final artifacts vs source of truth | LLM Judge |
| 6 | **Self-Improvement Loop** | Iterative fixes based on eval feedback | Orchestration Skill |

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    Test Framework Skill                          │
│  (Orchestrates all test types, saves results, suggests fixes)   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│  │  Unit    │  │Integration│  │ Static   │  │ Thinking │  │ Output   │
│  │  Tests   │  │  Tests    │  │ Eval     │  │ Eval     │  │ Eval     │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
│       │             │             │             │             │
│       └─────────────┴─────────────┴─────────────┴─────────────┘
│                                   │
│                           ┌───────▼───────┐
│                           │   MLflow      │
│                           │   Metrics     │
│                           └───────────────┘
└─────────────────────────────────────────────────────────────────┘
```

## Workflow

### On-Demand PR Testing

1. PR is opened/updated
2. Test framework runs all relevant tests
3. Results saved to MLflow with branch/commit tags
4. Compare against previous runs (main branch baseline)
5. LLM grades each test and provides summary
6. Self-improvement suggestions generated

### Chained Evaluation Flow

For complex evaluations (thinking + output), tests follow a consistent pattern:
1. **Data Generation**: Execute data gen skill with test instructions
2. **Downstream Task**: Execute dependent skill (pipeline, dashboard, etc.)
3. **Evaluate Both**: Assess thinking and output at each step

## MLflow Integration

Each test run logs:
- **Run metadata**: branch, commit, timestamp, PR number
- **Metrics**: pass/fail counts, scores per criteria, execution times
- **Artifacts**: full traces, comparison reports, recommendations

Enables:
- Historical trend analysis
- PR vs main comparison
- Regression detection

## Self-Improvement Loop (This Issue)

The final piece: an orchestrating skill that:
1. Runs all test types
2. Aggregates results
3. Identifies patterns in failures
4. Generates fix suggestions
5. Can iteratively apply fixes and re-test

### Skill Capabilities

```yaml
skill: test-framework
commands:
  - run-all: Execute full test suite
  - run-unit: Execute unit tests only
  - run-integration: Execute integration tests only
  - run-evals: Execute all LLM evaluations
  - compare-pr: Compare current branch to main
  - suggest-fixes: Generate improvement recommendations
  - auto-fix: Apply suggestions and re-test (with approval)
```

## Deliverables

1. **Python tools**: One script per test type (reusable, composable)
2. **Test Framework Skill**: SKILL.md that orchestrates everything
3. **MLflow schema**: Standardized metrics and artifacts
4. **Documentation**: How to add tests, create expectations, interpret results

## Acceptance Criteria

- [ ] All 5 test types implemented (#404-#408)
- [ ] Orchestrating skill created
- [ ] MLflow integration complete
- [ ] PR workflow documented
- [ ] Self-improvement loop functional
- [ ] Can compare any PR to main baseline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement AI Dev Kit Test Framework and Evaluation Skill #409

Overview

Vision

Test Levels

Architecture

Workflow

On-Demand PR Testing

Chained Evaluation Flow

MLflow Integration

Self-Improvement Loop (This Issue)

Skill Capabilities

Deliverables

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level	Type	What It Tests	Tool
1	Unit Tests	Individual functions, classes	pytest
2	Integration Tests	Complete workflows, API interactions	pytest + Databricks
3	Static Skill Eval	Skill quality without execution	LLM Judge
4	Thinking Eval	LLM reasoning during execution	LLM Judge
5	Output Eval	Final artifacts vs source of truth	LLM Judge
6	Self-Improvement Loop	Iterative fixes based on eval feedback	Orchestration Skill

Implement AI Dev Kit Test Framework and Evaluation Skill #409

Description

Overview

Vision

Test Levels

Architecture

Workflow

On-Demand PR Testing

Chained Evaluation Flow

MLflow Integration

Self-Improvement Loop (This Issue)

Skill Capabilities

Deliverables

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions