Milestone: AI Dev Kit Test Framework
Overview
Build a comprehensive test and evaluation framework for AI Dev Kit skills and tools, with an orchestrating skill that ties everything together and enables continuous improvement.
Vision
A complete testing pyramid that validates skills at multiple levels, from basic unit tests to sophisticated LLM-based evaluations, all integrated into the development workflow.
Test Levels
| Level |
Type |
What It Tests |
Tool |
| 1 |
Unit Tests |
Individual functions, classes |
pytest |
| 2 |
Integration Tests |
Complete workflows, API interactions |
pytest + Databricks |
| 3 |
Static Skill Eval |
Skill quality without execution |
LLM Judge |
| 4 |
Thinking Eval |
LLM reasoning during execution |
LLM Judge |
| 5 |
Output Eval |
Final artifacts vs source of truth |
LLM Judge |
| 6 |
Self-Improvement Loop |
Iterative fixes based on eval feedback |
Orchestration Skill |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Test Framework Skill │
│ (Orchestrates all test types, saves results, suggests fixes) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ │ Unit │ │Integration│ │ Static │ │ Thinking │ │ Output │
│ │ Tests │ │ Tests │ │ Eval │ │ Eval │ │ Eval │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────┴─────────────┘
│ │
│ ┌───────▼───────┐
│ │ MLflow │
│ │ Metrics │
│ └───────────────┘
└─────────────────────────────────────────────────────────────────┘
Workflow
On-Demand PR Testing
- PR is opened/updated
- Test framework runs all relevant tests
- Results saved to MLflow with branch/commit tags
- Compare against previous runs (main branch baseline)
- LLM grades each test and provides summary
- Self-improvement suggestions generated
Chained Evaluation Flow
For complex evaluations (thinking + output), tests follow a consistent pattern:
- Data Generation: Execute data gen skill with test instructions
- Downstream Task: Execute dependent skill (pipeline, dashboard, etc.)
- Evaluate Both: Assess thinking and output at each step
MLflow Integration
Each test run logs:
- Run metadata: branch, commit, timestamp, PR number
- Metrics: pass/fail counts, scores per criteria, execution times
- Artifacts: full traces, comparison reports, recommendations
Enables:
- Historical trend analysis
- PR vs main comparison
- Regression detection
Self-Improvement Loop (This Issue)
The final piece: an orchestrating skill that:
- Runs all test types
- Aggregates results
- Identifies patterns in failures
- Generates fix suggestions
- Can iteratively apply fixes and re-test
Skill Capabilities
skill: test-framework
commands:
- run-all: Execute full test suite
- run-unit: Execute unit tests only
- run-integration: Execute integration tests only
- run-evals: Execute all LLM evaluations
- compare-pr: Compare current branch to main
- suggest-fixes: Generate improvement recommendations
- auto-fix: Apply suggestions and re-test (with approval)
Deliverables
- Python tools: One script per test type (reusable, composable)
- Test Framework Skill: SKILL.md that orchestrates everything
- MLflow schema: Standardized metrics and artifacts
- Documentation: How to add tests, create expectations, interpret results
Acceptance Criteria
Overview
Build a comprehensive test and evaluation framework for AI Dev Kit skills and tools, with an orchestrating skill that ties everything together and enables continuous improvement.
Vision
A complete testing pyramid that validates skills at multiple levels, from basic unit tests to sophisticated LLM-based evaluations, all integrated into the development workflow.
Test Levels
Architecture
Workflow
On-Demand PR Testing
Chained Evaluation Flow
For complex evaluations (thinking + output), tests follow a consistent pattern:
MLflow Integration
Each test run logs:
Enables:
Self-Improvement Loop (This Issue)
The final piece: an orchestrating skill that:
Skill Capabilities
Deliverables
Acceptance Criteria