Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of Stage objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.
Originally built for spindle (a Graph RAG pipeline), spindle-eval is designed to evaluate any pipeline — full end-to-end systems, individual stages, or partial subsets.
Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:
- Stage-gated evaluation — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
- Pipeline-agnostic execution — define stages with the
Stageprotocol, wire them withStageDef, run them withPipelineExecutor - Composable configs — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
- Multiple tracking backends — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
- Structured events — thread-safe event store with duration analysis, token tracking, and error filtering
- KOS metrics — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
- Automated regression detection — CI compares metrics against baselines with bootstrap confidence intervals
- Golden dataset management — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation
┌─────────────────────────────┐
│ Hydra Configuration │
│ (composable YAML per stage) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ spindle-eval runner │
│ (discovery + orchestration) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ PipelineExecutor │
│ (stage wiring, metrics, │
│ gates, event logging) │
└──────────────┬──────────────┘
│
┌────────────┬───────────┼───────────┬────────────┐
▼ ▼ ▼ ▼ ▼
Stage 1 Stage 2 Stage 3 Stage N Metric fns
(any) (any) (any) (any) (attached)
│ │ │ │ │
└────────────┴───────────┴───────────┴────────────┘
│
┌──────────────▼──────────────┐
│ Tracker backends │
├──────────┬─────────┬────────┤
▼ ▼ ▼ ▼
MLflow File Langfuse No-op
(experiments) (JSON) (traces) (benchmarks)
pip install spindle-evalpip install spindle-evalinstalls the evaluation framework (library + CLI) so you can integrate it into your own pipeline/project.- MLflow/Langfuse are external tracking services. You only need them running if you choose those tracking backends.
- This repository includes scripts/config to run those services locally (Docker) or in GKE (see
docs/tracking_setup.md), which is why cloning the repo is useful for infrastructure setup. - If you use
fileornooptracking, you can run spindle-eval without MLflow/Langfuse.
For co-development with a pipeline package (editable install):
pip install -e ".[dev]"
pip install -e /path/to/your-pipeline# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick
# Parameter sweep
python -m spindle_eval.runner --multirun \
preprocessing.chunk_size=256,512,1024 \
retrieval.top_k=5,10,20from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution
class MyChunker:
name = "chunking"
def run(self, inputs, cfg):
chunks = do_chunking(cfg)
return StageResult(outputs={"chunks": chunks})
tracker = create_tracker("file", output_dir="./results")
stages = [
StageDef(
name="chunking",
stage=MyChunker(),
metrics=[boundary_coherence, size_distribution],
),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio
stages = [
StageDef(
name="taxonomy",
stage=MyTaxonomyBuilder(),
input_keys={"chunks": "preprocessing.chunks"},
metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
),
]Hydra config groups live in spindle_eval/conf/ (packaged with the install) and compose together:
| Group | Options | Controls |
|---|---|---|
preprocessing |
default, small_chunks, large_chunks |
Chunking strategy and size |
ontology |
schema_first, schema_free, hybrid |
Entity/relation schema discovery |
extraction |
llm, nlp, finetuned |
Triple extraction method |
retrieval |
hybrid, local, global, drift |
Graph retrieval strategy |
generation |
gpt4, claude, gemini |
LLM for answer generation |
evaluation |
quick, full |
Number of evaluation examples |
sweep |
none, er_threshold, retrieval, chunk_size |
Predefined sweep dimensions |
Pipeline packages can register additional config groups via Hydra's SearchPathPlugin. See docs/hydra-config-conventions.md.
Faithfulness, context recall, context precision, answer correctness, answer relevancy.
Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.
Triple extraction precision, recall, and F1 — with configurable stage gates.
Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See docs/kos-evaluation-guide.md.
Boundary coherence, size distribution, evidence span coverage.
Bootstrap confidence intervals for all metrics, used for regression detection in CI.
| Backend | Class | Use case |
|---|---|---|
| MLflow | MLflowTracker |
Production experiment tracking |
| File | FileTracker |
Local development, CI |
| Langfuse | Via OpenTelemetry | Trace-level debugging |
| No-op | NoOpTracker |
Benchmarking, unit tests |
| Composite | CompositeTracker |
Fan out to multiple backends |
from spindle_eval.tracking import create_tracker
tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")| Guide | Audience |
|---|---|
| Spindle Developer Guide | Pipeline developers integrating with spindle-eval |
| Custom Pipeline Guide | Developers building non-spindle pipelines |
| KOS Evaluation Guide | Developers evaluating SKOS/OWL knowledge structures |
| Hydra Config Conventions | Config authors and sweep designers |
| Tracking Setup | Setting up MLflow/Langfuse (GKE or local Docker) |
| PyPI Publishing | Building and uploading releases to PyPI |
- Python 3.10+
- Pipeline package (optional — mocks used if unavailable, controlled via
runner.allow_mock_fallback)
spindle-eval/
├── src/spindle_eval/
│ ├── runner.py # Hydra entrypoint, pipeline discovery
│ ├── pipeline.py # PipelineExecutor (stage wiring, metrics, gates)
│ ├── protocols.py # Stage, StageDef, StageResult, Tracker protocols
│ ├── compat.py # Legacy component dict → StageDef adapter
│ ├── mocks.py # Mock Stage implementations for testing
│ ├── metrics/ # Ragas, graph, extraction, KOS, chunk, provenance
│ ├── tracking/ # MLflow, file, noop, composite trackers
│ ├── events/ # Event store, duration/token/error analysis
│ ├── datasets/ # Golden dataset loading, KOS reference extraction
│ ├── baselines/ # Baseline runner implementations
│ ├── ci/ # Regression detection, PR report generation
│ └── production/ # Feedback loops, staleness monitoring
│ ├── conf/ # Hydra config groups (packaged for pip install)
│ └── golden_data/ # Default evaluation datasets (JSONL)
├── docs/ # Developer guides
├── baselines/ # Baseline metric snapshots
└── tests/