Skip to content

danielkentwood/spindle-eval

Repository files navigation

spindle-eval

Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of Stage objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.

Originally built for spindle (a Graph RAG pipeline), spindle-eval is designed to evaluate any pipeline — full end-to-end systems, individual stages, or partial subsets.

Why spindle-eval?

Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:

  • Stage-gated evaluation — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
  • Pipeline-agnostic execution — define stages with the Stage protocol, wire them with StageDef, run them with PipelineExecutor
  • Composable configs — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
  • Multiple tracking backends — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
  • Structured events — thread-safe event store with duration analysis, token tracking, and error filtering
  • KOS metrics — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
  • Automated regression detection — CI compares metrics against baselines with bootstrap confidence intervals
  • Golden dataset management — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation

Architecture overview

                    ┌─────────────────────────────┐
                    │     Hydra Configuration      │
                    │  (composable YAML per stage)  │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │     spindle-eval runner      │
                    │  (discovery + orchestration) │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │      PipelineExecutor        │
                    │  (stage wiring, metrics,     │
                    │   gates, event logging)      │
                    └──────────────┬──────────────┘
                                   │
          ┌────────────┬───────────┼───────────┬────────────┐
          ▼            ▼           ▼           ▼            ▼
      Stage 1      Stage 2     Stage 3     Stage N    Metric fns
      (any)        (any)       (any)       (any)     (attached)
          │            │           │           │            │
          └────────────┴───────────┴───────────┴────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │       Tracker backends       │
                    ├──────────┬─────────┬────────┤
                    ▼          ▼         ▼        ▼
                 MLflow     File     Langfuse   No-op
              (experiments) (JSON)  (traces)  (benchmarks)

Installation

pip install spindle-eval

Deployment model (quick clarification)

  • pip install spindle-eval installs the evaluation framework (library + CLI) so you can integrate it into your own pipeline/project.
  • MLflow/Langfuse are external tracking services. You only need them running if you choose those tracking backends.
  • This repository includes scripts/config to run those services locally (Docker) or in GKE (see docs/tracking_setup.md), which is why cloning the repo is useful for infrastructure setup.
  • If you use file or noop tracking, you can run spindle-eval without MLflow/Langfuse.

For co-development with a pipeline package (editable install):

pip install -e ".[dev]"
pip install -e /path/to/your-pipeline

Quick start

Full pipeline evaluation

# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick

# Parameter sweep
python -m spindle_eval.runner --multirun \
  preprocessing.chunk_size=256,512,1024 \
  retrieval.top_k=5,10,20

Evaluate a single stage

from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution

class MyChunker:
    name = "chunking"
    def run(self, inputs, cfg):
        chunks = do_chunking(cfg)
        return StageResult(outputs={"chunks": chunks})

tracker = create_tracker("file", output_dir="./results")
stages = [
    StageDef(
        name="chunking",
        stage=MyChunker(),
        metrics=[boundary_coherence, size_distribution],
    ),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()

Evaluate a KOS builder

from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio

stages = [
    StageDef(
        name="taxonomy",
        stage=MyTaxonomyBuilder(),
        input_keys={"chunks": "preprocessing.chunks"},
        metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
        gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
    ),
]

Configuration

Hydra config groups live in spindle_eval/conf/ (packaged with the install) and compose together:

Group Options Controls
preprocessing default, small_chunks, large_chunks Chunking strategy and size
ontology schema_first, schema_free, hybrid Entity/relation schema discovery
extraction llm, nlp, finetuned Triple extraction method
retrieval hybrid, local, global, drift Graph retrieval strategy
generation gpt4, claude, gemini LLM for answer generation
evaluation quick, full Number of evaluation examples
sweep none, er_threshold, retrieval, chunk_size Predefined sweep dimensions

Pipeline packages can register additional config groups via Hydra's SearchPathPlugin. See docs/hydra-config-conventions.md.

Metrics

RAG quality (via Ragas)

Faithfulness, context recall, context precision, answer correctness, answer relevancy.

Graph quality

Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.

Extraction quality

Triple extraction precision, recall, and F1 — with configurable stage gates.

KOS quality

Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See docs/kos-evaluation-guide.md.

Chunk and provenance quality

Boundary coherence, size distribution, evidence span coverage.

Statistical rigor

Bootstrap confidence intervals for all metrics, used for regression detection in CI.

Tracking backends

Backend Class Use case
MLflow MLflowTracker Production experiment tracking
File FileTracker Local development, CI
Langfuse Via OpenTelemetry Trace-level debugging
No-op NoOpTracker Benchmarking, unit tests
Composite CompositeTracker Fan out to multiple backends
from spindle_eval.tracking import create_tracker

tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")

Documentation

Guide Audience
Spindle Developer Guide Pipeline developers integrating with spindle-eval
Custom Pipeline Guide Developers building non-spindle pipelines
KOS Evaluation Guide Developers evaluating SKOS/OWL knowledge structures
Hydra Config Conventions Config authors and sweep designers
Tracking Setup Setting up MLflow/Langfuse (GKE or local Docker)
PyPI Publishing Building and uploading releases to PyPI

Requirements

  • Python 3.10+
  • Pipeline package (optional — mocks used if unavailable, controlled via runner.allow_mock_fallback)

Project structure

spindle-eval/
├── src/spindle_eval/
│   ├── runner.py           # Hydra entrypoint, pipeline discovery
│   ├── pipeline.py         # PipelineExecutor (stage wiring, metrics, gates)
│   ├── protocols.py        # Stage, StageDef, StageResult, Tracker protocols
│   ├── compat.py           # Legacy component dict → StageDef adapter
│   ├── mocks.py            # Mock Stage implementations for testing
│   ├── metrics/            # Ragas, graph, extraction, KOS, chunk, provenance
│   ├── tracking/           # MLflow, file, noop, composite trackers
│   ├── events/             # Event store, duration/token/error analysis
│   ├── datasets/           # Golden dataset loading, KOS reference extraction
│   ├── baselines/          # Baseline runner implementations
│   ├── ci/                 # Regression detection, PR report generation
│   └── production/         # Feedback loops, staleness monitoring
│   ├── conf/               # Hydra config groups (packaged for pip install)
│   └── golden_data/        # Default evaluation datasets (JSONL)
├── docs/                   # Developer guides
├── baselines/              # Baseline metric snapshots
└── tests/

About

Evaluation and observability harness for graph RAG and KOS development pipelines

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors