spindle-eval

Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of Stage objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.

Originally built for spindle (a Graph RAG pipeline), spindle-eval is designed to evaluate any pipeline — full end-to-end systems, individual stages, or partial subsets.

Why spindle-eval?

Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:

Stage-gated evaluation — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
Pipeline-agnostic execution — define stages with the Stage protocol, wire them with StageDef, run them with PipelineExecutor
Composable configs — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
Multiple tracking backends — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
Structured events — thread-safe event store with duration analysis, token tracking, and error filtering
KOS metrics — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
Automated regression detection — CI compares metrics against baselines with bootstrap confidence intervals
Golden dataset management — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation

Architecture overview

                    ┌─────────────────────────────┐
                    │     Hydra Configuration      │
                    │  (composable YAML per stage)  │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │     spindle-eval runner      │
                    │  (discovery + orchestration) │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │      PipelineExecutor        │
                    │  (stage wiring, metrics,     │
                    │   gates, event logging)      │
                    └──────────────┬──────────────┘
                                   │
          ┌────────────┬───────────┼───────────┬────────────┐
          ▼            ▼           ▼           ▼            ▼
      Stage 1      Stage 2     Stage 3     Stage N    Metric fns
      (any)        (any)       (any)       (any)     (attached)
          │            │           │           │            │
          └────────────┴───────────┴───────────┴────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │       Tracker backends       │
                    ├──────────┬─────────┬────────┤
                    ▼          ▼         ▼        ▼
                 MLflow     File     Langfuse   No-op
              (experiments) (JSON)  (traces)  (benchmarks)

Installation

pip install spindle-eval

Deployment model (quick clarification)

pip install spindle-eval installs the evaluation framework (library + CLI) so you can integrate it into your own pipeline/project.
MLflow/Langfuse are external tracking services. You only need them running if you choose those tracking backends.
This repository includes scripts/config to run those services locally (Docker) or in GKE (see docs/tracking_setup.md), which is why cloning the repo is useful for infrastructure setup.
If you use file or noop tracking, you can run spindle-eval without MLflow/Langfuse.

For co-development with a pipeline package (editable install):

pip install -e ".[dev]"
pip install -e /path/to/your-pipeline

Quick start

Full pipeline evaluation

# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick

# Parameter sweep
python -m spindle_eval.runner --multirun \
  preprocessing.chunk_size=256,512,1024 \
  retrieval.top_k=5,10,20

Evaluate a single stage

from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution

class MyChunker:
    name = "chunking"
    def run(self, inputs, cfg):
        chunks = do_chunking(cfg)
        return StageResult(outputs={"chunks": chunks})

tracker = create_tracker("file", output_dir="./results")
stages = [
    StageDef(
        name="chunking",
        stage=MyChunker(),
        metrics=[boundary_coherence, size_distribution],
    ),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()

Evaluate a KOS builder

from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio

stages = [
    StageDef(
        name="taxonomy",
        stage=MyTaxonomyBuilder(),
        input_keys={"chunks": "preprocessing.chunks"},
        metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
        gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
    ),
]

Configuration

Hydra config groups live in spindle_eval/conf/ (packaged with the install) and compose together:

Group	Options	Controls
`preprocessing`	`default`, `small_chunks`, `large_chunks`	Chunking strategy and size
`ontology`	`schema_first`, `schema_free`, `hybrid`	Entity/relation schema discovery
`extraction`	`llm`, `nlp`, `finetuned`	Triple extraction method
`retrieval`	`hybrid`, `local`, `global`, `drift`	Graph retrieval strategy
`generation`	`gpt4`, `claude`, `gemini`	LLM for answer generation
`evaluation`	`quick`, `full`	Number of evaluation examples
`sweep`	`none`, `er_threshold`, `retrieval`, `chunk_size`	Predefined sweep dimensions

Pipeline packages can register additional config groups via Hydra's SearchPathPlugin. See docs/hydra-config-conventions.md.

Metrics

RAG quality (via Ragas)

Faithfulness, context recall, context precision, answer correctness, answer relevancy.

Graph quality

Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.

Extraction quality

Triple extraction precision, recall, and F1 — with configurable stage gates.

KOS quality

Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See docs/kos-evaluation-guide.md.

Chunk and provenance quality

Boundary coherence, size distribution, evidence span coverage.

Statistical rigor

Bootstrap confidence intervals for all metrics, used for regression detection in CI.

Tracking backends

Backend	Class	Use case
MLflow	`MLflowTracker`	Production experiment tracking
File	`FileTracker`	Local development, CI
Langfuse	Via OpenTelemetry	Trace-level debugging
No-op	`NoOpTracker`	Benchmarking, unit tests
Composite	`CompositeTracker`	Fan out to multiple backends

from spindle_eval.tracking import create_tracker

tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")

Documentation

Guide	Audience
Spindle Developer Guide	Pipeline developers integrating with spindle-eval
Custom Pipeline Guide	Developers building non-spindle pipelines
KOS Evaluation Guide	Developers evaluating SKOS/OWL knowledge structures
Hydra Config Conventions	Config authors and sweep designers
Tracking Setup	Setting up MLflow/Langfuse (GKE or local Docker)
PyPI Publishing	Building and uploading releases to PyPI

Requirements

Python 3.10+
Pipeline package (optional — mocks used if unavailable, controlled via runner.allow_mock_fallback)

Project structure

spindle-eval/
├── src/spindle_eval/
│   ├── runner.py           # Hydra entrypoint, pipeline discovery
│   ├── pipeline.py         # PipelineExecutor (stage wiring, metrics, gates)
│   ├── protocols.py        # Stage, StageDef, StageResult, Tracker protocols
│   ├── compat.py           # Legacy component dict → StageDef adapter
│   ├── mocks.py            # Mock Stage implementations for testing
│   ├── metrics/            # Ragas, graph, extraction, KOS, chunk, provenance
│   ├── tracking/           # MLflow, file, noop, composite trackers
│   ├── events/             # Event store, duration/token/error analysis
│   ├── datasets/           # Golden dataset loading, KOS reference extraction
│   ├── baselines/          # Baseline runner implementations
│   ├── ci/                 # Regression detection, PR report generation
│   └── production/         # Feedback loops, staleness monitoring
│   ├── conf/               # Hydra config groups (packaged for pip install)
│   └── golden_data/        # Default evaluation datasets (JSONL)
├── docs/                   # Developer guides
├── baselines/              # Baseline metric snapshots
└── tests/

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.dvc		.dvc
.github/workflows		.github/workflows
baselines		baselines
deploy		deploy
docs		docs
scripts		scripts
src/spindle_eval		src/spindle_eval
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
deploy.env.local		deploy.env.local
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spindle-eval

Why spindle-eval?

Architecture overview

Installation

Deployment model (quick clarification)

Quick start

Full pipeline evaluation

Evaluate a single stage

Evaluate a KOS builder

Configuration

Metrics

RAG quality (via Ragas)

Graph quality

Extraction quality

KOS quality

Chunk and provenance quality

Statistical rigor

Tracking backends

Documentation

Requirements

Project structure

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spindle-eval

Why spindle-eval?

Architecture overview

Installation

Deployment model (quick clarification)

Quick start

Full pipeline evaluation

Evaluate a single stage

Evaluate a KOS builder

Configuration

Metrics

RAG quality (via Ragas)

Graph quality

Extraction quality

KOS quality

Chunk and provenance quality

Statistical rigor

Tracking backends

Documentation

Requirements

Project structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages