Skip to content

Versioned extraction runs: preserve prior entities + track which backend/model produced each one #63

@JordanCoin

Description

@JordanCoin

Problem

Re-extracting a document overwrites prior entities. There is no run log. We lose:

  • Audit trail — after a Claude-Opus re-extraction, we can't see what Claude-Haiku said last week
  • Reproducibility — no way to tell which extraction tier or which model produced a given entity
  • Regression detection — can't diff extractions across model upgrades or prompt changes
  • Dispute resolution — if a reporter challenges an entity, we can't show the provenance

Currently: Document.entities_extracted: bool — just a gate, not a log.

Not a duplicate of #60

#60 redesigns the extraction architecture (parallel extractors → LLM validates). This issue is about what we persist regardless of which architecture wins. Even after #60 lands, we still overwrite.

Proposed shape

New table — something like extraction_runs:

column purpose
id UUID
document_id FK
backend llm / gliner / spacy / regex / ensemble
model e.g. claude-opus-4-6, gliner-large-v2.1, en_core_web_trf
prompt_hash for LLM runs — hash of the system prompt used
started_at / completed_at timing
entity_count quick stat
notes free text (e.g. "validation kept 5, removed 2")

Then: Entity.extraction_run_id (FK, nullable for legacy rows).

Migration / back-compat

  • Add extraction_runs table (Alembic migration)
  • Add nullable extraction_run_id on entities
  • Existing entities have extraction_run_id = NULL (legacy)
  • New extractions write a run row first, then entities reference it
  • `--force` re-extract: don't delete prior entities; just insert a new run with new entities. Dedup at query time using canonical_id.

UI / CLI changes

  • `openfoia analyze runs ` — list runs for a document
  • `openfoia analyze extract --prefer-run ` — pick which run the graph/crossref uses
  • Graph view: show badge for backend/model per entity

Out of scope for this issue

  • Versioning the documents themselves (separate concern — docs change too, e.g. re-OCR)
  • Versioning entity links — start with just entities

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions