Problem
Re-extracting a document overwrites prior entities. There is no run log. We lose:
- Audit trail — after a Claude-Opus re-extraction, we can't see what Claude-Haiku said last week
- Reproducibility — no way to tell which extraction tier or which model produced a given entity
- Regression detection — can't diff extractions across model upgrades or prompt changes
- Dispute resolution — if a reporter challenges an entity, we can't show the provenance
Currently: Document.entities_extracted: bool — just a gate, not a log.
Not a duplicate of #60
#60 redesigns the extraction architecture (parallel extractors → LLM validates). This issue is about what we persist regardless of which architecture wins. Even after #60 lands, we still overwrite.
Proposed shape
New table — something like extraction_runs:
| column |
purpose |
| id |
UUID |
| document_id |
FK |
| backend |
llm / gliner / spacy / regex / ensemble |
| model |
e.g. claude-opus-4-6, gliner-large-v2.1, en_core_web_trf |
| prompt_hash |
for LLM runs — hash of the system prompt used |
| started_at / completed_at |
timing |
| entity_count |
quick stat |
| notes |
free text (e.g. "validation kept 5, removed 2") |
Then: Entity.extraction_run_id (FK, nullable for legacy rows).
Migration / back-compat
- Add
extraction_runs table (Alembic migration)
- Add nullable
extraction_run_id on entities
- Existing entities have
extraction_run_id = NULL (legacy)
- New extractions write a run row first, then entities reference it
- `--force` re-extract: don't delete prior entities; just insert a new run with new entities. Dedup at query time using canonical_id.
UI / CLI changes
- `openfoia analyze runs ` — list runs for a document
- `openfoia analyze extract --prefer-run ` — pick which run the graph/crossref uses
- Graph view: show badge for backend/model per entity
Out of scope for this issue
- Versioning the documents themselves (separate concern — docs change too, e.g. re-OCR)
- Versioning entity links — start with just entities
Problem
Re-extracting a document overwrites prior entities. There is no run log. We lose:
Currently:
Document.entities_extracted: bool— just a gate, not a log.Not a duplicate of #60
#60 redesigns the extraction architecture (parallel extractors → LLM validates). This issue is about what we persist regardless of which architecture wins. Even after #60 lands, we still overwrite.
Proposed shape
New table — something like
extraction_runs:llm/gliner/spacy/regex/ensembleclaude-opus-4-6,gliner-large-v2.1,en_core_web_trfThen:
Entity.extraction_run_id(FK, nullable for legacy rows).Migration / back-compat
extraction_runstable (Alembic migration)extraction_run_idonentitiesextraction_run_id = NULL(legacy)UI / CLI changes
Out of scope for this issue