Versioned extraction runs: preserve prior entities + track which backend/model produced each one

## Problem

Re-extracting a document overwrites prior entities. There is no run log. We lose:

- **Audit trail** — after a Claude-Opus re-extraction, we can't see what Claude-Haiku said last week
- **Reproducibility** — no way to tell *which* extraction tier or *which* model produced a given entity
- **Regression detection** — can't diff extractions across model upgrades or prompt changes
- **Dispute resolution** — if a reporter challenges an entity, we can't show the provenance

Currently: `Document.entities_extracted: bool` — just a gate, not a log.

## Not a duplicate of #60

#60 redesigns the extraction *architecture* (parallel extractors → LLM validates). This issue is about *what we persist* regardless of which architecture wins. Even after #60 lands, we still overwrite.

## Proposed shape

New table — something like `extraction_runs`:

| column | purpose |
|--------|---------|
| id | UUID |
| document_id | FK |
| backend | `llm` / `gliner` / `spacy` / `regex` / `ensemble` |
| model | e.g. `claude-opus-4-6`, `gliner-large-v2.1`, `en_core_web_trf` |
| prompt_hash | for LLM runs — hash of the system prompt used |
| started_at / completed_at | timing |
| entity_count | quick stat |
| notes | free text (e.g. \"validation kept 5, removed 2\") |

Then: `Entity.extraction_run_id` (FK, nullable for legacy rows).

## Migration / back-compat

- Add `extraction_runs` table (Alembic migration)
- Add nullable `extraction_run_id` on `entities`
- Existing entities have `extraction_run_id = NULL` (legacy)
- New extractions write a run row first, then entities reference it
- \`--force\` re-extract: **don't delete** prior entities; just insert a new run with new entities. Dedup at query time using canonical_id.

## UI / CLI changes

- \`openfoia analyze runs <doc-id>\` — list runs for a document
- \`openfoia analyze extract --prefer-run <run-id>\` — pick which run the graph/crossref uses
- Graph view: show badge for backend/model per entity

## Out of scope for this issue

- Versioning the *documents themselves* (separate concern — docs change too, e.g. re-OCR)
- Versioning *entity links* — start with just entities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioned extraction runs: preserve prior entities + track which backend/model produced each one #63

Problem

Not a duplicate of #60

Proposed shape

Migration / back-compat

UI / CLI changes

Out of scope for this issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

column	purpose
id	UUID
document_id	FK
backend	`llm` / `gliner` / `spacy` / `regex` / `ensemble`
model	e.g. `claude-opus-4-6`, `gliner-large-v2.1`, `en_core_web_trf`
prompt_hash	for LLM runs — hash of the system prompt used
started_at / completed_at	timing
entity_count	quick stat
notes	free text (e.g. "validation kept 5, removed 2")

Versioned extraction runs: preserve prior entities + track which backend/model produced each one #63

Description

Problem

Not a duplicate of #60

Proposed shape

Migration / back-compat

UI / CLI changes

Out of scope for this issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions