-
Notifications
You must be signed in to change notification settings - Fork 3
Setup stable canonical document representation #33
Description
PRD: Canonical document model (DB schema + migrations)
Problem Statement
EU Fact Force needs a stable canonical representation of a document in the database so ingestion, parsing, embeddings, search, and future features (claims, graph, trends) can share one notion of “a document.” Today ingestion centers on SourceFile (storage + status) and DocumentChunk tied only to files. That does not separate logical document identity (DOI or not, national reports, uploads) from physical file artifacts, and it does not capture where metadata came from or ingestion lineage for debugging and audit.
Without this layer, later work (DOI fetch, upload paths, parsing, chunking) risks inconsistent data and expensive refactors.
Solution
Introduce a canonical Document model as the aggregate root for bibliographic and product-facing metadata, with relational polymorphism for future source types (not Django model inheritance). Keep SourceFile as the model for stored binary assets (e.g. PDF). Add IngestionRun to record each ingestion attempt (lineage). Add ParsedArtifact for the single parse output per document (no expectation of multiple parses). Persist raw provider payloads separately from normalized fields so mappings can evolve without losing source truth.
Deliver this as database models and migrations (plus minimal admin registration for visibility). Do not fully rewire the ingestion pipeline in this change; that is follow-up work.
User Stories
- As a backend developer, I want a
Documenttable with a required non-empty title, so that every canonical record has a human-readable label even when other metadata is incomplete. - As a backend developer, I want
Documentto support optional DOI and other external identifiers, so that national reports and uploads without DOIs are first-class. - As a backend developer, I want to create a
Documentwith metadata only before any PDF is available, so that workflows can record bibliographic data as soon as it is known. - As a backend developer, I want
SourceFileto represent only the stored file (e.g. S3 key, status), so that file lifecycle stays separate from canonical metadata. - As a backend developer, I want
Documentto link to at most oneSourceFilewhen a file exists, so that the “metadata-only” and “file attached” states are explicit. - As a backend developer, I want deleting a
SourceFileto delete the relatedDocumentwhen that document is tied to that file, so that storage cleanup matches the agreed cascade semantics (see implementation decisions for nullable vs attached cases). - As a backend developer, I want an
IngestionRunrow per attempt, so that I can see input type, provider, status, errors, and pipeline version for support and debugging. - As a backend developer, I want raw provider JSON stored on an appropriate model, so that I can reprocess or audit without re-fetching from external APIs.
- As a backend developer, I want exactly one
ParsedArtifactperDocument(enforced in schema), so that there is no ambiguity about “which parse is current.” - As a backend developer, I want
DocumentChunkto require aDocument, so that retrieval and future evidence linking are anchored on the canonical document.
Relationship to draft research catalog model
A separate draft shared on Mattermost (shown above) sketches entities such as ResearchPaper, Author, reference tables for document type and evidence hierarchy, Theme, Keywords, and chunk-level embedding/citation concepts. That draft is not final and describes a broader research-catalog and taxonomy layer than this work.
The codebase uses Document as the canonical entity name instead of ResearchPaper. This is a voluntary choice for clarity and consistency with the ingestion roadmap; it does not change the intended role of the table as the logical “paper” or publication record.
Scope: This PRD remains limited to the ingestion spine (canonical document, stored file, lineage, raw provider payload, single parse artifact, chunks). Normalized catalog concerns—including DocType, HierarchyOfEvidence, Journal, Author, Keywords, and theme assignment—are out of scope here and are expected in follow-up work once stable document and chunk identifiers exist.
Themes: The draft shows a single topic/theme identifier per paper; the product direction is to support many themes per document when applicable (e.g. a many-to-many relationship in a later schema). This PRD does not implement theme tables or links.
Implementation Decisions
- Aggregate root:
Documentis the canonical entity for bibliographic/product metadata;SourceFileremains a physical artifact. - Polymorphism: Use relational modeling (typed fields / related tables later), not multi-table Django inheritance for
SourceFilesubclasses. - Identifiers: DOI is optional; support additional IDs via a structured field (e.g. JSON) for provider-specific keys (PMID, arXiv, internal ids).
- Title: Required, non-null and non-blank at the database level.
- Partial ingest: Allow
Documentrows with title and other allowed fields before anySourceFileis given, attachment is optional until the file exists. - Lineage: Include
IngestionRunwith fields appropriate to: link toDocument, optional link toSourceFilewhen relevant, input type, input identifier, provider, status, error message, pipeline version, timestamps. - Raw provider payload: Store verbatim provider response (e.g. JSON) on a dedicated field or related model scoped to the run or acquisition step—not mixed into normalized
Documentfields as the only copy of truth. - Parsed output: Introduce
ParsedArtifactwith a 1:1 relationship toDocument(enforced uniqueness), reflecting no expectation of multiple parses; re-parse in the future would replace the same logical row or require a follow-up PRD. - Chunks:
DocumentChunkmust have a required foreign key toDocument; adjust or retain linkage toSourceFile/ parse artifact as needed for migration continuity and provenance (exact FK graph to be reflected in migrations). - Cascade deletion: deleting
SourceFiledeletes the associatedDocumentwhen the document is defined as dependent on that file—implemented with Djangoon_deleteand nullable FK where metadata-only documents exist without a file. - Constraints: Strict nullability and uniqueness rules in migrations (e.g. partial unique constraint on DOI when non-empty, if that remains the rule).
- Admin: Light registration for new models; richer admin later.
- Scope boundary: Models and migrations only for this PRD; ingestion services, views, embedding, and chunking do not need to be fully migrated in this deliverable, but migrations should remain applicable to existing deployments.
Testing Decisions
- Good tests assert observable database behavior: constraints (unique DOI when present), required fields (title), cascade behaviour, and relationship cardinality (one parse per document), not internal implementation details of helpers.
- Modules to test: Model-level behaviour via Django’s ORM and migrations (integration-style tests in the existing test suite pattern).
- Existing ingestion tests for models, services, and pipeline runs in the repository’s test layout; new tests should follow the same pytest +
django_dbpatterns.
Out of Scope
- Rewriting
run_pipeline, fetch stubs, or upload flows to useDocumentend-to-end. - Changing search, embedding, or chunking algorithms or APIs.
- API contract changes for the web app
- Research catalog tables and links (
Author,Keywords,Theme, evidence hierarchy, journal normalization, many-to-many themes)—see Relationship to draft research catalog model.
Further Notes
- Align naming and relationships with the internal roadmap document that describes
Document,IngestionRun, raw assets, parsed artifacts, and chunks as the ingestion spine. - If a future requirement introduces re-parsing with history, the 1:1
ParsedArtifactrule would need revisiting; current expectation is single parse per document over the tool’s lifetime.