This document describes the post-extraction pipeline. The LLM only proposes mentions (candidate entities per chunk). Identity is resolved in Python with deterministic rules so you can debug merges without re-running the model.
Each extraction call runs on one passage. The model invents id strings so it
can wire relations and mentions inside the JSON object. Those ids are local
join keys, not stable database identifiers:
- Chunk A might emit
cheshire_cat; chunk B might emitcheshir_catorthe_cat. - The same story role can appear with different names (
Alice's SistervsAlice sister).
If we MERGE on LLM id, we get duplicate nodes. The loader therefore MERGEs on
canonical_id, computed in code.
We group mentions by:
- Entities:
(entity_type, normalized_merge_key(name)) - Events:
(chapter, normalized_merge_key(name))(chapter keeps same-named events in different chapters apart)
Then we assign:
char_<slug>,loc_<slug>,obj_<slug>for entitiesevt_<slug>_ch<n>for events
The slug is derived from the same normalization pipeline so the merge key is obvious, testable, and grep-friendly in Neo4j Browser.
| Stage | Role | Module |
|---|---|---|
| A | LLM extracts JSON per chunk | src/extract_graph.py |
| B | Normalize surface strings | src/kg_normalization.py |
| C | Mentions → canonical records + remap | src/kg_canonical.py |
| D | MERGE nodes/edges in Neo4j |
src/load_to_neo4j.py |
| E | Wipe DB + rerun | scripts/clear_aura_db.py, main.py, this doc |
Applied in order (tests and dedup depend on this ordering):
- Unicode NFKC (compatibility normalization)
- Lowercase + trim
- Typo / alias fragment map — regex replacements with word boundaries (
TYPO_AND_ALIAS_FRAGMENTS) - English possessives
word's→word, strip apostrophes - Strip leading articles only (
the,a,an, …) — notthein the middle - Replace non-alphanumeric runs with spaces; collapse whitespace
Bill vs Bill the Lizard: both stay separate because inner the is kept and the
strings normalize to different merge keys. To merge them you must add an explicit
rule (typo map or a future curated alias table) and accept false-merge risk if
another “Bill” appears.
For each canonical node we store:
name— display string (longest surface form among merged mentions, by default)aliases— other surface forms seen for that canonical idnormalized_name— human-readable merge key for debuggingprovenance_passage_ids— passage ids where we saw a mention (audit trail)
- Conservative merging: similar but not identical strings may remain split (by design).
- Empty names: mentions with no
nameare isolated per raw LLM id so unrelated blank rows do not collapse into one node. - Remap conflicts: if the same raw
idstring is reused for different canonical targets across chunks, we keep the first mapping and log a conflict (see load output). - Event identity: scoped by chapter + name; recurring motifs across chapters stay split
unless you change that policy in
kg_canonical.
- Open
src/kg_normalization.py. - Edit
TYPO_AND_ALIAS_FRAGMENTSwith(regex_pattern, replacement)pairs. - Re-run extract only if you want new LLM wording; normalization changes apply on load from existing JSON.
- Run diagnostics:
uv run python scripts/kg_dedup_stats.py --collisions --aliases --pairs1. Wipe the graph (keeps Aura instance; deletes all nodes/relationships in NEO4J_DATABASE):
uv run python scripts/clear_aura_db.py --yes2. (Re)apply constraints if the database is new or you are unsure indexes exist:
Run cypher/schema.cypher in Neo4j Browser (or neo4j-admin / cypher-shell).
3. Re-run the pipeline from the project root:
uv run python main.py --steps chunk,extract,loadUse --extract-limit N for a smoke test. To reload without re-calling the LLM:
uv run python main.py --steps load(Loads passages from the book + extractions from data/extractions/alice_extractions.json.)
4. Verify counts
uv run python scripts/kg_dedup_stats.py --aurauv run python scripts/kg_dedup_stats.py
uv run python scripts/kg_dedup_stats.py --aliases --pairs --collisions
uv run python scripts/kg_dedup_stats.py --aura