Knowledge graph ingestion: mentions, normalization, canonical ids, Neo4j

This document describes the post-extraction pipeline. The LLM only proposes mentions (candidate entities per chunk). Identity is resolved in Python with deterministic rules so you can debug merges without re-running the model.

Why raw LLM `id` values are not trusted

Each extraction call runs on one passage. The model invents id strings so it can wire relations and mentions inside the JSON object. Those ids are local join keys, not stable database identifiers:

Chunk A might emit cheshire_cat; chunk B might emit cheshir_cat or the_cat.
The same story role can appear with different names (Alice's Sister vs Alice sister).

If we MERGE on LLM id, we get duplicate nodes. The loader therefore MERGEs on canonical_id, computed in code.

Why canonical ids come from normalized names (+ type)

We group mentions by:

Entities: (entity_type, normalized_merge_key(name))
Events: (chapter, normalized_merge_key(name)) (chapter keeps same-named events in different chapters apart)

Then we assign:

char_<slug>, loc_<slug>, obj_<slug> for entities
evt_<slug>_ch<n> for events

The slug is derived from the same normalization pipeline so the merge key is obvious, testable, and grep-friendly in Neo4j Browser.

Stage map (code locations)

Stage	Role	Module
A	LLM extracts JSON per chunk	`src/extract_graph.py`
B	Normalize surface strings	`src/kg_normalization.py`
C	Mentions → canonical records + remap	`src/kg_canonical.py`
D	`MERGE` nodes/edges in Neo4j	`src/load_to_neo4j.py`
E	Wipe DB + rerun	`scripts/clear_aura_db.py`, `main.py`, this doc

Normalization rules (`kg_normalization.normalized_merge_key`)

Applied in order (tests and dedup depend on this ordering):

Unicode NFKC (compatibility normalization)
Lowercase + trim
Typo / alias fragment map — regex replacements with word boundaries (TYPO_AND_ALIAS_FRAGMENTS)
English possessives word's → word, strip apostrophes
Strip leading articles only (the, a, an, …) — not the in the middle
Replace non-alphanumeric runs with spaces; collapse whitespace

Bill vs Bill the Lizard: both stay separate because inner the is kept and the strings normalize to different merge keys. To merge them you must add an explicit rule (typo map or a future curated alias table) and accept false-merge risk if another “Bill” appears.

Aliases and provenance

For each canonical node we store:

name — display string (longest surface form among merged mentions, by default)
aliases — other surface forms seen for that canonical id
normalized_name — human-readable merge key for debugging
provenance_passage_ids — passage ids where we saw a mention (audit trail)

Current limitations

Conservative merging: similar but not identical strings may remain split (by design).
Empty names: mentions with no name are isolated per raw LLM id so unrelated blank rows do not collapse into one node.
Remap conflicts: if the same raw id string is reused for different canonical targets across chunks, we keep the first mapping and log a conflict (see load output).
Event identity: scoped by chapter + name; recurring motifs across chapters stay split unless you change that policy in kg_canonical.

How to extend typo and alias rules

Open src/kg_normalization.py.
Edit TYPO_AND_ALIAS_FRAGMENTS with (regex_pattern, replacement) pairs.
Re-run extract only if you want new LLM wording; normalization changes apply on load from existing JSON.
Run diagnostics:

uv run python scripts/kg_dedup_stats.py --collisions --aliases --pairs

Safe reset and full rebuild (Stage E)

1. Wipe the graph (keeps Aura instance; deletes all nodes/relationships in NEO4J_DATABASE):

uv run python scripts/clear_aura_db.py --yes

2. (Re)apply constraints if the database is new or you are unsure indexes exist:

Run cypher/schema.cypher in Neo4j Browser (or neo4j-admin / cypher-shell).

3. Re-run the pipeline from the project root:

uv run python main.py --steps chunk,extract,load

Use --extract-limit N for a smoke test. To reload without re-calling the LLM:

uv run python main.py --steps load

(Loads passages from the book + extractions from data/extractions/alice_extractions.json.)

4. Verify counts

uv run python scripts/kg_dedup_stats.py --aura

Diagnostics commands (summary)

uv run python scripts/kg_dedup_stats.py
uv run python scripts/kg_dedup_stats.py --aliases --pairs --collisions
uv run python scripts/kg_dedup_stats.py --aura

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge graph ingestion: mentions, normalization, canonical ids, Neo4j

Why raw LLM `id` values are not trusted

Why canonical ids come from normalized names (+ type)

Stage map (code locations)

Normalization rules (`kg_normalization.normalized_merge_key`)

Aliases and provenance

Current limitations

How to extend typo and alias rules

Safe reset and full rebuild (Stage E)

Diagnostics commands (summary)

FilesExpand file tree

KG_PIPELINE.md

Latest commit

History

KG_PIPELINE.md

File metadata and controls

Knowledge graph ingestion: mentions, normalization, canonical ids, Neo4j

Why raw LLM id values are not trusted

Why canonical ids come from normalized names (+ type)

Stage map (code locations)

Normalization rules (kg_normalization.normalized_merge_key)

Aliases and provenance

Current limitations

How to extend typo and alias rules

Safe reset and full rebuild (Stage E)

Diagnostics commands (summary)

Why raw LLM `id` values are not trusted

Normalization rules (`kg_normalization.normalized_merge_key`)