Skip to content

Latest commit

 

History

History
147 lines (99 loc) · 4.82 KB

File metadata and controls

147 lines (99 loc) · 4.82 KB

Knowledge graph ingestion: mentions, normalization, canonical ids, Neo4j

This document describes the post-extraction pipeline. The LLM only proposes mentions (candidate entities per chunk). Identity is resolved in Python with deterministic rules so you can debug merges without re-running the model.


Why raw LLM id values are not trusted

Each extraction call runs on one passage. The model invents id strings so it can wire relations and mentions inside the JSON object. Those ids are local join keys, not stable database identifiers:

  • Chunk A might emit cheshire_cat; chunk B might emit cheshir_cat or the_cat.
  • The same story role can appear with different names (Alice's Sister vs Alice sister).

If we MERGE on LLM id, we get duplicate nodes. The loader therefore MERGEs on canonical_id, computed in code.


Why canonical ids come from normalized names (+ type)

We group mentions by:

  • Entities: (entity_type, normalized_merge_key(name))
  • Events: (chapter, normalized_merge_key(name)) (chapter keeps same-named events in different chapters apart)

Then we assign:

  • char_<slug>, loc_<slug>, obj_<slug> for entities
  • evt_<slug>_ch<n> for events

The slug is derived from the same normalization pipeline so the merge key is obvious, testable, and grep-friendly in Neo4j Browser.


Stage map (code locations)

Stage Role Module
A LLM extracts JSON per chunk src/extract_graph.py
B Normalize surface strings src/kg_normalization.py
C Mentions → canonical records + remap src/kg_canonical.py
D MERGE nodes/edges in Neo4j src/load_to_neo4j.py
E Wipe DB + rerun scripts/clear_aura_db.py, main.py, this doc

Normalization rules (kg_normalization.normalized_merge_key)

Applied in order (tests and dedup depend on this ordering):

  1. Unicode NFKC (compatibility normalization)
  2. Lowercase + trim
  3. Typo / alias fragment map — regex replacements with word boundaries (TYPO_AND_ALIAS_FRAGMENTS)
  4. English possessives word'sword, strip apostrophes
  5. Strip leading articles only (the, a, an, …) — not the in the middle
  6. Replace non-alphanumeric runs with spaces; collapse whitespace

Bill vs Bill the Lizard: both stay separate because inner the is kept and the strings normalize to different merge keys. To merge them you must add an explicit rule (typo map or a future curated alias table) and accept false-merge risk if another “Bill” appears.


Aliases and provenance

For each canonical node we store:

  • name — display string (longest surface form among merged mentions, by default)
  • aliases — other surface forms seen for that canonical id
  • normalized_name — human-readable merge key for debugging
  • provenance_passage_ids — passage ids where we saw a mention (audit trail)

Current limitations

  • Conservative merging: similar but not identical strings may remain split (by design).
  • Empty names: mentions with no name are isolated per raw LLM id so unrelated blank rows do not collapse into one node.
  • Remap conflicts: if the same raw id string is reused for different canonical targets across chunks, we keep the first mapping and log a conflict (see load output).
  • Event identity: scoped by chapter + name; recurring motifs across chapters stay split unless you change that policy in kg_canonical.

How to extend typo and alias rules

  1. Open src/kg_normalization.py.
  2. Edit TYPO_AND_ALIAS_FRAGMENTS with (regex_pattern, replacement) pairs.
  3. Re-run extract only if you want new LLM wording; normalization changes apply on load from existing JSON.
  4. Run diagnostics:
uv run python scripts/kg_dedup_stats.py --collisions --aliases --pairs

Safe reset and full rebuild (Stage E)

1. Wipe the graph (keeps Aura instance; deletes all nodes/relationships in NEO4J_DATABASE):

uv run python scripts/clear_aura_db.py --yes

2. (Re)apply constraints if the database is new or you are unsure indexes exist:

Run cypher/schema.cypher in Neo4j Browser (or neo4j-admin / cypher-shell).

3. Re-run the pipeline from the project root:

uv run python main.py --steps chunk,extract,load

Use --extract-limit N for a smoke test. To reload without re-calling the LLM:

uv run python main.py --steps load

(Loads passages from the book + extractions from data/extractions/alice_extractions.json.)

4. Verify counts

uv run python scripts/kg_dedup_stats.py --aura

Diagnostics commands (summary)

uv run python scripts/kg_dedup_stats.py
uv run python scripts/kg_dedup_stats.py --aliases --pairs --collisions
uv run python scripts/kg_dedup_stats.py --aura