feat(entities): gate ambiguous bare first names on corroboration (#36)#39
Merged
Conversation
…#36) Bare common first names (≥4 chars, e.g. "Alexandre", "Thomas", "Jean") were auto-linked anywhere they appeared in a title or content, so any occurrence attached to whichever entity carried that name — frequently the wrong colleague, and worse for accented names whose ASCII near-twins collide. A bare single first name that is *ambiguous* (claimed by 2+ entities) now links only when its entity is corroborated in the same document — by a tag, source ref, full-name match, an exact title-part, or by appearing in the frontmatter attendees. Unambiguous bare names still link as before. Ambiguity is computed by build_first_name_index(), which accent-folds names so "Jérémy"/"Jeremy" share a bucket; folding only *widens* the ambiguity set (more conservative) and never broadens a match. The indexer threads the document's attendees and a once-built name_owners index into find_entity_mentions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stops bare common first names from auto-linking to the wrong entity.
A bare single first name (≥4 chars) that is ambiguous — claimed by 2+ entities — no longer auto-links in the participant, title, or content tiers. It links only when the entity is corroborated in the same document by a stronger signal:
tagged) or source-ref (source_ref) match,participant/title/discussed),attendees.Unambiguous bare first names (owned by exactly one entity) still link as before — no regression to existing single-name linking.
Accent-folding
Ambiguity is computed by the new
build_first_name_index(), which accent-folds names (Jérémy→jeremy) so an accented name and its ASCII near-twin land in the same ambiguity bucket. Folding is used only to widen the ambiguity set (making the matcher more conservative) — it never broadens a match. So a bare ASCII "Jeremy" that could be either "Jérémy Cotineau" or "Jeremy Brown" is gated until one of them is corroborated. This directly addresses the issue's note that folding "should probably not, on its own, be enough to auto-link a bare first name."Wiring
The indexer builds the
name_ownersindex once and threads it plus the document's frontmatterattendeesintofind_entity_mentions.Why
Reported by a Chief-of-Staff agent seeing common (incl. French) first names auto-matched to the wrong colleague — and, because mentions had no un-link until #35, two distinct people could end up looking like one identity.
Design note
The fix gates ambiguous bare names only — not all common names — because the existing test contract (and real usage) relies on unambiguous single names like "Soren"/"Wren" linking from content alone. Gating every bare name would mass-under-link. "Ambiguity + corroboration" is the targeted, test-consistent reading of the reported failure ("wrong colleague" / merged identity both require ≥2 candidates).
Tests
test_entities.py::TestFirstNameDisambiguation— ambiguity index (incl. accent-fold), ambiguous bare name not linked, rescued by full-name / attendee corroboration, accent-twin gating, title gating, unambiguous still links.test_disambiguation.py— fullindex_allintegration: bare ambiguous not linked; full-name and frontmatter-attendee corroboration each rescue exactly one entity (verifies the indexer threads attendees + name_owners).Docs:
docs/entities.md§Disambiguation → "Bare ambiguous first names (#36)".Closes #36.