Skip to content

feat(entities): gate ambiguous bare first names on corroboration (#36)#39

Merged
tenfourty merged 1 commit into
mainfrom
feat/first-name-disambiguation
Jun 25, 2026
Merged

feat(entities): gate ambiguous bare first names on corroboration (#36)#39
tenfourty merged 1 commit into
mainfrom
feat/first-name-disambiguation

Conversation

@tenfourty

Copy link
Copy Markdown
Owner

What

Stops bare common first names from auto-linking to the wrong entity.

A bare single first name (≥4 chars) that is ambiguous — claimed by 2+ entities — no longer auto-links in the participant, title, or content tiers. It links only when the entity is corroborated in the same document by a stronger signal:

  • a tag (tagged) or source-ref (source_ref) match,
  • a multi-word / full-name match (participant / title / discussed),
  • an exact title-part match, or
  • appearing in the frontmatter attendees.

Unambiguous bare first names (owned by exactly one entity) still link as before — no regression to existing single-name linking.

Accent-folding

Ambiguity is computed by the new build_first_name_index(), which accent-folds names (Jérémyjeremy) so an accented name and its ASCII near-twin land in the same ambiguity bucket. Folding is used only to widen the ambiguity set (making the matcher more conservative) — it never broadens a match. So a bare ASCII "Jeremy" that could be either "Jérémy Cotineau" or "Jeremy Brown" is gated until one of them is corroborated. This directly addresses the issue's note that folding "should probably not, on its own, be enough to auto-link a bare first name."

Wiring

The indexer builds the name_owners index once and threads it plus the document's frontmatter attendees into find_entity_mentions.

Why

Reported by a Chief-of-Staff agent seeing common (incl. French) first names auto-matched to the wrong colleague — and, because mentions had no un-link until #35, two distinct people could end up looking like one identity.

Design note

The fix gates ambiguous bare names only — not all common names — because the existing test contract (and real usage) relies on unambiguous single names like "Soren"/"Wren" linking from content alone. Gating every bare name would mass-under-link. "Ambiguity + corroboration" is the targeted, test-consistent reading of the reported failure ("wrong colleague" / merged identity both require ≥2 candidates).

Tests

  • test_entities.py::TestFirstNameDisambiguation — ambiguity index (incl. accent-fold), ambiguous bare name not linked, rescued by full-name / attendee corroboration, accent-twin gating, title gating, unambiguous still links.
  • test_disambiguation.py — full index_all integration: bare ambiguous not linked; full-name and frontmatter-attendee corroboration each rescue exactly one entity (verifies the indexer threads attendees + name_owners).
  • Two pre-existing tests that used the ambiguous "Anders" to assert auto-linking were updated to the new behaviour (they encoded exactly the false positive this fixes); repurposed to unambiguous names to keep their threshold/substring intent.

Docs: docs/entities.md §Disambiguation → "Bare ambiguous first names (#36)".

Closes #36.

…#36)

Bare common first names (≥4 chars, e.g. "Alexandre", "Thomas", "Jean") were
auto-linked anywhere they appeared in a title or content, so any occurrence
attached to whichever entity carried that name — frequently the wrong
colleague, and worse for accented names whose ASCII near-twins collide.

A bare single first name that is *ambiguous* (claimed by 2+ entities) now links
only when its entity is corroborated in the same document — by a tag, source
ref, full-name match, an exact title-part, or by appearing in the frontmatter
attendees. Unambiguous bare names still link as before. Ambiguity is computed
by build_first_name_index(), which accent-folds names so "Jérémy"/"Jeremy"
share a bucket; folding only *widens* the ambiguity set (more conservative) and
never broadens a match. The indexer threads the document's attendees and a
once-built name_owners index into find_entity_mentions.
@tenfourty tenfourty merged commit 5e29f6e into main Jun 25, 2026
8 checks passed
@tenfourty tenfourty deleted the feat/first-name-disambiguation branch June 25, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: disambiguate bare/common first-name entity matches (incl. accent-folded collisions)

1 participant