Skip to content

Deduplicate overlapping entity detections#7

Merged
JuliusScheuerer merged 1 commit into
mainfrom
worktree-dedup-entity-detections
Mar 25, 2026
Merged

Deduplicate overlapping entity detections#7
JuliusScheuerer merged 1 commit into
mainfrom
worktree-dedup-entity-detections

Conversation

@JuliusScheuerer
Copy link
Copy Markdown
Owner

Summary

  • Add _deduplicate_overlapping() to text_handler.py — when Presidio's built-in recognizers and custom German recognizers both match the same text span (e.g., IBAN_CODE + DE_IBAN), keeps only the highest-confidence result
  • Tiebreaker for equal scores: longer span wins (more specific match)
  • All detection paths (text, PDF, API) go through detect_pii_in_text(), so this is a single-point fix

Test plan

  • make check passes (275 tests, 96.67% coverage)
  • 10 new unit tests covering: exact overlap, partial overlap, non-overlapping, equal scores, adjacent spans, three-way overlap, IBAN dedup, phone dedup, integration with detect_pii_in_text
  • text_handler.py at 100% coverage

When Presidio's built-in recognizers and custom German recognizers both
match the same text span (e.g., IBAN_CODE + DE_IBAN), the entity appeared
twice in the review panel. Add _deduplicate_overlapping() to text_handler
that keeps the highest-confidence result per character range, with span
length as tiebreaker for equal scores.
@JuliusScheuerer JuliusScheuerer merged commit 7397a6a into main Mar 25, 2026
3 checks passed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 25, 2026

Greptile Summary

This PR introduces _deduplicate_overlapping() in text_handler.py to resolve the case where Presidio's built-in recognizers and custom German recognizers both fire on the same text span (e.g., IBAN_CODE + DE_IBAN). The fix is applied at the single choke-point detect_pii_in_text, so all detection paths (plain text, PDF, API) benefit automatically.

Key changes:

  • New _deduplicate_overlapping(results): sorts candidates by (-score, -span_length) then greedily accepts non-overlapping entries — a standard weighted interval scheduling approach that is correct for this domain.
  • detect_pii_in_text pipes the score-filtered results through the new function before returning.
  • 10 new unit tests covering all meaningful edge cases (exact overlap, partial overlap, adjacent/non-overlapping, equal-score tie-break, three-way chain, integration with detect_pii_in_text).
  • CLAUDE.md referenced in custom instructions does not appear to exist in the repository, so style-guide verification could not be performed.

Confidence Score: 5/5

  • This PR is safe to merge; the implementation is algorithmically correct and comprehensively tested.
  • The greedy sort-by-score-then-length algorithm is the correct approach for weighted interval deduplication. All edge cases (empty, single, exact, partial, adjacent, equal-score tie-break, three-way chain, integration) are covered by tests. The behavioral change to output ordering (now score-descending instead of engine order) is inconsequential because Presidio's anonymizer sorts entities by position internally. No logic bugs, security issues, or data-loss risks were identified.
  • No files require special attention.

Important Files Changed

Filename Overview
src/document_anonymizer/document/text_handler.py Adds _deduplicate_overlapping() using a greedy score-then-length sort and O(n²) overlap check; wires it into detect_pii_in_text. Logic and edge-case handling are correct.
tests/test_document/test_text_handler.py Ten new unit tests added covering empty list, single result, exact/partial overlap, non-overlapping, equal score tie-breaking, adjacent spans, three-way overlap, and an integration test through detect_pii_in_text. All scenarios are well-covered.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["detect_pii_in_text(engine, text)"] --> B["engine.analyze(text, language)"]
    B --> C["Filter: score >= score_threshold"]
    C --> D{"len(filtered) <= 1?"}
    D -- Yes --> G["Return filtered as-is"]
    D -- No --> E["Sort by (-score, -span_length)"]
    E --> F["Greedy accept loop\nfor each candidate:\n  if no overlap with accepted → accept"]
    F --> H["Return accepted list\n(score-desc order)"]
    G --> I["Caller: anonymize_plain_text / PDF / API"]
    H --> I
Loading

Reviews (1): Last reviewed commit: "Deduplicate overlapping entity detection..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant