Skip to content

fix: overhaul duplicate detection scoring, add address matching, trigger after imports#22

Open
bashar-qassis wants to merge 4 commits intomainfrom
fix/duplicate-detection
Open

fix: overhaul duplicate detection scoring, add address matching, trigger after imports#22
bashar-qassis wants to merge 4 commits intomainfrom
fix/duplicate-detection

Conversation

@bashar-qassis
Copy link
Copy Markdown
Owner

Summary

  • Fixed broken scoring formula that silently missed duplicates sharing the same email (scored 0.35, below 0.4 threshold) or phone (scored 0.25). Replaced additive scoring with max-signal + bonus approach where each signal independently qualifies: email=0.85, phone=0.75, address=0.60, name=pg_trgm similarity.
  • Added address matching on normalized line1 + postal_code (case-insensitive, trimmed).
  • Fixed email matching to be case-insensitive and filter both sides of the field pair by contact_field_type.
  • Fixed protocol matching to use LIKE 'mailto%' pattern, handling the colon inconsistency between seeded and custom-created field types.
  • Triggered duplicate detection after imports — all three import workers (MonicaApiCrawlWorker, ImportSourceWorker, ImportWorker) now enqueue DuplicateDetectionWorker on successful completion.
  • Added 20 comprehensive tests for the detection worker covering all match types, scoring, edge cases, and account isolation.

Test plan

  • mix compile --warnings-as-errors — clean
  • mix test — 1035 tests, 0 failures
  • mix quality — format, credo, sobelow, dialyzer all pass
  • Manual: trigger Monica import → verify duplicate candidates appear afterward
  • Manual: click "Scan now" → verify contacts sharing email/phone/address are detected

…ger after imports

The duplicate detection worker had several bugs preventing it from catching
obvious duplicates:

- Scoring formula (name*0.4 + email*0.35 + phone*0.25 with threshold 0.4)
  meant contacts sharing the same email but with different names scored 0.35,
  below the threshold — silently missed.
- Email comparison was case-sensitive.
- Only one side of email/phone field pairs had its type verified.
- Address data was completely ignored.
- No import worker triggered duplicate detection after completion.

Fixes:
- Replace additive scoring with max-signal + bonus approach where each signal
  independently qualifies (email=0.85, phone=0.75, address=0.60, name=similarity)
- Add case-insensitive email matching via LOWER() fragments
- Filter both cf1 and cf2 contact_field_types in email/phone queries
- Use LIKE 'mailto%' pattern to handle protocol colon inconsistency
- Add address matching on normalized line1 + postal_code
- Enqueue DuplicateDetectionWorker after successful completion in all three
  import workers (MonicaApiCrawlWorker, ImportSourceWorker, ImportWorker)
- Add comprehensive test suite (20 tests) for the detection worker
@bashar-qassis bashar-qassis force-pushed the fix/duplicate-detection branch from 5850c3f to 38cadb8 Compare April 4, 2026 16:42
list_candidates now takes limit/offset opts (default 20 per page).
The LiveView loads one page at a time with a "Load more" button.
Dismiss removes the candidate from the current list without reloading.
The /contacts/duplicates route uses ContactLive.Index, not the standalone
Duplicates LiveView. Added limit/offset pagination with Load more button
and optimistic dismiss (no full re-query) to match the standalone page.
Photos with the same content_hash on both contacts caused a unique
constraint violation during merge. Now deletes duplicate photos from
the non-survivor before transferring the rest, matching the pattern
used for contact_tags and activity_contacts.

Also collapsed the merge flow from 4 steps to 3 by combining the
preview and confirm steps into a single "Review & merge" step.
From the duplicates page (contact preselected), merge is now 2 clicks
instead of 3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant