-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
enhancementNew feature or requestNew feature or requestsearchSearch infrastructureSearch infrastructure
Description
问题
当前去重仅基于精确 ID 匹配, 存在 4 个缺陷:
- 跨源 ID 不匹配: S2 返回 doi, arXiv 返回 arxiv_id → 同一篇论文两个 key
- title_hash 标准化不一致: post_init strip标点, _paper_key 不strip
- PapersCool 不提取 arxiv_id
- 无模糊匹配
压测: 4 个源返回 120 篇, duplicates_removed=0.
方案
新建 PaperDeduplicator 三级去重器:
- DOI 精确匹配
- arxiv_id 精确匹配
- rapidfuzz.fuzz.ratio >= 0.85 模糊标题匹配
涉及文件
| 文件 | 改动 |
|---|---|
application/services/paper_search_service.py |
新建 PaperDeduplicator, 替换 _paper_key |
domain/paper.py |
修复 _compute_title_hash 标准化 |
pyproject.toml / requirements.txt |
新增 rapidfuzz |
tests/unit/test_paper_dedup.py |
新建 |
验收标准
- 压测去重率 > 30%
- rapidfuzz 阈值 0.85 不误合并
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestsearchSearch infrastructureSearch infrastructure