-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
enhancementNew feature or requestNew feature or requestsearchSearch infrastructureSearch infrastructure
Description
问题
去重时只保留质量最好的副本, 不合并其他源的补充信息。例如:
- S2 有 citation_count 但无 keywords
- OpenAlex 有 keywords 但 abstract 质量低 (inverted index 重建)
- arXiv 有最好的 PDF URL 但无引用数
方案
新建 paper_merger.py, 定义字段源优先级:
FIELD_SOURCE_PRIORITY = {
"abstract": ["europepmc", "semantic_scholar", "openalex", "arxiv"],
"citation_count": ["openalex", "crossref", "semantic_scholar"],
"venue": ["dblp", "openalex", "semantic_scholar", "crossref"],
"pdf_url": ["arxiv", "openalex", "semantic_scholar"],
"keywords": ["openalex", "hf_daily", "papers_cool"],
}涉及文件
| 文件 | 改动 |
|---|---|
application/services/paper_merger.py |
新建 |
application/services/paper_search_service.py |
去重时调用 merge |
tests/unit/test_paper_merger.py |
新建 |
验收标准
- 合并后字段完整度 > 90%
- citation_count 优先 OpenAlex, PDF URL 优先 arXiv
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestsearchSearch infrastructureSearch infrastructure