Skip to content

字段优先级合并: 多源 metadata 择优保留 #319

@jerry609

Description

@jerry609

问题

去重时只保留质量最好的副本, 不合并其他源的补充信息。例如:

  • S2 有 citation_count 但无 keywords
  • OpenAlex 有 keywords 但 abstract 质量低 (inverted index 重建)
  • arXiv 有最好的 PDF URL 但无引用数

方案

新建 paper_merger.py, 定义字段源优先级:

FIELD_SOURCE_PRIORITY = {
    "abstract":        ["europepmc", "semantic_scholar", "openalex", "arxiv"],
    "citation_count":  ["openalex", "crossref", "semantic_scholar"],
    "venue":           ["dblp", "openalex", "semantic_scholar", "crossref"],
    "pdf_url":         ["arxiv", "openalex", "semantic_scholar"],
    "keywords":        ["openalex", "hf_daily", "papers_cool"],
}

涉及文件

文件 改动
application/services/paper_merger.py 新建
application/services/paper_search_service.py 去重时调用 merge
tests/unit/test_paper_merger.py 新建

验收标准

  • 合并后字段完整度 > 90%
  • citation_count 优先 OpenAlex, PDF URL 优先 arXiv

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsearchSearch infrastructure

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions