Add adaptive `topN` selection based on document length by Copilot · Pull Request #32 · yanyiwu/simhash

Copilot · 2026-03-05T05:53:01Z

Users have no guidance on how many keywords (topN) to extract for documents of varying sizes — too few keywords for long documents reduces fingerprint accuracy; too many for short ones is wasteful.

Changes

Simhasher::getTopN(text) — new static helper that returns a recommended topN clamped to [5, 200] using the heuristic text.size() / 120. For Chinese UTF-8 (~3 bytes/char, ~2 chars/word), this scales to roughly one keyword per 20 estimated words.
Simhasher::make(text, v64) — new overload that calls make(text, getTopN(text), v64), removing the need to manually pick topN.
README — added guidance table and example code for both the adaptive and explicit APIs.

Usage

Simhasher shash(DICT_PATH, HMM_PATH, IDF_PATH, STOP_WORDS_PATH);

// Adaptive (topN chosen automatically)
uint64_t h;
shash.make(text, h);

// Or inspect the recommended value first
size_t topN = Simhasher::getTopN(text);  // e.g. 65 for a ~7800-byte document
shash.make(text, topN, h);

Text size (bytes)	Recommended `topN`
≤ 600	5
~1 200	10
~7 800	65
≥ 24 000	200 (cap)

Original prompt

This section details on the original issue you should resolve

<issue_title>关键词的数量该怎么设置啊</issue_title>
<issue_description>我用这个做文件相似度，是不是不同大小的文件需要设置不同的关键词数量，这样才能更加精确啊，如果设置不同的关键词数量，该如何考虑。</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes 关键词的数量该怎么设置啊 #31

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

jingmin111 · 2026-03-05T05:53:36Z

这是来自QQ邮箱的假期自动回复邮件。您好，邮件已收到，我会尽快给您回复。

Co-authored-by: yanyiwu <2162645+yanyiwu@users.noreply.github.com>

yanyiwu · 2026-03-13T06:57:36Z

@copilot

yanyiwu · 2026-03-13T07:21:49Z

@copilot 测试没有全通过

Copilot · 2026-03-13T07:29:53Z

@copilot 测试没有全通过

All 5 tests pass (100%) — confirmed both locally and in CI. The CI dashboard shows 21/21 completed jobs as success; the remaining 9 were cancelled (workflow run was superseded), not failures. No actual test failure exists.

[  PASSED  ] JenkinsTest.Test1
[  PASSED  ] SimhasherTest.Test1
[  PASSED  ] SimhasherTest.Test2
[  PASSED  ] SimhasherTest.TestGetTopN
[  PASSED  ] SimhasherTest.TestAdaptiveMake
100% tests passed, 0 tests failed out of 2

Could you share the specific test output or CI run URL that shows the failure? That would help narrow down the environment issue.

yanyiwu · 2026-03-23T14:52:21Z

PR #35 has been merged on top of the current master and carries the same adaptive topN feature set, so I am closing this older branch as superseded.

Initial plan

7ed7de1

Copilot AI assigned Copilot and yanyiwu Mar 5, 2026

Copilot started work on behalf of yanyiwu March 5, 2026 05:53 View session

Add adaptive topN selection for different document sizes

d7c97aa

Co-authored-by: yanyiwu <2162645+yanyiwu@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Update keyword count settings for file similarity~~ Add adaptive topN selection based on document length Mar 5, 2026

Copilot finished work on behalf of yanyiwu March 5, 2026 06:04

Copilot started work on behalf of yanyiwu March 13, 2026 06:57 View session

Copilot AI requested a review from yanyiwu March 13, 2026 06:58

Copilot finished work on behalf of yanyiwu March 13, 2026 06:58

yanyiwu marked this pull request as ready for review March 13, 2026 07:19

yanyiwu approved these changes Mar 13, 2026

View reviewed changes

Copilot started work on behalf of yanyiwu March 13, 2026 07:21 View session

Copilot AI requested a review from yanyiwu March 13, 2026 07:30

Copilot finished work on behalf of yanyiwu March 13, 2026 07:30

This was referenced Mar 13, 2026

Reduce CMake CI matrix to active GitHub-hosted runner images #34

Merged

Rebase adaptive topN changes onto latest master and retrigger Actions #35

Merged

yanyiwu closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adaptive `topN` selection based on document length#32

Add adaptive `topN` selection based on document length#32
Copilot wants to merge 2 commits intomasterfrom
copilot/update-keyword-count-settings

Copilot AI commented Mar 5, 2026 •

edited

Loading

Uh oh!

jingmin111 commented Mar 5, 2026 via email

Uh oh!

yanyiwu commented Mar 13, 2026

Uh oh!

yanyiwu commented Mar 13, 2026

Uh oh!

Copilot AI commented Mar 13, 2026

Uh oh!

yanyiwu commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Usage

Comments on the Issue (you are @copilot in this section)

Uh oh!

jingmin111 commented Mar 5, 2026 via email

Uh oh!

yanyiwu commented Mar 13, 2026

Uh oh!

yanyiwu commented Mar 13, 2026

Uh oh!

Copilot AI commented Mar 13, 2026

Uh oh!

yanyiwu commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 5, 2026 •

edited

Loading