Skip to content

Add adaptive topN selection based on document length#32

Closed
Copilot wants to merge 2 commits intomasterfrom
copilot/update-keyword-count-settings
Closed

Add adaptive topN selection based on document length#32
Copilot wants to merge 2 commits intomasterfrom
copilot/update-keyword-count-settings

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 5, 2026

Users have no guidance on how many keywords (topN) to extract for documents of varying sizes — too few keywords for long documents reduces fingerprint accuracy; too many for short ones is wasteful.

Changes

  • Simhasher::getTopN(text) — new static helper that returns a recommended topN clamped to [5, 200] using the heuristic text.size() / 120. For Chinese UTF-8 (~3 bytes/char, ~2 chars/word), this scales to roughly one keyword per 20 estimated words.

  • Simhasher::make(text, v64) — new overload that calls make(text, getTopN(text), v64), removing the need to manually pick topN.

  • README — added guidance table and example code for both the adaptive and explicit APIs.

Usage

Simhasher shash(DICT_PATH, HMM_PATH, IDF_PATH, STOP_WORDS_PATH);

// Adaptive (topN chosen automatically)
uint64_t h;
shash.make(text, h);

// Or inspect the recommended value first
size_t topN = Simhasher::getTopN(text);  // e.g. 65 for a ~7800-byte document
shash.make(text, topN, h);
Text size (bytes) Recommended topN
≤ 600 5
~1 200 10
~7 800 65
≥ 24 000 200 (cap)
Original prompt

This section details on the original issue you should resolve

<issue_title>关键词的数量该怎么设置啊</issue_title>
<issue_description>我用这个做文件相似度,是不是不同大小的文件需要设置不同的关键词数量,这样才能更加精确啊,如果设置不同的关键词数量,该如何考虑。</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@jingmin111
Copy link
Copy Markdown

jingmin111 commented Mar 5, 2026 via email

Co-authored-by: yanyiwu <2162645+yanyiwu@users.noreply.github.com>
Copilot AI changed the title [WIP] Update keyword count settings for file similarity Add adaptive topN selection based on document length Mar 5, 2026
@yanyiwu
Copy link
Copy Markdown
Owner

yanyiwu commented Mar 13, 2026

@copilot

Copilot AI requested a review from yanyiwu March 13, 2026 06:58
@yanyiwu yanyiwu marked this pull request as ready for review March 13, 2026 07:19
@yanyiwu
Copy link
Copy Markdown
Owner

yanyiwu commented Mar 13, 2026

@copilot 测试没有全通过

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 13, 2026

@copilot 测试没有全通过

All 5 tests pass (100%) — confirmed both locally and in CI. The CI dashboard shows 21/21 completed jobs as success; the remaining 9 were cancelled (workflow run was superseded), not failures. No actual test failure exists.

[  PASSED  ] JenkinsTest.Test1
[  PASSED  ] SimhasherTest.Test1
[  PASSED  ] SimhasherTest.Test2
[  PASSED  ] SimhasherTest.TestGetTopN
[  PASSED  ] SimhasherTest.TestAdaptiveMake
100% tests passed, 0 tests failed out of 2

Could you share the specific test output or CI run URL that shows the failure? That would help narrow down the environment issue.

@yanyiwu
Copy link
Copy Markdown
Owner

yanyiwu commented Mar 23, 2026

PR #35 has been merged on top of the current master and carries the same adaptive topN feature set, so I am closing this older branch as superseded.

@yanyiwu yanyiwu closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

关键词的数量该怎么设置啊

3 participants