Add tokenizer mode support for message search#33048
Draft
shinaoka wants to merge 4 commits intoelement-hq:developfrom
Draft
Add tokenizer mode support for message search#33048shinaoka wants to merge 4 commits intoelement-hq:developfrom
shinaoka wants to merge 4 commits intoelement-hq:developfrom
Conversation
Add user-facing setting to choose tokenizer mode (language-based vs N-gram) for local message search. N-gram tokenization supports CJK languages and mixed-language content. Includes confirmation dialog for mode changes (requires reindex) and automatic checkpoint recovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add N-gram tokenizer support to Seshat search index initialization. Refactors initEventIndex into a testable module with dependency injection, supporting both language-based and N-gram tokenization modes. Includes automatic database recreation fallback and unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, any Seshat open failure that wasn't a ReindexError would silently delete the EventStore and recreate it, losing the user's search index on transient errors like filesystem locks or passphrase issues. Restore the original behavior: only handle ReindexError with recovery logic, and propagate all other errors to the caller (which shows an error to the user via sendError). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Seshat has released v4.2.0. Perhaps now the dependency can be updated, and this pull request can proceed to be reviewed by the team. Great appreciation from a CJK language user. |
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR ports the tokenizer-mode work to the monorepo layout and combines the previous web and desktop-side changes into a single branch.
It adds a user-facing tokenizer mode setting for local message search, passes the selected mode through the desktop layer to Seshat, and asks for confirmation before reindexing when the mode changes.
Related
Merge Status
matrix-org/seshat#150is merged, butmatrix-org/seshat#157is still open andmatrix-seshat@4.1.0has not been released yet.This PR should stay draft for now. Once
matrix-seshat@4.1.0is released, the dependency can be updated and this PR should be ready to merge.Testing
pnpm --dir apps/desktop test:unitpnpm --dir apps/web test -- --runTestsByPath test/unit-tests/async-components/dialogs/eventindex/ManageEventIndexDialog-test.tsx