Skip to content

Add tokenizer mode support for message search#33048

Draft
shinaoka wants to merge 4 commits intoelement-hq:developfrom
shinaoka:shinaoka/tokenizer-mode
Draft

Add tokenizer mode support for message search#33048
shinaoka wants to merge 4 commits intoelement-hq:developfrom
shinaoka:shinaoka/tokenizer-mode

Conversation

@shinaoka
Copy link
Copy Markdown
Contributor

@shinaoka shinaoka commented Apr 6, 2026

Summary

This PR ports the tokenizer-mode work to the monorepo layout and combines the previous web and desktop-side changes into a single branch.

It adds a user-facing tokenizer mode setting for local message search, passes the selected mode through the desktop layer to Seshat, and asks for confirmation before reindexing when the mode changes.

Related

Merge Status

matrix-org/seshat#150 is merged, but matrix-org/seshat#157 is still open and matrix-seshat@4.1.0 has not been released yet.

This PR should stay draft for now. Once matrix-seshat@4.1.0 is released, the dependency can be updated and this PR should be ready to merge.

Testing

  • pnpm --dir apps/desktop test:unit
  • pnpm --dir apps/web test -- --runTestsByPath test/unit-tests/async-components/dialogs/eventindex/ManageEventIndexDialog-test.tsx

shinaoka and others added 4 commits March 30, 2026 20:26
Add user-facing setting to choose tokenizer mode (language-based vs N-gram)
for local message search. N-gram tokenization supports CJK languages and
mixed-language content. Includes confirmation dialog for mode changes
(requires reindex) and automatic checkpoint recovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add N-gram tokenizer support to Seshat search index initialization.
Refactors initEventIndex into a testable module with dependency injection,
supporting both language-based and N-gram tokenization modes. Includes
automatic database recreation fallback and unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, any Seshat open failure that wasn't a ReindexError would
silently delete the EventStore and recreate it, losing the user's search
index on transient errors like filesystem locks or passphrase issues.

Restore the original behavior: only handle ReindexError with recovery
logic, and propagate all other errors to the caller (which shows an
error to the user via sendError).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the Z-Community-PR Issue is solved by a community member's PR label Apr 6, 2026
@aik2mlj
Copy link
Copy Markdown

aik2mlj commented Apr 14, 2026

Seshat has released v4.2.0. Perhaps now the dependency can be updated, and this pull request can proceed to be reviewed by the team. Great appreciation from a CJK language user.

@t3chguy
Copy link
Copy Markdown
Member

t3chguy commented Apr 16, 2026

#33168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Z-Community-PR Issue is solved by a community member's PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants