Conversation
…diff highlights Previously splitWords only split on whitespace, so a trailing comma or prefix change caused the entire whitespace-delimited token to highlight. Now word characters (\w+), whitespace (\s+), and individual symbols are separate tokens, so only the actually changed parts get highlighted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oji tokenization
The splitWords regex now uses \p{L}\p{N} with the `gu` flags instead of
\w, so accented characters (e.g. café) are kept as whole word tokens and
emoji (e.g. 🎉) are treated as single code points rather than split
surrogate pairs. Adds test cases for both scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65aefeac41
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| function splitWords(text: string): string[] { | ||
| return text.split(/(\s+)/); | ||
| return text.match(/[\p{L}\p{N}_]+|\s+|[^\s\p{L}\p{N}_]/gu) ?? []; |
There was a problem hiding this comment.
Cap tokenization granularity before LCS diffing
computeCharHighlights passes splitWords output into lcsIndices, which allocates a full (m+1)*(n+1) dynamic-programming table. With this regex, punctuation-heavy lines now produce one token per symbol, so a single long modified line (such as minified JSON/lockfile content) can jump from ~1 token to thousands and create tens of millions of DP cells, causing major UI stalls or memory spikes. Please keep the improved precision but add a fallback (for example, max token count or coarse tokenization) when token streams get large.
Useful? React with 👍 / 👎.
Summary
\p{L},\p{N}) so accented characters and emoji are tokenized correctly as individual wordsTest plan
café→thé)🎉→🚀)🤖 Generated with Claude Code