Skip to content

fix(diff-viewer): improve inline diff tokenization for symbols, accents, and emoji#555

Merged
matt2e merged 2 commits intomainfrom
updates-highlighting-more-than-they-should
Mar 31, 2026
Merged

fix(diff-viewer): improve inline diff tokenization for symbols, accents, and emoji#555
matt2e merged 2 commits intomainfrom
updates-highlighting-more-than-they-should

Conversation

@matt2e
Copy link
Copy Markdown
Contributor

@matt2e matt2e commented Mar 31, 2026

Summary

  • Split tokens on symbol boundaries (not just whitespace) for finer-grained inline diff highlighting, preventing over-highlighting of unchanged content
  • Use Unicode property escapes (\p{L}, \p{N}) so accented characters and emoji are tokenized correctly as individual words

Test plan

  • Added test for trailing comma highlighting precision
  • Added test for accented word diff (caféthé)
  • Added test for emoji diff (🎉🚀)
  • Existing tests updated and passing

🤖 Generated with Claude Code

matt2e and others added 2 commits March 31, 2026 14:33
…diff highlights

Previously splitWords only split on whitespace, so a trailing comma or
prefix change caused the entire whitespace-delimited token to highlight.
Now word characters (\w+), whitespace (\s+), and individual symbols are
separate tokens, so only the actually changed parts get highlighted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oji tokenization

The splitWords regex now uses \p{L}\p{N} with the `gu` flags instead of
\w, so accented characters (e.g. café) are kept as whole word tokens and
emoji (e.g. 🎉) are treated as single code points rather than split
surrogate pairs. Adds test cases for both scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@matt2e matt2e requested review from baxen and wesbillman as code owners March 31, 2026 03:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65aefeac41

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


function splitWords(text: string): string[] {
return text.split(/(\s+)/);
return text.match(/[\p{L}\p{N}_]+|\s+|[^\s\p{L}\p{N}_]/gu) ?? [];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cap tokenization granularity before LCS diffing

computeCharHighlights passes splitWords output into lcsIndices, which allocates a full (m+1)*(n+1) dynamic-programming table. With this regex, punctuation-heavy lines now produce one token per symbol, so a single long modified line (such as minified JSON/lockfile content) can jump from ~1 token to thousands and create tens of millions of DP cells, causing major UI stalls or memory spikes. Please keep the improved precision but add a fallback (for example, max token count or coarse tokenization) when token streams get large.

Useful? React with 👍 / 👎.

@matt2e matt2e merged commit 20e8f36 into main Mar 31, 2026
5 checks passed
@matt2e matt2e deleted the updates-highlighting-more-than-they-should branch March 31, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant