fix(diff-viewer): improve inline diff tokenization for symbols, accents, and emoji by matt2e · Pull Request #555 · block/builderbot

matt2e · 2026-03-31T03:50:22Z

Summary

Split tokens on symbol boundaries (not just whitespace) for finer-grained inline diff highlighting, preventing over-highlighting of unchanged content
Use Unicode property escapes (\p{L}, \p{N}) so accented characters and emoji are tokenized correctly as individual words

Test plan

Added test for trailing comma highlighting precision
Added test for accented word diff (café → thé)
Added test for emoji diff (🎉 → 🚀)
Existing tests updated and passing

🤖 Generated with Claude Code

…diff highlights Previously splitWords only split on whitespace, so a trailing comma or prefix change caused the entire whitespace-delimited token to highlight. Now word characters (\w+), whitespace (\s+), and individual symbols are separate tokens, so only the actually changed parts get highlighted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oji tokenization The splitWords regex now uses \p{L}\p{N} with the `gu` flags instead of \w, so accented characters (e.g. café) are kept as whole word tokens and emoji (e.g. 🎉) are treated as single code points rather than split surrogate pairs. Adds test cases for both scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65aefeac41

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T03:55:30Z

packages/diff-viewer/src/lib/utils/inlineDiff.ts


 function splitWords(text: string): string[] {
-  return text.split(/(\s+)/);
+  return text.match(/[\p{L}\p{N}_]+|\s+|[^\s\p{L}\p{N}_]/gu) ?? [];


Cap tokenization granularity before LCS diffing

computeCharHighlights passes splitWords output into lcsIndices, which allocates a full (m+1)*(n+1) dynamic-programming table. With this regex, punctuation-heavy lines now produce one token per symbol, so a single long modified line (such as minified JSON/lockfile content) can jump from ~1 token to thousands and create tens of millions of DP cells, causing major UI stalls or memory spikes. Please keep the improved precision but add a fallback (for example, max token count or coarse tokenization) when token streams get large.

Useful? React with 👍 / 👎.

matt2e and others added 2 commits March 31, 2026 14:33

matt2e requested review from baxen and wesbillman as code owners March 31, 2026 03:50

chatgpt-codex-connector bot reviewed Mar 31, 2026

View reviewed changes

matt2e merged commit 20e8f36 into main Mar 31, 2026
5 checks passed

matt2e deleted the updates-highlighting-more-than-they-should branch March 31, 2026 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(diff-viewer): improve inline diff tokenization for symbols, accents, and emoji#555

fix(diff-viewer): improve inline diff tokenization for symbols, accents, and emoji#555
matt2e merged 2 commits intomainfrom
updates-highlighting-more-than-they-should

matt2e commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matt2e commented Mar 31, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant