I was looking for a faster, compatible alternative to Intl.Segmenter to unify behavior across browsers and platforms and came across this library. Our initial tests looked good (both in accuracy and speed), but I noticed a few inconsistencies with some of our test corpus data (sourced from wikisource and gutenberg).
For comparison, I had Claude Opus generate a comparison script, which returned this overview:
── bn/bn-kabuliwala ── 4 divergences (977 / 973 / 977 segments) ──
@1174 U+0997 U+09CD U+200C U+09A1 U+09C1
intl: [গ্][ডু]
unicode-seg: [গ্ডু]
@1183 U+0997 U+09CD U+200C U+09A1 U+09C1
intl: [গ্][ডু]
unicode-seg: [গ্ডু]
@1388 U+0997 U+09CD U+200C U+09A1 U+09C1
intl: [গ্][ডু]
unicode-seg: [গ্ডু]
@1397 U+0997 U+09CD U+200C U+09A1 U+09C1
intl: [গ্][ডু]
unicode-seg: [গ্ডু]
── bn/bn-postmaster ── 6 divergences (1168 / 1162 / 1168 segments) ──
@40 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
intl: [স্ট্][মা]
unicode-seg: [স্ট্মা]
@161 U+0986 U+09AA U+09BF
intl: [আ][পি]
unicode-seg: [আপি]
@194 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
intl: [স্ট্][মা]
unicode-seg: [স্ট্মা]
@288 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
intl: [স্ট্][মা]
unicode-seg: [স্ট্মা]
@1164 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
intl: [স্ট্][মা]
unicode-seg: [স্ট্মা]
@1697 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
intl: [স্ট্][মা]
unicode-seg: [স্ট্মা]
── bn/bn-samapti ── 4 divergences (1047 / 1043 / 1047 segments) ──
@414 U+09B2 U+09CD U+200C U+099C U+09CD U+09AC
intl: [ল্][জ্ব]
unicode-seg: [ল্জ্ব]
@424 U+098F U+09AC U+0982
intl: [এ][বং]
unicode-seg: [এবং]
@436 U+09B2 U+09CD U+200C U+099B
intl: [ল্][ছ]
unicode-seg: [ল্ছ]
@443 U+0020 U+0995
intl: [ ][ক]
unicode-seg: [ ক]
── ml/ml-kazhuthayude-vakkuketta-kaalayude-katha ── 75 divergences (2169 / 2094 / 2169 segments) ──
@176 U+0020 U+0D35
intl: [ ][വ]
unicode-seg: [ വ]
@252 U+0020 U+0D15
intl: [ ][ക]
unicode-seg: [ ക]
@322 U+0020 U+0D35 U+0D43
intl: [ ][വൃ]
unicode-seg: [ വൃ]
@366 U+0020 U+0D2A U+0D41
intl: [ ][പു]
unicode-seg: [ പു]
@416 U+0020 U+0D2F
intl: [ ][യ]
unicode-seg: [ യ]
@444 U+0020 U+0D2A U+0D4B
intl: [ ][പോ]
unicode-seg: [ പോ]
@473 U+0020 U+0D1C U+0D4B
intl: [ ][ജോ]
unicode-seg: [ ജോ]
@731 U+0020 U+0D28 U+0D40
intl: [ ][നീ]
unicode-seg: [ നീ]
... and 67 more
── ml/ml-vasanavikruthi ── 43 divergences (1392 / 1349 / 1392 segments) ──
@166 U+0D0E U+0D28 U+0D4D U+0D28 U+0D3E
intl: [എ][ന്നാ]
unicode-seg: [എന്നാ]
@187 U+0D21 U+0D4D U+200C U+0D22 U+0D3F
intl: [ഡ്][ഢി]
unicode-seg: [ഡ്ഢി]
@272 U+0D07 U+0D28 U+0D3F
intl: [ഇ][നി]
unicode-seg: [ഇനി]
@368 U+0D09 U+0D26 U+0D4D U+0D2F U+0D4B
intl: [ഉ][ദ്യോ]
unicode-seg: [ഉദ്യോ]
@431 U+0020 U+0D24 U+0D3E
intl: [ ][താ]
unicode-seg: [ താ]
@487 U+0020 U+0D26 U+0D41
intl: [ ][ദു]
unicode-seg: [ ദു]
@555 U+0020 U+0D2C U+0D41
intl: [ ][ബു]
unicode-seg: [ ബു]
@641 U+0D12 U+0D30
intl: [ഒ][ര]
unicode-seg: [ഒര]
... and 35 more
── mr/mr-kathali-maitri ── 1 divergences (7278 / 7277 / 7278 segments) ──
@5990 U+0020 U+091C U+093E
intl: [ ][जा]
unicode-seg: [ जा]
── or/or-chhamana-athaguntha-ch1 ── 2 divergences (2865 / 2863 / 2865 segments) ──
@1658 U+0020 U+0B15 U+0B3E
intl: [ ][କା]
unicode-seg: [ କା]
@3208 U+0020 U+0B36 U+0B4D U+0B5F U+0B3E
intl: [ ][ଶ୍ୟା]
unicode-seg: [ ଶ୍ୟା]
SUMMARY BY LANGUAGE
═══════════════════
lang total
───────────────────
ml 118
bn 14
or 2
mr 1
───────────────────
TOTAL 135
I also compared this between Node v24 (which uses Unicode 16 data) and Node v25 (Unicode 17) - and there's been no changes in our test corpus snapshots.
What's interesting is that these failure cases do not show up even with the official Unicode GraphemeBreakTest data.
The following report is an attempt at a reproduction by Opus:
Bug 1: ZWNJ (U+200C) does not break GB9c conjunct formation
Symptom: Consonant + Virama + ZWNJ + Consonant is incorrectly kept as one grapheme cluster. It should break after ZWNJ into two clusters.
Affected scripts: Bengali, Malayalam, Oriya, Devanagari — any script using ZWNJ to visually suppress conjunct formation.
Root cause (grapheme.js:169-194): When ZWNJ (GCB=Extend, InCB=None) appears after a virama in a conjunct sequence, the InCB state update runs because consonant=true && catAfter===Extend. The check linker = linker || cp === 0x094D || ... doesn't match ZWNJ (it's not a virama), but linker was already true from the preceding virama. The || preserves it. ZWNJ has InCB=None and should explicitly reset the conjunct state (consonant=false, linker=false), because it marks an intentional break in conjunct formation.
Minimal reproduction:
import { graphemeSegments } from 'unicode-segmenter/grapheme';
// Devanagari KA + Virama + ZWNJ + KA
const text = '\u0915\u094D\u200C\u0915';
const segs = [...graphemeSegments(text)].map(s => s.segment);
// Got: ['क्क'] (1 cluster — wrong)
// Expected: ['क्', 'क'] (2 clusters)
Fix: Add before the virama check at line 169:
if (cp === 0x200C) { consonant = false; linker = false; }
else if (consonant && catAfter === 3) { /* existing virama check */ }
Bug 2: InCB state (consonant/linker) not reset after grapheme cluster break
Symptom: After a cluster ending with Consonant + Virama (which sets consonant=true, linker=true), if a break occurs (e.g., before a space, comma, or period), the next cluster incorrectly absorbs the following Indic consonant via GB9c.
Affected scripts: Malayalam (118 of 135 total divergences), Bengali, Oriya, Devanagari, Marathi.
Root cause (grapheme.js:141-157): The boundary handler resets emoji, risCount, index, _catBegin, _hd — but not consonant or linker. These leak into the next cluster. When the next cluster starts with a non-Indic character (space, punctuation) followed by an Indic consonant, GB9c fires incorrectly because consonant && linker are still true.
Minimal reproduction:
import { graphemeSegments } from 'unicode-segmenter/grapheme';
// Malayalam KA + Virama + SPACE + VA
const text = '\u0D15\u0D4D\u0020\u0D35';
const segs = [...graphemeSegments(text)].map(s => s.segment);
// Got: ['ക്', ' വ'] (space joined with VA — wrong)
// Expected: ['ക്', ' ', 'വ'] (3 separate clusters)
Fix: Add in the boundary block after line 153:
consonant = false;
linker = false;
Why the official GraphemeBreakTest.txt doesn't catch these
- Bug 1: The test file has ZWNJ in pairwise combinations (
KA+ZWNJ, Virama+ZWNJ, ZWNJ+KA) but never the full 4-codepoint Consonant + Virama + ZWNJ + Consonant sequence needed to trigger GB9c with ZWNJ.
- Bug 2: All test cases are 2-4 codepoints within a single cluster. Cross-cluster state leaks require at least 4+ codepoints spanning two clusters.
I was looking for a faster, compatible alternative to
Intl.Segmenterto unify behavior across browsers and platforms and came across this library. Our initial tests looked good (both in accuracy and speed), but I noticed a few inconsistencies with some of our test corpus data (sourced from wikisource and gutenberg).For comparison, I had Claude Opus generate a comparison script, which returned this overview:
I also compared this between Node v24 (which uses Unicode 16 data) and Node v25 (Unicode 17) - and there's been no changes in our test corpus snapshots.
What's interesting is that these failure cases do not show up even with the official Unicode
GraphemeBreakTestdata.The following report is an attempt at a reproduction by Opus:
Bug 1: ZWNJ (U+200C) does not break GB9c conjunct formation
Symptom:
Consonant + Virama + ZWNJ + Consonantis incorrectly kept as one grapheme cluster. It should break after ZWNJ into two clusters.Affected scripts: Bengali, Malayalam, Oriya, Devanagari — any script using ZWNJ to visually suppress conjunct formation.
Root cause (
grapheme.js:169-194): When ZWNJ (GCB=Extend, InCB=None) appears after a virama in a conjunct sequence, the InCB state update runs becauseconsonant=true && catAfter===Extend. The checklinker = linker || cp === 0x094D || ...doesn't match ZWNJ (it's not a virama), butlinkerwas alreadytruefrom the preceding virama. The||preserves it. ZWNJ hasInCB=Noneand should explicitly reset the conjunct state (consonant=false, linker=false), because it marks an intentional break in conjunct formation.Minimal reproduction:
Fix: Add before the virama check at line 169:
Bug 2: InCB state (
consonant/linker) not reset after grapheme cluster breakSymptom: After a cluster ending with
Consonant + Virama(which setsconsonant=true, linker=true), if a break occurs (e.g., before a space, comma, or period), the next cluster incorrectly absorbs the following Indic consonant via GB9c.Affected scripts: Malayalam (118 of 135 total divergences), Bengali, Oriya, Devanagari, Marathi.
Root cause (
grapheme.js:141-157): The boundary handler resetsemoji,risCount,index,_catBegin,_hd— but notconsonantorlinker. These leak into the next cluster. When the next cluster starts with a non-Indic character (space, punctuation) followed by an Indic consonant, GB9c fires incorrectly becauseconsonant && linkerare stilltrue.Minimal reproduction:
Fix: Add in the boundary block after line 153:
Why the official GraphemeBreakTest.txt doesn't catch these
KA+ZWNJ,Virama+ZWNJ,ZWNJ+KA) but never the full 4-codepointConsonant + Virama + ZWNJ + Consonantsequence needed to trigger GB9c with ZWNJ.