Skip to content

Splitting errors in ml, bn, or and mr languages #124

@spaceemotion

Description

@spaceemotion

I was looking for a faster, compatible alternative to Intl.Segmenter to unify behavior across browsers and platforms and came across this library. Our initial tests looked good (both in accuracy and speed), but I noticed a few inconsistencies with some of our test corpus data (sourced from wikisource and gutenberg).

For comparison, I had Claude Opus generate a comparison script, which returned this overview:

── bn/bn-kabuliwala ── 4 divergences (977 / 973 / 977 segments) ──
  @1174 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]
  @1183 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]
  @1388 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]
  @1397 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]

── bn/bn-postmaster ── 6 divergences (1168 / 1162 / 1168 segments) ──
  @40 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @161 U+0986 U+09AA U+09BF
    intl:        [আ][পি]
    unicode-seg: [আপি]
  @194 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @288 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @1164 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @1697 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]

── bn/bn-samapti ── 4 divergences (1047 / 1043 / 1047 segments) ──
  @414 U+09B2 U+09CD U+200C U+099C U+09CD U+09AC
    intl:        [ল্‌][জ্ব]
    unicode-seg: [ল্‌জ্ব]
  @424 U+098F U+09AC U+0982
    intl:        [এ][বং]
    unicode-seg: [এবং]
  @436 U+09B2 U+09CD U+200C U+099B
    intl:        [ল্‌][ছ]
    unicode-seg: [ল্‌ছ]
  @443 U+0020 U+0995
    intl:        [ ][ক]
    unicode-seg: [ ক]

── ml/ml-kazhuthayude-vakkuketta-kaalayude-katha ── 75 divergences (2169 / 2094 / 2169 segments) ──
  @176 U+0020 U+0D35
    intl:        [ ][വ]
    unicode-seg: [ വ]
  @252 U+0020 U+0D15
    intl:        [ ][ക]
    unicode-seg: [ ക]
  @322 U+0020 U+0D35 U+0D43
    intl:        [ ][വൃ]
    unicode-seg: [ വൃ]
  @366 U+0020 U+0D2A U+0D41
    intl:        [ ][പു]
    unicode-seg: [ പു]
  @416 U+0020 U+0D2F
    intl:        [ ][യ]
    unicode-seg: [ യ]
  @444 U+0020 U+0D2A U+0D4B
    intl:        [ ][പോ]
    unicode-seg: [ പോ]
  @473 U+0020 U+0D1C U+0D4B
    intl:        [ ][ജോ]
    unicode-seg: [ ജോ]
  @731 U+0020 U+0D28 U+0D40
    intl:        [ ][നീ]
    unicode-seg: [ നീ]
  ... and 67 more

── ml/ml-vasanavikruthi ── 43 divergences (1392 / 1349 / 1392 segments) ──
  @166 U+0D0E U+0D28 U+0D4D U+0D28 U+0D3E
    intl:        [എ][ന്നാ]
    unicode-seg: [എന്നാ]
  @187 U+0D21 U+0D4D U+200C U+0D22 U+0D3F
    intl:        [ഡ്‌][ഢി]
    unicode-seg: [ഡ്‌ഢി]
  @272 U+0D07 U+0D28 U+0D3F
    intl:        [ഇ][നി]
    unicode-seg: [ഇനി]
  @368 U+0D09 U+0D26 U+0D4D U+0D2F U+0D4B
    intl:        [ഉ][ദ്യോ]
    unicode-seg: [ഉദ്യോ]
  @431 U+0020 U+0D24 U+0D3E
    intl:        [ ][താ]
    unicode-seg: [ താ]
  @487 U+0020 U+0D26 U+0D41
    intl:        [ ][ദു]
    unicode-seg: [ ദു]
  @555 U+0020 U+0D2C U+0D41
    intl:        [ ][ബു]
    unicode-seg: [ ബു]
  @641 U+0D12 U+0D30
    intl:        [ഒ][ര]
    unicode-seg: [ഒര]
  ... and 35 more

── mr/mr-kathali-maitri ── 1 divergences (7278 / 7277 / 7278 segments) ──
  @5990 U+0020 U+091C U+093E
    intl:        [ ][जा]
    unicode-seg: [ जा]

── or/or-chhamana-athaguntha-ch1 ── 2 divergences (2865 / 2863 / 2865 segments) ──
  @1658 U+0020 U+0B15 U+0B3E
    intl:        [ ][କା]
    unicode-seg: [ କା]
  @3208 U+0020 U+0B36 U+0B4D U+0B5F U+0B3E
    intl:        [ ][ଶ୍ୟା]
    unicode-seg: [ ଶ୍ୟା]

SUMMARY BY LANGUAGE
═══════════════════
lang          total
───────────────────
ml              118
bn               14
or                2
mr                1
───────────────────
TOTAL           135

I also compared this between Node v24 (which uses Unicode 16 data) and Node v25 (Unicode 17) - and there's been no changes in our test corpus snapshots.

What's interesting is that these failure cases do not show up even with the official Unicode GraphemeBreakTest data.

The following report is an attempt at a reproduction by Opus:


Bug 1: ZWNJ (U+200C) does not break GB9c conjunct formation

Symptom: Consonant + Virama + ZWNJ + Consonant is incorrectly kept as one grapheme cluster. It should break after ZWNJ into two clusters.

Affected scripts: Bengali, Malayalam, Oriya, Devanagari — any script using ZWNJ to visually suppress conjunct formation.

Root cause (grapheme.js:169-194): When ZWNJ (GCB=Extend, InCB=None) appears after a virama in a conjunct sequence, the InCB state update runs because consonant=true && catAfter===Extend. The check linker = linker || cp === 0x094D || ... doesn't match ZWNJ (it's not a virama), but linker was already true from the preceding virama. The || preserves it. ZWNJ has InCB=None and should explicitly reset the conjunct state (consonant=false, linker=false), because it marks an intentional break in conjunct formation.

Minimal reproduction:

import { graphemeSegments } from 'unicode-segmenter/grapheme';
// Devanagari KA + Virama + ZWNJ + KA
const text = '\u0915\u094D\u200C\u0915';
const segs = [...graphemeSegments(text)].map(s => s.segment);
// Got:      ['क्‌क']     (1 cluster — wrong)
// Expected: ['क्‌', 'क'] (2 clusters)

Fix: Add before the virama check at line 169:

if (cp === 0x200C) { consonant = false; linker = false; }
else if (consonant && catAfter === 3) { /* existing virama check */ }

Bug 2: InCB state (consonant/linker) not reset after grapheme cluster break

Symptom: After a cluster ending with Consonant + Virama (which sets consonant=true, linker=true), if a break occurs (e.g., before a space, comma, or period), the next cluster incorrectly absorbs the following Indic consonant via GB9c.

Affected scripts: Malayalam (118 of 135 total divergences), Bengali, Oriya, Devanagari, Marathi.

Root cause (grapheme.js:141-157): The boundary handler resets emoji, risCount, index, _catBegin, _hd — but not consonant or linker. These leak into the next cluster. When the next cluster starts with a non-Indic character (space, punctuation) followed by an Indic consonant, GB9c fires incorrectly because consonant && linker are still true.

Minimal reproduction:

import { graphemeSegments } from 'unicode-segmenter/grapheme';
// Malayalam KA + Virama + SPACE + VA
const text = '\u0D15\u0D4D\u0020\u0D35';
const segs = [...graphemeSegments(text)].map(s => s.segment);
// Got:      ['ക്', ' വ']         (space joined with VA — wrong)
// Expected: ['ക്', ' ', 'വ']     (3 separate clusters)

Fix: Add in the boundary block after line 153:

consonant = false;
linker = false;

Why the official GraphemeBreakTest.txt doesn't catch these

  • Bug 1: The test file has ZWNJ in pairwise combinations (KA+ZWNJ, Virama+ZWNJ, ZWNJ+KA) but never the full 4-codepoint Consonant + Virama + ZWNJ + Consonant sequence needed to trigger GB9c with ZWNJ.
  • Bug 2: All test cases are 2-4 codepoints within a single cluster. Cross-cluster state leaks require at least 4+ codepoints spanning two clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions