Splitting errors in `ml`, `bn`, `or` and `mr` languages

I was looking for a faster, compatible alternative to `Intl.Segmenter` to unify behavior across browsers and platforms and came across this library. Our initial tests looked good (both in accuracy and speed), but I noticed a few inconsistencies with some of our test corpus data (sourced from wikisource and gutenberg).

For comparison, I had Claude Opus generate a comparison script, which returned this overview:
```
── bn/bn-kabuliwala ── 4 divergences (977 / 973 / 977 segments) ──
  @1174 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]
  @1183 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]
  @1388 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]
  @1397 U+0997 U+09CD U+200C U+09A1 U+09C1
    intl:        [গ্‌][ডু]
    unicode-seg: [গ্‌ডু]

── bn/bn-postmaster ── 6 divergences (1168 / 1162 / 1168 segments) ──
  @40 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @161 U+0986 U+09AA U+09BF
    intl:        [আ][পি]
    unicode-seg: [আপি]
  @194 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @288 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @1164 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]
  @1697 U+09B8 U+09CD U+099F U+09CD U+200C U+09AE U+09BE
    intl:        [স্ট্‌][মা]
    unicode-seg: [স্ট্‌মা]

── bn/bn-samapti ── 4 divergences (1047 / 1043 / 1047 segments) ──
  @414 U+09B2 U+09CD U+200C U+099C U+09CD U+09AC
    intl:        [ল্‌][জ্ব]
    unicode-seg: [ল্‌জ্ব]
  @424 U+098F U+09AC U+0982
    intl:        [এ][বং]
    unicode-seg: [এবং]
  @436 U+09B2 U+09CD U+200C U+099B
    intl:        [ল্‌][ছ]
    unicode-seg: [ল্‌ছ]
  @443 U+0020 U+0995
    intl:        [ ][ক]
    unicode-seg: [ ক]

── ml/ml-kazhuthayude-vakkuketta-kaalayude-katha ── 75 divergences (2169 / 2094 / 2169 segments) ──
  @176 U+0020 U+0D35
    intl:        [ ][വ]
    unicode-seg: [ വ]
  @252 U+0020 U+0D15
    intl:        [ ][ക]
    unicode-seg: [ ക]
  @322 U+0020 U+0D35 U+0D43
    intl:        [ ][വൃ]
    unicode-seg: [ വൃ]
  @366 U+0020 U+0D2A U+0D41
    intl:        [ ][പു]
    unicode-seg: [ പു]
  @416 U+0020 U+0D2F
    intl:        [ ][യ]
    unicode-seg: [ യ]
  @444 U+0020 U+0D2A U+0D4B
    intl:        [ ][പോ]
    unicode-seg: [ പോ]
  @473 U+0020 U+0D1C U+0D4B
    intl:        [ ][ജോ]
    unicode-seg: [ ജോ]
  @731 U+0020 U+0D28 U+0D40
    intl:        [ ][നീ]
    unicode-seg: [ നീ]
  ... and 67 more

── ml/ml-vasanavikruthi ── 43 divergences (1392 / 1349 / 1392 segments) ──
  @166 U+0D0E U+0D28 U+0D4D U+0D28 U+0D3E
    intl:        [എ][ന്നാ]
    unicode-seg: [എന്നാ]
  @187 U+0D21 U+0D4D U+200C U+0D22 U+0D3F
    intl:        [ഡ്‌][ഢി]
    unicode-seg: [ഡ്‌ഢി]
  @272 U+0D07 U+0D28 U+0D3F
    intl:        [ഇ][നി]
    unicode-seg: [ഇനി]
  @368 U+0D09 U+0D26 U+0D4D U+0D2F U+0D4B
    intl:        [ഉ][ദ്യോ]
    unicode-seg: [ഉദ്യോ]
  @431 U+0020 U+0D24 U+0D3E
    intl:        [ ][താ]
    unicode-seg: [ താ]
  @487 U+0020 U+0D26 U+0D41
    intl:        [ ][ദു]
    unicode-seg: [ ദു]
  @555 U+0020 U+0D2C U+0D41
    intl:        [ ][ബു]
    unicode-seg: [ ബു]
  @641 U+0D12 U+0D30
    intl:        [ഒ][ര]
    unicode-seg: [ഒര]
  ... and 35 more

── mr/mr-kathali-maitri ── 1 divergences (7278 / 7277 / 7278 segments) ──
  @5990 U+0020 U+091C U+093E
    intl:        [ ][जा]
    unicode-seg: [ जा]

── or/or-chhamana-athaguntha-ch1 ── 2 divergences (2865 / 2863 / 2865 segments) ──
  @1658 U+0020 U+0B15 U+0B3E
    intl:        [ ][କା]
    unicode-seg: [ କା]
  @3208 U+0020 U+0B36 U+0B4D U+0B5F U+0B3E
    intl:        [ ][ଶ୍ୟା]
    unicode-seg: [ ଶ୍ୟା]

SUMMARY BY LANGUAGE
═══════════════════
lang          total
───────────────────
ml              118
bn               14
or                2
mr                1
───────────────────
TOTAL           135
```

I also compared this between Node v24 (which uses Unicode 16 data) and Node v25 (Unicode 17) - and there's been no changes in our test corpus snapshots.

What's interesting is that these failure cases do not show up even with the official Unicode `GraphemeBreakTest` data. 

The following report is an attempt at a reproduction by Opus:

---

## Bug 1: ZWNJ (U+200C) does not break GB9c conjunct formation

**Symptom**: `Consonant + Virama + ZWNJ + Consonant` is incorrectly kept as one grapheme cluster. It should break after ZWNJ into two clusters.

**Affected scripts**: Bengali, Malayalam, Oriya, Devanagari — any script using ZWNJ to visually suppress conjunct formation.

**Root cause** (`grapheme.js:169-194`): When ZWNJ (GCB=Extend, InCB=None) appears after a virama in a conjunct sequence, the InCB state update runs because `consonant=true && catAfter===Extend`. The check `linker = linker || cp === 0x094D || ...` doesn't match ZWNJ (it's not a virama), but `linker` was **already** `true` from the preceding virama. The `||` preserves it. ZWNJ has `InCB=None` and should explicitly **reset** the conjunct state (`consonant=false, linker=false`), because it marks an intentional break in conjunct formation.

**Minimal reproduction**:
```js
import { graphemeSegments } from 'unicode-segmenter/grapheme';
// Devanagari KA + Virama + ZWNJ + KA
const text = '\u0915\u094D\u200C\u0915';
const segs = [...graphemeSegments(text)].map(s => s.segment);
// Got:      ['क्‌क']     (1 cluster — wrong)
// Expected: ['क्‌', 'क'] (2 clusters)
```

**Fix**: Add before the virama check at line 169:
```js
if (cp === 0x200C) { consonant = false; linker = false; }
else if (consonant && catAfter === 3) { /* existing virama check */ }
```

## Bug 2: InCB state (`consonant`/`linker`) not reset after grapheme cluster break

**Symptom**: After a cluster ending with `Consonant + Virama` (which sets `consonant=true, linker=true`), if a break occurs (e.g., before a space, comma, or period), the next cluster incorrectly absorbs the following Indic consonant via GB9c.

**Affected scripts**: Malayalam (118 of 135 total divergences), Bengali, Oriya, Devanagari, Marathi.

**Root cause** (`grapheme.js:141-157`): The boundary handler resets `emoji`, `risCount`, `index`, `_catBegin`, `_hd` — but **not** `consonant` or `linker`. These leak into the next cluster. When the next cluster starts with a non-Indic character (space, punctuation) followed by an Indic consonant, GB9c fires incorrectly because `consonant && linker` are still `true`.

**Minimal reproduction**:
```js
import { graphemeSegments } from 'unicode-segmenter/grapheme';
// Malayalam KA + Virama + SPACE + VA
const text = '\u0D15\u0D4D\u0020\u0D35';
const segs = [...graphemeSegments(text)].map(s => s.segment);
// Got:      ['ക്', ' വ']         (space joined with VA — wrong)
// Expected: ['ക്', ' ', 'വ']     (3 separate clusters)
```

**Fix**: Add in the boundary block after line 153:
```js
consonant = false;
linker = false;
```

## Why the official GraphemeBreakTest.txt doesn't catch these

- **Bug 1**: The test file has ZWNJ in pairwise combinations (`KA+ZWNJ`, `Virama+ZWNJ`, `ZWNJ+KA`) but never the full 4-codepoint `Consonant + Virama + ZWNJ + Consonant` sequence needed to trigger GB9c with ZWNJ.
- **Bug 2**: All test cases are 2-4 codepoints within a single cluster. Cross-cluster state leaks require at least 4+ codepoints spanning two clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting errors in `ml`, `bn`, `or` and `mr` languages #124

Bug 1: ZWNJ (U+200C) does not break GB9c conjunct formation

Bug 2: InCB state (`consonant`/`linker`) not reset after grapheme cluster break

Why the official GraphemeBreakTest.txt doesn't catch these

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Splitting errors in ml, bn, or and mr languages #124

Description

Bug 1: ZWNJ (U+200C) does not break GB9c conjunct formation

Bug 2: InCB state (consonant/linker) not reset after grapheme cluster break

Why the official GraphemeBreakTest.txt doesn't catch these

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Splitting errors in `ml`, `bn`, `or` and `mr` languages #124

Bug 2: InCB state (`consonant`/`linker`) not reset after grapheme cluster break