Skip to content

782 improve docx importing with recursive text splitting to avoid excessively large cells#894

Open
Luke-Bilhorn wants to merge 10 commits into
mainfrom
782-improve-docx-importing-with-recursive-text-splitting-to-avoid-excessively-large-cells
Open

782 improve docx importing with recursive text splitting to avoid excessively large cells#894
Luke-Bilhorn wants to merge 10 commits into
mainfrom
782-improve-docx-importing-with-recursive-text-splitting-to-avoid-excessively-large-cells

Conversation

@Luke-Bilhorn
Copy link
Copy Markdown
Contributor

More we could do here; can update later.

Paragraphs in imported DOCX files are now recursively split into
translator-friendly cells. A new general-purpose text splitter
(NewSourceUploader/utils/textSplitter.ts) takes a plain string and
an ideal cell length, then bisects at the boundary nearest the midpoint:
sentence ends (L1), sub-sentence stops like commas/dashes/ellipsis (L2),
and whitespace as a last resort (L3). Each tier has its own length
threshold (×1.1 / ×1.5 / ×2.4) and a minimum side-length guard (×0.3)
to prevent fragments. Multilingual punctuation is supported across Latin,
CJK, Arabic, Urdu, Devanagari, Ethiopic, and several other scripts.

The DOCX importer uses this splitter via run-aware helpers that preserve
inline formatting (bold, italic, font, color, etc.) even when a split
falls mid-run. Split cells carry segmentIndex/segmentCount metadata so
the round-trip exporter can recombine translated segments in order before
writing them back to the original <w:p>, keeping the output DOCX
structurally identical to the input. Cell metadata now contains optional fields segmentIndex and segmentCount to support these subdivisions.

The ideal cell length defaults to 160 characters and is user-adjustable
via a collapsible Advanced Settings panel in the DOCX import UI.
Adds an `advancedSettings` prop slot to UnifiedImporterForm, rendered as a
collapsible panel below the file-selection card, and uses it from the DOCX
importer to let the user override the ideal cell length used by the recursive
paragraph splitter. Defaults to 160 characters.

Ports the intent of the old experiment-layout "Made Ideal Segment Length
Button visible" commit onto the new UnifiedImporterForm-based DOCX form.

Made-with: Cursor
@Luke-Bilhorn Luke-Bilhorn force-pushed the 782-improve-docx-importing-with-recursive-text-splitting-to-avoid-excessively-large-cells branch from 473140b to f375caf Compare April 21, 2026 04:13
Luke-Bilhorn and others added 6 commits April 29, 2026 14:49
…xt-splitting-to-avoid-excessively-large-cells
…xt-splitting-to-avoid-excessively-large-cells
…xt-splitting-to-avoid-excessively-large-cells
When a BCP-47 locale is supplied (e.g. derived from <w:lang> on DOCX
runs), the recursive splitter now sources sentence (L1) and word (L3)
split candidates from Intl.Segmenter, falling back to the existing
regex when the locale is missing or unsupported. Sub-sentence stops
(L2) remain regex-only because Intl has no clause granularity.
Thresholds, midpoint preference, and min-side guard are unchanged.

The DOCX importer picks each paragraph's dominant run-level lang
(weighted by content length) and passes it through, giving correct
word boundaries for scripts without space-separated words (Thai,
Khmer, Lao, Myanmar, CJK) and smarter sentence detection.

Uses an opaque SegmenterHandle type rather than Intl.Segmenter
directly, so the file compiles under both the webview tsconfig
(Vite) and the root tsconfig (webpack/ts-loader) which lacks
ES2022.Intl.
Replaces the previous w:lang sniffing in the DOCX importer with the
project's source language tag (from metadata.json), threaded through
the wizard context. This gives Intl.Segmenter a stable, predictable
locale for L1/L3 splitting that does not depend on whether the source
DOCX (e.g. Google Docs exports) carries w:lang attributes.

- Provider now reads metadata.json and includes sourceLanguageTag in
  the projectInventory message.
- WizardContext / WizardState carry sourceLanguageTag through to
  importer components.
- DocxImporterForm reads it from wizardContext and passes it as a
  locale option to parseFile.
- parseFile / createCellsFromDocx accept the locale and forward it to
  splitTextIntoRanges; pickParagraphLocale (the w:lang tally helper)
  is removed.
…xt-splitting-to-avoid-excessively-large-cells
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve docx importing with recursive text splitting to avoid excessively large cells

1 participant