When importing a .docx file, block styles (e.g., paragraphs) provide the main cell divisions. Some paragraphs can be very long, however, making validation and drafting unwieldy and unclear, diluting our main value proposition.
Therefore, we can implement recursive text splitting until a typical target cell length is found.
Basically, split on new paragraphs. If the length is longer than the target cell length, split by newlines, then split by .|!|?, then split by minor stops, then split by whitespace (or some variation of this pecking order).
This will ensure we extract usefully sized validation units.
When importing a .docx file, block styles (e.g., paragraphs) provide the main cell divisions. Some paragraphs can be very long, however, making validation and drafting unwieldy and unclear, diluting our main value proposition.
Therefore, we can implement recursive text splitting until a typical target cell length is found.
Basically, split on new paragraphs. If the length is longer than the target cell length, split by newlines, then split by
.|!|?, then split by minor stops, then split by whitespace (or some variation of this pecking order).This will ensure we extract usefully sized validation units.