Skip to content

Improve docx importing with recursive text splitting to avoid excessively large cells #782

@ryderwishart

Description

@ryderwishart

When importing a .docx file, block styles (e.g., paragraphs) provide the main cell divisions. Some paragraphs can be very long, however, making validation and drafting unwieldy and unclear, diluting our main value proposition.

Therefore, we can implement recursive text splitting until a typical target cell length is found.

Basically, split on new paragraphs. If the length is longer than the target cell length, split by newlines, then split by .|!|?, then split by minor stops, then split by whitespace (or some variation of this pecking order).

This will ensure we extract usefully sized validation units.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions