Replace hand-rolled sentence splitter with LangChain's RecursiveCharacterTextSplitter by alexkroman · Pull Request #244 · AssemblyAI/cli

alexkroman · 2026-06-18T17:47:40Z

Summary

Replaced the custom sentence-splitting and chunking logic in aai_cli/tts/text.py with LangChain's RecursiveCharacterTextSplitter, simplifying the implementation while maintaining the same behavior and guarantees.

Changes

Removed custom splitting logic: Deleted split_sentences() and _bounded() functions that manually parsed sentence terminators and sliced oversized text.
Adopted LangChain's RecursiveCharacterTextSplitter: Configured with a separator hierarchy ([". ", "! ", "? ", "\n\n", "\n", " ", ""]) that prefers sentence boundaries and falls back to word/character splits only when necessary, with keep_separator="end" to preserve punctuation for prosody.
Simplified chunk_text(): Reduced from ~30 lines of greedy packing logic to a 6-line wrapper around the splitter.
Updated tests: Removed tests for the deleted split_sentences() function; kept and refined tests for chunk_text() to verify the same invariants (no mid-sentence breaks unless oversized, no text loss, budget compliance).
Added dependency: langchain-text-splitters>=1.0.0 to pyproject.toml.

Implementation Details

The new approach delegates the complexity of recursive splitting to a battle-tested library rather than maintaining custom parsing logic. The separator list ensures:

Sentence terminators followed by spaces are the preferred break point
Paragraph and line breaks are secondary
Word boundaries are tertiary
Character-level splits only occur for blobs with no natural boundaries (e.g., PDFs with no punctuation)

The keep_separator="end" parameter keeps terminators with the preceding chunk, preserving the punctuation needed for the TTS model's prosody. All existing test cases pass with the new implementation, confirming behavioral equivalence.

https://claude.ai/code/session_01He2iJSWkEB5U3ZhxyAFnRc

Replace the hand-rolled sentence scanner + greedy packer in tts/text.py with LangChain's RecursiveCharacterTextSplitter, configured with sentence-terminator separators so chunks stay sentence-aligned and within the per-frame char budget. Behavior is preserved across the existing cases (sentence packing, mid-number periods, oversized-blob slicing, lossless rejoin); the split_sentences helper is dropped since chunk_text was its only caller. Adds the langchain-text-splitters dependency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01He2iJSWkEB5U3ZhxyAFnRc

alexkroman enabled auto-merge June 18, 2026 17:47

alexkroman force-pushed the claude/zen-feynman-q64cxo branch from cd9749a to 7ba1839 Compare June 18, 2026 18:43

alexkroman added this pull request to the merge queue Jun 18, 2026

Merged via the queue into main with commit 26b9f24 Jun 18, 2026
20 checks passed

alexkroman deleted the claude/zen-feynman-q64cxo branch June 18, 2026 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace hand-rolled sentence splitter with LangChain's RecursiveCharacterTextSplitter#244

Replace hand-rolled sentence splitter with LangChain's RecursiveCharacterTextSplitter#244
alexkroman merged 1 commit into
mainfrom
claude/zen-feynman-q64cxo

alexkroman commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 18, 2026

Summary

Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants