Skip to content

Replace hand-rolled sentence splitter with LangChain's RecursiveCharacterTextSplitter#244

Merged
alexkroman merged 1 commit into
mainfrom
claude/zen-feynman-q64cxo
Jun 18, 2026
Merged

Replace hand-rolled sentence splitter with LangChain's RecursiveCharacterTextSplitter#244
alexkroman merged 1 commit into
mainfrom
claude/zen-feynman-q64cxo

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Summary

Replaced the custom sentence-splitting and chunking logic in aai_cli/tts/text.py with LangChain's RecursiveCharacterTextSplitter, simplifying the implementation while maintaining the same behavior and guarantees.

Changes

  • Removed custom splitting logic: Deleted split_sentences() and _bounded() functions that manually parsed sentence terminators and sliced oversized text.
  • Adopted LangChain's RecursiveCharacterTextSplitter: Configured with a separator hierarchy ([". ", "! ", "? ", "\n\n", "\n", " ", ""]) that prefers sentence boundaries and falls back to word/character splits only when necessary, with keep_separator="end" to preserve punctuation for prosody.
  • Simplified chunk_text(): Reduced from ~30 lines of greedy packing logic to a 6-line wrapper around the splitter.
  • Updated tests: Removed tests for the deleted split_sentences() function; kept and refined tests for chunk_text() to verify the same invariants (no mid-sentence breaks unless oversized, no text loss, budget compliance).
  • Added dependency: langchain-text-splitters>=1.0.0 to pyproject.toml.

Implementation Details

The new approach delegates the complexity of recursive splitting to a battle-tested library rather than maintaining custom parsing logic. The separator list ensures:

  • Sentence terminators followed by spaces are the preferred break point
  • Paragraph and line breaks are secondary
  • Word boundaries are tertiary
  • Character-level splits only occur for blobs with no natural boundaries (e.g., PDFs with no punctuation)

The keep_separator="end" parameter keeps terminators with the preceding chunk, preserving the punctuation needed for the TTS model's prosody. All existing test cases pass with the new implementation, confirming behavioral equivalence.

https://claude.ai/code/session_01He2iJSWkEB5U3ZhxyAFnRc

@alexkroman alexkroman enabled auto-merge June 18, 2026 17:47
Replace the hand-rolled sentence scanner + greedy packer in tts/text.py
with LangChain's RecursiveCharacterTextSplitter, configured with
sentence-terminator separators so chunks stay sentence-aligned and within
the per-frame char budget. Behavior is preserved across the existing cases
(sentence packing, mid-number periods, oversized-blob slicing, lossless
rejoin); the split_sentences helper is dropped since chunk_text was its only
caller. Adds the langchain-text-splitters dependency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01He2iJSWkEB5U3ZhxyAFnRc
@alexkroman alexkroman force-pushed the claude/zen-feynman-q64cxo branch from cd9749a to 7ba1839 Compare June 18, 2026 18:43
@alexkroman alexkroman added this pull request to the merge queue Jun 18, 2026
Merged via the queue into main with commit 26b9f24 Jun 18, 2026
20 checks passed
@alexkroman alexkroman deleted the claude/zen-feynman-q64cxo branch June 18, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants