Skip to content

richard950825-sys/Text_Splitter

Repository files navigation

novel-splitter (Step 9: TTS export)

A minimal Python tool for converting Chinese novel TXT files into segmented JSONL.

Current status: Step 9 TTS script/roles export.

Requirements

  • Python >= 3.10

Install (virtual environment)

python -m venv .venv
.venv\Scripts\activate
pip install -e .

For tests:

pip install -e .[dev]
pytest -q

Run (CLI)

python -m novel_splitter.cli --in INPUT.txt --out OUTPUT.jsonl

Output (Step 9 structure + attribution + registry + aggregation + TTS)

The CLI writes JSONL segments in source order:

  • type is "dialogue" for content inside Chinese quotes, otherwise "narration"
  • text is the exact substring from the raw input (quotes excluded for dialogue)
  • span is [start, end) character offsets into the raw input
  • seg_id increments as seg_000001, seg_000002, ...
  • doc_id defaults to input file name without extension
  • chapter_id / paragraph_id are null
  • speaker is filled for dialogue segments using simple rules; unmatched uses UNKNOWN (char_id="UNKNOWN", name=null, confidence=0.0, method="sieve")
  • NAME extraction supports 1~8 Chinese chars with common honorific suffixes
  • Speaker reuse in the same paragraph reuses the most recent attributed speaker (confidence=0.6)
  • Dialogue speaker.char_id is mapped to stable IDs: CHAR_0001, CHAR_0002, ...
  • chapter_id/paragraph_id are filled by default using simple rules
  • meta is {} unless an unmatched quote forces fallback narration

Supported quotes: “ ” (required), ‘ ’, 「 」, 『 』 (optional). Attribution rules only use “ ”. SAY_VERBS are centralized in constants for easy extension.

characters.json

The CLI also writes a character registry file:

  • Default path: <doc_id>__characters.json next to the JSONL output
  • Override with --chars-out

dialogues_by_character.json

The CLI writes an aggregated dialogue export:

  • Default path: <doc_id>__dialogues_by_character.json next to the JSONL output
  • Override with --dialogues-out

TTS exports

  • Script: <doc_id>__script.txt (override with --script-out)
  • Roles directory: <doc_id>__roles/ (override with --roles-dir, disable with --no-roles)
    • NARRATOR.txt for narration
    • CHAR_0001.txt etc. for each character
    • UNKNOWN.txt for unassigned dialogue

Structure flags

  • --with-structure (default): fill chapter_id/paragraph_id
  • --no-structure: leave chapter_id/paragraph_id as null

Chapter detection rule: line matches ^第[一二三四五六七八九十百千0-9]+章.*$
Paragraphs are based on raw \n line splits (empty lines count).

About

321

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages