novel-splitter (Step 9: TTS export)

A minimal Python tool for converting Chinese novel TXT files into segmented JSONL.

Current status: Step 9 TTS script/roles export.

Requirements

Python >= 3.10

Install (virtual environment)

python -m venv .venv
.venv\Scripts\activate
pip install -e .

For tests:

pip install -e .[dev]
pytest -q

Run (CLI)

python -m novel_splitter.cli --in INPUT.txt --out OUTPUT.jsonl

Output (Step 9 structure + attribution + registry + aggregation + TTS)

The CLI writes JSONL segments in source order:

type is "dialogue" for content inside Chinese quotes, otherwise "narration"
text is the exact substring from the raw input (quotes excluded for dialogue)
span is [start, end) character offsets into the raw input
seg_id increments as seg_000001, seg_000002, ...
doc_id defaults to input file name without extension
chapter_id / paragraph_id are null
speaker is filled for dialogue segments using simple rules; unmatched uses UNKNOWN (char_id="UNKNOWN", name=null, confidence=0.0, method="sieve")
NAME extraction supports 1~8 Chinese chars with common honorific suffixes
Speaker reuse in the same paragraph reuses the most recent attributed speaker (confidence=0.6)
Dialogue speaker.char_id is mapped to stable IDs: CHAR_0001, CHAR_0002, ...
chapter_id/paragraph_id are filled by default using simple rules
meta is {} unless an unmatched quote forces fallback narration

Supported quotes: “ ” (required), ‘ ’, 「」, 『』 (optional). Attribution rules only use “ ”. SAY_VERBS are centralized in constants for easy extension.

characters.json

The CLI also writes a character registry file:

Default path: <doc_id>__characters.json next to the JSONL output
Override with --chars-out

dialogues_by_character.json

The CLI writes an aggregated dialogue export:

Default path: <doc_id>__dialogues_by_character.json next to the JSONL output
Override with --dialogues-out

TTS exports

Script: <doc_id>__script.txt (override with --script-out)
Roles directory: <doc_id>__roles/ (override with --roles-dir, disable with --no-roles)
- NARRATOR.txt for narration
- CHAR_0001.txt etc. for each character
- UNKNOWN.txt for unassigned dialogue

Structure flags

--with-structure (default): fill chapter_id/paragraph_id
--no-structure: leave chapter_id/paragraph_id as null

Chapter detection rule: line matches ^第[一二三四五六七八九十百千0-9]+章.*$
Paragraphs are based on raw \n line splits (empty lines count).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
novel_splitter		novel_splitter
tests		tests
.gitignore		.gitignore
README.md		README.md
REPORT_STEP1.md		REPORT_STEP1.md
REPORT_STEP3.md		REPORT_STEP3.md
REPORT_STEP4.md		REPORT_STEP4.md
REPORT_STEP5.md		REPORT_STEP5.md
REPORT_STEP6.md		REPORT_STEP6.md
REPORT_STEP7.md		REPORT_STEP7.md
REPORT_STEP8.md		REPORT_STEP8.md
REPORT_STEP9.md		REPORT_STEP9.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

novel-splitter (Step 9: TTS export)

Requirements

Install (virtual environment)

Run (CLI)

Output (Step 9 structure + attribution + registry + aggregation + TTS)

characters.json

dialogues_by_character.json

TTS exports

Structure flags

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

novel-splitter (Step 9: TTS export)

Requirements

Install (virtual environment)

Run (CLI)

Output (Step 9 structure + attribution + registry + aggregation + TTS)

characters.json

dialogues_by_character.json

TTS exports

Structure flags

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages