A minimal Python tool for converting Chinese novel TXT files into segmented JSONL.
Current status: Step 9 TTS script/roles export.
- Python >= 3.10
python -m venv .venv
.venv\Scripts\activate
pip install -e .For tests:
pip install -e .[dev]
pytest -qpython -m novel_splitter.cli --in INPUT.txt --out OUTPUT.jsonlThe CLI writes JSONL segments in source order:
typeis"dialogue"for content inside Chinese quotes, otherwise"narration"textis the exact substring from the raw input (quotes excluded for dialogue)spanis[start, end)character offsets into the raw inputseg_idincrements asseg_000001,seg_000002, ...doc_iddefaults to input file name without extensionchapter_id/paragraph_idarenullspeakeris filled for dialogue segments using simple rules; unmatched usesUNKNOWN(char_id="UNKNOWN", name=null, confidence=0.0, method="sieve")- NAME extraction supports 1~8 Chinese chars with common honorific suffixes
- Speaker reuse in the same paragraph reuses the most recent attributed speaker (confidence=0.6)
- Dialogue speaker.char_id is mapped to stable IDs: CHAR_0001, CHAR_0002, ...
- chapter_id/paragraph_id are filled by default using simple rules
metais{}unless an unmatched quote forces fallback narration
Supported quotes: “ ” (required), ‘ ’, 「 」, 『 』 (optional).
Attribution rules only use “ ”. SAY_VERBS are centralized in constants for easy extension.
The CLI also writes a character registry file:
- Default path:
<doc_id>__characters.jsonnext to the JSONL output - Override with
--chars-out
The CLI writes an aggregated dialogue export:
- Default path:
<doc_id>__dialogues_by_character.jsonnext to the JSONL output - Override with
--dialogues-out
- Script:
<doc_id>__script.txt(override with--script-out) - Roles directory:
<doc_id>__roles/(override with--roles-dir, disable with--no-roles)NARRATOR.txtfor narrationCHAR_0001.txtetc. for each characterUNKNOWN.txtfor unassigned dialogue
--with-structure(default): fill chapter_id/paragraph_id--no-structure: leave chapter_id/paragraph_id as null
Chapter detection rule: line matches ^第[一二三四五六七八九十百千0-9]+章.*$
Paragraphs are based on raw \n line splits (empty lines count).