Skip to content

ArnoXiang/PyReadAlongEPUB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

PyReadAlongEPUB

Automates the tedious process of enabling read-along (media overlays) in EPUB 3.0 files. Takes an EPUB 2.0 or 3.0 file, generates a narration script, optionally generates TTS audio or accepts an existing recording, wraps text in span tags, aligns audio to text at the word or sentence level, creates SMIL files, and updates all metadata -- then repacks it into a valid read-along EPUB. Can also strip all read-along functionality to prepare an EPUB for translation.

Requirements

  • Python 3.10+ (uses standard library only for everything except audio alignment and TTS)
  • TTS narration (pick one, or bring your own audio):
    • Edge TTS (free, no API key): pip install edge-tts
    • ElevenLabs (highest quality): pip install elevenlabs + API key
    • OpenAI TTS: pip install openai + API key
  • WhisperX (for automatic word-level audio alignment):
    pip install whisperx torch torchaudio
    
  • OR Audacity label exports if you prefer manual timing
  • Optional: mutagen (pip install mutagen) or ffprobe (from FFmpeg) for audio duration detection
  • GUI: pip install PyQt6 (for the graphical interface)

GUI

A PyQt6 graphical interface is available for students and users who prefer not to use the command line.

pip install PyQt6
python readalong_gui.py

The GUI is organized into 6 tabs:

Tab What it does
1. Prepare Select an EPUB, set pages to exclude, unpack, and generate a narration script ({name}_script.html + .txt) in the working folder
2. Wrap Choose word or sentence level and wrap text in span tags
3. Audio Provide audio via TTS generation or an existing recording. TTS mode: pick an engine, browse 320+ voices in a dropdown, enter API key if needed. Existing mode: browse for a narration file. Both modes support optional background music.
4. Align Two-step flow: Step A runs WhisperX to auto-align and exports a single combined Audacity label file for QA. Step B imports the corrected labels and generates SMIL files.
5. Finalize Pick a highlight color with live preview, optionally replace the cover image, set the output filename, and pack the final EPUB. Includes a "Run Full Pipeline" button for one-click processing.
Restore Strip all read-along functionality from an EPUB to prepare it for translation. Optionally remove audio files.

Features:

  • Drag & drop EPUB files onto the window
  • Auto-restore previous session when reloading an EPUB (exclude list, level, audio map, labels)
  • Narration script generation (HTML with formatting for dubbing talent + plain TXT for TTS)
  • Edge TTS voice dropdown with all 320+ voices (grouped by language, English first)
  • ElevenLabs voice loading via "Load Voices" button after entering API key
  • OpenAI voices pre-populated (alloy, echo, fable, nova, onyx, shimmer, etc.)
  • API key fields shown only when needed (hidden for free Edge TTS)
  • Color picker for the overlay highlight with live preview
  • Real-time log output showing each step's progress
  • WhisperX -> Audacity QA workflow with a single combined label file (one file to open in Audacity, contains all words for the entire book)
  • Replacement cover image picker for adding a play-button-equipped cover
  • Tappable cover automatically enabled (tap the cover image to start read-along)
  • "Run Full Pipeline" button to configure all tabs then run everything in one click
  • Dark theme matching the karaoke generator GUI

Folder layout

The tool keeps the unpacked EPUB directory clean -- all working files are stored in a sibling _working folder so external tools can safely zip the EPUB folder without picking up debris.

MyBook.epub                  (original)
MyBook/                      (unpacked EPUB -- only valid EPUB content)
MyBook_working/              (sibling working folder)
  MyBook_script.html           (narration script with formatting for dubbing talent)
  MyBook_script.txt            (plain text for TTS engines)
  audio_map.json               (page-to-audio mappings)
  audacity_labels/             (WhisperX export for QA in Audacity)
  .readalong_state.json        (pipeline state)

Quick Start (CLI)

Full pipeline with TTS narration (no audio needed!)

python readalong.py auto MyBook.epub \
  --narrate --engine edge \
  --exclude copyright.xhtml \
  --overlaycolor yellow

Sentence-level highlighting (for older readers)

python readalong.py auto MyBook.epub \
  --narrate --engine edge \
  --level sentence \
  --exclude cover.xhtml copyright.xhtml

Full pipeline with existing audio

python readalong.py auto MyBook.epub \
  --narration narration.mp3 \
  --backgroundmusic soundtrack.mp3 \
  --exclude copyright.xhtml \
  --overlaycolor yellow

Strip read-along for translation

python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_stripped.epub

Step-by-step pipeline

# 1. Unpack the EPUB and generate narration script
python readalong.py prepare MyBook.epub --exclude copyright.xhtml

# 2. Wrap all visible words in <span> tags with unique IDs
python readalong.py wrap MyBook --exclude copyright.xhtml

# 3. Generate narration audio from the text (or skip if you have audio)
python readalong.py narrate MyBook --engine edge

# 4. Generate word-level timestamps
python readalong.py align MyBook --method whisperx

# 5. Generate SMIL files from the timestamps
python readalong.py smil MyBook

# 6. Update the OPF, HTML namespaces, and CSS
python readalong.py finalize MyBook --overlaycolor yellow

# 7. Repack into an EPUB file
python readalong.py pack MyBook

Commands

prepare -- Unpack and generate script

python readalong.py prepare <epub_file> [options]

Unpacks the EPUB and analyzes its structure. Generates a narration script and audio map in the _working folder. If the EPUB already has audio, use --narration and --backgroundmusic to inject additional audio assets.

Option Description
-o, --output Output directory (default: same name as EPUB without extension)
--narration FILE Narration audio file to copy into the EPUB and register in the OPF manifest
--backgroundmusic FILE Background music file to inject
--exclude XHTML files to exclude from the narration script

What it does:

  • Extracts the EPUB ZIP into a directory
  • Locates the OPF file via META-INF/container.xml
  • If --narration is provided: copies the file to OEBPS/audio/, registers it in the OPF <manifest> with the correct media type
  • If --backgroundmusic is provided: copies the file, registers it, and injects an iBooks-compatible <audio> element into every XHTML <body>
  • Reports all XHTML files with word counts and whether they are already wrapped
  • Creates {name}_working/ folder with:
    • {name}_script.html -- formatted narration script preserving bold, italic, and headings from the EPUB. Designed to be opened in a browser and shared with dubbing talent.
    • {name}_script.txt -- plain text version for TTS engines
    • audio_map.json -- page-to-audio mappings (auto-detected matches pre-filled, empty values for pages without a match)

Supported audio formats: .mp3, .m4a, .mp4, .ogg, .oga, .wav, .webm, .aac, .flac

wrap -- Wrap words or sentences in span tags

python readalong.py wrap <epub_dir> [--exclude ...] [--level word|sentence]

Wraps content in each XHTML file with <span> tags that the SMIL files reference for highlighting.

Option Description
--exclude XHTML filenames to skip (e.g., copyright.xhtml)
--level word (default) or sentence

Word level (default): Wraps every visible word with <span id="WN">word</span>. Best for early readers / children's books where each word lights up individually.

  • Parses each XHTML body by splitting on HTML tags
  • Skips content inside <script>, <style>, and <audio> elements
  • Preserves all existing HTML structure (classes, attributes, nesting)
  • IDs reset to W1 on each page

Sentence level (--level sentence): Wraps the content of each <p> element with <span id="SN">...</span>. Best for older readers where the entire sentence highlights as it is read.

  • Each <p> tag becomes one highlightable unit
  • Paragraphs with no visible text (e.g., image-only) are skipped
  • IDs reset to S1 on each page
  • WhisperX still runs at the word level internally, then timestamps are grouped into sentence boundaries

Example transformations:

<!-- Word level (default) -->
<p>The quick brown fox</p>
<!-- becomes -->
<p><span id="W1">The</span> <span id="W2">quick</span> <span id="W3">brown</span> <span id="W4">fox</span></p>

<!-- Sentence level (--level sentence) -->
<p>My name is Nate the Great.</p>
<p>I am a detective.</p>
<!-- becomes -->
<p><span id="S1">My name is Nate the Great.</span></p>
<p><span id="S2">I am a detective.</span></p>

narrate -- Generate narration audio from text

python readalong.py narrate <epub_dir> [options]

Generates narration audio directly from the XHTML text using a text-to-speech engine. This replaces the need for a recording studio or a pre-recorded narration file.

Option Description
--engine TTS engine: edge (default), elevenlabs, or openai
--voice Voice name (engine-specific, see below)
--per-page Generate one audio file per page instead of one for the whole book

Available engines:

Engine Quality Cost Setup
edge Good (Microsoft neural voices) Free, no API key pip install edge-tts
elevenlabs Excellent (best for audiobooks) Paid (free tier: 10k chars/mo) pip install elevenlabs + ELEVENLABS_API_KEY env var
openai Very good Paid (~$15/1M chars HD) pip install openai + OPENAI_API_KEY env var

Voice examples:

  • Edge: en-US-AndrewMultilingualNeural (default), en-US-JennyNeural, en-GB-SoniaNeural
    • List all voices: edge-tts --list-voices
  • ElevenLabs: Rachel (default), Bella, Antoni, Josh
  • OpenAI: alloy (default), nova, shimmer, echo, fable, onyx

What it does:

  • Extracts text from each wrapped XHTML page (respecting --exclude from the wrap step)
  • Single-audio mode (default): Concatenates all page text with paragraph breaks and generates one narration.mp3 file
  • Per-page mode (--per-page): Generates one audio file per XHTML page, named to match the XHTML file for automatic audio matching
  • Registers the generated audio in the OPF manifest
  • Updates audio_map.json with the new mappings
  • Skips generation if the output file already exists (delete to regenerate)

align -- Generate word-level timestamps

python readalong.py align <epub_dir> [options]

Generates precise start/end timestamps for every word using either WhisperX (automatic) or Audacity label files (manual).

Option Description
--method whisperx (default) or audacity
--labels-dir DIR Directory containing Audacity .txt label exports (required for audacity method)
--audio-map Manual audio mapping overrides on the command line: page1.xhtml=audio/file1.m4a page2.xhtml=audio/file2.m4a
--file-audio-map FILE JSON file with page-to-audio mappings (the audio_map.json generated by prepare is designed for this)

WhisperX mode (default):

The tool automatically detects whether the EPUB uses per-page audio or single-audio (one narration file covers the entire book).

  • Per-page mode: Runs WhisperX independently on each audio file matched to its XHTML page
  • Single-audio mode: Runs WhisperX once on the entire narration file, then splits the resulting word timestamps across pages based on word counts and spine order

Audio file matching rules (in priority order):

  1. --file-audio-map / --audio-map: Manual overrides take highest priority
  2. Basename match: foxandgrapes.xhtml looks for foxandgrapes.m4a, foxandgrapes.mp3, etc.
  3. Designated narration: If --narration was used in the prepare step, that file is the fallback
  4. Single narration heuristic: If only one non-soundtrack audio file exists, it is used

WhisperX uses the large-v2 model and auto-detects CUDA vs CPU.

Audacity mode:

python readalong.py align MyBook --method audacity --labels-dir ./labels/

Expects one .txt file per XHTML page in the labels directory, named to match the XHTML file (e.g., foxandgrapes.txt for foxandgrapes.xhtml). Each file should be an Audacity label export with tab-separated columns: start_time\tend_time\tlabel.

Recommended QA workflow (GUI):

The GUI's two-step Align flow makes WhisperX QA much easier:

  1. Step A runs WhisperX and exports a single combined label file ({book}_labels.txt) covering all pages. Each label uses the format page:word_id (e.g., cover:W1, p04:W6) so the entire book is in one timeline.
  2. Open the narration audio in Audacity, then File > Import > Labels and select the combined file. All words for the entire book appear on one timeline.
  3. Drag label boundaries to fix any timing issues, especially around compound words (like "hook-and-ladder") which WhisperX often splits incorrectly.
  4. File > Export Other > Labels -- save back to the same combined file.
  5. Step B in the GUI reads the combined file, splits it back into per-page timestamps, and generates SMIL files.

This is much faster than opening one label file per page. After fixing the labels, you only need to re-run Step B and Finalize -- no need to re-run the full pipeline.

smil -- Generate SMIL files

python readalong.py smil <epub_dir>

Creates one SMIL file per XHTML page from the timestamps generated by align. SMIL files are placed alongside their XHTML counterparts with correct relative audio paths.

finalize -- Update OPF, HTML namespaces, CSS, and cover

python readalong.py finalize <epub_dir> [--overlaycolor COLOR] [--cover IMAGE]

Applies all remaining read-along metadata changes to make the EPUB spec-compliant.

Option Description
--overlaycolor CSS color for the word highlight background (default: yellow). Accepts any valid CSS color: named colors, hex (#00ccff), rgb(), etc.
--cover Replacement cover image file (e.g., a cover with a play button overlay). The original cover filename is preserved so manifest references stay valid.

What it does:

  1. Updates content.opf: bumps version to 3.0, adds duration/active-class metadata, registers SMIL files, adds media-overlay attributes to XHTML items
  2. Updates HTML namespaces (xmlns:ibooks, xmlns:epub) on all non-excluded XHTML files
  3. Adds media overlay CSS (.-epub-media-overlay-active) to the main stylesheet
  4. Replaces the cover image if --cover is provided
  5. Makes the cover tappable to start read-along by adding ibooks:readaloud="startstop" to the cover <img> tag

pack -- Repack into EPUB

python readalong.py pack <epub_dir> [-o output.epub]

Repacks the directory into a valid .epub ZIP file. Writes mimetype first and uncompressed (per EPUB spec). Backs up existing output to .epub.bak.

strip -- Remove read-along functionality

python readalong.py strip <epub_dir> [--remove-audio]

Completely removes all read-along artifacts from an EPUB to prepare it for translation or re-recording. This is the inverse of the entire pipeline.

Option Description
--remove-audio Also delete audio files from OEBPS/audio/

What it removes:

Layer Artifacts removed
SMIL All .smil files deleted
OPF media-overlay attributes, SMIL manifest entries, media:duration and media:active-class metadata
XHTML Word spans (<span id="W1"> unwrapped), sentence spans (<span id="S1"> unwrapped), <audio epub:type="ibooks:soundtrack"> tags, play/stop buttons (<p ibooks:readaloud="...">) , xmlns:ibooks namespace
CSS .-epub-media-overlay-active rule, .-media-overlay-active rule, play/stop button styles (#raplay, #rastop, #rass), .-ibooks-media-overlay-enabled rules

After stripping, use pack to create a clean EPUB:

python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_for_translation.epub

auto -- Full pipeline

python readalong.py auto <epub_file> [all options from other commands]

Runs the full pipeline in sequence: prepare, wrap, (narrate), align, smil, finalize, pack. The output EPUB is named <original>_readalong.epub. The narrate step only runs when --narrate is specified.

Option Description
--narration FILE Existing narration audio file to inject
--backgroundmusic FILE Background music file to inject
--narrate Generate narration via TTS instead of providing --narration
--engine TTS engine for --narrate: edge (default), elevenlabs, openai
--voice TTS voice name (engine-specific)
--per-page Generate one audio per page instead of one for the whole book
--level word (default) or sentence -- highlight granularity
--exclude XHTML files to skip for wrapping
--method whisperx or audacity
--labels-dir DIR Audacity labels directory
--audio-map Manual page-to-audio mappings on the command line
--file-audio-map FILE JSON file with page-to-audio mappings
--overlaycolor COLOR Highlight color (default: yellow)
--cover IMAGE Replacement cover image (e.g., a cover with a play button)

How It Works

The read-along EPUB standard

Read-along EPUBs use EPUB 3 Media Overlays, which synchronize text with audio at the word level. The key components are:

  • SMIL files: One per XHTML page. Each <par> element pairs a word or sentence (via its span ID in the XHTML) with an audio clip (start and end time in the narration audio).
  • content.opf metadata: Declares the SMIL files, links them to XHTML pages via media-overlay attributes, and specifies the CSS class applied to the active word.
  • CSS highlight class: The .-epub-media-overlay-active class determines how the currently-spoken word looks (color, background, etc.).
  • HTML namespaces: The ibooks and epub namespaces enable reader-specific features like soundtrack playback and read-along buttons.

Audio matching logic

When the tool needs to find the audio file for a given XHTML page, it checks in this order:

  1. Manual overrides (--file-audio-map and --audio-map): Checked first. CLI values override JSON values.
  2. Basename match: foxandgrapes.xhtml looks for foxandgrapes.m4a, foxandgrapes.mp3, etc.
  3. Designated narration: If --narration was used in the prepare step, that file is the fallback.
  4. Single narration heuristic: If only one non-soundtrack audio file exists, it is used.

Single-audio vs. per-page audio

  • Per-page audio (like BabysOwnAesop): Each XHTML page has its own audio file. WhisperX runs independently on each.
  • Single audio (common for new projects): One audio file covers the entire book. WhisperX processes it once, then timestamps are split across pages in spine order.

Typical Workflow

New project (start to finish)

  1. Start with a plain EPUB
  2. Run the pipeline:
# Fully automated -- no audio files needed!
python readalong.py auto MyBook.epub \
  --narrate --engine edge \
  --exclude copyright.xhtml \
  --overlaycolor lightblue
  1. Open the output EPUB in Apple Books, Thorium, or another EPUB 3 reader and test
  2. If something is off, fix and re-run individual steps (state is preserved between runs)

Preparing for translation

  1. Strip read-along from the source-language EPUB:
python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_stripped.epub
  1. Translate the text in the stripped EPUB
  2. Record or generate localized narration
  3. Run the pipeline on the translated EPUB to add read-along back

GUI workflow

  1. Launch python readalong_gui.py
  2. Drag & drop your EPUB onto the window
  3. Configure settings across any tabs (exclude files, word/sentence level, TTS voice, overlay color)
  4. Click "Run Full Pipeline" on the Finalize tab -- or run each step individually
  5. Use the Restore tab to strip read-along when preparing for translation

Troubleshooting

  • WhisperX word count mismatch: The tool prints a warning and uses best-effort sequential alignment. WhisperX often splits compound words like "hook-and-ladder" into separate tokens; the tool tries to merge them back, but you may need to fix some boundaries in Audacity.
  • Drift/timing issues with WhisperX: WhisperX gets you 80-90% of the way there but its word boundaries lean slightly early and accumulate drift through the book. Use the GUI's two-step Align flow: WhisperX exports a single combined label file that you can open in Audacity, fix in one session, and re-import via Step B.
  • WhisperX ImportError for lightning: Install lightning: pip install lightning
  • WhisperX RuntimeError: Attempting to deserialize object on a CUDA device: WhisperX ships a model checkpoint saved on CUDA. Upgrade it for CPU:
    import torch
    path = '<your_python_site_packages>/whisperx/assets/pytorch_model.bin'
    checkpoint = torch.load(path, map_location='cpu', weights_only=False)
    torch.save(checkpoint, path)
  • Duration shows 0:00:00.000: Install mutagen (pip install mutagen) or ffprobe (from FFmpeg) for audio duration detection.
  • EPUB validation errors: Upload to an EPUB validator to identify issues. Common problems include missing manifest entries or malformed XHTML.
  • Read-along doesn't play: Ensure the EPUB reader supports Media Overlays (Apple Books, Thorium). Verify that media-overlay attributes in the OPF point to the correct SMIL IDs.
  • Cover doesn't start playback when tapped: The finalize step automatically adds ibooks:readaloud="startstop" to the cover image. Verify the cover XHTML has this attribute on the <img> tag inside <div id="cover">.
  • Re-running a step: The .readalong_state.json file in the _working folder preserves state, so you can re-run any step without starting over. After fixing labels in Audacity, you only need to re-run Step B (Import Labels & Generate SMIL) and Finalize -- not the entire pipeline.
  • EPUB folder has extra files: Working files (scripts, labels, state) are stored in the _working sibling folder, never inside the EPUB directory. The EPUB folder can be safely zipped by any external tool.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages