PyReadAlongEPUB

Automates the tedious process of enabling read-along (media overlays) in EPUB 3.0 files. Takes an EPUB 2.0 or 3.0 file, generates a narration script, optionally generates TTS audio or accepts an existing recording, wraps text in span tags, aligns audio to text at the word or sentence level, creates SMIL files, and updates all metadata -- then repacks it into a valid read-along EPUB. Can also strip all read-along functionality to prepare an EPUB for translation.

Requirements

Python 3.10+ (uses standard library only for everything except audio alignment and TTS)
TTS narration (pick one, or bring your own audio):
- Edge TTS (free, no API key): pip install edge-tts
- ElevenLabs (highest quality): pip install elevenlabs + API key
- OpenAI TTS: pip install openai + API key
WhisperX (for automatic word-level audio alignment):
```
pip install whisperx torch torchaudio
```
OR Audacity label exports if you prefer manual timing
Optional: mutagen (pip install mutagen) or ffprobe (from FFmpeg) for audio duration detection
GUI: pip install PyQt6 (for the graphical interface)

GUI

A PyQt6 graphical interface is available for students and users who prefer not to use the command line.

pip install PyQt6
python readalong_gui.py

The GUI is organized into 6 tabs:

Tab	What it does
1. Prepare	Select an EPUB, set pages to exclude, unpack, and generate a narration script (`{name}_script.html` + `.txt`) in the working folder
2. Wrap	Choose word or sentence level and wrap text in span tags
3. Audio	Provide audio via TTS generation or an existing recording. TTS mode: pick an engine, browse 320+ voices in a dropdown, enter API key if needed. Existing mode: browse for a narration file. Both modes support optional background music.
4. Align	Two-step flow: Step A runs WhisperX to auto-align and exports a single combined Audacity label file for QA. Step B imports the corrected labels and generates SMIL files.
5. Finalize	Pick a highlight color with live preview, optionally replace the cover image, set the output filename, and pack the final EPUB. Includes a "Run Full Pipeline" button for one-click processing.
Restore	Strip all read-along functionality from an EPUB to prepare it for translation. Optionally remove audio files.

Features:

Drag & drop EPUB files onto the window
Auto-restore previous session when reloading an EPUB (exclude list, level, audio map, labels)
Narration script generation (HTML with formatting for dubbing talent + plain TXT for TTS)
Edge TTS voice dropdown with all 320+ voices (grouped by language, English first)
ElevenLabs voice loading via "Load Voices" button after entering API key
OpenAI voices pre-populated (alloy, echo, fable, nova, onyx, shimmer, etc.)
API key fields shown only when needed (hidden for free Edge TTS)
Color picker for the overlay highlight with live preview
Real-time log output showing each step's progress
WhisperX -> Audacity QA workflow with a single combined label file (one file to open in Audacity, contains all words for the entire book)
Replacement cover image picker for adding a play-button-equipped cover
Tappable cover automatically enabled (tap the cover image to start read-along)
"Run Full Pipeline" button to configure all tabs then run everything in one click
Dark theme matching the karaoke generator GUI

Folder layout

The tool keeps the unpacked EPUB directory clean -- all working files are stored in a sibling _working folder so external tools can safely zip the EPUB folder without picking up debris.

MyBook.epub                  (original)
MyBook/                      (unpacked EPUB -- only valid EPUB content)
MyBook_working/              (sibling working folder)
  MyBook_script.html           (narration script with formatting for dubbing talent)
  MyBook_script.txt            (plain text for TTS engines)
  audio_map.json               (page-to-audio mappings)
  audacity_labels/             (WhisperX export for QA in Audacity)
  .readalong_state.json        (pipeline state)

Quick Start (CLI)

Full pipeline with TTS narration (no audio needed!)

python readalong.py auto MyBook.epub \
  --narrate --engine edge \
  --exclude copyright.xhtml \
  --overlaycolor yellow

Sentence-level highlighting (for older readers)

python readalong.py auto MyBook.epub \
  --narrate --engine edge \
  --level sentence \
  --exclude cover.xhtml copyright.xhtml

Full pipeline with existing audio

python readalong.py auto MyBook.epub \
  --narration narration.mp3 \
  --backgroundmusic soundtrack.mp3 \
  --exclude copyright.xhtml \
  --overlaycolor yellow

Strip read-along for translation

python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_stripped.epub

Step-by-step pipeline

# 1. Unpack the EPUB and generate narration script
python readalong.py prepare MyBook.epub --exclude copyright.xhtml

# 2. Wrap all visible words in <span> tags with unique IDs
python readalong.py wrap MyBook --exclude copyright.xhtml

# 3. Generate narration audio from the text (or skip if you have audio)
python readalong.py narrate MyBook --engine edge

# 4. Generate word-level timestamps
python readalong.py align MyBook --method whisperx

# 5. Generate SMIL files from the timestamps
python readalong.py smil MyBook

# 6. Update the OPF, HTML namespaces, and CSS
python readalong.py finalize MyBook --overlaycolor yellow

# 7. Repack into an EPUB file
python readalong.py pack MyBook

Commands

`prepare` -- Unpack and generate script

python readalong.py prepare <epub_file> [options]

Unpacks the EPUB and analyzes its structure. Generates a narration script and audio map in the _working folder. If the EPUB already has audio, use --narration and --backgroundmusic to inject additional audio assets.

Option	Description
`-o`, `--output`	Output directory (default: same name as EPUB without extension)
`--narration FILE`	Narration audio file to copy into the EPUB and register in the OPF manifest
`--backgroundmusic FILE`	Background music file to inject
`--exclude`	XHTML files to exclude from the narration script

What it does:

Extracts the EPUB ZIP into a directory
Locates the OPF file via META-INF/container.xml
If --narration is provided: copies the file to OEBPS/audio/, registers it in the OPF <manifest> with the correct media type
If --backgroundmusic is provided: copies the file, registers it, and injects an iBooks-compatible <audio> element into every XHTML <body>
Reports all XHTML files with word counts and whether they are already wrapped
Creates {name}_working/ folder with:
- {name}_script.html -- formatted narration script preserving bold, italic, and headings from the EPUB. Designed to be opened in a browser and shared with dubbing talent.
- {name}_script.txt -- plain text version for TTS engines
- audio_map.json -- page-to-audio mappings (auto-detected matches pre-filled, empty values for pages without a match)

Supported audio formats: .mp3, .m4a, .mp4, .ogg, .oga, .wav, .webm, .aac, .flac

`wrap` -- Wrap words or sentences in span tags

python readalong.py wrap <epub_dir> [--exclude ...] [--level word|sentence]

Wraps content in each XHTML file with <span> tags that the SMIL files reference for highlighting.

Option	Description
`--exclude`	XHTML filenames to skip (e.g., `copyright.xhtml`)
`--level`	`word` (default) or `sentence`

Word level (default): Wraps every visible word with <span id="WN">word</span>. Best for early readers / children's books where each word lights up individually.

Parses each XHTML body by splitting on HTML tags
Skips content inside <script>, <style>, and <audio> elements
Preserves all existing HTML structure (classes, attributes, nesting)
IDs reset to W1 on each page

Sentence level (--level sentence): Wraps the content of each <p> element with <span id="SN">...</span>. Best for older readers where the entire sentence highlights as it is read.

Each <p> tag becomes one highlightable unit
Paragraphs with no visible text (e.g., image-only) are skipped
IDs reset to S1 on each page
WhisperX still runs at the word level internally, then timestamps are grouped into sentence boundaries

Example transformations:

<!-- Word level (default) -->
<p>The quick brown fox</p>
<!-- becomes -->
<p><span id="W1">The</span> <span id="W2">quick</span> <span id="W3">brown</span> <span id="W4">fox</span></p>

<!-- Sentence level (--level sentence) -->
<p>My name is Nate the Great.</p>
<p>I am a detective.</p>
<!-- becomes -->
<p><span id="S1">My name is Nate the Great.</span></p>
<p><span id="S2">I am a detective.</span></p>

`narrate` -- Generate narration audio from text

python readalong.py narrate <epub_dir> [options]

Generates narration audio directly from the XHTML text using a text-to-speech engine. This replaces the need for a recording studio or a pre-recorded narration file.

Option	Description
`--engine`	TTS engine: `edge` (default), `elevenlabs`, or `openai`
`--voice`	Voice name (engine-specific, see below)
`--per-page`	Generate one audio file per page instead of one for the whole book

Available engines:

Engine	Quality	Cost	Setup
edge	Good (Microsoft neural voices)	Free, no API key	`pip install edge-tts`
elevenlabs	Excellent (best for audiobooks)	Paid (free tier: 10k chars/mo)	`pip install elevenlabs` + `ELEVENLABS_API_KEY` env var
openai	Very good	Paid (~$15/1M chars HD)	`pip install openai` + `OPENAI_API_KEY` env var

Voice examples:

Edge: en-US-AndrewMultilingualNeural (default), en-US-JennyNeural, en-GB-SoniaNeural
- List all voices: edge-tts --list-voices
ElevenLabs: Rachel (default), Bella, Antoni, Josh
OpenAI: alloy (default), nova, shimmer, echo, fable, onyx

What it does:

Extracts text from each wrapped XHTML page (respecting --exclude from the wrap step)
Single-audio mode (default): Concatenates all page text with paragraph breaks and generates one narration.mp3 file
Per-page mode (--per-page): Generates one audio file per XHTML page, named to match the XHTML file for automatic audio matching
Registers the generated audio in the OPF manifest
Updates audio_map.json with the new mappings
Skips generation if the output file already exists (delete to regenerate)

`align` -- Generate word-level timestamps

python readalong.py align <epub_dir> [options]

Generates precise start/end timestamps for every word using either WhisperX (automatic) or Audacity label files (manual).

Option	Description
`--method`	`whisperx` (default) or `audacity`
`--labels-dir DIR`	Directory containing Audacity `.txt` label exports (required for `audacity` method)
`--audio-map`	Manual audio mapping overrides on the command line: `page1.xhtml=audio/file1.m4a page2.xhtml=audio/file2.m4a`
`--file-audio-map FILE`	JSON file with page-to-audio mappings (the `audio_map.json` generated by `prepare` is designed for this)

WhisperX mode (default):

The tool automatically detects whether the EPUB uses per-page audio or single-audio (one narration file covers the entire book).

Per-page mode: Runs WhisperX independently on each audio file matched to its XHTML page
Single-audio mode: Runs WhisperX once on the entire narration file, then splits the resulting word timestamps across pages based on word counts and spine order

Audio file matching rules (in priority order):

--file-audio-map / --audio-map: Manual overrides take highest priority
Basename match: foxandgrapes.xhtml looks for foxandgrapes.m4a, foxandgrapes.mp3, etc.
Designated narration: If --narration was used in the prepare step, that file is the fallback
Single narration heuristic: If only one non-soundtrack audio file exists, it is used

WhisperX uses the large-v2 model and auto-detects CUDA vs CPU.

Audacity mode:

python readalong.py align MyBook --method audacity --labels-dir ./labels/

Expects one .txt file per XHTML page in the labels directory, named to match the XHTML file (e.g., foxandgrapes.txt for foxandgrapes.xhtml). Each file should be an Audacity label export with tab-separated columns: start_time\tend_time\tlabel.

Recommended QA workflow (GUI):

The GUI's two-step Align flow makes WhisperX QA much easier:

Step A runs WhisperX and exports a single combined label file ({book}_labels.txt) covering all pages. Each label uses the format page:word_id (e.g., cover:W1, p04:W6) so the entire book is in one timeline.
Open the narration audio in Audacity, then File > Import > Labels and select the combined file. All words for the entire book appear on one timeline.
Drag label boundaries to fix any timing issues, especially around compound words (like "hook-and-ladder") which WhisperX often splits incorrectly.
File > Export Other > Labels -- save back to the same combined file.
Step B in the GUI reads the combined file, splits it back into per-page timestamps, and generates SMIL files.

This is much faster than opening one label file per page. After fixing the labels, you only need to re-run Step B and Finalize -- no need to re-run the full pipeline.

`smil` -- Generate SMIL files

python readalong.py smil <epub_dir>

Creates one SMIL file per XHTML page from the timestamps generated by align. SMIL files are placed alongside their XHTML counterparts with correct relative audio paths.

`finalize` -- Update OPF, HTML namespaces, CSS, and cover

python readalong.py finalize <epub_dir> [--overlaycolor COLOR] [--cover IMAGE]

Applies all remaining read-along metadata changes to make the EPUB spec-compliant.

Option	Description
`--overlaycolor`	CSS color for the word highlight background (default: `yellow`). Accepts any valid CSS color: named colors, hex (`#00ccff`), `rgb()`, etc.
`--cover`	Replacement cover image file (e.g., a cover with a play button overlay). The original cover filename is preserved so manifest references stay valid.

What it does:

Updates content.opf: bumps version to 3.0, adds duration/active-class metadata, registers SMIL files, adds media-overlay attributes to XHTML items
Updates HTML namespaces (xmlns:ibooks, xmlns:epub) on all non-excluded XHTML files
Adds media overlay CSS (.-epub-media-overlay-active) to the main stylesheet
Replaces the cover image if --cover is provided
Makes the cover tappable to start read-along by adding ibooks:readaloud="startstop" to the cover <img> tag

`pack` -- Repack into EPUB

python readalong.py pack <epub_dir> [-o output.epub]

Repacks the directory into a valid .epub ZIP file. Writes mimetype first and uncompressed (per EPUB spec). Backs up existing output to .epub.bak.

`strip` -- Remove read-along functionality

python readalong.py strip <epub_dir> [--remove-audio]

Completely removes all read-along artifacts from an EPUB to prepare it for translation or re-recording. This is the inverse of the entire pipeline.

Option	Description
`--remove-audio`	Also delete audio files from `OEBPS/audio/`

What it removes:

Layer	Artifacts removed
SMIL	All `.smil` files deleted
OPF	`media-overlay` attributes, SMIL manifest entries, `media:duration` and `media:active-class` metadata
XHTML	Word spans (`<span id="W1">` unwrapped), sentence spans (`<span id="S1">` unwrapped), `<audio epub:type="ibooks:soundtrack">` tags, play/stop buttons (`<p ibooks:readaloud="...">`) , `xmlns:ibooks` namespace
CSS	`.-epub-media-overlay-active` rule, `.-media-overlay-active` rule, play/stop button styles (`#raplay`, `#rastop`, `#rass`), `.-ibooks-media-overlay-enabled` rules

After stripping, use pack to create a clean EPUB:

python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_for_translation.epub

`auto` -- Full pipeline

python readalong.py auto <epub_file> [all options from other commands]

Runs the full pipeline in sequence: prepare, wrap, (narrate), align, smil, finalize, pack. The output EPUB is named <original>_readalong.epub. The narrate step only runs when --narrate is specified.

Option	Description
`--narration FILE`	Existing narration audio file to inject
`--backgroundmusic FILE`	Background music file to inject
`--narrate`	Generate narration via TTS instead of providing `--narration`
`--engine`	TTS engine for `--narrate`: `edge` (default), `elevenlabs`, `openai`
`--voice`	TTS voice name (engine-specific)
`--per-page`	Generate one audio per page instead of one for the whole book
`--level`	`word` (default) or `sentence` -- highlight granularity
`--exclude`	XHTML files to skip for wrapping
`--method`	`whisperx` or `audacity`
`--labels-dir DIR`	Audacity labels directory
`--audio-map`	Manual page-to-audio mappings on the command line
`--file-audio-map FILE`	JSON file with page-to-audio mappings
`--overlaycolor COLOR`	Highlight color (default: `yellow`)
`--cover IMAGE`	Replacement cover image (e.g., a cover with a play button)

How It Works

The read-along EPUB standard

Read-along EPUBs use EPUB 3 Media Overlays, which synchronize text with audio at the word level. The key components are:

SMIL files: One per XHTML page. Each <par> element pairs a word or sentence (via its span ID in the XHTML) with an audio clip (start and end time in the narration audio).
content.opf metadata: Declares the SMIL files, links them to XHTML pages via media-overlay attributes, and specifies the CSS class applied to the active word.
CSS highlight class: The .-epub-media-overlay-active class determines how the currently-spoken word looks (color, background, etc.).
HTML namespaces: The ibooks and epub namespaces enable reader-specific features like soundtrack playback and read-along buttons.

Audio matching logic

When the tool needs to find the audio file for a given XHTML page, it checks in this order:

Manual overrides (--file-audio-map and --audio-map): Checked first. CLI values override JSON values.
Basename match: foxandgrapes.xhtml looks for foxandgrapes.m4a, foxandgrapes.mp3, etc.
Designated narration: If --narration was used in the prepare step, that file is the fallback.
Single narration heuristic: If only one non-soundtrack audio file exists, it is used.

Single-audio vs. per-page audio

Per-page audio (like BabysOwnAesop): Each XHTML page has its own audio file. WhisperX runs independently on each.
Single audio (common for new projects): One audio file covers the entire book. WhisperX processes it once, then timestamps are split across pages in spine order.

Typical Workflow

New project (start to finish)

Start with a plain EPUB
Run the pipeline:

# Fully automated -- no audio files needed!
python readalong.py auto MyBook.epub \
  --narrate --engine edge \
  --exclude copyright.xhtml \
  --overlaycolor lightblue

Open the output EPUB in Apple Books, Thorium, or another EPUB 3 reader and test
If something is off, fix and re-run individual steps (state is preserved between runs)

Preparing for translation

Strip read-along from the source-language EPUB:

python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_stripped.epub

Translate the text in the stripped EPUB
Record or generate localized narration
Run the pipeline on the translated EPUB to add read-along back

GUI workflow

Launch python readalong_gui.py
Drag & drop your EPUB onto the window
Configure settings across any tabs (exclude files, word/sentence level, TTS voice, overlay color)
Click "Run Full Pipeline" on the Finalize tab -- or run each step individually
Use the Restore tab to strip read-along when preparing for translation

Troubleshooting

WhisperX word count mismatch: The tool prints a warning and uses best-effort sequential alignment. WhisperX often splits compound words like "hook-and-ladder" into separate tokens; the tool tries to merge them back, but you may need to fix some boundaries in Audacity.
Drift/timing issues with WhisperX: WhisperX gets you 80-90% of the way there but its word boundaries lean slightly early and accumulate drift through the book. Use the GUI's two-step Align flow: WhisperX exports a single combined label file that you can open in Audacity, fix in one session, and re-import via Step B.
WhisperX ImportError for lightning: Install lightning: pip install lightning

WhisperX RuntimeError: Attempting to deserialize object on a CUDA device: WhisperX ships a model checkpoint saved on CUDA. Upgrade it for CPU:

import torch
path = '<your_python_site_packages>/whisperx/assets/pytorch_model.bin'
checkpoint = torch.load(path, map_location='cpu', weights_only=False)
torch.save(checkpoint, path)

Duration shows 0:00:00.000: Install mutagen (pip install mutagen) or ffprobe (from FFmpeg) for audio duration detection.
EPUB validation errors: Upload to an EPUB validator to identify issues. Common problems include missing manifest entries or malformed XHTML.
Read-along doesn't play: Ensure the EPUB reader supports Media Overlays (Apple Books, Thorium). Verify that media-overlay attributes in the OPF point to the correct SMIL IDs.
Cover doesn't start playback when tapped: The finalize step automatically adds ibooks:readaloud="startstop" to the cover image. Verify the cover XHTML has this attribute on the <img> tag inside <div id="cover">.
Re-running a step: The .readalong_state.json file in the _working folder preserves state, so you can re-run any step without starting over. After fixing labels in Audacity, you only need to re-run Step B (Import Labels & Generate SMIL) and Finalize -- not the entire pipeline.
EPUB folder has extra files: Working files (scripts, labels, state) are stored in the _working sibling folder, never inside the EPUB directory. The EPUB folder can be safely zipped by any external tool.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
readalong.py		readalong.py
readalong_gui.py		readalong_gui.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PyReadAlongEPUB

Requirements

GUI

Folder layout

Quick Start (CLI)

Full pipeline with TTS narration (no audio needed!)

Sentence-level highlighting (for older readers)

Full pipeline with existing audio

Strip read-along for translation

Step-by-step pipeline

Commands

prepare -- Unpack and generate script

wrap -- Wrap words or sentences in span tags

narrate -- Generate narration audio from text

align -- Generate word-level timestamps

smil -- Generate SMIL files

finalize -- Update OPF, HTML namespaces, CSS, and cover

pack -- Repack into EPUB

strip -- Remove read-along functionality

auto -- Full pipeline

How It Works

The read-along EPUB standard

Audio matching logic

Single-audio vs. per-page audio

Typical Workflow

New project (start to finish)

Preparing for translation

GUI workflow

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`prepare` -- Unpack and generate script

`wrap` -- Wrap words or sentences in span tags

`narrate` -- Generate narration audio from text

`align` -- Generate word-level timestamps

`smil` -- Generate SMIL files

`finalize` -- Update OPF, HTML namespaces, CSS, and cover

`pack` -- Repack into EPUB

`strip` -- Remove read-along functionality

`auto` -- Full pipeline

Packages