Automates the tedious process of enabling read-along (media overlays) in EPUB 3.0 files. Takes an EPUB 2.0 or 3.0 file, generates a narration script, optionally generates TTS audio or accepts an existing recording, wraps text in span tags, aligns audio to text at the word or sentence level, creates SMIL files, and updates all metadata -- then repacks it into a valid read-along EPUB. Can also strip all read-along functionality to prepare an EPUB for translation.
- Python 3.10+ (uses standard library only for everything except audio alignment and TTS)
- TTS narration (pick one, or bring your own audio):
- Edge TTS (free, no API key):
pip install edge-tts - ElevenLabs (highest quality):
pip install elevenlabs+ API key - OpenAI TTS:
pip install openai+ API key
- Edge TTS (free, no API key):
- WhisperX (for automatic word-level audio alignment):
pip install whisperx torch torchaudio - OR Audacity label exports if you prefer manual timing
- Optional:
mutagen(pip install mutagen) orffprobe(from FFmpeg) for audio duration detection - GUI:
pip install PyQt6(for the graphical interface)
A PyQt6 graphical interface is available for students and users who prefer not to use the command line.
pip install PyQt6
python readalong_gui.pyThe GUI is organized into 6 tabs:
| Tab | What it does |
|---|---|
| 1. Prepare | Select an EPUB, set pages to exclude, unpack, and generate a narration script ({name}_script.html + .txt) in the working folder |
| 2. Wrap | Choose word or sentence level and wrap text in span tags |
| 3. Audio | Provide audio via TTS generation or an existing recording. TTS mode: pick an engine, browse 320+ voices in a dropdown, enter API key if needed. Existing mode: browse for a narration file. Both modes support optional background music. |
| 4. Align | Two-step flow: Step A runs WhisperX to auto-align and exports a single combined Audacity label file for QA. Step B imports the corrected labels and generates SMIL files. |
| 5. Finalize | Pick a highlight color with live preview, optionally replace the cover image, set the output filename, and pack the final EPUB. Includes a "Run Full Pipeline" button for one-click processing. |
| Restore | Strip all read-along functionality from an EPUB to prepare it for translation. Optionally remove audio files. |
Features:
- Drag & drop EPUB files onto the window
- Auto-restore previous session when reloading an EPUB (exclude list, level, audio map, labels)
- Narration script generation (HTML with formatting for dubbing talent + plain TXT for TTS)
- Edge TTS voice dropdown with all 320+ voices (grouped by language, English first)
- ElevenLabs voice loading via "Load Voices" button after entering API key
- OpenAI voices pre-populated (alloy, echo, fable, nova, onyx, shimmer, etc.)
- API key fields shown only when needed (hidden for free Edge TTS)
- Color picker for the overlay highlight with live preview
- Real-time log output showing each step's progress
- WhisperX -> Audacity QA workflow with a single combined label file (one file to open in Audacity, contains all words for the entire book)
- Replacement cover image picker for adding a play-button-equipped cover
- Tappable cover automatically enabled (tap the cover image to start read-along)
- "Run Full Pipeline" button to configure all tabs then run everything in one click
- Dark theme matching the karaoke generator GUI
The tool keeps the unpacked EPUB directory clean -- all working files are stored in a sibling _working folder so external tools can safely zip the EPUB folder without picking up debris.
MyBook.epub (original)
MyBook/ (unpacked EPUB -- only valid EPUB content)
MyBook_working/ (sibling working folder)
MyBook_script.html (narration script with formatting for dubbing talent)
MyBook_script.txt (plain text for TTS engines)
audio_map.json (page-to-audio mappings)
audacity_labels/ (WhisperX export for QA in Audacity)
.readalong_state.json (pipeline state)
python readalong.py auto MyBook.epub \
--narrate --engine edge \
--exclude copyright.xhtml \
--overlaycolor yellowpython readalong.py auto MyBook.epub \
--narrate --engine edge \
--level sentence \
--exclude cover.xhtml copyright.xhtmlpython readalong.py auto MyBook.epub \
--narration narration.mp3 \
--backgroundmusic soundtrack.mp3 \
--exclude copyright.xhtml \
--overlaycolor yellowpython readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_stripped.epub# 1. Unpack the EPUB and generate narration script
python readalong.py prepare MyBook.epub --exclude copyright.xhtml
# 2. Wrap all visible words in <span> tags with unique IDs
python readalong.py wrap MyBook --exclude copyright.xhtml
# 3. Generate narration audio from the text (or skip if you have audio)
python readalong.py narrate MyBook --engine edge
# 4. Generate word-level timestamps
python readalong.py align MyBook --method whisperx
# 5. Generate SMIL files from the timestamps
python readalong.py smil MyBook
# 6. Update the OPF, HTML namespaces, and CSS
python readalong.py finalize MyBook --overlaycolor yellow
# 7. Repack into an EPUB file
python readalong.py pack MyBookpython readalong.py prepare <epub_file> [options]
Unpacks the EPUB and analyzes its structure. Generates a narration script and audio map in the _working folder. If the EPUB already has audio, use --narration and --backgroundmusic to inject additional audio assets.
| Option | Description |
|---|---|
-o, --output |
Output directory (default: same name as EPUB without extension) |
--narration FILE |
Narration audio file to copy into the EPUB and register in the OPF manifest |
--backgroundmusic FILE |
Background music file to inject |
--exclude |
XHTML files to exclude from the narration script |
What it does:
- Extracts the EPUB ZIP into a directory
- Locates the OPF file via
META-INF/container.xml - If
--narrationis provided: copies the file toOEBPS/audio/, registers it in the OPF<manifest>with the correct media type - If
--backgroundmusicis provided: copies the file, registers it, and injects an iBooks-compatible<audio>element into every XHTML<body> - Reports all XHTML files with word counts and whether they are already wrapped
- Creates
{name}_working/folder with:{name}_script.html-- formatted narration script preserving bold, italic, and headings from the EPUB. Designed to be opened in a browser and shared with dubbing talent.{name}_script.txt-- plain text version for TTS enginesaudio_map.json-- page-to-audio mappings (auto-detected matches pre-filled, empty values for pages without a match)
Supported audio formats: .mp3, .m4a, .mp4, .ogg, .oga, .wav, .webm, .aac, .flac
python readalong.py wrap <epub_dir> [--exclude ...] [--level word|sentence]
Wraps content in each XHTML file with <span> tags that the SMIL files reference for highlighting.
| Option | Description |
|---|---|
--exclude |
XHTML filenames to skip (e.g., copyright.xhtml) |
--level |
word (default) or sentence |
Word level (default): Wraps every visible word with <span id="WN">word</span>. Best for early readers / children's books where each word lights up individually.
- Parses each XHTML body by splitting on HTML tags
- Skips content inside
<script>,<style>, and<audio>elements - Preserves all existing HTML structure (classes, attributes, nesting)
- IDs reset to W1 on each page
Sentence level (--level sentence): Wraps the content of each <p> element with <span id="SN">...</span>. Best for older readers where the entire sentence highlights as it is read.
- Each
<p>tag becomes one highlightable unit - Paragraphs with no visible text (e.g., image-only) are skipped
- IDs reset to S1 on each page
- WhisperX still runs at the word level internally, then timestamps are grouped into sentence boundaries
Example transformations:
<!-- Word level (default) -->
<p>The quick brown fox</p>
<!-- becomes -->
<p><span id="W1">The</span> <span id="W2">quick</span> <span id="W3">brown</span> <span id="W4">fox</span></p>
<!-- Sentence level (--level sentence) -->
<p>My name is Nate the Great.</p>
<p>I am a detective.</p>
<!-- becomes -->
<p><span id="S1">My name is Nate the Great.</span></p>
<p><span id="S2">I am a detective.</span></p>python readalong.py narrate <epub_dir> [options]
Generates narration audio directly from the XHTML text using a text-to-speech engine. This replaces the need for a recording studio or a pre-recorded narration file.
| Option | Description |
|---|---|
--engine |
TTS engine: edge (default), elevenlabs, or openai |
--voice |
Voice name (engine-specific, see below) |
--per-page |
Generate one audio file per page instead of one for the whole book |
Available engines:
| Engine | Quality | Cost | Setup |
|---|---|---|---|
| edge | Good (Microsoft neural voices) | Free, no API key | pip install edge-tts |
| elevenlabs | Excellent (best for audiobooks) | Paid (free tier: 10k chars/mo) | pip install elevenlabs + ELEVENLABS_API_KEY env var |
| openai | Very good | Paid (~$15/1M chars HD) | pip install openai + OPENAI_API_KEY env var |
Voice examples:
- Edge:
en-US-AndrewMultilingualNeural(default),en-US-JennyNeural,en-GB-SoniaNeural- List all voices:
edge-tts --list-voices
- List all voices:
- ElevenLabs:
Rachel(default),Bella,Antoni,Josh - OpenAI:
alloy(default),nova,shimmer,echo,fable,onyx
What it does:
- Extracts text from each wrapped XHTML page (respecting
--excludefrom the wrap step) - Single-audio mode (default): Concatenates all page text with paragraph breaks and generates one
narration.mp3file - Per-page mode (
--per-page): Generates one audio file per XHTML page, named to match the XHTML file for automatic audio matching - Registers the generated audio in the OPF manifest
- Updates
audio_map.jsonwith the new mappings - Skips generation if the output file already exists (delete to regenerate)
python readalong.py align <epub_dir> [options]
Generates precise start/end timestamps for every word using either WhisperX (automatic) or Audacity label files (manual).
| Option | Description |
|---|---|
--method |
whisperx (default) or audacity |
--labels-dir DIR |
Directory containing Audacity .txt label exports (required for audacity method) |
--audio-map |
Manual audio mapping overrides on the command line: page1.xhtml=audio/file1.m4a page2.xhtml=audio/file2.m4a |
--file-audio-map FILE |
JSON file with page-to-audio mappings (the audio_map.json generated by prepare is designed for this) |
WhisperX mode (default):
The tool automatically detects whether the EPUB uses per-page audio or single-audio (one narration file covers the entire book).
- Per-page mode: Runs WhisperX independently on each audio file matched to its XHTML page
- Single-audio mode: Runs WhisperX once on the entire narration file, then splits the resulting word timestamps across pages based on word counts and spine order
Audio file matching rules (in priority order):
--file-audio-map/--audio-map: Manual overrides take highest priority- Basename match:
foxandgrapes.xhtmllooks forfoxandgrapes.m4a,foxandgrapes.mp3, etc. - Designated narration: If
--narrationwas used in the prepare step, that file is the fallback - Single narration heuristic: If only one non-soundtrack audio file exists, it is used
WhisperX uses the large-v2 model and auto-detects CUDA vs CPU.
Audacity mode:
python readalong.py align MyBook --method audacity --labels-dir ./labels/Expects one .txt file per XHTML page in the labels directory, named to match the XHTML file (e.g., foxandgrapes.txt for foxandgrapes.xhtml). Each file should be an Audacity label export with tab-separated columns: start_time\tend_time\tlabel.
Recommended QA workflow (GUI):
The GUI's two-step Align flow makes WhisperX QA much easier:
- Step A runs WhisperX and exports a single combined label file (
{book}_labels.txt) covering all pages. Each label uses the formatpage:word_id(e.g.,cover:W1,p04:W6) so the entire book is in one timeline. - Open the narration audio in Audacity, then File > Import > Labels and select the combined file. All words for the entire book appear on one timeline.
- Drag label boundaries to fix any timing issues, especially around compound words (like "hook-and-ladder") which WhisperX often splits incorrectly.
- File > Export Other > Labels -- save back to the same combined file.
- Step B in the GUI reads the combined file, splits it back into per-page timestamps, and generates SMIL files.
This is much faster than opening one label file per page. After fixing the labels, you only need to re-run Step B and Finalize -- no need to re-run the full pipeline.
python readalong.py smil <epub_dir>
Creates one SMIL file per XHTML page from the timestamps generated by align. SMIL files are placed alongside their XHTML counterparts with correct relative audio paths.
python readalong.py finalize <epub_dir> [--overlaycolor COLOR] [--cover IMAGE]
Applies all remaining read-along metadata changes to make the EPUB spec-compliant.
| Option | Description |
|---|---|
--overlaycolor |
CSS color for the word highlight background (default: yellow). Accepts any valid CSS color: named colors, hex (#00ccff), rgb(), etc. |
--cover |
Replacement cover image file (e.g., a cover with a play button overlay). The original cover filename is preserved so manifest references stay valid. |
What it does:
- Updates
content.opf: bumps version to 3.0, adds duration/active-class metadata, registers SMIL files, addsmedia-overlayattributes to XHTML items - Updates HTML namespaces (
xmlns:ibooks,xmlns:epub) on all non-excluded XHTML files - Adds media overlay CSS (
.-epub-media-overlay-active) to the main stylesheet - Replaces the cover image if
--coveris provided - Makes the cover tappable to start read-along by adding
ibooks:readaloud="startstop"to the cover<img>tag
python readalong.py pack <epub_dir> [-o output.epub]
Repacks the directory into a valid .epub ZIP file. Writes mimetype first and uncompressed (per EPUB spec). Backs up existing output to .epub.bak.
python readalong.py strip <epub_dir> [--remove-audio]
Completely removes all read-along artifacts from an EPUB to prepare it for translation or re-recording. This is the inverse of the entire pipeline.
| Option | Description |
|---|---|
--remove-audio |
Also delete audio files from OEBPS/audio/ |
What it removes:
| Layer | Artifacts removed |
|---|---|
| SMIL | All .smil files deleted |
| OPF | media-overlay attributes, SMIL manifest entries, media:duration and media:active-class metadata |
| XHTML | Word spans (<span id="W1"> unwrapped), sentence spans (<span id="S1"> unwrapped), <audio epub:type="ibooks:soundtrack"> tags, play/stop buttons (<p ibooks:readaloud="...">) , xmlns:ibooks namespace |
| CSS | .-epub-media-overlay-active rule, .-media-overlay-active rule, play/stop button styles (#raplay, #rastop, #rass), .-ibooks-media-overlay-enabled rules |
After stripping, use pack to create a clean EPUB:
python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_for_translation.epubpython readalong.py auto <epub_file> [all options from other commands]
Runs the full pipeline in sequence: prepare, wrap, (narrate), align, smil, finalize, pack. The output EPUB is named <original>_readalong.epub. The narrate step only runs when --narrate is specified.
| Option | Description |
|---|---|
--narration FILE |
Existing narration audio file to inject |
--backgroundmusic FILE |
Background music file to inject |
--narrate |
Generate narration via TTS instead of providing --narration |
--engine |
TTS engine for --narrate: edge (default), elevenlabs, openai |
--voice |
TTS voice name (engine-specific) |
--per-page |
Generate one audio per page instead of one for the whole book |
--level |
word (default) or sentence -- highlight granularity |
--exclude |
XHTML files to skip for wrapping |
--method |
whisperx or audacity |
--labels-dir DIR |
Audacity labels directory |
--audio-map |
Manual page-to-audio mappings on the command line |
--file-audio-map FILE |
JSON file with page-to-audio mappings |
--overlaycolor COLOR |
Highlight color (default: yellow) |
--cover IMAGE |
Replacement cover image (e.g., a cover with a play button) |
Read-along EPUBs use EPUB 3 Media Overlays, which synchronize text with audio at the word level. The key components are:
- SMIL files: One per XHTML page. Each
<par>element pairs a word or sentence (via its span ID in the XHTML) with an audio clip (start and end time in the narration audio). content.opfmetadata: Declares the SMIL files, links them to XHTML pages viamedia-overlayattributes, and specifies the CSS class applied to the active word.- CSS highlight class: The
.-epub-media-overlay-activeclass determines how the currently-spoken word looks (color, background, etc.). - HTML namespaces: The
ibooksandepubnamespaces enable reader-specific features like soundtrack playback and read-along buttons.
When the tool needs to find the audio file for a given XHTML page, it checks in this order:
- Manual overrides (
--file-audio-mapand--audio-map): Checked first. CLI values override JSON values. - Basename match:
foxandgrapes.xhtmllooks forfoxandgrapes.m4a,foxandgrapes.mp3, etc. - Designated narration: If
--narrationwas used in the prepare step, that file is the fallback. - Single narration heuristic: If only one non-soundtrack audio file exists, it is used.
- Per-page audio (like BabysOwnAesop): Each XHTML page has its own audio file. WhisperX runs independently on each.
- Single audio (common for new projects): One audio file covers the entire book. WhisperX processes it once, then timestamps are split across pages in spine order.
- Start with a plain EPUB
- Run the pipeline:
# Fully automated -- no audio files needed!
python readalong.py auto MyBook.epub \
--narrate --engine edge \
--exclude copyright.xhtml \
--overlaycolor lightblue- Open the output EPUB in Apple Books, Thorium, or another EPUB 3 reader and test
- If something is off, fix and re-run individual steps (state is preserved between runs)
- Strip read-along from the source-language EPUB:
python readalong.py strip MyBook
python readalong.py pack MyBook -o MyBook_stripped.epub- Translate the text in the stripped EPUB
- Record or generate localized narration
- Run the pipeline on the translated EPUB to add read-along back
- Launch
python readalong_gui.py - Drag & drop your EPUB onto the window
- Configure settings across any tabs (exclude files, word/sentence level, TTS voice, overlay color)
- Click "Run Full Pipeline" on the Finalize tab -- or run each step individually
- Use the Restore tab to strip read-along when preparing for translation
- WhisperX word count mismatch: The tool prints a warning and uses best-effort sequential alignment. WhisperX often splits compound words like "hook-and-ladder" into separate tokens; the tool tries to merge them back, but you may need to fix some boundaries in Audacity.
- Drift/timing issues with WhisperX: WhisperX gets you 80-90% of the way there but its word boundaries lean slightly early and accumulate drift through the book. Use the GUI's two-step Align flow: WhisperX exports a single combined label file that you can open in Audacity, fix in one session, and re-import via Step B.
- WhisperX
ImportErrorforlightning: Install lightning:pip install lightning - WhisperX
RuntimeError: Attempting to deserialize object on a CUDA device: WhisperX ships a model checkpoint saved on CUDA. Upgrade it for CPU:import torch path = '<your_python_site_packages>/whisperx/assets/pytorch_model.bin' checkpoint = torch.load(path, map_location='cpu', weights_only=False) torch.save(checkpoint, path)
- Duration shows 0:00:00.000: Install
mutagen(pip install mutagen) orffprobe(from FFmpeg) for audio duration detection. - EPUB validation errors: Upload to an EPUB validator to identify issues. Common problems include missing manifest entries or malformed XHTML.
- Read-along doesn't play: Ensure the EPUB reader supports Media Overlays (Apple Books, Thorium). Verify that
media-overlayattributes in the OPF point to the correct SMIL IDs. - Cover doesn't start playback when tapped: The
finalizestep automatically addsibooks:readaloud="startstop"to the cover image. Verify the cover XHTML has this attribute on the<img>tag inside<div id="cover">. - Re-running a step: The
.readalong_state.jsonfile in the_workingfolder preserves state, so you can re-run any step without starting over. After fixing labels in Audacity, you only need to re-run Step B (Import Labels & Generate SMIL) and Finalize -- not the entire pipeline. - EPUB folder has extra files: Working files (scripts, labels, state) are stored in the
_workingsibling folder, never inside the EPUB directory. The EPUB folder can be safely zipped by any external tool.