SpeechifyPDF seamlessly transforms your PDF documents into natural, high-quality spoken audio. Perfect for commuters, multi-taskers, and auditory learners, our app accurately extracts text and leverages advanced Text-to-Speech technology to bring your files to life. Experience a more accessible, convenient way to consume written content anywhere.
A single-app PDF reader powered by FastAPI and the Kokoro TTS model, with real-time word-level highlighting as it reads aloud. Built for CPU-only systems with per-sentence streaming for responsive playback.
- π€ High-quality TTS: Synthesizes from PDFs using Kokoro β a lightweight 82M-param model
- π― Per-word highlighting: Each word lights up as the model speaks it
- β‘ CPU-optimized: Streams one sentence at a time; prefetches ahead for gap-free playback
- π¨ Modern dark UI: Sleek design with glassmorphism, ambient glows, and smooth animations
- ποΈ Voice & speed controls: Multiple voices across US English, UK English; adjustable playback speed (0.5Γ β 1.75Γ)
- π Reflowed text: PDFs are extracted and reflowed for easy reading
- π₯οΈ Single app: FastAPI serves both the API and the frontend β one process, one port
No prior experience needed. The whole thing runs locally β no API keys, no cloud, no payment.
- A computer running Linux, macOS, or Windows with about 8 GB of RAM. Any modern CPU works β no graphics card required.
- Miniconda (a free tool that manages Python versions for you). If you don't
have it, download and install it from
https://www.anaconda.com/download/success (pick "Miniconda"). After installing,
close and reopen your terminal so the
condacommand becomes available.- Quick check: type
conda --versionand press Enter. If it prints a version number, you're good.
- Quick check: type
- An internet connection for the first run only β the voice model (~350 MB) is downloaded automatically the first time you read a PDF, then cached forever.
Why Miniconda? SpeechifyPDF needs Python 3.10β3.12 (the Kokoro voice model does not support Python 3.13 yet). Conda installs the right Python version in an isolated "environment" so it never interferes with anything else on your machine.
cd SpeechifyPDFCopy-paste these four lines one at a time:
conda create -n speechifyPDF python=3.12 -y # make an isolated Python 3.12
conda activate speechifyPDF # switch into it
pip install torch --index-url https://download.pytorch.org/whl/cpu # CPU version of PyTorch
pip install -r backend/requirements.txt # the rest of the dependenciesThis may take a few minutes. It only has to be done once.
On Linux/macOS:
./run.sh(If you get a "permission denied" message, run bash run.sh instead.)
On Windows: double-click run.bat, or in a terminal run:
run.batWhen it's ready you'll see a line like:
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
Go to http://127.0.0.1:8000. Drag a PDF onto the page (or click to choose one), then press the Space bar β or click any sentence β to start listening. Each word lights up as it's read.
First read is slower. The very first sentence triggers a one-time model download and warm-up, so it can take 10β30 seconds. After that, sentences are synthesized in a few seconds and cached for instant replay.
Go back to the terminal and press Ctrl + C.
You only do Step 2 once. To run the app again later:
cd SpeechifyPDF
conda activate speechifyPDF # run.sh does this for you, so this line is optional
./run.shPrefer to run it manually instead of the script?
conda activate speechifyPDF
cd backend
uvicorn app:app --host 127.0.0.1 --port 8000
# then visit http://127.0.0.1:8000- PDF Extraction (
pdf_utils.py): Extracts text, filters non-printable characters, splits into sentences (~280 chars max for responsiveness). - TTS (
tts.py): Uses Kokoro'sKPipelineto synthesize each sentence, returning per-word timings (start_ts,end_ts). - API (
app.py): Serves endpoints:POST /api/uploadβ parses PDF, returns document structureGET /api/voicesβ available TTS voicesGET /api/tts/{doc_id}/{sentence_id}β returns audio (base64 WAV) + word timings
- Static files: FastAPI serves the frontend from
backend/static/.
- Upload: Drag-and-drop or click to upload a PDF. Document text appears immediately.
- Playback: Click a sentence or press Space to start reading. Click any word to seek within that sentence.
- Highlighting: A
rAFloop maps<audio>.currentTimeto the active word, updating the highlight in real-time. - Prefetch: While a sentence plays, the next is fetched in the background for gap-free playback.
SpeechifyPDF/
βββ backend/
β βββ app.py # FastAPI server + endpoints + static file serving
β βββ pdf_utils.py # PDF extraction & sentence splitting
β βββ tts.py # Kokoro pipeline wrapper
β βββ requirements.txt
β βββ static/
β βββ index.html # Landing page + reader UI
β βββ styles.css # Dark theme styling
β βββ app.js # Upload, playback, word highlighting logic
βββ run.sh # Linux/macOS launcher
βββ run.bat # Windows launcher
βββ README.md
- Single app: No separate frontend build step β FastAPI serves static HTML/CSS/JS directly alongside the API.
- Per-sentence streaming: Synthesis is CPU-bound; generating one sentence at a time keeps the UI responsive.
- Reflowed text: PDFs are extracted as plain text, avoiding layout complexity. Paragraphs and sentences are preserved.
- Token-based alignment: Kokoro tokens are grouped into display words (e.g.,
"dog"+"."β"dog.") usingwhitespacemetadata. - Single audio element: A shared
<audio>element cycles through sentences; the rAF loop syncs highlighting tocurrentTime. - Caching: Generated audio is cached in-memory (LRU on max documents), making replays instant.
- Keyboard:
Space= play/pause |β/β= prev/next sentence - Click to seek: Click any sentence to jump there; click a word to seek within the active sentence
- Auto-scroll: The current word auto-scrolls into view (center of screen)
- CPU performance: On slower CPUs, synthesis may take 5β10 seconds per sentence. Prefetching happens in a background thread.
- Kokoro models: Available voices include US & UK English female/male, Brazilian Portuguese, Spanish, French, Hindi, Italian, Japanese, Mandarin.
- Language codes:
'a'(US),'b'(UK),'e'(Spanish),'f'(French),'h'(Hindi),'i'(Italian),'j'(Japanese),'p'(Portuguese),'z'(Mandarin). - Memory: The model + torch weights stay in RAM. On constrained systems, restart between long sessions.
- Audio quality: 24 kHz mono, 16-bit PCM WAV. Playback quality depends on your system audio.
conda: command not found β Miniconda isn't installed yet, or the terminal was
opened before installing it. Install it from
https://www.anaconda.com/download/success, then close and reopen your terminal.
./run.sh: Permission denied β Run it as bash run.sh instead, or make it
executable once with chmod +x run.sh.
Install fails mentioning Python 3.13 / "no matching distribution" β Kokoro
requires Python 3.10β3.12. Make sure you created the environment with
python=3.12 (Step 2) and that conda activate speechifyPDF is active.
Address already in use / port 8000 busy β Another program (or a previous run)
is using port 8000. Stop the old one, or start on a different port:
uvicorn app:app --port 8001 and visit http://127.0.0.1:8001.
"No extractable text found" β The PDF is a scanned image (just pictures of pages), so there's no real text to read. SpeechifyPDF reads text-based PDFs; it does not do OCR. Try a PDF where you can select/copy the text.
Nothing plays / no sound β Check your system volume and that the browser tab isn't muted. The first sentence also takes longer (one-time model download).
Slow synthesis β On a CPU, the first sentence can take 10β30s (model warm-up) and later sentences a few seconds each. This is normal. Closing other heavy apps frees up RAM/CPU and speeds it up.
"espeak-ng library: ..." β Informational: the bundled espeak-ng library is loaded for grapheme-to-phoneme conversion. No action needed unless it reports a failure.
This app wraps Kokoro (Apache 2.0) and uses FastAPI and other open-source libraries. See their respective licenses.