π Professional offline text-to-speech studio with voice cloning capabilities
Transform books, documents, and any text into natural-sounding audiobooks using state-of-the-art XTTS-v2 technology. No internet required, complete privacy, professional results.
- β¨ Key Features
- π Installation
- π― Usage Guide
- ποΈ Architecture
- π΅ Voice Cloning Guide
- π οΈ Smart Text Processing
- βοΈ Configuration & Settings
- π¨ Troubleshooting
- π What Makes This Interesting
- π License
- π Acknowledgments
- Automatic Text Cleaning: Removes emojis, fixes problematic characters, normalizes line endings
- Multi-format Support: Load TXT, MD, and PDF files with automatic preprocessing
- Intelligent Character Replacement: Converts symbols to speech-friendly text (β’β"trademark", β¬β"euros")
- Line Ending Normalization: Handles CRLF/LF issues automatically
- Voice Cloning: Clone any voice from a WAV sample using XTTS-v2
- Streaming Playback: Hear results immediately as synthesis progresses
- Professional Quality: State-of-the-art neural TTS with natural prosody
- Export Options: Save as WAV or MP3 with configurable quality
- Standard Shortcuts: Ctrl+A (select all), Ctrl+F (find/replace), Ctrl+Z/Y (undo/redo)
- Find & Replace: Advanced search with regex support
- Smart Preprocessing: One-click text cleaning for optimal TTS results
- Undo Support: Restore original text after cleaning operations
- Streaming Architecture: Start hearing audio within seconds, not minutes
- Chunked Processing: Efficiently handle large documents (books, reports)
- Cross-platform: Works on Windows, macOS, and Linux
- Dual Interface: Both GUI and CLI for different workflows
- Professional UX: Progress tracking, status updates, and error handling
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.8-3.11 | 3.12+ not yet supported by TTS dependencies |
| FFmpeg | Latest | Required for audio processing |
| GPU | CUDA-compatible | Optional, but 5-10x faster than CPU |
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html1. Clone and setup:
git clone https://github.com/yourusername/local-tts-studio.git
cd local-tts-studio
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt2. Launch GUI:
python run.py3. Start creating: Load a document, optionally load a voice sample, and click "Speak"!
π‘ First run: The app will automatically download the XTTS-v2 model (~1.8GB) on first use.
Launch the application:
python run.pyπ Workflow:
π Load Content β π€ Load Voice (optional) β βΆοΈ Generate β πΎ Save Audio
- π Load Content: File β Open Text/PDF (automatically cleaned)
- π€ Optional Voice: Voice β Load Voice Sample (for cloning)
βΆοΈ Generate: Click "Speak" (streaming playback starts immediately)- πΎ Save: Click "Save Audio" to export WAV/MP3
Text Editor Features:
- Ctrl+A: Select all text
- Ctrl+F: Find and replace with regex support
- Edit β Clean Text for TTS: Manual text cleaning
- Edit β Undo Text Cleaning: Restore original text
Basic conversion:
python tts_cli.py convert input.txt output.wavWith voice cloning:
python tts_cli.py convert book.pdf audiobook.mp3 --voice voice_sample.wavβοΈ CLI Options:
| Flag | Description | Default |
|---|---|---|
--voice, -v |
WAV file for voice cloning | Built-in voice |
--chunk-size, -c |
Max characters per chunk | 200 |
--gpu |
Enable GPU acceleration | Auto-detect |
--verbose |
Enable detailed logging | False |
# π Convert a PDF to MP3 with custom voice
python tts_cli.py convert manual.pdf manual_audio.mp3 --voice john.wav --gpu
# π Convert markdown to WAV
python tts_cli.py convert README.md readme_audio.wav
# π Process large book with optimized chunks
python tts_cli.py convert large_book.txt book.mp3 --chunk-size 150Built with professional software engineering practices:
local-tts-studio/
βββ src/
β βββ core/
β β βββ tts_engine.py # TTS synthesis engine
β βββ gui/
β β βββ main_window.py # Main application GUI
β β βββ text_editor_enhancements.py # Advanced text editing
β βββ utils/
β β βββ text_processing.py # File loading and chunking
β β βββ text_preprocessing.py # Smart text cleaning pipeline
β β βββ audio_utils.py # Streaming audio playback
β βββ config/
β βββ settings.py # Configuration management
βββ tests/ # Unit tests
βββ run.py # GUI entry point
βββ tts_cli.py # CLI interface
βββ requirements.txt # Python dependencies
- Modular Architecture: Separate concerns for maintainability
- Streaming Processing: Real-time audio generation and playback
- Professional UX: Standard shortcuts, progress feedback, error handling
- Intelligent Preprocessing: Automatic text optimization for TTS
- Cross-Platform: Works on Windows, macOS, Linux
- Quality Recording: Use a clear WAV file (10-30 seconds)
- Clean Audio: Minimal background noise, consistent volume
- Natural Speech: Include varied intonation and speech patterns
- Technical Specs: 22050 Hz sample rate or higher recommended
- Avoid monotone: Include questions, statements, excitement
- Multiple sentences: Better than single words or phrases
- Clear articulation: Avoid mumbling or unclear speech
- Consistent quality: Same microphone/environment if possible
Local TTS Studio automatically handles problematic text elements:
- Emojis: πππ β Removed completely
- Special Characters: β’β¬Β£ β "trademark euros pounds"
- Smart Quotes: ""'' β Regular quotes
- Line Endings: CRLF/LF β Normalized
- URLs/Emails: β "web link" / "email address"
- Abbreviations: "Dr." β "Doctor", "e.g." β "for example"
- Edit β Clean Text for TTS: Apply cleaning to current text
- Edit β Undo Text Cleaning: Restore original text
- Auto-clean on load: Files automatically processed when opened
Settings automatically saved to ~/.local-tts-studio/config.json:
- TTS model preferences
- Audio output quality settings
- UI customization options
- Default voice cloning settings
GPU not detected:
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"Audio device issues (WSL/headless):
- Application automatically detects and handles missing audio devices
- Audio generation still works - just save files for playback elsewhere
- Status bar shows "simulating playback" when no audio device available
Memory issues with large texts:
# Reduce chunk size for large documents
python tts_cli.py convert large_book.txt output.wav --chunk-size 1000Emoji/special characters still appearing:
- Text is automatically cleaned before synthesis
- Use "Edit β Clean Text for TTS" for manual cleaning
- Check logs for preprocessing details
- GPU recommended: 5-10x faster than CPU
- Smaller chunks: Better for memory-constrained systems
- Audio device: Real playback vs simulation for better experience
- Voice samples: 10-30 seconds optimal for cloning quality
- Truly Offline: No API keys, no internet required after setup
- Professional Quality: XTTS-v2 rivals commercial services
- Voice Cloning: Clone any voice from a short sample
- Smart Processing: Handles real-world text automatically
- Streaming Playback: Immediate results, not batch processing
- Production Ready: Professional architecture and error handling
- Content Creators: Turn scripts into professional narration
- Accessibility: Convert documents for audio consumption
- Language Learning: Hear text pronunciation in multiple languages
- Audiobook Creation: Transform books into professional audiobooks
- Privacy-Focused: All processing stays on your machine
This codebase is licensed under GPL 3.0 - see LICENSE file for details.
Important: This application uses the XTTS-v2 model weights and other third-party libraries, each with their own licensing terms:
- XTTS-v2 Model Weights: Licensed under their original terms by Coqui AI. This codebase does not modify, redistribute, or claim any rights over the model weights themselves.
- Coqui TTS Library: Licensed under Mozilla Public License 2.0
- Other Dependencies: See individual package licenses
License Clarification: While this application code is GPL 3.0, we make no claims about changing or affecting the licensing of any model weights, trained models, or third-party libraries used by this application. All third-party components retain their original licensing terms.
Core Technologies: