Skip to content

cunicopia-dev/local-tts-studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ Local TTS Studio

License: GPL v3 Python 3.8+ Platform TTS Engine Audio

πŸš€ Professional offline text-to-speech studio with voice cloning capabilities

Transform books, documents, and any text into natural-sounding audiobooks using state-of-the-art XTTS-v2 technology. No internet required, complete privacy, professional results.


πŸ“‹ Table of Contents

✨ Key Features

🎯 Smart Text Processing

  • Automatic Text Cleaning: Removes emojis, fixes problematic characters, normalizes line endings
  • Multi-format Support: Load TXT, MD, and PDF files with automatic preprocessing
  • Intelligent Character Replacement: Converts symbols to speech-friendly text (β„’β†’"trademark", €→"euros")
  • Line Ending Normalization: Handles CRLF/LF issues automatically

🎀 Advanced Audio Generation

  • Voice Cloning: Clone any voice from a WAV sample using XTTS-v2
  • Streaming Playback: Hear results immediately as synthesis progresses
  • Professional Quality: State-of-the-art neural TTS with natural prosody
  • Export Options: Save as WAV or MP3 with configurable quality

πŸ’» Text Editor

  • Standard Shortcuts: Ctrl+A (select all), Ctrl+F (find/replace), Ctrl+Z/Y (undo/redo)
  • Find & Replace: Advanced search with regex support
  • Smart Preprocessing: One-click text cleaning for optimal TTS results
  • Undo Support: Restore original text after cleaning operations

πŸš€ Performance & Usability

  • Streaming Architecture: Start hearing audio within seconds, not minutes
  • Chunked Processing: Efficiently handle large documents (books, reports)
  • Cross-platform: Works on Windows, macOS, and Linux
  • Dual Interface: Both GUI and CLI for different workflows
  • Professional UX: Progress tracking, status updates, and error handling

πŸš€ Installation

πŸ“‹ Prerequisites

Requirement Version Notes
Python 3.8-3.11 3.12+ not yet supported by TTS dependencies
FFmpeg Latest Required for audio processing
GPU CUDA-compatible Optional, but 5-10x faster than CPU
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

⚑ Quick Start

1. Clone and setup:

git clone https://github.com/yourusername/local-tts-studio.git
cd local-tts-studio
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Launch GUI:

python run.py

3. Start creating: Load a document, optionally load a voice sample, and click "Speak"!

πŸ’‘ First run: The app will automatically download the XTTS-v2 model (~1.8GB) on first use.

🎯 Usage Guide

GUI Mode (Recommended)

Launch the application:

python run.py

πŸ”„ Workflow:

πŸ“„ Load Content β†’ 🎀 Load Voice (optional) β†’ ▢️ Generate β†’ πŸ’Ύ Save Audio
  1. πŸ“‚ Load Content: File β†’ Open Text/PDF (automatically cleaned)
  2. 🎀 Optional Voice: Voice β†’ Load Voice Sample (for cloning)
  3. ▢️ Generate: Click "Speak" (streaming playback starts immediately)
  4. πŸ’Ύ Save: Click "Save Audio" to export WAV/MP3

Text Editor Features:

  • Ctrl+A: Select all text
  • Ctrl+F: Find and replace with regex support
  • Edit β†’ Clean Text for TTS: Manual text cleaning
  • Edit β†’ Undo Text Cleaning: Restore original text

πŸ’» Command-Line Mode

Basic conversion:

python tts_cli.py convert input.txt output.wav

With voice cloning:

python tts_cli.py convert book.pdf audiobook.mp3 --voice voice_sample.wav

βš™οΈ CLI Options:

Flag Description Default
--voice, -v WAV file for voice cloning Built-in voice
--chunk-size, -c Max characters per chunk 200
--gpu Enable GPU acceleration Auto-detect
--verbose Enable detailed logging False

πŸ“š Examples

# πŸ“– Convert a PDF to MP3 with custom voice
python tts_cli.py convert manual.pdf manual_audio.mp3 --voice john.wav --gpu

# πŸ“ Convert markdown to WAV  
python tts_cli.py convert README.md readme_audio.wav

# πŸ“• Process large book with optimized chunks
python tts_cli.py convert large_book.txt book.mp3 --chunk-size 150

πŸ—οΈ Architecture

Built with professional software engineering practices:

local-tts-studio/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   └── tts_engine.py           # TTS synthesis engine
β”‚   β”œβ”€β”€ gui/
β”‚   β”‚   β”œβ”€β”€ main_window.py          # Main application GUI
β”‚   β”‚   └── text_editor_enhancements.py  # Advanced text editing
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ text_processing.py      # File loading and chunking
β”‚   β”‚   β”œβ”€β”€ text_preprocessing.py   # Smart text cleaning pipeline
β”‚   β”‚   └── audio_utils.py          # Streaming audio playback
β”‚   └── config/
β”‚       └── settings.py             # Configuration management
β”œβ”€β”€ tests/                          # Unit tests
β”œβ”€β”€ run.py                          # GUI entry point
β”œβ”€β”€ tts_cli.py                     # CLI interface
└── requirements.txt               # Python dependencies

Key Design Principles

  • Modular Architecture: Separate concerns for maintainability
  • Streaming Processing: Real-time audio generation and playback
  • Professional UX: Standard shortcuts, progress feedback, error handling
  • Intelligent Preprocessing: Automatic text optimization for TTS
  • Cross-Platform: Works on Windows, macOS, Linux

🎡 Voice Cloning Guide

Getting Great Results

  1. Quality Recording: Use a clear WAV file (10-30 seconds)
  2. Clean Audio: Minimal background noise, consistent volume
  3. Natural Speech: Include varied intonation and speech patterns
  4. Technical Specs: 22050 Hz sample rate or higher recommended

Tips for Success

  • Avoid monotone: Include questions, statements, excitement
  • Multiple sentences: Better than single words or phrases
  • Clear articulation: Avoid mumbling or unclear speech
  • Consistent quality: Same microphone/environment if possible

πŸ› οΈ Smart Text Processing

Local TTS Studio automatically handles problematic text elements:

Automatic Cleaning

  • Emojis: πŸ˜€πŸŽ‰πŸš€ β†’ Removed completely
  • Special Characters: ℒ€£ β†’ "trademark euros pounds"
  • Smart Quotes: ""'' β†’ Regular quotes
  • Line Endings: CRLF/LF β†’ Normalized
  • URLs/Emails: β†’ "web link" / "email address"
  • Abbreviations: "Dr." β†’ "Doctor", "e.g." β†’ "for example"

Manual Control

  • Edit β†’ Clean Text for TTS: Apply cleaning to current text
  • Edit β†’ Undo Text Cleaning: Restore original text
  • Auto-clean on load: Files automatically processed when opened

βš™οΈ Configuration & Settings

Settings automatically saved to ~/.local-tts-studio/config.json:

  • TTS model preferences
  • Audio output quality settings
  • UI customization options
  • Default voice cloning settings

🚨 Troubleshooting

Common Issues

GPU not detected:

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

Audio device issues (WSL/headless):

  • Application automatically detects and handles missing audio devices
  • Audio generation still works - just save files for playback elsewhere
  • Status bar shows "simulating playback" when no audio device available

Memory issues with large texts:

# Reduce chunk size for large documents
python tts_cli.py convert large_book.txt output.wav --chunk-size 1000

Emoji/special characters still appearing:

  • Text is automatically cleaned before synthesis
  • Use "Edit β†’ Clean Text for TTS" for manual cleaning
  • Check logs for preprocessing details

Performance Tips

  • GPU recommended: 5-10x faster than CPU
  • Smaller chunks: Better for memory-constrained systems
  • Audio device: Real playback vs simulation for better experience
  • Voice samples: 10-30 seconds optimal for cloning quality

πŸ”— What Makes This Interesting

vs. Other TTS Solutions

  • Truly Offline: No API keys, no internet required after setup
  • Professional Quality: XTTS-v2 rivals commercial services
  • Voice Cloning: Clone any voice from a short sample
  • Smart Processing: Handles real-world text automatically
  • Streaming Playback: Immediate results, not batch processing
  • Production Ready: Professional architecture and error handling

Perfect For

  • Content Creators: Turn scripts into professional narration
  • Accessibility: Convert documents for audio consumption
  • Language Learning: Hear text pronunciation in multiple languages
  • Audiobook Creation: Transform books into professional audiobooks
  • Privacy-Focused: All processing stays on your machine

πŸ“„ License

Application Code

This codebase is licensed under GPL 3.0 - see LICENSE file for details.

Model Weights & Third-Party Components

Important: This application uses the XTTS-v2 model weights and other third-party libraries, each with their own licensing terms:

  • XTTS-v2 Model Weights: Licensed under their original terms by Coqui AI. This codebase does not modify, redistribute, or claim any rights over the model weights themselves.
  • Coqui TTS Library: Licensed under Mozilla Public License 2.0
  • Other Dependencies: See individual package licenses

License Clarification: While this application code is GPL 3.0, we make no claims about changing or affecting the licensing of any model weights, trained models, or third-party libraries used by this application. All third-party components retain their original licensing terms.

πŸ™ Acknowledgments

Core Technologies:

About

Offline text-to-speech audiobook narrator with voice cloning, supporting TXT/MD/PDF input and WAV/MP3 output. Features GUI and CLI interfaces.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages