C++ implementation of StyleTTS2 text-to-speech synthesis using ONNX Runtime. This project provides a fast and efficient way to run StyleTTS2 models for high-quality voice cloning and speech synthesis with both CPU and GPU support.
Can be used in Windows and Unity applications developed in C++ or C# via dll wrapper (Tested. Iβll push another C# and Unity integration in a different repo later.)
- Features
- Requirements
- Installation
- Quick Start
- Usage
- Project Structure
- Performance
- Contributing
- License
- Acknowledgments
- ποΈ High-Quality Speech Synthesis: Generate natural-sounding speech using StyleTTS2 models
- π Voice Cloning: Support for custom voice styles using reference audio
- β‘ ONNX Runtime: Optimized inference using ONNX Runtime for maximum performance
- π₯οΈ CPU/CUDA Support: Runs on both CPU and CUDA-enabled GPUs
- π Real-time Performance: Efficient C++ implementation for real-time synthesis
- π WAV Output: Direct WAV file generation at 24kHz sample rate
- π Multi-language Support: Powered by eSpeak-NG phonemizer
- π¦ Lightweight: No Python runtime dependencies needed for inference, Only used ~1GB VRAM on your GPU
- OS: Linux, Windows
- CPU: x86_64 processor
- GPU (optional): NVIDIA GPU with CUDA support for GPU acceleration
| Dependency | Version | Purpose |
|---|---|---|
| CMake | 3.10+ | Build system |
| C++ Compiler | C++17 compatible | GCC 7+, Clang 5+, MSVC 2017+ |
| ONNX Runtime | 1.17.0+ | Model inference engine |
| eSpeak-NG | Latest | Text-to-phoneme conversion |
For convenience, you can use the pre-built ONNX Runtime and eSpeak-NG binaries along with the pre-trained ONNX models:
π¦ Download Pre-trained Models:
- Get the exported ONNX models from my Hugging Face repository
- Place them in the
trained_models/directory
Note: You can also export your own fine-tuned models using the StyleTTS2-Lite repository.
If you prefer to build dependencies for your specific environment from source: ONNX Runtime GitHub
and the official repository: eSpeak-NG GitHub
- Clone the repository:
git clone https://github.com/DDATT/StyleTTS2-onnx-cpp.git
cd StyleTTS2-onnx-cpp- Create build directory:
mkdir build
cd build- Configure and build:
cmake ..
make- Verify installation:
./styleTTS2π‘ Tip: If you installed dependencies in custom locations, update the paths in CMakeLists.txt:
find_path(ONNX_RUNTIME_SESSION_INCLUDE_DIRS onnxruntime_cxx_api.h HINTS /your_onnxruntime_path/include/)
find_library(ONNX_RUNTIME_LIB onnxruntime HINTS /your_onnxruntime_path/lib/)UNDER CONSTRUCTION
After building the project, follow these steps to synthesize your first audio:
- Download the models from Hugging Face
- Prepare reference audio (see below)
- Run synthesis with
./styleTTS2
Output will be saved as test.wav in the current directory.
Download and place the following ONNX models in the trained_models/ directory:
| Model File | Description | Size |
|---|---|---|
plbert_simp.onnx |
PL-BERT phoneme encoder | ~23MB |
bert_encoder.onnx |
BERT text encoder | ~2MB |
final_simp.onnx |
Main synthesis model | ~305MB |
style_encoder_simp.onnx |
Style encoder (for reference) | ~55MB |
predictor_encoder_simp.onnx |
Predictor encoder (for reference) | ~55MB |
Your directory structure should look like:
trained_models/
βββ plbert_simp.onnx
βββ bert_encoder.onnx
βββ final_simp.onnx
βββ style_encoder_simp.onnx
βββ predictor_encoder_simp.onnx
To clone a specific voice, you need to generate style embeddings from reference audio:
-
Prepare your reference audio:
- Place a WAV file (24kHz recommended) in the project root
- Update the filename in
Assets/get_reference_audio.py
-
Install Python dependencies:
pip install numpy soundfile torch torchaudio librosa onnxruntime- Generate embeddings:
cd Assets
python get_reference_audio.pyThis will generate:
ref_s.bin- Style embedding (speaker characteristics)ref_p.bin- Predictor embedding (prosody patterns)
- Move embeddings to model directory:
mv ref_s.bin ref_p.bin ../trained_models/π‘ Tips for best results:
- Use clean audio with minimal background noise
- 5-15 seconds of speech is usually sufficient
- The speaker should speak naturally and clearly
cd build
./styleTTS2What happens:
- βοΈ Loads all ONNX models and initializes the TTS engine
- π― Loads reference style embeddings
- π Processes input text through phonemizer
- π΅ Synthesizes speech
- πΎ Saves output to
test.wav - π Displays performance metrics
Expected output:
Initializing StyleTTS2...
Initialization took: 2.5 seconds
Synthesizing speech...
=== Results ===
Initialization time: 2.5 seconds
Inference time: 0.8 seconds
Total runtime: 3.3 seconds
Audio duration: 10.2 seconds
Synthesis completed successfully!
Here's how to integrate StyleTTS2 into your own C++ application:
#include "styletts2.h"
#include "wavfile.hpp"
#include <fstream>
int main() {
try {
// Initialize StyleTTS2
// Parameters: (modelDir, espeakDataDir, useCuda)
StyleTTS2 tts("trained_models", "espeak-ng/share/espeak-ng-data", false);
// Load custom voice style
tts.LoadStyle("trained_models/ref_s.bin", "trained_models/ref_p.bin");
// Synthesize speech
std::string text = "Hello, this is a test of speech synthesis.";
float speed = 1.0f; // 1.0 = normal speed, 0.5 = slower, 2.0 = faster
std::vector<int16_t> audio = tts.synthesize(text, speed);
// Save to WAV file
std::ofstream audioFile("output.wav", std::ios::binary);
writeWavHeader(24000, 2, 1, (int32_t)audio.size(), audioFile);
audioFile.write((const char*)audio.data(), sizeof(int16_t) * audio.size());
audioFile.close();
std::cout << "Audio saved to output.wav" << std::endl;
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
return 1;
}
return 0;
}StyleTTS2-onnx-cpp/
βββ π CMakeLists.txt # Build configuration
βββ π main.cpp # Main application entry point
βββ π README.md # This documentation
βββ π LICENSE.txt # MIT License
β
βββ π Assets/
β βββ get_reference_audio.py # Python script to generate style embeddings
β
βββ π include/ # Header files
β βββ phonemize.h # Phonemization interface (eSpeak-NG wrapper)
β βββ styletts2.h # StyleTTS2 main class interface
β βββ wavfile.hpp # WAV file I/O utilities
β
βββ π src/ # Source files
β βββ phonemize.cpp # Phonemization implementation
β βββ styletts2.cpp # StyleTTS2 core implementation
β
βββ π trained_models/ # ONNX models (create this directory)
βββ plbert_simp.onnx # Download from Hugging Face
βββ bert_encoder.onnx # Download from Hugging Face
βββ final_simp.onnx # Download from Hugging Face
βββ style_encoder_simp.onnx # Download from Hugging Face
βββ predictor_encoder_simp.onnx # Download from Hugging Face
βββ ref_s.bin # Generated from reference audio
βββ ref_p.bin # Generated from reference audio
- GPU acceleration: Can provide 100 words long audio generation under 1 sec with RTX 5060ti.
- Memory usage: ~1GB VRAM for models + runtime on CUDA GPU
Contributions are welcome! Feel free to Report bugs, issues and Suggest new features or improvements.
If you find this project useful, please consider giving it a star! β
- Follow C++17 standards
- Add comments for complex logic
- Test your changes thoroughly
- Update documentation as needed
This project is licensed under the MIT License - see the LICENSE.txt file for details.
- StyleTTS2 - Original implementation (MIT License)
- ONNX Runtime - Inference engine (MIT License)
- eSpeak-NG - Phonemizer (GPL v3)
Special thanks to:
- yl4579 - Original StyleTTS2 implementation and research
- dangtr0408 and thewh1teagle - StyleTTS2-Lite for model export utilities
- Microsoft - ONNX Runtime high-performance inference engine
- eSpeak-NG Team - eSpeak-NG text-to-phoneme conversion
- GitHub Issues: Report bugs or request features
- Models: Hugging Face Repository