StyleTTS2 ONNX C++

C++ implementation of StyleTTS2 text-to-speech synthesis using ONNX Runtime. This project provides a fast and efficient way to run StyleTTS2 models for high-quality voice cloning and speech synthesis with both CPU and GPU support.

Can be used in Windows and Unity applications developed in C++ or C# via dll wrapper (Tested. I’ll push another C# and Unity integration in a different repo later.)

✨ Features

🎙️ High-Quality Speech Synthesis: Generate natural-sounding speech using StyleTTS2 models
🎭 Voice Cloning: Support for custom voice styles using reference audio
⚡ ONNX Runtime: Optimized inference using ONNX Runtime for maximum performance
🖥️ CPU/CUDA Support: Runs on both CPU and CUDA-enabled GPUs
🚀 Real-time Performance: Efficient C++ implementation for real-time synthesis
🔊 WAV Output: Direct WAV file generation at 24kHz sample rate
🌍 Multi-language Support: Powered by eSpeak-NG phonemizer
📦 Lightweight: No Python runtime dependencies needed for inference, Only used ~1GB VRAM on your GPU

📋 Requirements

System Requirements

OS: Linux, Windows
CPU: x86_64 processor
GPU (optional): NVIDIA GPU with CUDA support for GPU acceleration

Software Dependencies

Dependency	Version	Purpose
CMake	3.10+	Build system
C++ Compiler	C++17 compatible	GCC 7+, Clang 5+, MSVC 2017+
ONNX Runtime	1.17.0+	Model inference engine
eSpeak-NG	Latest	Text-to-phoneme conversion

🔧 Installation

Option 1: Using Pre-built Dependencies (Recommended)

For convenience, you can use the pre-built ONNX Runtime and eSpeak-NG binaries along with the pre-trained ONNX models:

📦 Download Pre-trained Models:

Get the exported ONNX models from my Hugging Face repository
Place them in the trained_models/ directory

Note: You can also export your own fine-tuned models using the StyleTTS2-Lite repository.

Option 2: Building Dependencies from Source

If you prefer to build dependencies for your specific environment from source: ONNX Runtime GitHub

and the official repository: eSpeak-NG GitHub

Build the Project (For Linux Users)

Clone the repository:

git clone https://github.com/DDATT/StyleTTS2-onnx-cpp.git
cd StyleTTS2-onnx-cpp

Create build directory:

mkdir build
cd build

Configure and build:

cmake ..
make

Verify installation:

./styleTTS2

💡 Tip: If you installed dependencies in custom locations, update the paths in CMakeLists.txt:

find_path(ONNX_RUNTIME_SESSION_INCLUDE_DIRS onnxruntime_cxx_api.h HINTS /your_onnxruntime_path/include/)
find_library(ONNX_RUNTIME_LIB onnxruntime HINTS /your_onnxruntime_path/lib/)

Build the Project (For Windows Users)

UNDER CONSTRUCTION

🚀 Quick Start

After building the project, follow these steps to synthesize your first audio:

Download the models from Hugging Face
Prepare reference audio (see below)
Run synthesis with ./styleTTS2

Output will be saved as test.wav in the current directory.

📖 Usage

Step 1: Prepare Models

Download and place the following ONNX models in the trained_models/ directory:

Model File	Description	Size
`plbert_simp.onnx`	PL-BERT phoneme encoder	~23MB
`bert_encoder.onnx`	BERT text encoder	~2MB
`final_simp.onnx`	Main synthesis model	~305MB
`style_encoder_simp.onnx`	Style encoder (for reference)	~55MB
`predictor_encoder_simp.onnx`	Predictor encoder (for reference)	~55MB

Your directory structure should look like:

trained_models/
├── plbert_simp.onnx
├── bert_encoder.onnx
├── final_simp.onnx
├── style_encoder_simp.onnx
└── predictor_encoder_simp.onnx

Step 2: Generate Reference Style Embeddings

To clone a specific voice, you need to generate style embeddings from reference audio:

Prepare your reference audio:
- Place a WAV file (24kHz recommended) in the project root
- Update the filename in Assets/get_reference_audio.py
Install Python dependencies:

pip install numpy soundfile torch torchaudio librosa onnxruntime

Generate embeddings:

cd Assets
python get_reference_audio.py

This will generate:

ref_s.bin - Style embedding (speaker characteristics)
ref_p.bin - Predictor embedding (prosody patterns)

Move embeddings to model directory:

mv ref_s.bin ref_p.bin ../trained_models/

💡 Tips for best results:

Use clean audio with minimal background noise
5-15 seconds of speech is usually sufficient
The speaker should speak naturally and clearly

Step 3: Run Synthesis

cd build
./styleTTS2

What happens:

⚙️ Loads all ONNX models and initializes the TTS engine
🎯 Loads reference style embeddings
📝 Processes input text through phonemizer
🎵 Synthesizes speech
💾 Saves output to test.wav
📊 Displays performance metrics

Expected output:

Initializing StyleTTS2...
Initialization took: 2.5 seconds
Synthesizing speech...

=== Results ===
Initialization time: 2.5 seconds
Inference time: 0.8 seconds
Total runtime: 3.3 seconds
Audio duration: 10.2 seconds

Synthesis completed successfully!

Custom Integration Example

Here's how to integrate StyleTTS2 into your own C++ application:

#include "styletts2.h"
#include "wavfile.hpp"
#include <fstream>

int main() {
    try {
        // Initialize StyleTTS2
        // Parameters: (modelDir, espeakDataDir, useCuda)
        StyleTTS2 tts("trained_models", "espeak-ng/share/espeak-ng-data", false);
        
        // Load custom voice style
        tts.LoadStyle("trained_models/ref_s.bin", "trained_models/ref_p.bin");
        
        // Synthesize speech
        std::string text = "Hello, this is a test of speech synthesis.";
        float speed = 1.0f;  // 1.0 = normal speed, 0.5 = slower, 2.0 = faster
        std::vector<int16_t> audio = tts.synthesize(text, speed);
        
        // Save to WAV file
        std::ofstream audioFile("output.wav", std::ios::binary);
        writeWavHeader(24000, 2, 1, (int32_t)audio.size(), audioFile);
        audioFile.write((const char*)audio.data(), sizeof(int16_t) * audio.size());
        audioFile.close();
        
        std::cout << "Audio saved to output.wav" << std::endl;
        
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    
    return 0;
}

📁 Project Structure

StyleTTS2-onnx-cpp/
├── 📄 CMakeLists.txt              # Build configuration
├── 📄 main.cpp                    # Main application entry point
├── 📄 README.md                   # This documentation
├── 📄 LICENSE.txt                 # MIT License
│
├── 📂 Assets/
│   └── get_reference_audio.py    # Python script to generate style embeddings
│
├── 📂 include/                    # Header files
│   ├── phonemize.h               # Phonemization interface (eSpeak-NG wrapper)
│   ├── styletts2.h               # StyleTTS2 main class interface
│   └── wavfile.hpp               # WAV file I/O utilities
│
├── 📂 src/                        # Source files
│   ├── phonemize.cpp             # Phonemization implementation
│   └── styletts2.cpp             # StyleTTS2 core implementation
│
└── 📂 trained_models/             # ONNX models (create this directory)
    ├── plbert_simp.onnx          # Download from Hugging Face
    ├── bert_encoder.onnx         # Download from Hugging Face
    ├── final_simp.onnx           # Download from Hugging Face
    ├── style_encoder_simp.onnx   # Download from Hugging Face
    ├── predictor_encoder_simp.onnx  # Download from Hugging Face
    ├── ref_s.bin                 # Generated from reference audio
    └── ref_p.bin                 # Generated from reference audio

Performance Notes

GPU acceleration: Can provide 100 words long audio generation under 1 sec with RTX 5060ti.
Memory usage: ~1GB VRAM for models + runtime on CUDA GPU

🤝 Contributing

Contributions are welcome! Feel free to Report bugs, issues and Suggest new features or improvements.

If you find this project useful, please consider giving it a star! ⭐

Development Guidelines

Follow C++17 standards
Add comments for complex logic
Test your changes thoroughly
Update documentation as needed

📄 License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Third-Party Licenses

StyleTTS2 - Original implementation (MIT License)
ONNX Runtime - Inference engine (MIT License)
eSpeak-NG - Phonemizer (GPL v3)

🙏 Acknowledgments

Special thanks to:

yl4579 - Original StyleTTS2 implementation and research
dangtr0408 and thewh1teagle - StyleTTS2-Lite for model export utilities
Microsoft - ONNX Runtime high-performance inference engine
eSpeak-NG Team - eSpeak-NG text-to-phoneme conversion

📞 Contact & Support

GitHub Issues: Report bugs or request features
Models: Hugging Face Repository

Made with ❤️ by DDATT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StyleTTS2 ONNX C++

📑 Table of Contents

✨ Features

📋 Requirements

System Requirements

Software Dependencies

🔧 Installation

Option 1: Using Pre-built Dependencies (Recommended)

Option 2: Building Dependencies from Source

Build the Project (For Linux Users)

Build the Project (For Windows Users)

🚀 Quick Start

📖 Usage

Step 1: Prepare Models

Step 2: Generate Reference Style Embeddings

Step 3: Run Synthesis

Custom Integration Example

📁 Project Structure

Performance Notes

🤝 Contributing

Development Guidelines

📄 License

Third-Party Licenses

🙏 Acknowledgments

📞 Contact & Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Assets		Assets
include		include
src		src
.gitattributes		.gitattributes
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
main.cpp		main.cpp

License

DDATT/StyleTTS2-onnx-cpp

Folders and files

Latest commit

History

Repository files navigation

StyleTTS2 ONNX C++

📑 Table of Contents

✨ Features

📋 Requirements

System Requirements

Software Dependencies

🔧 Installation

Option 1: Using Pre-built Dependencies (Recommended)

Option 2: Building Dependencies from Source

Build the Project (For Linux Users)

Build the Project (For Windows Users)

🚀 Quick Start

📖 Usage

Step 1: Prepare Models

Step 2: Generate Reference Style Embeddings

Step 3: Run Synthesis

Custom Integration Example

📁 Project Structure

Performance Notes

🤝 Contributing

Development Guidelines

📄 License

Third-Party Licenses

🙏 Acknowledgments

📞 Contact & Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages