Skip to content

StyleTTS2 inference on C++ by using ONNXRuntime and Espeak phonemizer

License

Notifications You must be signed in to change notification settings

DDATT/StyleTTS2-onnx-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

StyleTTS2 ONNX C++

License: MIT C++17 CMake

C++ implementation of StyleTTS2 text-to-speech synthesis using ONNX Runtime. This project provides a fast and efficient way to run StyleTTS2 models for high-quality voice cloning and speech synthesis with both CPU and GPU support.

Can be used in Windows and Unity applications developed in C++ or C# via dll wrapper (Tested. I’ll push another C# and Unity integration in a different repo later.)

πŸ“‘ Table of Contents

✨ Features

  • πŸŽ™οΈ High-Quality Speech Synthesis: Generate natural-sounding speech using StyleTTS2 models
  • 🎭 Voice Cloning: Support for custom voice styles using reference audio
  • ⚑ ONNX Runtime: Optimized inference using ONNX Runtime for maximum performance
  • πŸ–₯️ CPU/CUDA Support: Runs on both CPU and CUDA-enabled GPUs
  • πŸš€ Real-time Performance: Efficient C++ implementation for real-time synthesis
  • πŸ”Š WAV Output: Direct WAV file generation at 24kHz sample rate
  • 🌍 Multi-language Support: Powered by eSpeak-NG phonemizer
  • πŸ“¦ Lightweight: No Python runtime dependencies needed for inference, Only used ~1GB VRAM on your GPU

πŸ“‹ Requirements

System Requirements

  • OS: Linux, Windows
  • CPU: x86_64 processor
  • GPU (optional): NVIDIA GPU with CUDA support for GPU acceleration

Software Dependencies

Dependency Version Purpose
CMake 3.10+ Build system
C++ Compiler C++17 compatible GCC 7+, Clang 5+, MSVC 2017+
ONNX Runtime 1.17.0+ Model inference engine
eSpeak-NG Latest Text-to-phoneme conversion

πŸ”§ Installation

Option 1: Using Pre-built Dependencies (Recommended)

For convenience, you can use the pre-built ONNX Runtime and eSpeak-NG binaries along with the pre-trained ONNX models:

πŸ“¦ Download Pre-trained Models:

Note: You can also export your own fine-tuned models using the StyleTTS2-Lite repository.

Option 2: Building Dependencies from Source

If you prefer to build dependencies for your specific environment from source: ONNX Runtime GitHub

and the official repository: eSpeak-NG GitHub

Build the Project (For Linux Users)

  1. Clone the repository:
git clone https://github.com/DDATT/StyleTTS2-onnx-cpp.git
cd StyleTTS2-onnx-cpp
  1. Create build directory:
mkdir build
cd build
  1. Configure and build:
cmake ..
make
  1. Verify installation:
./styleTTS2

πŸ’‘ Tip: If you installed dependencies in custom locations, update the paths in CMakeLists.txt:

find_path(ONNX_RUNTIME_SESSION_INCLUDE_DIRS onnxruntime_cxx_api.h HINTS /your_onnxruntime_path/include/)
find_library(ONNX_RUNTIME_LIB onnxruntime HINTS /your_onnxruntime_path/lib/)

Build the Project (For Windows Users)

UNDER CONSTRUCTION

πŸš€ Quick Start

After building the project, follow these steps to synthesize your first audio:

  1. Download the models from Hugging Face
  2. Prepare reference audio (see below)
  3. Run synthesis with ./styleTTS2

Output will be saved as test.wav in the current directory.

πŸ“– Usage

Step 1: Prepare Models

Download and place the following ONNX models in the trained_models/ directory:

Model File Description Size
plbert_simp.onnx PL-BERT phoneme encoder ~23MB
bert_encoder.onnx BERT text encoder ~2MB
final_simp.onnx Main synthesis model ~305MB
style_encoder_simp.onnx Style encoder (for reference) ~55MB
predictor_encoder_simp.onnx Predictor encoder (for reference) ~55MB

Your directory structure should look like:

trained_models/
β”œβ”€β”€ plbert_simp.onnx
β”œβ”€β”€ bert_encoder.onnx
β”œβ”€β”€ final_simp.onnx
β”œβ”€β”€ style_encoder_simp.onnx
└── predictor_encoder_simp.onnx

Step 2: Generate Reference Style Embeddings

To clone a specific voice, you need to generate style embeddings from reference audio:

  1. Prepare your reference audio:

    • Place a WAV file (24kHz recommended) in the project root
    • Update the filename in Assets/get_reference_audio.py
  2. Install Python dependencies:

pip install numpy soundfile torch torchaudio librosa onnxruntime
  1. Generate embeddings:
cd Assets
python get_reference_audio.py

This will generate:

  • ref_s.bin - Style embedding (speaker characteristics)
  • ref_p.bin - Predictor embedding (prosody patterns)
  1. Move embeddings to model directory:
mv ref_s.bin ref_p.bin ../trained_models/

πŸ’‘ Tips for best results:

  • Use clean audio with minimal background noise
  • 5-15 seconds of speech is usually sufficient
  • The speaker should speak naturally and clearly

Step 3: Run Synthesis

cd build
./styleTTS2

What happens:

  1. βš™οΈ Loads all ONNX models and initializes the TTS engine
  2. 🎯 Loads reference style embeddings
  3. πŸ“ Processes input text through phonemizer
  4. 🎡 Synthesizes speech
  5. πŸ’Ύ Saves output to test.wav
  6. πŸ“Š Displays performance metrics

Expected output:

Initializing StyleTTS2...
Initialization took: 2.5 seconds
Synthesizing speech...

=== Results ===
Initialization time: 2.5 seconds
Inference time: 0.8 seconds
Total runtime: 3.3 seconds
Audio duration: 10.2 seconds

Synthesis completed successfully!

Custom Integration Example

Here's how to integrate StyleTTS2 into your own C++ application:

#include "styletts2.h"
#include "wavfile.hpp"
#include <fstream>

int main() {
    try {
        // Initialize StyleTTS2
        // Parameters: (modelDir, espeakDataDir, useCuda)
        StyleTTS2 tts("trained_models", "espeak-ng/share/espeak-ng-data", false);
        
        // Load custom voice style
        tts.LoadStyle("trained_models/ref_s.bin", "trained_models/ref_p.bin");
        
        // Synthesize speech
        std::string text = "Hello, this is a test of speech synthesis.";
        float speed = 1.0f;  // 1.0 = normal speed, 0.5 = slower, 2.0 = faster
        std::vector<int16_t> audio = tts.synthesize(text, speed);
        
        // Save to WAV file
        std::ofstream audioFile("output.wav", std::ios::binary);
        writeWavHeader(24000, 2, 1, (int32_t)audio.size(), audioFile);
        audioFile.write((const char*)audio.data(), sizeof(int16_t) * audio.size());
        audioFile.close();
        
        std::cout << "Audio saved to output.wav" << std::endl;
        
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    
    return 0;
}

πŸ“ Project Structure

StyleTTS2-onnx-cpp/
β”œβ”€β”€ πŸ“„ CMakeLists.txt              # Build configuration
β”œβ”€β”€ πŸ“„ main.cpp                    # Main application entry point
β”œβ”€β”€ πŸ“„ README.md                   # This documentation
β”œβ”€β”€ πŸ“„ LICENSE.txt                 # MIT License
β”‚
β”œβ”€β”€ πŸ“‚ Assets/
β”‚   └── get_reference_audio.py    # Python script to generate style embeddings
β”‚
β”œβ”€β”€ πŸ“‚ include/                    # Header files
β”‚   β”œβ”€β”€ phonemize.h               # Phonemization interface (eSpeak-NG wrapper)
β”‚   β”œβ”€β”€ styletts2.h               # StyleTTS2 main class interface
β”‚   └── wavfile.hpp               # WAV file I/O utilities
β”‚
β”œβ”€β”€ πŸ“‚ src/                        # Source files
β”‚   β”œβ”€β”€ phonemize.cpp             # Phonemization implementation
β”‚   └── styletts2.cpp             # StyleTTS2 core implementation
β”‚
└── πŸ“‚ trained_models/             # ONNX models (create this directory)
    β”œβ”€β”€ plbert_simp.onnx          # Download from Hugging Face
    β”œβ”€β”€ bert_encoder.onnx         # Download from Hugging Face
    β”œβ”€β”€ final_simp.onnx           # Download from Hugging Face
    β”œβ”€β”€ style_encoder_simp.onnx   # Download from Hugging Face
    β”œβ”€β”€ predictor_encoder_simp.onnx  # Download from Hugging Face
    β”œβ”€β”€ ref_s.bin                 # Generated from reference audio
    └── ref_p.bin                 # Generated from reference audio

Performance Notes

  • GPU acceleration: Can provide 100 words long audio generation under 1 sec with RTX 5060ti.
  • Memory usage: ~1GB VRAM for models + runtime on CUDA GPU

🀝 Contributing

Contributions are welcome! Feel free to Report bugs, issues and Suggest new features or improvements.

If you find this project useful, please consider giving it a star! ⭐

Development Guidelines

  • Follow C++17 standards
  • Add comments for complex logic
  • Test your changes thoroughly
  • Update documentation as needed

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Third-Party Licenses

πŸ™ Acknowledgments

Special thanks to:

πŸ“ž Contact & Support


Made with ❀️ by DDATT

About

StyleTTS2 inference on C++ by using ONNXRuntime and Espeak phonemizer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published