- Tokenization of music score conditions and singing waveforms.
- Multi-stream language model token prediction.
- Conditional flow matching-based mel-spectrogram generation.
- A mel-to-wave vocoder.
This repo contain scripts for stage3.
-
For stage1, please follow the ACE-Opencpop Recipe. For stage2, please follow the instruction at ESPnet-Speechlm branch. You may also refer to the local fork for these two stages.
-
For stage4, please follow HIFIGAN training in ParallelWaveGAN or refer to the local fork.
We use a conditional flow matching model, converting the source Gaussian noise to the target mel spectrogram conditioned on the codec token predicted by SLM.
Please directly modify the path and config at the main entry flow.py.
python flow.py
- Update stage1 and stage2 processing scripts to a ESPnet local fork.
- Update stage4 processing scripts to a ParallelWaveGAN local fork.
We thank INSPIREMUSIC, Matcha-TTS for releasing their code. Our work also based on OpusLM and ESPnet-Codec.