Skip to content

nethermanpro/simulmega

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimulMEGA

The official code repository for SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation. Paper | Demo 替代文字

📌 Overview

SimulMEGA (Simultaneous Generation by Mixture-of-Experts GAting) is an unsupervised policy learning framework for simultaneous speech translation (SimulST) that enables real-time, low-latency cross-lingual communication. By integrating prefix-based training with a lightweight Mixture-of-Experts (MoE) refiner, SimulMEGA learns optimal read/write decisions implicitly—without any inference-time overhead or architectural overhaul. Built on standard Transformer backbones (e.g., Whisper, CosyVoice 2), SimulMEGA requires only minimal modifications and supports both speech-to-text (S2TT) and text-to-speech (TTS) streaming within a unified framework.

✨ Highlights

Unsupervised Policy Learning

MoE routers learn when to read input or write output by balancing prefix and global context—no human-annotated policies needed.

Zero Inference Overhead

The MoE refiner is training-only; inference uses the original model architecture, preserving speed and compatibility.

Unified Streaming Framework

Same core design works for both SimulST (S2TT) and streaming TTS—enabling full simultaneous speech-to-speech translation (S2ST).

Plug-and-Play Compatibility

Easily adapts to existing models like Whisper (for S2TT) or CosyVoice 2 (for TTS) via lightweight fine-tuning.

💡 Simultaneous Speech to Text

Prepare Pretrained Model

Convert whisper model to simulmega format, you may use the script in utils/convert_whisper.py

```bash
python split_whisper.py \
--whisper_ckpt ./whisper-medium.pt \
--output_ckpt ./distil_medium_ast.pt \
--model_size medium \
--n_audio_layer_shared 20 \
--n_audio_layer_ast 4 \
--n_text_layer_ast 12
```

Prepare Data & Config

The data is in Huggingface Audio format, with extra translation label, e.g., trans_zh, trans_en. The Fill the pretain model path and data path in the config file, e.g., exp_spec/simulmegastt/simulmegastt.yaml

Start Training

Run the following command to start training:

python train.py --config simulmegastt

Inference

Here is an example script for inference:

python infer_stt.py \
  --wav_path /path/to/audio.wav \
  --ckpt_path /path/to/model.ckpt \
  --tgt_lang en \
  --simul_threshold 0.5 \
  --label "This is the reference transcription."

💡 Simultaneous Text to Speech

Prepare Pretrained Model

Download the pretrained CosyVoice 2 model from here

Prepare Data & Config

The data is in standard Huggingface Audio format, Fill the pretain model path and data path in the config file, e.g., exp_spec/simulmegatts/simulmegatts.yaml

Start Training

Run the following command to start training:

python train.py --config simulmegatts

Inference

Here is an example script for inference:

python tts_streaming.py \
  --ckpt_path ./simulmegalm_ckpt.pt \
  --pretrain_path ./CosyVoice2-0.5B \
  --prompt_wav ./reference.wav \
  --prompt_text "这是参考音频的文本。" \
  --output_dir ./generated_audio \
  --threshold 0.3 \
  --minratio 0.0 \
  --device 0

📚 Acknowledgments

The models are built upon the excellent works of Whisper and CosyVoice. And borrowed a lot of code from CosyVoice.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages