The official code repository for SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation. Paper | Demo

SimulMEGA (Simultaneous Generation by Mixture-of-Experts GAting) is an unsupervised policy learning framework for simultaneous speech translation (SimulST) that enables real-time, low-latency cross-lingual communication. By integrating prefix-based training with a lightweight Mixture-of-Experts (MoE) refiner, SimulMEGA learns optimal read/write decisions implicitly—without any inference-time overhead or architectural overhaul. Built on standard Transformer backbones (e.g., Whisper, CosyVoice 2), SimulMEGA requires only minimal modifications and supports both speech-to-text (S2TT) and text-to-speech (TTS) streaming within a unified framework.
MoE routers learn when to read input or write output by balancing prefix and global context—no human-annotated policies needed.
The MoE refiner is training-only; inference uses the original model architecture, preserving speed and compatibility.
Same core design works for both SimulST (S2TT) and streaming TTS—enabling full simultaneous speech-to-speech translation (S2ST).
Easily adapts to existing models like Whisper (for S2TT) or CosyVoice 2 (for TTS) via lightweight fine-tuning.
Convert whisper model to simulmega format, you may use the script in utils/convert_whisper.py
```bash
python split_whisper.py \
--whisper_ckpt ./whisper-medium.pt \
--output_ckpt ./distil_medium_ast.pt \
--model_size medium \
--n_audio_layer_shared 20 \
--n_audio_layer_ast 4 \
--n_text_layer_ast 12
```
The data is in Huggingface Audio format, with extra translation label, e.g., trans_zh, trans_en. The Fill the pretain model path and data path in the config file, e.g., exp_spec/simulmegastt/simulmegastt.yaml
Run the following command to start training:
python train.py --config simulmegasttHere is an example script for inference:
python infer_stt.py \
--wav_path /path/to/audio.wav \
--ckpt_path /path/to/model.ckpt \
--tgt_lang en \
--simul_threshold 0.5 \
--label "This is the reference transcription."Download the pretrained CosyVoice 2 model from here
The data is in standard Huggingface Audio format, Fill the pretain model path and data path in the config file, e.g., exp_spec/simulmegatts/simulmegatts.yaml
Run the following command to start training:
python train.py --config simulmegattsHere is an example script for inference:
python tts_streaming.py \
--ckpt_path ./simulmegalm_ckpt.pt \
--pretrain_path ./CosyVoice2-0.5B \
--prompt_wav ./reference.wav \
--prompt_text "这是参考音频的文本。" \
--output_dir ./generated_audio \
--threshold 0.3 \
--minratio 0.0 \
--device 0The models are built upon the excellent works of Whisper and CosyVoice. And borrowed a lot of code from CosyVoice.