SimulMEGA

The official code repository for SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation. Paper | Demo

📌 Overview

SimulMEGA (Simultaneous Generation by Mixture-of-Experts GAting) is an unsupervised policy learning framework for simultaneous speech translation (SimulST) that enables real-time, low-latency cross-lingual communication. By integrating prefix-based training with a lightweight Mixture-of-Experts (MoE) refiner, SimulMEGA learns optimal read/write decisions implicitly—without any inference-time overhead or architectural overhaul. Built on standard Transformer backbones (e.g., Whisper, CosyVoice 2), SimulMEGA requires only minimal modifications and supports both speech-to-text (S2TT) and text-to-speech (TTS) streaming within a unified framework.

✨ Highlights

Unsupervised Policy Learning

MoE routers learn when to read input or write output by balancing prefix and global context—no human-annotated policies needed.

Zero Inference Overhead

The MoE refiner is training-only; inference uses the original model architecture, preserving speed and compatibility.

Unified Streaming Framework

Same core design works for both SimulST (S2TT) and streaming TTS—enabling full simultaneous speech-to-speech translation (S2ST).

Plug-and-Play Compatibility

Easily adapts to existing models like Whisper (for S2TT) or CosyVoice 2 (for TTS) via lightweight fine-tuning.

💡 Simultaneous Speech to Text

Prepare Pretrained Model

Convert whisper model to simulmega format, you may use the script in utils/convert_whisper.py

```bash
python split_whisper.py \
--whisper_ckpt ./whisper-medium.pt \
--output_ckpt ./distil_medium_ast.pt \
--model_size medium \
--n_audio_layer_shared 20 \
--n_audio_layer_ast 4 \
--n_text_layer_ast 12
```

Prepare Data & Config

The data is in Huggingface Audio format, with extra translation label, e.g., trans_zh, trans_en. The Fill the pretain model path and data path in the config file, e.g., exp_spec/simulmegastt/simulmegastt.yaml

Start Training

Run the following command to start training:

python train.py --config simulmegastt

Inference

Here is an example script for inference:

python infer_stt.py \
  --wav_path /path/to/audio.wav \
  --ckpt_path /path/to/model.ckpt \
  --tgt_lang en \
  --simul_threshold 0.5 \
  --label "This is the reference transcription."

💡 Simultaneous Text to Speech

Prepare Pretrained Model

Download the pretrained CosyVoice 2 model from here

Prepare Data & Config

The data is in standard Huggingface Audio format, Fill the pretain model path and data path in the config file, e.g., exp_spec/simulmegatts/simulmegatts.yaml

Start Training

Run the following command to start training:

python train.py --config simulmegatts

Inference

Here is an example script for inference:

python tts_streaming.py \
  --ckpt_path ./simulmegalm_ckpt.pt \
  --pretrain_path ./CosyVoice2-0.5B \
  --prompt_wav ./reference.wav \
  --prompt_text "这是参考音频的文本。" \
  --output_dir ./generated_audio \
  --threshold 0.3 \
  --minratio 0.0 \
  --device 0

📚 Acknowledgments

The models are built upon the excellent works of Whisper and CosyVoice. And borrowed a lot of code from CosyVoice.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
exp_spec		exp_spec
image		image
models		models
utils		utils
README.md		README.md
infer_stt.py		infer_stt.py
infer_tts.py		infer_tts.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimulMEGA

📌 Overview

✨ Highlights

Unsupervised Policy Learning

Zero Inference Overhead

Unified Streaming Framework

Plug-and-Play Compatibility

💡 Simultaneous Speech to Text

Prepare Pretrained Model

Prepare Data & Config

Start Training

Inference

💡 Simultaneous Text to Speech

Prepare Pretrained Model

Prepare Data & Config

Start Training

Inference

📚 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimulMEGA

📌 Overview

✨ Highlights

Unsupervised Policy Learning

Zero Inference Overhead

Unified Streaming Framework

Plug-and-Play Compatibility

💡 Simultaneous Speech to Text

Prepare Pretrained Model

Prepare Data & Config

Start Training

Inference

💡 Simultaneous Text to Speech

Prepare Pretrained Model

Prepare Data & Config

Start Training

Inference

📚 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages