Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series)
Environment
- GPU: NVIDIA RTX 5060 (Blackwell sm_120)
- OS: Windows 11
- torch: 2.11.0+cu128
- transformers: 5.6.2
- CosyVoice commit: ace7c47 (latest main)
- Model: Fun-CosyVoice3-0.5B-2512 (official weights from modelscope)
Steps to reproduce
from cosyvoice.cli.cosyvoice import AutoModel
model = AutoModel(model_dir="pretrained_models/Fun-CosyVoice3-0.5B-2512", fp16=True)
for i, j in enumerate(model.inference_zero_shot(
"八百标兵奔北坡,北坡炮兵并排跑。",
"You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。",
"./asset/zero_shot_prompt.wav",
stream=False
)):
torchaudio.save(f"out_{i}.wav", j["tts_speech"], model.sample_rate)
Output is unintelligible noise. When transcribed back with FunASR-Nano, the result is gibberish (Japanese kana, repeated syllables).
Root cause
CosyVoice3 LM inference (cosyvoice/llm/llm.py:459) uses forward_one_step() which calls Qwen2ForCausalLM.forward() through the transformers library. In transformers >= 5.x, Qwen2Attention.forward() uses ALL_ATTENTION_FUNCTIONS.get_interface(), which replaced the direct scaled_dot_product_attention call from 4.x. RMSNorm, KV cache, and rotary embedding APIs also changed.
These API-level differences cause minute numerical drift. CosyVoice3 generates speech tokens autoregressively — after 100+ steps, the accumulated drift collapses output into noise.
Why CosyVoice2 works
I verified that CosyVoice2 runs successfully on the same environment. The reason is architectural: CosyVoice2 uses cosyvoice/transformer/encoder.py:forward_chunk() — a hand-written chunk-based transformer that calls torch.nn.functional.scaled_dot_product_attention directly, bypassing transformers Qwen2 wrappers entirely.
Why this matters for RTX 50-series users
RTX 50-series GPUs require torch >= 2.7 (sm_120 support). The official CosyVoice3 requirements.txt pins torch 2.3.1 (no sm_120). Users are forced onto torch >= 2.7 + transformers >= 5.x, creating a hard incompatibility that does not exist for 40-series and older GPUs.
Related issues
Suggested fix
- Backport CosyVoice3 inference to use
forward_chunk() like CosyVoice2, or
- Make
CosyVoice3LM.inference() compatible with transformers 5.x Qwen2 API, or
- Document minimum transformers version constraint.
Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series)
Environment
Steps to reproduce
Output is unintelligible noise. When transcribed back with FunASR-Nano, the result is gibberish (Japanese kana, repeated syllables).
Root cause
CosyVoice3 LM inference (cosyvoice/llm/llm.py:459) uses
forward_one_step()which callsQwen2ForCausalLM.forward()through the transformers library. In transformers >= 5.x,Qwen2Attention.forward()usesALL_ATTENTION_FUNCTIONS.get_interface(), which replaced the directscaled_dot_product_attentioncall from 4.x. RMSNorm, KV cache, and rotary embedding APIs also changed.These API-level differences cause minute numerical drift. CosyVoice3 generates speech tokens autoregressively — after 100+ steps, the accumulated drift collapses output into noise.
Why CosyVoice2 works
I verified that CosyVoice2 runs successfully on the same environment. The reason is architectural: CosyVoice2 uses
cosyvoice/transformer/encoder.py:forward_chunk()— a hand-written chunk-based transformer that callstorch.nn.functional.scaled_dot_product_attentiondirectly, bypassing transformers Qwen2 wrappers entirely.Why this matters for RTX 50-series users
RTX 50-series GPUs require torch >= 2.7 (sm_120 support). The official CosyVoice3 requirements.txt pins torch 2.3.1 (no sm_120). Users are forced onto torch >= 2.7 + transformers >= 5.x, creating a hard incompatibility that does not exist for 40-series and older GPUs.
Related issues
Suggested fix
forward_chunk()like CosyVoice2, orCosyVoice3LM.inference()compatible with transformers 5.x Qwen2 API, or