Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series)

## Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series)

### Environment
- GPU: NVIDIA RTX 5060 (Blackwell sm_120)
- OS: Windows 11
- torch: 2.11.0+cu128
- transformers: 5.6.2
- CosyVoice commit: ace7c47 (latest main)
- Model: Fun-CosyVoice3-0.5B-2512 (official weights from modelscope)

### Steps to reproduce

```python
from cosyvoice.cli.cosyvoice import AutoModel
model = AutoModel(model_dir="pretrained_models/Fun-CosyVoice3-0.5B-2512", fp16=True)

for i, j in enumerate(model.inference_zero_shot(
    "八百标兵奔北坡，北坡炮兵并排跑。",
    "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。",
    "./asset/zero_shot_prompt.wav",
    stream=False
)):
    torchaudio.save(f"out_{i}.wav", j["tts_speech"], model.sample_rate)
```

Output is unintelligible noise. When transcribed back with FunASR-Nano, the result is gibberish (Japanese kana, repeated syllables).

### Root cause

CosyVoice3 LM inference (cosyvoice/llm/llm.py:459) uses `forward_one_step()` which calls `Qwen2ForCausalLM.forward()` through the transformers library. In transformers >= 5.x, `Qwen2Attention.forward()` uses `ALL_ATTENTION_FUNCTIONS.get_interface()`, which replaced the direct `scaled_dot_product_attention` call from 4.x. RMSNorm, KV cache, and rotary embedding APIs also changed.

These API-level differences cause minute numerical drift. CosyVoice3 generates speech tokens **autoregressively** — after 100+ steps, the accumulated drift collapses output into noise.

### Why CosyVoice2 works

I verified that CosyVoice2 runs successfully on the same environment. The reason is architectural: CosyVoice2 uses `cosyvoice/transformer/encoder.py:forward_chunk()` — a hand-written chunk-based transformer that calls `torch.nn.functional.scaled_dot_product_attention` directly, bypassing transformers Qwen2 wrappers entirely.

### Why this matters for RTX 50-series users

RTX 50-series GPUs require torch >= 2.7 (sm_120 support). The official CosyVoice3 requirements.txt pins torch 2.3.1 (no sm_120). Users are forced onto torch >= 2.7 + transformers >= 5.x, creating a hard incompatibility that does not exist for 40-series and older GPUs.

### Related issues
- #1692 — same symptoms (closed, no response)
- #1741 — same symptoms (closed, no response)
- #1318 — CosyVoice2 on 50-series (torch too old, fixed by upgrading to 2.8+)

### Suggested fix
1. Backport CosyVoice3 inference to use `forward_chunk()` like CosyVoice2, or
2. Make `CosyVoice3LM.inference()` compatible with transformers 5.x Qwen2 API, or
3. Document minimum transformers version constraint.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series) #1886