Skip to content

Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series) #1886

@TOMUIV

Description

@TOMUIV

Bug: CosyVoice3 produces garbled speech on torch >= 2.7 (required by RTX 50-series)

Environment

  • GPU: NVIDIA RTX 5060 (Blackwell sm_120)
  • OS: Windows 11
  • torch: 2.11.0+cu128
  • transformers: 5.6.2
  • CosyVoice commit: ace7c47 (latest main)
  • Model: Fun-CosyVoice3-0.5B-2512 (official weights from modelscope)

Steps to reproduce

from cosyvoice.cli.cosyvoice import AutoModel
model = AutoModel(model_dir="pretrained_models/Fun-CosyVoice3-0.5B-2512", fp16=True)

for i, j in enumerate(model.inference_zero_shot(
    "八百标兵奔北坡,北坡炮兵并排跑。",
    "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。",
    "./asset/zero_shot_prompt.wav",
    stream=False
)):
    torchaudio.save(f"out_{i}.wav", j["tts_speech"], model.sample_rate)

Output is unintelligible noise. When transcribed back with FunASR-Nano, the result is gibberish (Japanese kana, repeated syllables).

Root cause

CosyVoice3 LM inference (cosyvoice/llm/llm.py:459) uses forward_one_step() which calls Qwen2ForCausalLM.forward() through the transformers library. In transformers >= 5.x, Qwen2Attention.forward() uses ALL_ATTENTION_FUNCTIONS.get_interface(), which replaced the direct scaled_dot_product_attention call from 4.x. RMSNorm, KV cache, and rotary embedding APIs also changed.

These API-level differences cause minute numerical drift. CosyVoice3 generates speech tokens autoregressively — after 100+ steps, the accumulated drift collapses output into noise.

Why CosyVoice2 works

I verified that CosyVoice2 runs successfully on the same environment. The reason is architectural: CosyVoice2 uses cosyvoice/transformer/encoder.py:forward_chunk() — a hand-written chunk-based transformer that calls torch.nn.functional.scaled_dot_product_attention directly, bypassing transformers Qwen2 wrappers entirely.

Why this matters for RTX 50-series users

RTX 50-series GPUs require torch >= 2.7 (sm_120 support). The official CosyVoice3 requirements.txt pins torch 2.3.1 (no sm_120). Users are forced onto torch >= 2.7 + transformers >= 5.x, creating a hard incompatibility that does not exist for 40-series and older GPUs.

Related issues

Suggested fix

  1. Backport CosyVoice3 inference to use forward_chunk() like CosyVoice2, or
  2. Make CosyVoice3LM.inference() compatible with transformers 5.x Qwen2 API, or
  3. Document minimum transformers version constraint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions