Skip to content

feat(tts): Add batched ICL generation to improve speed for Qwen3-TTS#644

Draft
HFrost0 wants to merge 2 commits intoBlaizzy:mainfrom
HFrost0:feat_qwen3tts_batch_icl2
Draft

feat(tts): Add batched ICL generation to improve speed for Qwen3-TTS#644
HFrost0 wants to merge 2 commits intoBlaizzy:mainfrom
HFrost0:feat_qwen3tts_batch_icl2

Conversation

@HFrost0
Copy link
Copy Markdown

@HFrost0 HFrost0 commented Apr 10, 2026

Context

This PR addresses slow generation speeds when doing voice cloning (ICL) on long texts for Qwen3-TTS.

Previously, handling long text meant choosing between two bad workarounds:

  1. Passing everything at once: Often resulted in OOMs or model hallucinations as the context grew too long.
  2. Splitting the text externally: Running generate sequentially for each chunk avoided OOMs, but was extremely slow because the heavily-parameterized reference audio and speaker embeddings had to be recalculated for every chunk.

Description

This PR introduces batched ICL generation to fix these issues natively:

  • Batched Parallel Generation:
    The text is now split internally via split_pattern and evaluated in parallel. The reference audio features (ICLSharedEmbeddings) are computed exactly once and shared across the batch using left-padding. Each segment's decode follows the same ref-code-prepend + proportional-trim approach as single-segment ICL, ensuring consistent acoustic quality. This completely removes the redundant decoding overhead and provides a massive speedup.

Changes in the codebase

  • Extracted reference audio processing into _prepare_icl_shared_context and created ICLSharedEmbeddings to manage shared parameters.
  • Implemented _batch_generate_icl for executing parallel batched generations.
  • Updated generate() to split text and route multi-segment ICL to the batch path, while preserving single-segment streaming support.

Changes outside the codebase

  • None

Additional information - Benchmark

A quick speed test on an M3 Max doing voice cloning for an 8-segment long text (~225 Chinese characters):

Generation Mode Latency
Sequential (Old) ~47.5 s
Batched (This PR) ~12.4 s

Speedup: 3.8× (results are M3 Max specific; other Apple Silicon may vary)

⚠️ Behavioral Change — Feedback Requested

On main, the ICL path ignores split_pattern entirely — it always passes the full text to _generate_icl as a single segment. This PR changes that: ICL now respects split_pattern (default "\n"), which means:

  1. Multi-segment output: If the input text contains \n, it will be split and processed in batch. Users who previously received a single GenerationResult will now receive N results. Downstream code that assumes a single result may need updating.
  2. Streaming silently degrades: When stream=True is combined with multi-segment ICL, streaming is not supported. A warning is emitted, but the user gets non-streaming batch output instead of the real-time chunks they might expect.

Options to discuss:

  • (A) Keep current behavior (this PR): ICL always respects split_pattern. Simple, consistent, and fast — but a breaking change for users with newlines in text.
  • (B) Only split when explicitly requested: Default to no splitting for ICL (e.g. treat split_pattern as non-applicable for ICL unless some flag is set). Fully backward compatible but requires an API change to opt into batching.
  • (C) Fallback for streaming: When stream=True with multiple segments, fall back to sequential _generate_icl per segment (preserves streaming, loses batch speedup). Batch only when stream=False.

Happy to adjust based on your preference.

Checklist

  • Tests added/updated
  • Documentation updated
  • Issue referenced (e.g., "Closes #...")

…arallel reference audio encoding and batched input embedding construction.
@HFrost0 HFrost0 changed the title feat(tts): Add batched ICL generation to improve speed and align decoding logic with official Qwen3-TTS feat(tts): Add batched ICL generation to improve speed for Qwen3-TTS Apr 10, 2026
@HFrost0 HFrost0 marked this pull request as draft April 11, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants