feat(tts): Add batched ICL generation to improve speed for Qwen3-TTS#644
Draft
HFrost0 wants to merge 2 commits intoBlaizzy:mainfrom
Draft
feat(tts): Add batched ICL generation to improve speed for Qwen3-TTS#644HFrost0 wants to merge 2 commits intoBlaizzy:mainfrom
HFrost0 wants to merge 2 commits intoBlaizzy:mainfrom
Conversation
…arallel reference audio encoding and batched input embedding construction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This PR addresses slow generation speeds when doing voice cloning (ICL) on long texts for Qwen3-TTS.
Previously, handling long text meant choosing between two bad workarounds:
generatesequentially for each chunk avoided OOMs, but was extremely slow because the heavily-parameterized reference audio and speaker embeddings had to be recalculated for every chunk.Description
This PR introduces batched ICL generation to fix these issues natively:
The text is now split internally via
split_patternand evaluated in parallel. The reference audio features (ICLSharedEmbeddings) are computed exactly once and shared across the batch using left-padding. Each segment's decode follows the same ref-code-prepend + proportional-trim approach as single-segment ICL, ensuring consistent acoustic quality. This completely removes the redundant decoding overhead and provides a massive speedup.Changes in the codebase
_prepare_icl_shared_contextand createdICLSharedEmbeddingsto manage shared parameters._batch_generate_iclfor executing parallel batched generations.generate()to split text and route multi-segment ICL to the batch path, while preserving single-segment streaming support.Changes outside the codebase
Additional information - Benchmark
A quick speed test on an M3 Max doing voice cloning for an 8-segment long text (~225 Chinese characters):
Speedup: 3.8× (results are M3 Max specific; other Apple Silicon may vary)
On
main, the ICL path ignoressplit_patternentirely — it always passes the full text to_generate_iclas a single segment. This PR changes that: ICL now respectssplit_pattern(default"\n"), which means:\n, it will be split and processed in batch. Users who previously received a singleGenerationResultwill now receive N results. Downstream code that assumes a single result may need updating.stream=Trueis combined with multi-segment ICL, streaming is not supported. A warning is emitted, but the user gets non-streaming batch output instead of the real-time chunks they might expect.Options to discuss:
split_pattern. Simple, consistent, and fast — but a breaking change for users with newlines in text.split_patternas non-applicable for ICL unless some flag is set). Fully backward compatible but requires an API change to opt into batching.stream=Truewith multiple segments, fall back to sequential_generate_iclper segment (preserves streaming, loses batch speedup). Batch only whenstream=False.Happy to adjust based on your preference.
Checklist