feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages)#641
feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages)#641acul3 wants to merge 6 commits intoBlaizzy:mainfrom
Conversation
cfeb6fd to
9423076
Compare
|
am i doing something wrong? the 4bit model sounds broken. audios converted with |
|
@acul3 Can you add a README for the model to document how to use it and what the supported repo ids are? See the other models for examples. Otherwise this looks good to me. |
|
Please also run |
|
@lucasnewman , did you hear the samples I uploaded? something isn't right. |
|
@gianpaj Thanks for reporting! I've pushed a fix that addresses the quality issue. Root cause: This shifted all token positions by 1, causing the LM to produce degraded output, can you try again ? |
|
@lucasnewman sure let me, add it |
|
cmd1_000.wav using same config @gianpaj |
|
I tested the unmerged feat/add-voxcpm2 branch locally on Apple Silicon and What I tested:
What works:
What still does not work well:
What I tried that did not solve it:
Why I think this matters:
My current hypothesis:
|
|
hi @krmao , do you sample with prompt that i can test ? will try to match it with original repo |
@acul3 Hi, yes — I verified this with Whisper, not only by listening. Here is one clean anonymized repro I tested locally on Apple Silicon using the PR
Validation method:
Expected:
Actual Whisper result from my local run:
Whisper also detected the language as en for this sample. So the issue I’m reporting is text fidelity, not only subjective audio quality. |
|
Unable to generate Chinese audio. The audio produced is meaningless noise with no relation to the text. Several other languages were also tested (Japanese, Korean, English) and worked correctly. Except that the --speed parameter has no effect. |
Add MLX implementation of OpenBMB's VoxCPM2 with support for: - Zero-shot TTS, voice design, voice cloning, and continuation modes - 48kHz studio-quality audio via asymmetric AudioVAE (16kHz encode / 48kHz decode) - Sample-rate conditioned decoder with SampleRateConditionLayer - MiniCPM4 backbone with kv_channels and no_rope support - VoxCPMLocDiTV2 diffusion transformer with multi-token mu - Fusion concat projection for residual LM input - Configurable warmup patches to eliminate onset artifacts - CLI integration via --instruct (voice design), --ref_audio (cloning) - 4-bit/8-bit quantization support (1.17x realtime on Apple Silicon) - 15 unit tests in test_models.py Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Use tokenize+convert_tokens_to_ids instead of encode() to match PyTorch behavior (encode() adds BOS token that shifts all positions) - Reduce warmup_patches to 1 when instruct is provided - Enforce minimum cfg_value=2.0 for CLI compatibility Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
The BOS token fix resolved the onset artifacts, so warmup patches are no longer needed by default. Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Add model README with usage docs, supported repo IDs, parameters, and architecture overview - Run pre-commit (black + isort) to fix formatting Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Add decode_chunk_size to AudioVAE (encoder=640, decoder=1920) - Use decode_chunk_size for continuation audio trimming - Make VAD silence trimming optional (default off, matching upstream) - Add input type validation for text parameter Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Fix test_voice_design_prefix to mock tokenize+convert_tokens_to_ids instead of encode (matching the BOS token fix) - Add VoxCPM2 to TTS model comparison table in docs Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
|
Hi @acul3 — nice work getting VoxCPM2 into mlx-audio! I've been independently porting VoxCPM2 to Swift/MLX and spent significant time comparing against the official voxcpm pip package (v2.0.2) source code. I found several issues that may explain the quality problems reported here. Happy to help contribute fixes: Confirmed issues from official source (voxcpm/model/voxcpm2.py, modules/locdit/unified_cfm.py, modules/minicpm4/model.py):
Would you like me to submit fixes for any of these, or would you prefer I open a companion PR? I have all of these verified against the official source with line references. |
|
hi @xocialize thanks for reporting. i am open if you want submit fixes for any of it in this branch, or open Companion PR!. just let me know, so i can test , thank you |
|
Hi @acul3 — thanks for the open offer! Quick context on where my findings come from: I ported VoxCPM2 to Swift/MLX in parallel against the official Line numbers below reference:
Bugs that affect output quality1.
|
@xocialize Do you want to post a new PR with your changes included? I think it would be quicker than going back and forth here. |
|
Thanks @lucasnewman — yes, happy to. I'll scope a Companion PR covering items 1-4 (the surgical conditioning-path fixes: Would you prefer it targeting |
|
@xocialize do you have a ealier patch file yet? we can test. |
|
@krmao — here are the line-level changes for items 1-3 against Important caveat: these diffs are translated from my Swift reference where I've validated the fixes against real Item 1:
|
|
@lucasnewman following up on this — to clarify before I open the Companion PR, would you prefer it to target |
|
@xocialize A standalone PR against main with all changes included is ideal for getting this in, thanks! |
|
@lucasnewman @acul3 — circling back: pulled the current PR head locally and ran zero-shot, voice design, and reference-audio cloning end-to-end against The fixes from my Apr 17 comment (items 1-7) are all in — Nice work @acul3 — the integration looks ready to merge from where I'm sitting. The standalone PR I offered isn't needed since you've already folded everything in. Happy to post sample WAVs from the three modes if useful for the merge discussion. |
Context
VoxCPM2 is OpenBMB's latest 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, voice design, and 30-language support. This PR adds a full MLX implementation.
Description
Complete MLX port of openbmb/VoxCPM2, building on the existing VoxCPM v1 implementation but as a separate module due to substantial architectural differences.
Key architecture changes from v1:
SampleRateConditionLayerfor asymmetric 16kHz encode / 48kHz decodefusion_concat_projreplaces element-wise addition for residual LM inputVoxCPMLocDiTV2with multi-token mu ((B, 2*H)→ 2 start tokens)kv_channelsandno_ropesupport for residual LMGeneration Modes
All 5 modes from the original VoxCPM2 repo are supported:
--text "Hello"--text "Hello" --instruct "A warm female voice"--text "Hello" --ref_audio speaker.wavprompt_text + prompt_audio + textref_audio + prompt_text + prompt_audio + textContinuation & Combined modes are designed for long-form speech generation (audiobooks, podcasts) — each chunk picks up naturally from the previous one while keeping the voice consistent.
Changes in the codebase
New files (
mlx_audio/tts/models/voxcpm2/):config.pyminicpm.pykv_channels+no_ropeencoder.pydit.pymean_modeaudio_vae.pyvoxcpm2.py__init__.pyModified:
mlx_audio/tts/utils.py— added"voxcpm2": "voxcpm2"to MODEL_REMAPPINGTests: 15 unit tests added to
mlx_audio/tts/tests/test_models.pycovering config, registration, AudioVAE, MiniCPM, DiT, and full Model.Quick examples
Performance (Apple Silicon)
Changes outside the codebase
mlx-community/VoxCPM2-{bf16,8bit,4bit}Checklist