Skip to content

feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages)#641

Open
acul3 wants to merge 6 commits intoBlaizzy:mainfrom
acul3:feat/add-voxcpm2
Open

feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages)#641
acul3 wants to merge 6 commits intoBlaizzy:mainfrom
acul3:feat/add-voxcpm2

Conversation

@acul3
Copy link
Copy Markdown

@acul3 acul3 commented Apr 8, 2026

Context

VoxCPM2 is OpenBMB's latest 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, voice design, and 30-language support. This PR adds a full MLX implementation.

Description

Complete MLX port of openbmb/VoxCPM2, building on the existing VoxCPM v1 implementation but as a separate module due to substantial architectural differences.

Key architecture changes from v1:

  • AudioVAE V2 with SampleRateConditionLayer for asymmetric 16kHz encode / 48kHz decode
  • fusion_concat_proj replaces element-wise addition for residual LM input
  • VoxCPMLocDiTV2 with multi-token mu ((B, 2*H) → 2 start tokens)
  • MiniCPM backbone with kv_channels and no_rope support for residual LM

Generation Modes

All 5 modes from the original VoxCPM2 repo are supported:

Mode Description Usage
Zero-shot Random voice, text only --text "Hello"
Voice design Create voice from text description --text "Hello" --instruct "A warm female voice"
Reference cloning Clone voice from audio sample --text "Hello" --ref_audio speaker.wav
Continuation Continue from previous audio (seamless transitions for long-form) prompt_text + prompt_audio + text
Combined Reference voice + continuation (clone voice AND continue from prompt) ref_audio + prompt_text + prompt_audio + text

Continuation & Combined modes are designed for long-form speech generation (audiobooks, podcasts) — each chunk picks up naturally from the previous one while keeping the voice consistent.

Changes in the codebase

New files (mlx_audio/tts/models/voxcpm2/):

File Lines Description
config.py 128 ModelArgs with v2 defaults, AudioVAEV2Config
minicpm.py 255 MiniCPM backbone + kv_channels + no_rope
encoder.py 35 VoxCPMLocEnc (local feature encoder)
dit.py 183 VoxCPMLocDiTV2 + UnifiedCFM with mean_mode
audio_vae.py 617 AudioVAEV2 with SR conditioning + sanitize
voxcpm2.py 640 Main Model with all gen modes + CLI compat
__init__.py 4 Exports

Modified: mlx_audio/tts/utils.py — added "voxcpm2": "voxcpm2" to MODEL_REMAPPING

Tests: 15 unit tests added to mlx_audio/tts/tests/test_models.py covering config, registration, AudioVAE, MiniCPM, DiT, and full Model.

Quick examples

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/VoxCPM2-8bit")

# Zero-shot
for result in model.generate(text="Hello world"):
    print(result.audio_duration)

# Voice design
for result in model.generate(text="Hello", instruct="A young woman, gentle voice"):
    print(result.audio_duration)

# Voice cloning
for result in model.generate(text="Hello", ref_audio="speaker.wav"):
    print(result.audio_duration)

# Continuation (long-form)
for result in model.generate(
    text=" and this continues seamlessly.",
    prompt_text="First sentence here",
    prompt_audio="first.wav",
):
    print(result.audio_duration)
# CLI
python -m mlx_audio.tts.generate \
  --model mlx-community/VoxCPM2-8bit \
  --text "Hello world" \
  --instruct "A young woman, gentle voice" \
  --verbose

Performance (Apple Silicon)

Variant Size RTF (7ts)
bf16 4.96 GB 0.48x
8-bit 3.23 GB 0.85x
4-bit 2.30 GB 0.90x

Changes outside the codebase

  • Converted weights uploaded to HuggingFace: mlx-community/VoxCPM2-{bf16,8bit,4bit}

Checklist

  • All 5 generation modes tested with real weights
  • Voice design tested with various descriptions
  • Voice cloning tested with reference audio
  • Continuation & combined modes tested
  • CLI integration tested (zero-shot, --instruct, --ref_audio)
  • bfloat16 weights verified (float16 causes artifacts)
  • Quantization tested (4-bit, 8-bit)
  • Unit tests (15 tests in test_models.py)

@acul3 acul3 force-pushed the feat/add-voxcpm2 branch 2 times, most recently from cfeb6fd to 9423076 Compare April 8, 2026 03:00
@acul3 acul3 changed the title Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) Feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) Apr 8, 2026
@acul3 acul3 changed the title Feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) Apr 8, 2026
@gianpaj
Copy link
Copy Markdown

gianpaj commented Apr 8, 2026

am i doing something wrong? the 4bit model sounds broken.

❯ l ~/.lmstudio/models/mlx-community/VoxCPM2-4bit
.rw-r--r--@ 5.3k gianpaj  8 Apr 23:01 config.json
.rw-r--r--@ 2.3G gianpaj  8 Apr 23:04 model.safetensors
.rw-r--r--@ 1.6k gianpaj  8 Apr 23:01 special_tokens_map.json
.rw-r--r--@ 522k gianpaj  8 Apr 23:01 test_en.wav
.rw-r--r--@ 614k gianpaj  8 Apr 23:01 test_id.wav
.rw-r--r--@ 3.7M gianpaj  8 Apr 23:01 tokenizer.json
.rw-r--r--@ 5.0k gianpaj  8 Apr 23:01 tokenizer_config.json
❯ uv run python -m mlx_audio.tts.generate --model ~/.lmstudio/models/mlx-community/VoxCPM2-4bit  --text "Hello world"  \
  --verbose
You are using a model of type `voxcpm2` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
Text: Hello world
Voice: None
Speed: 1.0x
Language: en
✅ Audio successfully generated and saving as: audio_000.wav
==========
Duration:              00:00:01.600
Samples/sec:           27547.6
Prompt:                3 tokens, 1.1 tokens-per-sec
Audio:                 76808 samples, 27547.6 samples-per-sec
Real-time factor:      0.57x
Processing time:       2.79s
Peak memory usage:     3.23GB


❯ cat pyproject.toml
[project]
name = "voxcpm2"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "mlx-audio @ git+https://github.com/acul3/mlx-audio.git@feat/add-voxcpm2",
]

audio_000.mp3

uv run python -m mlx_audio.tts.generate --model ~/.lmstudio/models/mlx-community/VoxCPM2-4bit  --text "Hello world" --instruct "A young woman, gentle voice" \
  --verbose
  
You are using a model of type `voxcpm2` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
Instruct: A young woman, gentle voice
Text: Hello world
Voice: None
Speed: 1.0x
Language: en
✅ Audio successfully generated and saving as: audio_000.wav
==========
Duration:              00:00:02.240
Samples/sec:           30523.6
Prompt:                11 tokens, 3.1 tokens-per-sec
Audio:                 107528 samples, 30523.6 samples-per-sec
Real-time factor:      0.64x
Processing time:       3.52s
Peak memory usage:     3.24G

audio_000woman.mp3

uv run python -m mlx_audio.tts.generate --model ~/.lmstudio/models/mlx-community/VoxCPM2-4bit  --text "Hello world" --instruct "A young woman, gentle voice" --voice "female_1" \
  --verbose

audio_000female_1.mp3

audios converted with ffmpeg -i audio_000.wav audio_000.mp3 to upload them on github

@lucasnewman
Copy link
Copy Markdown
Collaborator

@acul3 Can you add a README for the model to document how to use it and what the supported repo ids are? See the other models for examples. Otherwise this looks good to me.

@lucasnewman
Copy link
Copy Markdown
Collaborator

Please also run pre-commit run --all to fix the formatting.

@gianpaj
Copy link
Copy Markdown

gianpaj commented Apr 10, 2026

@lucasnewman , did you hear the samples I uploaded? something isn't right.
Did you try to generate audio with one of the quantized models?

@acul3
Copy link
Copy Markdown
Author

acul3 commented Apr 11, 2026

@gianpaj Thanks for reporting! I've pushed a fix that addresses the quality issue.

Root cause:
tokenizer.encode() was adding a BOS token ,that the original PyTorch model doesn't use. also fix some warmup default, that might cut at the first.

This shifted all token positions by 1, causing the LM to produce degraded output,

can you try again ?

@acul3
Copy link
Copy Markdown
Author

acul3 commented Apr 11, 2026

@lucasnewman sure let me, add it

@acul3
Copy link
Copy Markdown
Author

acul3 commented Apr 11, 2026

cmd1_000.wav
cmd2_000.wav
cmd3_000.wav

using same config @gianpaj

@krmao
Copy link
Copy Markdown

krmao commented Apr 11, 2026

I tested the unmerged feat/add-voxcpm2 branch locally on Apple Silicon and
confirmed that it fixes the model registration/loading problem, but I still see
reliability issues for short Chinese TTS generation in voice design mode.

What I tested:

  • Model: mlx-community/VoxCPM2-bf16
  • Runtime: PR branch acul3/mlx-audio@feat/add-voxcpm2
  • Invocation style: non-streaming only, using the documented Python API /
    model.generate(...)
  • Mode: instruct + text voice design
  • I also verified that the PR already includes the BOS-token fix and the
    warmup_patches=0 change

What works:

  • The model now loads successfully as a native voxcpm2 implementation
  • Generation completes without crashing
  • The output is valid audio and sounds generally speech-like

What still does not work well:

  • For short Chinese prompts, content fidelity is unstable
  • The generated speech often drifts away from the requested text
  • In some cases the output appears to switch language or produce unrelated
    content
  • This is not just an audio-quality issue; it is a text-following / content-
    retention issue

What I tried that did not solve it:

  • Using the latest PR code with the BOS-token fix
  • Using the corrected warmup_patches default
  • Using non-streaming generation only
  • Using bf16 rather than quantized variants
  • Using voice-design style prompts exactly as documented: text=..., instruct=...

Why I think this matters:

  • The loading issue appears fixed, but short-form multilingual text conditioning,
    especially Chinese short prompts, still seems unreliable in voice design mode
  • This suggests the remaining problem is not only registration/tokenization, but
    likely generation behavior or conditioning alignment in the MLX implementation

My current hypothesis:

  • The PR fixes a real bug, but there may still be a mismatch between the MLX
    implementation and the original PyTorch behavior for short-form content
    retention
  • The issue seems more visible on very short prompts than on ordinary demo
    sentences

@acul3
Copy link
Copy Markdown
Author

acul3 commented Apr 11, 2026

hi @krmao , do you sample with prompt that i can test ?

will try to match it with original repo

@krmao
Copy link
Copy Markdown

krmao commented Apr 11, 2026

hi @krmao , do you sample with prompt that i can test ?

will try to match it with original repo

@acul3 Hi, yes — I verified this with Whisper, not only by listening.

Here is one clean anonymized repro I tested locally on Apple Silicon using the PR
branch, non-streaming generation:

  • text: 小雨来了
  • instruct: 二十多岁的年轻女性,声音清脆自然,普通话标准,轻声细语,像在安静的卧
    室里

Validation method:

  1. generate audio with the PR branch
  2. transcribe the generated wav with Whisper whisper-large-v3
  3. compare transcription vs expected text

Expected:

  • 小雨来了

Actual Whisper result from my local run:

  • So yeah, okay?

Whisper also detected the language as en for this sample.

So the issue I’m reporting is text fidelity, not only subjective audio quality.

@su3
Copy link
Copy Markdown

su3 commented Apr 14, 2026

Unable to generate Chinese audio. The audio produced is meaningless noise with no relation to the text. Several other languages were also tested (Japanese, Korean, English) and worked correctly. Except that the --speed parameter has no effect.

python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-8bit \
  --text "你好,世界!" --lang_code zh --verbose
You are using a model of type `voxcpm2` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
Text: 你好,世界!
Voice: None
Speed: 1.0x
Language: zh
✅ Audio successfully generated and saving as: audio_000.wav
==========
Duration:              00:00:01.120
Samples/sec:           74953.9
Prompt:                5 tokens, 7.0 tokens-per-sec
Audio:                 53768 samples, 74953.9 samples-per-sec
Real-time factor:      1.56x
Processing time:       0.72s
Peak memory usage:     4.16GB

Samsul Rahmadani and others added 6 commits April 16, 2026 10:51
Add MLX implementation of OpenBMB's VoxCPM2 with support for:
- Zero-shot TTS, voice design, voice cloning, and continuation modes
- 48kHz studio-quality audio via asymmetric AudioVAE (16kHz encode / 48kHz decode)
- Sample-rate conditioned decoder with SampleRateConditionLayer
- MiniCPM4 backbone with kv_channels and no_rope support
- VoxCPMLocDiTV2 diffusion transformer with multi-token mu
- Fusion concat projection for residual LM input
- Configurable warmup patches to eliminate onset artifacts
- CLI integration via --instruct (voice design), --ref_audio (cloning)
- 4-bit/8-bit quantization support (1.17x realtime on Apple Silicon)
- 15 unit tests in test_models.py

Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Use tokenize+convert_tokens_to_ids instead of encode() to match
  PyTorch behavior (encode() adds BOS token that shifts all positions)
- Reduce warmup_patches to 1 when instruct is provided
- Enforce minimum cfg_value=2.0 for CLI compatibility

Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
The BOS token fix resolved the onset artifacts, so warmup patches
are no longer needed by default.

Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Add model README with usage docs, supported repo IDs, parameters,
  and architecture overview
- Run pre-commit (black + isort) to fix formatting

Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Add decode_chunk_size to AudioVAE (encoder=640, decoder=1920)
- Use decode_chunk_size for continuation audio trimming
- Make VAD silence trimming optional (default off, matching upstream)
- Add input type validation for text parameter

Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
- Fix test_voice_design_prefix to mock tokenize+convert_tokens_to_ids
  instead of encode (matching the BOS token fix)
- Add VoxCPM2 to TTS model comparison table in docs

Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>
@acul3 acul3 force-pushed the feat/add-voxcpm2 branch from 66c522a to 17ac926 Compare April 16, 2026 03:51
@xocialize
Copy link
Copy Markdown

Hi @acul3 — nice work getting VoxCPM2 into mlx-audio! I've been independently porting VoxCPM2 to Swift/MLX and spent significant time comparing against the official voxcpm pip package (v2.0.2) source code. I found several issues that may explain the quality problems reported here. Happy to help contribute fixes:

Confirmed issues from official source (voxcpm/model/voxcpm2.py, modules/locdit/unified_cfm.py, modules/minicpm4/model.py):

  1. kv_channels support — The bf16 model uses kv_channels=128 for attention head dimensions (not hidden_size // num_heads). This affects Q/K/V projection sizes for LocEnc and LocDiT.
  2. scale_emb gating — Official code: scale_emb = 1.0 when use_mup=False. The config has use_mup=False and scale_emb=12, but the ×12 multiplication should NOT be applied. See voxcpm2.py line 1003-1006.
  3. residual_lm_no_rope=True — The RALM must not apply RoPE. Config field residual_lm_no_rope is True for VoxCPM2.
  4. SampleRateConditionLayer — The AudioVAE decoder has 6 FiLM conditioning layers (sr_cond_layers) that modulate based on target sample rate. Skipping them produces degraded audio since the decoder weights were trained with them active.
  5. bf16 on Apple Silicon — Upstream PR #263 confirms bf16 causes glitched audio on MPS/Metal. Fix: cast to float32 for inference.
  6. dit_hidden is concatenation, not addition — dit_hidden = cat(lm_to_dit_proj, res_to_dit_proj) gives (B, 2*H_dit), reshaped to 2 mu prefix tokens in LocDiT. See local_dit_v2.py line 109-110.

Would you like me to submit fixes for any of these, or would you prefer I open a companion PR? I have all of these verified against the official source with line references.

@acul3
Copy link
Copy Markdown
Author

acul3 commented Apr 17, 2026

hi @xocialize thanks for reporting.

i am open if you want submit fixes for any of it in this branch, or open Companion PR!.

just let me know, so i can test , thank you

@xocialize
Copy link
Copy Markdown

Hi @acul3 — thanks for the open offer!

Quick context on where my findings come from: I ported VoxCPM2 to Swift/MLX in parallel against the official voxcpm pip package v2.0.2 as my reference. The bugs below surfaced as I diffed your PR against that source. I haven't written Python patches myself — my fixes are in Swift — so I'm sharing this as a bug list with receipts rather than a PR. Happy to dig into any of them if you want more detail.

Line numbers below reference:


Bugs that affect output quality

1. dit_hidden is addition, should be concatenation

voxcpm.py:381:

dit_h = dit_h1 + dit_h2  # (1, H)

Official voxcpm/model/voxcpm2.py _inference() (~line 1049):

dit_hidden_1 = self.lm_to_dit_proj(lm_hidden)  # [b, h_dit]
dit_hidden_2 = self.res_to_dit_proj(residual_hidden)  # [b, h_dit]
dit_hidden = torch.cat((dit_hidden_1, dit_hidden_2), dim=-1)

The official concatenates along the last axis to get (B, 2*H_dit), which the LocDiT then reshapes to two mu prefix tokens (B, 2, H_dit). Addition collapses this into a single mu token and drops half the LocDiT's input dimensionality. In my Swift port this was the single biggest quality change after fixing — sum → concat visibly improved prosody and intelligibility.

2. scale_emb is inverted

voxcpm.py:273:

scale_emb = (
    self.args.lm_config.scale_emb if not self.args.lm_config.use_mup else 1.0
)

Official voxcpm/model/voxcpm2.py:

scale_emb = self.config.lm_config.scale_emb if self.config.lm_config.use_mup else 1.0

The condition is flipped. The default config has use_mup=False, so scale_emb gets applied to the text embeddings here (yours) vs. 1.0 in the official. For the VoxCPM2 shipped configs scale_emb=12, so text embeddings end up ~12× too large → severely degraded output.

Same issue in minicpm.py:188-198scale_depth is applied unconditionally, but the official gates it on use_mup. Both should be gated consistently.

3. Stop predictor threshold is hardcoded, should be parameterized with min_len=2

voxcpm.py:400:

if i > 5 and stop_flag == 1:
    break

Official voxcpm/model/voxcpm2.py:

if i > min_len and stop_flag == 1:
    break

where min_len defaults to 2 (not 5). This makes short utterances (2-5 patches) unable to stop at their natural endpoint — they either keep generating until max_len or produce clipped output depending on how the badcase retry interacts. Should be a function arg with default 2.

4. Voice cloning uses VoxCPM 1.x continuation pattern, not VoxCPM2 reference-only

voxcpm.py:282-295:

if ref_audio is not None and ref_text is not None:
    combined_text = ref_text + text
    input_ids = self.tokenizer.encode(combined_text)
    ...
    text_token = mx.concatenate([input_ids, text_pad_token])

This is the VoxCPM 1.x "continuation" layout. VoxCPM2's reference-only mode uses a different pattern via _make_ref_prefix in voxcpm/model/voxcpm2.py:

tokens:  [103,         zeros(refLen),  104         ]
feats:   [zero_patch,  ref_feat...,    zero_patch  ]
t_mask:  [1,           0...0,          1           ]  # 103/104 are text-masked
a_mask:  [0,           1...1,          0           ]  # ref audio is audio-masked

Full sequence: [103, ref_audio_patches, 104, text_tokens, 101]. Special tokens: 101=<audio_start>, 103=<ref_start>, 104=<ref_end>.

The official _generate has 4 modes selected by reference_wav_path + prompt_wav_path:

  • [text, 101] — zero-shot
  • [103, ref, 104, text, 101] — reference-only
  • [text, 101, prompt] — continuation (VoxCPM 1.x-style, what the PR currently implements)
  • [103, ref, 104, text, 101, prompt] — combined

The PR's ref_audio path should switch to the reference-only layout as its primary mode, and ideally expose the other three as additional parameters.


Missing features (affect output quality)

5. SampleRateConditionLayer and 48 kHz output

config.py:60:

sample_rate: int = 44100

No out_sample_rate, no SampleRateConditionLayer. Official VoxCPM2 has:

  • Encoder sample rate: 16 kHz (input)
  • Output sample rate: 48 kHz (decoder upsamples 1920×)
  • Per-decoder-block SampleRateConditionLayerEmbedding(num_buckets, channels) applied as FiLM before each decoder block via sr_bin_boundaries=[20000, 30000, 40000], bucketizing the requested output rate into a 4-way one-hot

See official voxcpm/modules/audiovae/audio_vae_v2.py:218-265 (SampleRateConditionLayer) and the forward() that applies sr_cond_layer(x, sr_cond) before each decoder block. Without this, the decoder produces at input rate and misses the prosodic conditioning the FiLM layer provides.

6. kv_channels not honored in LM/RALM/DiT head dim

The official MiniCPM-4 config exposes kv_channels as the head dim (128 in shipped configs), decoupled from hidden_size / num_heads. PR #641 computes head_dim inline as hidden_size / num_heads which gives the wrong value for DiT (hidden=1024, heads=16 → 64 ≠ 128 shipped). This produces the right shape but wrong semantics if a user ever loads a config where kv_channels ≠ hidden_size / num_heads.

7. RALM no_rope not honored

Official config.py has residual_lm_no_rope: bool = True. The RALM (residual LM) must disable RoPE. I don't see a no_rope flag plumbed through minicpm.py in the PR. If RoPE is being applied to RALM, the residual branch is seeing an extra positional encoding signal it wasn't trained on.


What's correct (for completeness)

Just to be clear, a lot of your port is right. For example:

  • Euler step direction: dit.py:175 x - dt * dphi_dt with t_span = linspace(1, 0) — ✓ matches official exactly
  • CFG-zero-star formula: dit.py:170 — ✓ matches
  • Sway sampling with coef=1.0: dit.py:193 — ✓ matches
  • zero_init_steps formula: dit.py:117 max(1, int(len(t_span) * 0.04)) — ✓ matches
  • mean_mode=False → dt zeroed for uncond: dit.py:140 — ✓ matches (and your FIXED comment is accurate)

So the flow-matching core is solid. It's the conditioning path (embeddings, cloning, sample-rate conditioning, stop predictor) that has the divergence.


Reference

  • Official source I'm comparing against: pip install voxcpm==2.0.2voxcpm/model/voxcpm2.py, voxcpm/modules/locdit/unified_cfm.py, voxcpm/modules/locdit/local_dit_v2.py, voxcpm/modules/minicpm4/model.py, voxcpm/modules/audiovae/audio_vae_v2.py
  • A Swift/MLX port applying these fixes achieves intelligible English speech (including voice design and reference-audio cloning) against the real mlx-community/VoxCPM2-bf16 weights on M4 Pro, passing end-to-end generation tests through the full AudioVAE.

I'm happy to go deeper on any single item, test candidate patches against my Swift reference for parity, or open a companion PR with one or two targeted fixes if that helps you make progress without overloading this PR. Whatever works for your merge workflow.

@lucasnewman
Copy link
Copy Markdown
Collaborator

Hi @acul3 — thanks for the open offer!

Quick context on where my findings come from: I ported VoxCPM2 to Swift/MLX in parallel against the official voxcpm pip package v2.0.2 as my reference. The bugs below surfaced as I diffed your PR against that source. I haven't written Python patches myself — my fixes are in Swift — so I'm sharing this as a bug list with receipts rather than a PR. Happy to dig into any of them if you want more detail.

Line numbers below reference:

Bugs that affect output quality

1. dit_hidden is addition, should be concatenation

voxcpm.py:381:

dit_h = dit_h1 + dit_h2  # (1, H)

Official voxcpm/model/voxcpm2.py _inference() (~line 1049):

dit_hidden_1 = self.lm_to_dit_proj(lm_hidden)  # [b, h_dit]
dit_hidden_2 = self.res_to_dit_proj(residual_hidden)  # [b, h_dit]
dit_hidden = torch.cat((dit_hidden_1, dit_hidden_2), dim=-1)

The official concatenates along the last axis to get (B, 2*H_dit), which the LocDiT then reshapes to two mu prefix tokens (B, 2, H_dit). Addition collapses this into a single mu token and drops half the LocDiT's input dimensionality. In my Swift port this was the single biggest quality change after fixing — sum → concat visibly improved prosody and intelligibility.

2. scale_emb is inverted

voxcpm.py:273:

scale_emb = (
    self.args.lm_config.scale_emb if not self.args.lm_config.use_mup else 1.0
)

Official voxcpm/model/voxcpm2.py:

scale_emb = self.config.lm_config.scale_emb if self.config.lm_config.use_mup else 1.0

The condition is flipped. The default config has use_mup=False, so scale_emb gets applied to the text embeddings here (yours) vs. 1.0 in the official. For the VoxCPM2 shipped configs scale_emb=12, so text embeddings end up ~12× too large → severely degraded output.

Same issue in minicpm.py:188-198scale_depth is applied unconditionally, but the official gates it on use_mup. Both should be gated consistently.

3. Stop predictor threshold is hardcoded, should be parameterized with min_len=2

voxcpm.py:400:

if i > 5 and stop_flag == 1:
    break

Official voxcpm/model/voxcpm2.py:

if i > min_len and stop_flag == 1:
    break

where min_len defaults to 2 (not 5). This makes short utterances (2-5 patches) unable to stop at their natural endpoint — they either keep generating until max_len or produce clipped output depending on how the badcase retry interacts. Should be a function arg with default 2.

4. Voice cloning uses VoxCPM 1.x continuation pattern, not VoxCPM2 reference-only

voxcpm.py:282-295:

if ref_audio is not None and ref_text is not None:
    combined_text = ref_text + text
    input_ids = self.tokenizer.encode(combined_text)
    ...
    text_token = mx.concatenate([input_ids, text_pad_token])

This is the VoxCPM 1.x "continuation" layout. VoxCPM2's reference-only mode uses a different pattern via _make_ref_prefix in voxcpm/model/voxcpm2.py:

tokens:  [103,         zeros(refLen),  104         ]
feats:   [zero_patch,  ref_feat...,    zero_patch  ]
t_mask:  [1,           0...0,          1           ]  # 103/104 are text-masked
a_mask:  [0,           1...1,          0           ]  # ref audio is audio-masked

Full sequence: [103, ref_audio_patches, 104, text_tokens, 101]. Special tokens: 101=<audio_start>, 103=<ref_start>, 104=<ref_end>.

The official _generate has 4 modes selected by reference_wav_path + prompt_wav_path:

  • [text, 101] — zero-shot
  • [103, ref, 104, text, 101] — reference-only
  • [text, 101, prompt] — continuation (VoxCPM 1.x-style, what the PR currently implements)
  • [103, ref, 104, text, 101, prompt] — combined

The PR's ref_audio path should switch to the reference-only layout as its primary mode, and ideally expose the other three as additional parameters.

Missing features (affect output quality)

5. SampleRateConditionLayer and 48 kHz output

config.py:60:

sample_rate: int = 44100

No out_sample_rate, no SampleRateConditionLayer. Official VoxCPM2 has:

  • Encoder sample rate: 16 kHz (input)
  • Output sample rate: 48 kHz (decoder upsamples 1920×)
  • Per-decoder-block SampleRateConditionLayerEmbedding(num_buckets, channels) applied as FiLM before each decoder block via sr_bin_boundaries=[20000, 30000, 40000], bucketizing the requested output rate into a 4-way one-hot

See official voxcpm/modules/audiovae/audio_vae_v2.py:218-265 (SampleRateConditionLayer) and the forward() that applies sr_cond_layer(x, sr_cond) before each decoder block. Without this, the decoder produces at input rate and misses the prosodic conditioning the FiLM layer provides.

6. kv_channels not honored in LM/RALM/DiT head dim

The official MiniCPM-4 config exposes kv_channels as the head dim (128 in shipped configs), decoupled from hidden_size / num_heads. PR #641 computes head_dim inline as hidden_size / num_heads which gives the wrong value for DiT (hidden=1024, heads=16 → 64 ≠ 128 shipped). This produces the right shape but wrong semantics if a user ever loads a config where kv_channels ≠ hidden_size / num_heads.

7. RALM no_rope not honored

Official config.py has residual_lm_no_rope: bool = True. The RALM (residual LM) must disable RoPE. I don't see a no_rope flag plumbed through minicpm.py in the PR. If RoPE is being applied to RALM, the residual branch is seeing an extra positional encoding signal it wasn't trained on.

What's correct (for completeness)

Just to be clear, a lot of your port is right. For example:

  • Euler step direction: dit.py:175 x - dt * dphi_dt with t_span = linspace(1, 0) — ✓ matches official exactly
  • CFG-zero-star formula: dit.py:170 — ✓ matches
  • Sway sampling with coef=1.0: dit.py:193 — ✓ matches
  • zero_init_steps formula: dit.py:117 max(1, int(len(t_span) * 0.04)) — ✓ matches
  • mean_mode=False → dt zeroed for uncond: dit.py:140 — ✓ matches (and your FIXED comment is accurate)

So the flow-matching core is solid. It's the conditioning path (embeddings, cloning, sample-rate conditioning, stop predictor) that has the divergence.

Reference

  • Official source I'm comparing against: pip install voxcpm==2.0.2voxcpm/model/voxcpm2.py, voxcpm/modules/locdit/unified_cfm.py, voxcpm/modules/locdit/local_dit_v2.py, voxcpm/modules/minicpm4/model.py, voxcpm/modules/audiovae/audio_vae_v2.py
  • A Swift/MLX port applying these fixes achieves intelligible English speech (including voice design and reference-audio cloning) against the real mlx-community/VoxCPM2-bf16 weights on M4 Pro, passing end-to-end generation tests through the full AudioVAE.

I'm happy to go deeper on any single item, test candidate patches against my Swift reference for parity, or open a companion PR with one or two targeted fixes if that helps you make progress without overloading this PR. Whatever works for your merge workflow.

@xocialize Do you want to post a new PR with your changes included? I think it would be quicker than going back and forth here.

@xocialize
Copy link
Copy Markdown

Thanks @lucasnewman — yes, happy to. I'll scope a Companion PR covering items 1-4 (the surgical conditioning-path fixes: dit_hidden concat, scale_emb flip, stop predictor min_len, and VoxCPM2-style _make_ref_prefix for voice cloning). Items 5-7 (SampleRateConditionLayer + 48 kHz, kv_channels, RALM no_rope) are bigger architectural additions — I'd suggest splitting those into a follow-up PR if this one lands cleanly.

Would you prefer it targeting Blaizzy/mlx-audio:main directly, or as a fork-and-PR against @acul3's feat/add-voxcpm2 branch?

@krmao
Copy link
Copy Markdown

krmao commented Apr 24, 2026

@xocialize do you have a ealier patch file yet? we can test.

@xocialize
Copy link
Copy Markdown

@krmao — here are the line-level changes for items 1-3 against voxcpm.py in PR #641. Item 4 (voice cloning layout) is a bigger restructure that'll land with the formal PR, so I'm leaving it out here.

Important caveat: these diffs are translated from my Swift reference where I've validated the fixes against real mlx-community/VoxCPM2-bf16 output. I haven't yet rerun them through the Python MLX port myself — that's part of the formal PR validation pass. So treat this as "best translation from a working Swift implementation," not "verified Python patch." If anything looks wrong when you apply it, please flag and I'll iterate.

Item 1: dit_hidden addition → concatenation

mlx_audio/tts/models/voxcpm/voxcpm.py:381

             # DiT
             dit_h1 = self.lm_to_dit_proj(lm_hidden)
             dit_h2 = self.res_to_dit_proj(residual_hidden)
-            dit_h = dit_h1 + dit_h2  # (1, H)
+            dit_h = mx.concatenate([dit_h1, dit_h2], axis=-1)  # (1, 2*H_dit)

Downstream feat_decoder.sample(mu=dit_h, ...) should already accept the wider input — LocDiT reshapes mu to (B, -1, H_dit) to get the two prefix tokens. If it errors on shape, that's a sign the LocDiT's mu reshape was assuming a single token; check dit.py for mu.reshape(B, -1, H) or view(B, -1, H).

Item 2: scale_emb condition flipped

mlx_audio/tts/models/voxcpm/voxcpm.py:273

         # scale_emb
         scale_emb = (
-            self.args.lm_config.scale_emb if not self.args.lm_config.use_mup else 1.0
+            self.args.lm_config.scale_emb if self.args.lm_config.use_mup else 1.0
         )

Same fix in mlx_audio/tts/models/voxcpm/minicpm.py for scale_depth (lines ~188-198) — the official gates scale_depth on use_mup too. The current code applies it unconditionally; should be:

-        x = r + h * (self.scale_depth / math.sqrt(self.num_hidden_layers))
+        if self.use_mup:
+            x = r + h * (self.scale_depth / math.sqrt(self.num_hidden_layers))
+        else:
+            x = r + h

(in both places where the scale_depth line currently appears)

Item 3: stop predictor min_len parameter

mlx_audio/tts/models/voxcpm/voxcpm.py:256 (generate signature):

     def generate(
         self,
         text: str,
         max_tokens: int = 4096,
+        min_len: int = 2,
         ref_text: Optional[str] = None,
         ref_audio: Optional[str] = None,
         inference_timesteps: int = 10,
         cfg_value: float = 2.0,
         **kwargs,
     ):

mlx_audio/tts/models/voxcpm/voxcpm.py:400:

             stop_logits = self.stop_head(nn.silu(self.stop_proj(lm_hidden)))
             stop_flag = mx.argmax(stop_logits, axis=-1).item()
-            if i > 5 and stop_flag == 1:
+            if i > min_len and stop_flag == 1:
                 break

If you apply 1+2 together, you should hear the biggest perceptual improvement — those are the two that compound (text embedding scale × DiT input dimensionality). Item 3 mostly fixes premature truncation on short utterances. Curious to hear what your Whisper-based validation says.

@xocialize
Copy link
Copy Markdown

@lucasnewman following up on this — to clarify before I open the Companion PR, would you prefer it to target Blaizzy/mlx-audio:main directly, or as a fork-and-PR against @acul3's feat/add-voxcpm2 branch? Either is fine on my end; just want to be sure I'm landing it where it's most useful for your merge workflow.

@lucasnewman
Copy link
Copy Markdown
Collaborator

@xocialize A standalone PR against main with all changes included is ideal for getting this in, thanks!

@xocialize
Copy link
Copy Markdown

@lucasnewman @acul3 — circling back: pulled the current PR head locally and ran zero-shot, voice design, and reference-audio cloning end-to-end against mlx-community/VoxCPM2-bf16. All three modes produce intelligible 48 kHz output with appropriate voice characteristics — verified the voice-design preset against its description, and confirmed reference cloning produces output recognizably similar to the reference speaker on a same-language English clip.

The fixes from my Apr 17 comment (items 1-7) are all in — dit_h concatenation, scale_emb / scale_depth MuP gating, min_tokens parameterization, _make_ref_prefix for VoxCPM2-style reference layout, SampleRateConditionLayer in the AudioVAE decoder, kv_channels honoring, and RALM no_rope. They landed cleanly during the rename to voxcpm2/ (commits c7c79b8 through 6405793).

Nice work @acul3 — the integration looks ready to merge from where I'm sitting. The standalone PR I offered isn't needed since you've already folded everything in.

Happy to post sample WAVs from the three modes if useful for the merge discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants