feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) by acul3 · Pull Request #641 · Blaizzy/mlx-audio

acul3 · 2026-04-08T02:46:07Z

Context

VoxCPM2 is OpenBMB's latest 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, voice design, and 30-language support. This PR adds a full MLX implementation.

Description

Complete MLX port of openbmb/VoxCPM2, building on the existing VoxCPM v1 implementation but as a separate module due to substantial architectural differences.

Key architecture changes from v1:

AudioVAE V2 with SampleRateConditionLayer for asymmetric 16kHz encode / 48kHz decode
fusion_concat_proj replaces element-wise addition for residual LM input
VoxCPMLocDiTV2 with multi-token mu ((B, 2*H) → 2 start tokens)
MiniCPM backbone with kv_channels and no_rope support for residual LM

Generation Modes

All 5 modes from the original VoxCPM2 repo are supported:

Mode	Description	Usage
Zero-shot	Random voice, text only	`--text "Hello"`
Voice design	Create voice from text description	`--text "Hello" --instruct "A warm female voice"`
Reference cloning	Clone voice from audio sample	`--text "Hello" --ref_audio speaker.wav`
Continuation	Continue from previous audio (seamless transitions for long-form)	`prompt_text + prompt_audio + text`
Combined	Reference voice + continuation (clone voice AND continue from prompt)	`ref_audio + prompt_text + prompt_audio + text`

Continuation & Combined modes are designed for long-form speech generation (audiobooks, podcasts) — each chunk picks up naturally from the previous one while keeping the voice consistent.

Changes in the codebase

New files (mlx_audio/tts/models/voxcpm2/):

File	Lines	Description
`config.py`	128	ModelArgs with v2 defaults, AudioVAEV2Config
`minicpm.py`	255	MiniCPM backbone + `kv_channels` + `no_rope`
`encoder.py`	35	VoxCPMLocEnc (local feature encoder)
`dit.py`	183	VoxCPMLocDiTV2 + UnifiedCFM with `mean_mode`
`audio_vae.py`	617	AudioVAEV2 with SR conditioning + sanitize
`voxcpm2.py`	640	Main Model with all gen modes + CLI compat
`__init__.py`	4	Exports

Modified: mlx_audio/tts/utils.py — added "voxcpm2": "voxcpm2" to MODEL_REMAPPING

Tests: 15 unit tests added to mlx_audio/tts/tests/test_models.py covering config, registration, AudioVAE, MiniCPM, DiT, and full Model.

Quick examples

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/VoxCPM2-8bit")

# Zero-shot
for result in model.generate(text="Hello world"):
    print(result.audio_duration)

# Voice design
for result in model.generate(text="Hello", instruct="A young woman, gentle voice"):
    print(result.audio_duration)

# Voice cloning
for result in model.generate(text="Hello", ref_audio="speaker.wav"):
    print(result.audio_duration)

# Continuation (long-form)
for result in model.generate(
    text=" and this continues seamlessly.",
    prompt_text="First sentence here",
    prompt_audio="first.wav",
):
    print(result.audio_duration)

# CLI
python -m mlx_audio.tts.generate \
  --model mlx-community/VoxCPM2-8bit \
  --text "Hello world" \
  --instruct "A young woman, gentle voice" \
  --verbose

Performance (Apple Silicon)

Variant	Size	RTF (7ts)
bf16	4.96 GB	0.48x
8-bit	3.23 GB	0.85x
4-bit	2.30 GB	0.90x

Changes outside the codebase

Converted weights uploaded to HuggingFace: mlx-community/VoxCPM2-{bf16,8bit,4bit}

Checklist

All 5 generation modes tested with real weights
Voice design tested with various descriptions
Voice cloning tested with reference audio
Continuation & combined modes tested
CLI integration tested (zero-shot, --instruct, --ref_audio)
bfloat16 weights verified (float16 causes artifacts)
Quantization tested (4-bit, 8-bit)
Unit tests (15 tests in test_models.py)

gianpaj · 2026-04-08T21:31:09Z

am i doing something wrong? the 4bit model sounds broken.

❯ l ~/.lmstudio/models/mlx-community/VoxCPM2-4bit
.rw-r--r--@ 5.3k gianpaj  8 Apr 23:01 config.json
.rw-r--r--@ 2.3G gianpaj  8 Apr 23:04 model.safetensors
.rw-r--r--@ 1.6k gianpaj  8 Apr 23:01 special_tokens_map.json
.rw-r--r--@ 522k gianpaj  8 Apr 23:01 test_en.wav
.rw-r--r--@ 614k gianpaj  8 Apr 23:01 test_id.wav
.rw-r--r--@ 3.7M gianpaj  8 Apr 23:01 tokenizer.json
.rw-r--r--@ 5.0k gianpaj  8 Apr 23:01 tokenizer_config.json

❯ uv run python -m mlx_audio.tts.generate --model ~/.lmstudio/models/mlx-community/VoxCPM2-4bit  --text "Hello world"  \
  --verbose
You are using a model of type `voxcpm2` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
Text: Hello world
Voice: None
Speed: 1.0x
Language: en
✅ Audio successfully generated and saving as: audio_000.wav
==========
Duration:              00:00:01.600
Samples/sec:           27547.6
Prompt:                3 tokens, 1.1 tokens-per-sec
Audio:                 76808 samples, 27547.6 samples-per-sec
Real-time factor:      0.57x
Processing time:       2.79s
Peak memory usage:     3.23GB


❯ cat pyproject.toml
[project]
name = "voxcpm2"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "mlx-audio @ git+https://github.com/acul3/mlx-audio.git@feat/add-voxcpm2",
]

audio_000.mp3

uv run python -m mlx_audio.tts.generate --model ~/.lmstudio/models/mlx-community/VoxCPM2-4bit  --text "Hello world" --instruct "A young woman, gentle voice" \
  --verbose
  
You are using a model of type `voxcpm2` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
Instruct: A young woman, gentle voice
Text: Hello world
Voice: None
Speed: 1.0x
Language: en
✅ Audio successfully generated and saving as: audio_000.wav
==========
Duration:              00:00:02.240
Samples/sec:           30523.6
Prompt:                11 tokens, 3.1 tokens-per-sec
Audio:                 107528 samples, 30523.6 samples-per-sec
Real-time factor:      0.64x
Processing time:       3.52s
Peak memory usage:     3.24G

audio_000woman.mp3

uv run python -m mlx_audio.tts.generate --model ~/.lmstudio/models/mlx-community/VoxCPM2-4bit  --text "Hello world" --instruct "A young woman, gentle voice" --voice "female_1" \
  --verbose

audio_000female_1.mp3

audios converted with ffmpeg -i audio_000.wav audio_000.mp3 to upload them on github

lucasnewman · 2026-04-10T16:14:25Z

@acul3 Can you add a README for the model to document how to use it and what the supported repo ids are? See the other models for examples. Otherwise this looks good to me.

lucasnewman · 2026-04-10T16:21:05Z

Please also run pre-commit run --all to fix the formatting.

gianpaj · 2026-04-10T23:15:39Z

@lucasnewman , did you hear the samples I uploaded? something isn't right.
Did you try to generate audio with one of the quantized models?

acul3 · 2026-04-11T09:38:59Z

@gianpaj Thanks for reporting! I've pushed a fix that addresses the quality issue.

Root cause:
tokenizer.encode() was adding a BOS token ,that the original PyTorch model doesn't use. also fix some warmup default, that might cut at the first.

This shifted all token positions by 1, causing the LM to produce degraded output,

can you try again ?

acul3 · 2026-04-11T09:39:09Z

@lucasnewman sure let me, add it

acul3 · 2026-04-11T10:18:14Z

cmd1_000.wav
cmd2_000.wav
cmd3_000.wav

using same config @gianpaj

krmao · 2026-04-11T15:21:55Z

I tested the unmerged feat/add-voxcpm2 branch locally on Apple Silicon and
confirmed that it fixes the model registration/loading problem, but I still see
reliability issues for short Chinese TTS generation in voice design mode.

What I tested:

Model: mlx-community/VoxCPM2-bf16
Runtime: PR branch acul3/mlx-audio@feat/add-voxcpm2
Invocation style: non-streaming only, using the documented Python API /
model.generate(...)
Mode: instruct + text voice design
I also verified that the PR already includes the BOS-token fix and the
warmup_patches=0 change

What works:

The model now loads successfully as a native voxcpm2 implementation
Generation completes without crashing
The output is valid audio and sounds generally speech-like

What still does not work well:

For short Chinese prompts, content fidelity is unstable
The generated speech often drifts away from the requested text
In some cases the output appears to switch language or produce unrelated
content
This is not just an audio-quality issue; it is a text-following / content-
retention issue

What I tried that did not solve it:

Using the latest PR code with the BOS-token fix
Using the corrected warmup_patches default
Using non-streaming generation only
Using bf16 rather than quantized variants
Using voice-design style prompts exactly as documented: text=..., instruct=...

Why I think this matters:

The loading issue appears fixed, but short-form multilingual text conditioning,
especially Chinese short prompts, still seems unreliable in voice design mode
This suggests the remaining problem is not only registration/tokenization, but
likely generation behavior or conditioning alignment in the MLX implementation

My current hypothesis:

The PR fixes a real bug, but there may still be a mismatch between the MLX
implementation and the original PyTorch behavior for short-form content
retention
The issue seems more visible on very short prompts than on ordinary demo
sentences

acul3 · 2026-04-11T15:37:18Z

hi @krmao , do you sample with prompt that i can test ?

will try to match it with original repo

krmao · 2026-04-11T15:54:36Z

hi @krmao , do you sample with prompt that i can test ?

will try to match it with original repo

@acul3 Hi, yes — I verified this with Whisper, not only by listening.

Here is one clean anonymized repro I tested locally on Apple Silicon using the PR
branch, non-streaming generation:

text: 小雨来了
instruct: 二十多岁的年轻女性，声音清脆自然，普通话标准，轻声细语，像在安静的卧
室里

Validation method:

generate audio with the PR branch
transcribe the generated wav with Whisper whisper-large-v3
compare transcription vs expected text

Expected:

小雨来了

Actual Whisper result from my local run:

So yeah, okay?

Whisper also detected the language as en for this sample.

So the issue I’m reporting is text fidelity, not only subjective audio quality.

su3 · 2026-04-14T07:09:35Z

Unable to generate Chinese audio. The audio produced is meaningless noise with no relation to the text. Several other languages were also tested (Japanese, Korean, English) and worked correctly. Except that the --speed parameter has no effect.

python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-8bit \
  --text "你好，世界！" --lang_code zh --verbose
You are using a model of type `voxcpm2` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
Text: 你好，世界！
Voice: None
Speed: 1.0x
Language: zh
✅ Audio successfully generated and saving as: audio_000.wav
==========
Duration:              00:00:01.120
Samples/sec:           74953.9
Prompt:                5 tokens, 7.0 tokens-per-sec
Audio:                 53768 samples, 74953.9 samples-per-sec
Real-time factor:      1.56x
Processing time:       0.72s
Peak memory usage:     4.16GB

Add MLX implementation of OpenBMB's VoxCPM2 with support for: - Zero-shot TTS, voice design, voice cloning, and continuation modes - 48kHz studio-quality audio via asymmetric AudioVAE (16kHz encode / 48kHz decode) - Sample-rate conditioned decoder with SampleRateConditionLayer - MiniCPM4 backbone with kv_channels and no_rope support - VoxCPMLocDiTV2 diffusion transformer with multi-token mu - Fusion concat projection for residual LM input - Configurable warmup patches to eliminate onset artifacts - CLI integration via --instruct (voice design), --ref_audio (cloning) - 4-bit/8-bit quantization support (1.17x realtime on Apple Silicon) - 15 unit tests in test_models.py Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

- Use tokenize+convert_tokens_to_ids instead of encode() to match PyTorch behavior (encode() adds BOS token that shifts all positions) - Reduce warmup_patches to 1 when instruct is provided - Enforce minimum cfg_value=2.0 for CLI compatibility Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

The BOS token fix resolved the onset artifacts, so warmup patches are no longer needed by default. Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

- Add model README with usage docs, supported repo IDs, parameters, and architecture overview - Run pre-commit (black + isort) to fix formatting Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

- Add decode_chunk_size to AudioVAE (encoder=640, decoder=1920) - Use decode_chunk_size for continuation audio trimming - Make VAD silence trimming optional (default off, matching upstream) - Add input type validation for text parameter Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

- Fix test_voice_design_prefix to mock tokenize+convert_tokens_to_ids instead of encode (matching the BOS token fix) - Add VoxCPM2 to TTS model comparison table in docs Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

xocialize · 2026-04-16T18:05:09Z

Hi @acul3 — nice work getting VoxCPM2 into mlx-audio! I've been independently porting VoxCPM2 to Swift/MLX and spent significant time comparing against the official voxcpm pip package (v2.0.2) source code. I found several issues that may explain the quality problems reported here. Happy to help contribute fixes:

Confirmed issues from official source (voxcpm/model/voxcpm2.py, modules/locdit/unified_cfm.py, modules/minicpm4/model.py):

kv_channels support — The bf16 model uses kv_channels=128 for attention head dimensions (not hidden_size // num_heads). This affects Q/K/V projection sizes for LocEnc and LocDiT.
scale_emb gating — Official code: scale_emb = 1.0 when use_mup=False. The config has use_mup=False and scale_emb=12, but the ×12 multiplication should NOT be applied. See voxcpm2.py line 1003-1006.
residual_lm_no_rope=True — The RALM must not apply RoPE. Config field residual_lm_no_rope is True for VoxCPM2.
SampleRateConditionLayer — The AudioVAE decoder has 6 FiLM conditioning layers (sr_cond_layers) that modulate based on target sample rate. Skipping them produces degraded audio since the decoder weights were trained with them active.
bf16 on Apple Silicon — Upstream PR #263 confirms bf16 causes glitched audio on MPS/Metal. Fix: cast to float32 for inference.
dit_hidden is concatenation, not addition — dit_hidden = cat(lm_to_dit_proj, res_to_dit_proj) gives (B, 2*H_dit), reshaped to 2 mu prefix tokens in LocDiT. See local_dit_v2.py line 109-110.

Would you like me to submit fixes for any of these, or would you prefer I open a companion PR? I have all of these verified against the official source with line references.

acul3 · 2026-04-17T14:34:05Z

hi @xocialize thanks for reporting.

i am open if you want submit fixes for any of it in this branch, or open Companion PR!.

just let me know, so i can test , thank you

xocialize · 2026-04-17T17:52:33Z

Hi @acul3 — thanks for the open offer!

Quick context on where my findings come from: I ported VoxCPM2 to Swift/MLX in parallel against the official voxcpm pip package v2.0.2 as my reference. The bugs below surfaced as I diffed your PR against that source. I haven't written Python patches myself — my fixes are in Swift — so I'm sharing this as a bug list with receipts rather than a PR. Happy to dig into any of them if you want more detail.

Line numbers below reference:

PR feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) #641 — mlx_audio/tts/models/voxcpm/
Official — voxcpm/ from pip install voxcpm==2.0.2

Bugs that affect output quality

1. `dit_hidden` is addition, should be concatenation

voxcpm.py:381:

dit_h = dit_h1 + dit_h2  # (1, H)

Official voxcpm/model/voxcpm2.py _inference() (~line 1049):

dit_hidden_1 = self.lm_to_dit_proj(lm_hidden)  # [b, h_dit]
dit_hidden_2 = self.res_to_dit_proj(residual_hidden)  # [b, h_dit]
dit_hidden = torch.cat((dit_hidden_1, dit_hidden_2), dim=-1)

The official concatenates along the last axis to get (B, 2*H_dit), which the LocDiT then reshapes to two mu prefix tokens (B, 2, H_dit). Addition collapses this into a single mu token and drops half the LocDiT's input dimensionality. In my Swift port this was the single biggest quality change after fixing — sum → concat visibly improved prosody and intelligibility.

2. `scale_emb` is inverted

voxcpm.py:273:

scale_emb = (
    self.args.lm_config.scale_emb if not self.args.lm_config.use_mup else 1.0
)

Official voxcpm/model/voxcpm2.py:

scale_emb = self.config.lm_config.scale_emb if self.config.lm_config.use_mup else 1.0

The condition is flipped. The default config has use_mup=False, so scale_emb gets applied to the text embeddings here (yours) vs. 1.0 in the official. For the VoxCPM2 shipped configs scale_emb=12, so text embeddings end up ~12× too large → severely degraded output.

Same issue in minicpm.py:188-198 — scale_depth is applied unconditionally, but the official gates it on use_mup. Both should be gated consistently.

3. Stop predictor threshold is hardcoded, should be parameterized with `min_len=2`

voxcpm.py:400:

if i > 5 and stop_flag == 1:
    break

Official voxcpm/model/voxcpm2.py:

if i > min_len and stop_flag == 1:
    break

where min_len defaults to 2 (not 5). This makes short utterances (2-5 patches) unable to stop at their natural endpoint — they either keep generating until max_len or produce clipped output depending on how the badcase retry interacts. Should be a function arg with default 2.

4. Voice cloning uses VoxCPM 1.x continuation pattern, not VoxCPM2 reference-only

voxcpm.py:282-295:

if ref_audio is not None and ref_text is not None:
    combined_text = ref_text + text
    input_ids = self.tokenizer.encode(combined_text)
    ...
    text_token = mx.concatenate([input_ids, text_pad_token])

This is the VoxCPM 1.x "continuation" layout. VoxCPM2's reference-only mode uses a different pattern via _make_ref_prefix in voxcpm/model/voxcpm2.py:

tokens:  [103,         zeros(refLen),  104         ]
feats:   [zero_patch,  ref_feat...,    zero_patch  ]
t_mask:  [1,           0...0,          1           ]  # 103/104 are text-masked
a_mask:  [0,           1...1,          0           ]  # ref audio is audio-masked

Full sequence: [103, ref_audio_patches, 104, text_tokens, 101]. Special tokens: 101=<audio_start>, 103=<ref_start>, 104=<ref_end>.

The official _generate has 4 modes selected by reference_wav_path + prompt_wav_path:

[text, 101] — zero-shot
[103, ref, 104, text, 101] — reference-only
[text, 101, prompt] — continuation (VoxCPM 1.x-style, what the PR currently implements)
[103, ref, 104, text, 101, prompt] — combined

The PR's ref_audio path should switch to the reference-only layout as its primary mode, and ideally expose the other three as additional parameters.

Missing features (affect output quality)

5. `SampleRateConditionLayer` and 48 kHz output

config.py:60:

sample_rate: int = 44100

No out_sample_rate, no SampleRateConditionLayer. Official VoxCPM2 has:

Encoder sample rate: 16 kHz (input)
Output sample rate: 48 kHz (decoder upsamples 1920×)
Per-decoder-block SampleRateConditionLayer — Embedding(num_buckets, channels) applied as FiLM before each decoder block via sr_bin_boundaries=[20000, 30000, 40000], bucketizing the requested output rate into a 4-way one-hot

See official voxcpm/modules/audiovae/audio_vae_v2.py:218-265 (SampleRateConditionLayer) and the forward() that applies sr_cond_layer(x, sr_cond) before each decoder block. Without this, the decoder produces at input rate and misses the prosodic conditioning the FiLM layer provides.

6. `kv_channels` not honored in LM/RALM/DiT head dim

The official MiniCPM-4 config exposes kv_channels as the head dim (128 in shipped configs), decoupled from hidden_size / num_heads. PR #641 computes head_dim inline as hidden_size / num_heads which gives the wrong value for DiT (hidden=1024, heads=16 → 64 ≠ 128 shipped). This produces the right shape but wrong semantics if a user ever loads a config where kv_channels ≠ hidden_size / num_heads.

7. RALM `no_rope` not honored

Official config.py has residual_lm_no_rope: bool = True. The RALM (residual LM) must disable RoPE. I don't see a no_rope flag plumbed through minicpm.py in the PR. If RoPE is being applied to RALM, the residual branch is seeing an extra positional encoding signal it wasn't trained on.

What's correct (for completeness)

Just to be clear, a lot of your port is right. For example:

Euler step direction: dit.py:175 x - dt * dphi_dt with t_span = linspace(1, 0) — ✓ matches official exactly
CFG-zero-star formula: dit.py:170 — ✓ matches
Sway sampling with coef=1.0: dit.py:193 — ✓ matches
zero_init_steps formula: dit.py:117 max(1, int(len(t_span) * 0.04)) — ✓ matches
mean_mode=False → dt zeroed for uncond: dit.py:140 — ✓ matches (and your FIXED comment is accurate)

So the flow-matching core is solid. It's the conditioning path (embeddings, cloning, sample-rate conditioning, stop predictor) that has the divergence.

Reference

Official source I'm comparing against: pip install voxcpm==2.0.2 → voxcpm/model/voxcpm2.py, voxcpm/modules/locdit/unified_cfm.py, voxcpm/modules/locdit/local_dit_v2.py, voxcpm/modules/minicpm4/model.py, voxcpm/modules/audiovae/audio_vae_v2.py
A Swift/MLX port applying these fixes achieves intelligible English speech (including voice design and reference-audio cloning) against the real mlx-community/VoxCPM2-bf16 weights on M4 Pro, passing end-to-end generation tests through the full AudioVAE.

I'm happy to go deeper on any single item, test candidate patches against my Swift reference for parity, or open a companion PR with one or two targeted fixes if that helps you make progress without overloading this PR. Whatever works for your merge workflow.

lucasnewman · 2026-04-22T01:32:42Z

Hi @acul3 — thanks for the open offer!

Quick context on where my findings come from: I ported VoxCPM2 to Swift/MLX in parallel against the official voxcpm pip package v2.0.2 as my reference. The bugs below surfaced as I diffed your PR against that source. I haven't written Python patches myself — my fixes are in Swift — so I'm sharing this as a bug list with receipts rather than a PR. Happy to dig into any of them if you want more detail.

Line numbers below reference:

PR feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) #641 — mlx_audio/tts/models/voxcpm/

Official — voxcpm/ from pip install voxcpm==2.0.2

Bugs that affect output quality

1. dit_hidden is addition, should be concatenation

voxcpm.py:381:
dit_h = dit_h1 + dit_h2  # (1, H)
Official voxcpm/model/voxcpm2.py _inference() (~line 1049):
dit_hidden_1 = self.lm_to_dit_proj(lm_hidden)  # [b, h_dit]
dit_hidden_2 = self.res_to_dit_proj(residual_hidden)  # [b, h_dit]
dit_hidden = torch.cat((dit_hidden_1, dit_hidden_2), dim=-1)
The official concatenates along the last axis to get (B, 2*H_dit), which the LocDiT then reshapes to two mu prefix tokens (B, 2, H_dit). Addition collapses this into a single mu token and drops half the LocDiT's input dimensionality. In my Swift port this was the single biggest quality change after fixing — sum → concat visibly improved prosody and intelligibility.

2. scale_emb is inverted

voxcpm.py:273:
scale_emb = (
    self.args.lm_config.scale_emb if not self.args.lm_config.use_mup else 1.0
)
Official voxcpm/model/voxcpm2.py:
scale_emb = self.config.lm_config.scale_emb if self.config.lm_config.use_mup else 1.0
The condition is flipped. The default config has use_mup=False, so scale_emb gets applied to the text embeddings here (yours) vs. 1.0 in the official. For the VoxCPM2 shipped configs scale_emb=12, so text embeddings end up ~12× too large → severely degraded output.

Same issue in minicpm.py:188-198 — scale_depth is applied unconditionally, but the official gates it on use_mup. Both should be gated consistently.

3. Stop predictor threshold is hardcoded, should be parameterized with min_len=2

voxcpm.py:400:
if i > 5 and stop_flag == 1:
    break
Official voxcpm/model/voxcpm2.py:
if i > min_len and stop_flag == 1:
    break
where min_len defaults to 2 (not 5). This makes short utterances (2-5 patches) unable to stop at their natural endpoint — they either keep generating until max_len or produce clipped output depending on how the badcase retry interacts. Should be a function arg with default 2.

4. Voice cloning uses VoxCPM 1.x continuation pattern, not VoxCPM2 reference-only

voxcpm.py:282-295:
if ref_audio is not None and ref_text is not None:
    combined_text = ref_text + text
    input_ids = self.tokenizer.encode(combined_text)
    ...
    text_token = mx.concatenate([input_ids, text_pad_token])
This is the VoxCPM 1.x "continuation" layout. VoxCPM2's reference-only mode uses a different pattern via _make_ref_prefix in voxcpm/model/voxcpm2.py:
tokens:  [103,         zeros(refLen),  104         ]
feats:   [zero_patch,  ref_feat...,    zero_patch  ]
t_mask:  [1,           0...0,          1           ]  # 103/104 are text-masked
a_mask:  [0,           1...1,          0           ]  # ref audio is audio-masked
Full sequence: [103, ref_audio_patches, 104, text_tokens, 101]. Special tokens: 101=<audio_start>, 103=<ref_start>, 104=<ref_end>.

The official _generate has 4 modes selected by reference_wav_path + prompt_wav_path:

[text, 101] — zero-shot

[103, ref, 104, text, 101] — reference-only

[text, 101, prompt] — continuation (VoxCPM 1.x-style, what the PR currently implements)

[103, ref, 104, text, 101, prompt] — combined

The PR's ref_audio path should switch to the reference-only layout as its primary mode, and ideally expose the other three as additional parameters.

Missing features (affect output quality)

5. SampleRateConditionLayer and 48 kHz output

config.py:60:
sample_rate: int = 44100
No out_sample_rate, no SampleRateConditionLayer. Official VoxCPM2 has:

Encoder sample rate: 16 kHz (input)

Output sample rate: 48 kHz (decoder upsamples 1920×)

Per-decoder-block SampleRateConditionLayer — Embedding(num_buckets, channels) applied as FiLM before each decoder block via sr_bin_boundaries=[20000, 30000, 40000], bucketizing the requested output rate into a 4-way one-hot

See official voxcpm/modules/audiovae/audio_vae_v2.py:218-265 (SampleRateConditionLayer) and the forward() that applies sr_cond_layer(x, sr_cond) before each decoder block. Without this, the decoder produces at input rate and misses the prosodic conditioning the FiLM layer provides.

6. kv_channels not honored in LM/RALM/DiT head dim

The official MiniCPM-4 config exposes kv_channels as the head dim (128 in shipped configs), decoupled from hidden_size / num_heads. PR #641 computes head_dim inline as hidden_size / num_heads which gives the wrong value for DiT (hidden=1024, heads=16 → 64 ≠ 128 shipped). This produces the right shape but wrong semantics if a user ever loads a config where kv_channels ≠ hidden_size / num_heads.

7. RALM no_rope not honored

Official config.py has residual_lm_no_rope: bool = True. The RALM (residual LM) must disable RoPE. I don't see a no_rope flag plumbed through minicpm.py in the PR. If RoPE is being applied to RALM, the residual branch is seeing an extra positional encoding signal it wasn't trained on.

What's correct (for completeness)

Just to be clear, a lot of your port is right. For example:

Euler step direction: dit.py:175 x - dt * dphi_dt with t_span = linspace(1, 0) — ✓ matches official exactly

CFG-zero-star formula: dit.py:170 — ✓ matches

Sway sampling with coef=1.0: dit.py:193 — ✓ matches

zero_init_steps formula: dit.py:117 max(1, int(len(t_span) * 0.04)) — ✓ matches

mean_mode=False → dt zeroed for uncond: dit.py:140 — ✓ matches (and your FIXED comment is accurate)

So the flow-matching core is solid. It's the conditioning path (embeddings, cloning, sample-rate conditioning, stop predictor) that has the divergence.

Reference

Official source I'm comparing against: pip install voxcpm==2.0.2 → voxcpm/model/voxcpm2.py, voxcpm/modules/locdit/unified_cfm.py, voxcpm/modules/locdit/local_dit_v2.py, voxcpm/modules/minicpm4/model.py, voxcpm/modules/audiovae/audio_vae_v2.py

A Swift/MLX port applying these fixes achieves intelligible English speech (including voice design and reference-audio cloning) against the real mlx-community/VoxCPM2-bf16 weights on M4 Pro, passing end-to-end generation tests through the full AudioVAE.

I'm happy to go deeper on any single item, test candidate patches against my Swift reference for parity, or open a companion PR with one or two targeted fixes if that helps you make progress without overloading this PR. Whatever works for your merge workflow.

@xocialize Do you want to post a new PR with your changes included? I think it would be quicker than going back and forth here.

xocialize · 2026-04-23T07:19:54Z

Thanks @lucasnewman — yes, happy to. I'll scope a Companion PR covering items 1-4 (the surgical conditioning-path fixes: dit_hidden concat, scale_emb flip, stop predictor min_len, and VoxCPM2-style _make_ref_prefix for voice cloning). Items 5-7 (SampleRateConditionLayer + 48 kHz, kv_channels, RALM no_rope) are bigger architectural additions — I'd suggest splitting those into a follow-up PR if this one lands cleanly.

Would you prefer it targeting Blaizzy/mlx-audio:main directly, or as a fork-and-PR against @acul3's feat/add-voxcpm2 branch?

krmao · 2026-04-24T07:44:22Z

@xocialize do you have a ealier patch file yet? we can test.

xocialize · 2026-04-25T15:16:28Z

@krmao — here are the line-level changes for items 1-3 against voxcpm.py in PR #641. Item 4 (voice cloning layout) is a bigger restructure that'll land with the formal PR, so I'm leaving it out here.

Important caveat: these diffs are translated from my Swift reference where I've validated the fixes against real mlx-community/VoxCPM2-bf16 output. I haven't yet rerun them through the Python MLX port myself — that's part of the formal PR validation pass. So treat this as "best translation from a working Swift implementation," not "verified Python patch." If anything looks wrong when you apply it, please flag and I'll iterate.

Item 1: `dit_hidden` addition → concatenation

mlx_audio/tts/models/voxcpm/voxcpm.py:381

             # DiT
             dit_h1 = self.lm_to_dit_proj(lm_hidden)
             dit_h2 = self.res_to_dit_proj(residual_hidden)
-            dit_h = dit_h1 + dit_h2  # (1, H)
+            dit_h = mx.concatenate([dit_h1, dit_h2], axis=-1)  # (1, 2*H_dit)

Downstream feat_decoder.sample(mu=dit_h, ...) should already accept the wider input — LocDiT reshapes mu to (B, -1, H_dit) to get the two prefix tokens. If it errors on shape, that's a sign the LocDiT's mu reshape was assuming a single token; check dit.py for mu.reshape(B, -1, H) or view(B, -1, H).

Item 2: `scale_emb` condition flipped

mlx_audio/tts/models/voxcpm/voxcpm.py:273

         # scale_emb
         scale_emb = (
-            self.args.lm_config.scale_emb if not self.args.lm_config.use_mup else 1.0
+            self.args.lm_config.scale_emb if self.args.lm_config.use_mup else 1.0
         )

Same fix in mlx_audio/tts/models/voxcpm/minicpm.py for scale_depth (lines ~188-198) — the official gates scale_depth on use_mup too. The current code applies it unconditionally; should be:

-        x = r + h * (self.scale_depth / math.sqrt(self.num_hidden_layers))
+        if self.use_mup:
+            x = r + h * (self.scale_depth / math.sqrt(self.num_hidden_layers))
+        else:
+            x = r + h

(in both places where the scale_depth line currently appears)

Item 3: stop predictor `min_len` parameter

mlx_audio/tts/models/voxcpm/voxcpm.py:256 (generate signature):

     def generate(
         self,
         text: str,
         max_tokens: int = 4096,
+        min_len: int = 2,
         ref_text: Optional[str] = None,
         ref_audio: Optional[str] = None,
         inference_timesteps: int = 10,
         cfg_value: float = 2.0,
         **kwargs,
     ):

mlx_audio/tts/models/voxcpm/voxcpm.py:400:

             stop_logits = self.stop_head(nn.silu(self.stop_proj(lm_hidden)))
             stop_flag = mx.argmax(stop_logits, axis=-1).item()
-            if i > 5 and stop_flag == 1:
+            if i > min_len and stop_flag == 1:
                 break

If you apply 1+2 together, you should hear the biggest perceptual improvement — those are the two that compound (text embedding scale × DiT input dimensionality). Item 3 mostly fixes premature truncation on short utterances. Curious to hear what your Whisper-based validation says.

xocialize · 2026-04-25T15:16:34Z

@lucasnewman following up on this — to clarify before I open the Companion PR, would you prefer it to target Blaizzy/mlx-audio:main directly, or as a fork-and-PR against @acul3's feat/add-voxcpm2 branch? Either is fine on my end; just want to be sure I'm landing it where it's most useful for your merge workflow.

lucasnewman · 2026-04-25T16:29:52Z

@xocialize A standalone PR against main with all changes included is ideal for getting this in, thanks!

xocialize · 2026-04-25T23:33:40Z

@lucasnewman @acul3 — circling back: pulled the current PR head locally and ran zero-shot, voice design, and reference-audio cloning end-to-end against mlx-community/VoxCPM2-bf16. All three modes produce intelligible 48 kHz output with appropriate voice characteristics — verified the voice-design preset against its description, and confirmed reference cloning produces output recognizably similar to the reference speaker on a same-language English clip.

The fixes from my Apr 17 comment (items 1-7) are all in — dit_h concatenation, scale_emb / scale_depth MuP gating, min_tokens parameterization, _make_ref_prefix for VoxCPM2-style reference layout, SampleRateConditionLayer in the AudioVAE decoder, kv_channels honoring, and RALM no_rope. They landed cleanly during the rename to voxcpm2/ (commits c7c79b8 through 6405793).

Nice work @acul3 — the integration looks ready to merge from where I'm sitting. The standalone PR I offered isn't needed since you've already folded everything in.

Happy to post sample WAVs from the three modes if useful for the merge discussion.

acul3 force-pushed the feat/add-voxcpm2 branch 2 times, most recently from cfeb6fd to 9423076 Compare April 8, 2026 03:00

acul3 changed the title ~~Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages)~~ Feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) Apr 8, 2026

acul3 changed the title ~~Feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages)~~ feat: Add VoxCPM2 TTS model (2B params, 48kHz, 30 languages) Apr 8, 2026

acul3 mentioned this pull request Apr 13, 2026

VoxCPM2 support: model_type 'voxcpm2' not in MODEL_REMAPPING #649

Open

Samsul Rahmadani and others added 6 commits April 16, 2026 10:51

Set warmup_patches default to 0 (matching PyTorch)

80e6d93

The BOS token fix resolved the onset artifacts, so warmup patches are no longer needed by default. Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

Add README and fix formatting

33f3f45

- Add model README with usage docs, supported repo IDs, parameters, and architecture overview - Run pre-commit (black + isort) to fix formatting Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

Fix test and add docs entry

17ac926

- Fix test_voice_design_prefix to mock tokenize+convert_tokens_to_ids instead of encode (matching the BOS token fix) - Add VoxCPM2 to TTS model comparison table in docs Co-Authored-By: Samsul Rahmadani <samsulrahmadani@users.noreply.github.com>

acul3 force-pushed the feat/add-voxcpm2 branch from 66c522a to 17ac926 Compare April 16, 2026 03:51

Uh oh!

Conversation

acul3 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Generation Modes

Changes in the codebase

Quick examples

Performance (Apple Silicon)

Changes outside the codebase

Checklist

Uh oh!

gianpaj commented Apr 8, 2026

Uh oh!

lucasnewman commented Apr 10, 2026

Uh oh!

lucasnewman commented Apr 10, 2026

Uh oh!

gianpaj commented Apr 10, 2026

Uh oh!

acul3 commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acul3 commented Apr 11, 2026

Uh oh!

acul3 commented Apr 11, 2026

Uh oh!

krmao commented Apr 11, 2026

Uh oh!

acul3 commented Apr 11, 2026

Uh oh!

krmao commented Apr 11, 2026

Uh oh!

su3 commented Apr 14, 2026

Uh oh!

xocialize commented Apr 16, 2026

Uh oh!

acul3 commented Apr 17, 2026

Uh oh!

xocialize commented Apr 17, 2026

Bugs that affect output quality

1. dit_hidden is addition, should be concatenation

2. scale_emb is inverted

3. Stop predictor threshold is hardcoded, should be parameterized with min_len=2

4. Voice cloning uses VoxCPM 1.x continuation pattern, not VoxCPM2 reference-only

Missing features (affect output quality)

5. SampleRateConditionLayer and 48 kHz output

6. kv_channels not honored in LM/RALM/DiT head dim

7. RALM no_rope not honored

What's correct (for completeness)

Reference

Uh oh!

lucasnewman commented Apr 22, 2026

Bugs that affect output quality

1. dit_hidden is addition, should be concatenation

2. scale_emb is inverted

3. Stop predictor threshold is hardcoded, should be parameterized with min_len=2

4. Voice cloning uses VoxCPM 1.x continuation pattern, not VoxCPM2 reference-only

Missing features (affect output quality)

5. SampleRateConditionLayer and 48 kHz output

6. kv_channels not honored in LM/RALM/DiT head dim

7. RALM no_rope not honored

What's correct (for completeness)

Reference

Uh oh!

xocialize commented Apr 23, 2026

Uh oh!

krmao commented Apr 24, 2026

Uh oh!

xocialize commented Apr 25, 2026

Item 1: dit_hidden addition → concatenation

Item 2: scale_emb condition flipped

Item 3: stop predictor min_len parameter

Uh oh!

xocialize commented Apr 25, 2026

Uh oh!

lucasnewman commented Apr 25, 2026

Uh oh!

xocialize commented Apr 25, 2026

acul3 commented Apr 8, 2026 •

edited

Loading

acul3 commented Apr 11, 2026 •

edited

Loading

1. `dit_hidden` is addition, should be concatenation

2. `scale_emb` is inverted

3. Stop predictor threshold is hardcoded, should be parameterized with `min_len=2`

5. `SampleRateConditionLayer` and 48 kHz output

6. `kv_channels` not honored in LM/RALM/DiT head dim

7. RALM `no_rope` not honored

1. `dit_hidden` is addition, should be concatenation

2. `scale_emb` is inverted

3. Stop predictor threshold is hardcoded, should be parameterized with `min_len=2`

5. `SampleRateConditionLayer` and 48 kHz output

6. `kv_channels` not honored in LM/RALM/DiT head dim

7. RALM `no_rope` not honored

Item 1: `dit_hidden` addition → concatenation

Item 2: `scale_emb` condition flipped

Item 3: stop predictor `min_len` parameter