Skip to content

Optimize ACE-Step: NLC VAE, compiled decode, LoRA support#498

Open
fspecii wants to merge 1 commit intoBlaizzy:pc/add-acefrom
fspecii:pc/add-ace
Open

Optimize ACE-Step: NLC VAE, compiled decode, LoRA support#498
fspecii wants to merge 1 commit intoBlaizzy:pc/add-acefrom
fspecii:pc/add-ace

Conversation

@fspecii
Copy link
Copy Markdown

@fspecii fspecii commented Feb 15, 2026

Summary

  • VAE rewrite to NLC format: Replace WeightNormConv1d/WeightNormConvTranspose1d with nn.Conv1d and FastConvTranspose1d. Weight-norm parameters (weight_g, weight_v) are fused into regular weights at load time via sanitize(), eliminating per-forward-pass normalization overhead.
  • Compiled VAE decode: mx.compile(model.vae.decode) with auto-conversion to mlx_weights.safetensors on first load (skips PT→MLX conversion on subsequent runs).
  • mx.fast.scaled_dot_product_attention: Replaces manual attention (matmul → mask → softmax → matmul) with MLX's fused kernel.
  • Simplified turbo diffusion: Single-pass inference without CFG/APG (turbo model was distilled without guidance). Removes ~80 lines of unused guidance code.
  • LoRA adapter support: load_lora() / unload_lora() with weight fusion (W + scale * (alpha/r) * B @ A), base weight backup/restore for hot-swapping adapters.
  • Quantized 5Hz LM variants: Added 0.6B-8bit and 0.6B-4bit model IDs for lower-memory language model inference.
  • Music metadata: bpm, keyscale, timesignature parameters forwarded to prompt formatting.
  • Model loading: custom_loading class attribute + acestep remapping for clean integration with mlx_audio.utils.base_load_model.

Test plan

  • Generate 30s instrumental with --model ACE-Step/ACE-Step1.5 — verified output WAV
  • Generate 30s track with lyrics and vocal_language param
  • Generate with bpm/keyscale/timesignature metadata params
  • Verify mlx_audio.tts.load() API path works with custom_loading
  • Generate 60s track — verified correct duration and stereo output
  • Verify other TTS models (Kokoro, Spark) unaffected by utils changes
  • Test with LoRA adapter loading/unloading
  • Test quantized LM variants (0.6B-8bit, 0.6B-4bit)
  • Test audio-to-audio tasks (cover, extract, complete)

- Rewrite VAE to native NLC format with nn.Conv1d and FastConvTranspose1d,
  fusing weight-norm (g*v/||v||) at load time instead of every forward pass
- Replace manual attention with mx.fast.scaled_dot_product_attention
- Simplify turbo diffusion to single-pass (no CFG/APG) matching upstream behavior
- Add compiled VAE decode (mx.compile) with auto-conversion to mlx_weights.safetensors
- Add LoRA adapter support (load/unload with weight fusion)
- Add quantized 5Hz LM variants (0.6B-8bit, 0.6B-4bit)
- Add music metadata params (bpm, keyscale, timesignature)
- Add acestep model remapping and custom_loading support in utils
@lucasnewman
Copy link
Copy Markdown
Collaborator

I don't see the weight norm conv changes -- did those get excluded? Also, it looks like you removed cfg entirely? Was it not needed even for existing models? If you could explain the intent/goal of the changes vs just the mechanical pieces, it would be helpful context.

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Feb 23, 2026

@lucasnewman

This PR will be closed in favour of #499.

I will add @fspecii as a contributor there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants