Support for Zyphra/ZAYA1-base by kyr0 · Pull Request #1261 · ml-explore/mlx-lm

kyr0 · 2026-05-09T02:18:13Z

I found this novel architecture quite interesting, so I clean room implemented support for it after reading and debugging the vLLM patch pushed 2 days ago.

This implementation works well with full precision, 8 bit quantization and 4 bit quantization on a Macbook Air M4 24GB.

Stats:

8 bit quant: ~20t/s generated @ 9.1 GB VRAM (weights) + KV cache
4 bit quant: ~30t/s generated @ 4.99 GB VRAM (weights) + KV cache

I covered testing with manual tests of mlx_lm.benchmark, mlx_lm.convert and mlx_lm.server as well.

Novel features added:

Compressed Convolutional Attention (CCA)
Residual Scaling
Odd/Even Layers and Router1 technique
Quantized expert layer switching (SwiGLU)
...as described in the technical report.

The converted/quantized models are available on my HF account: https://huggingface.co/kyr0
e.g. https://huggingface.co/kyr0/zaya1-base-8b-MLX

https://huggingface.co/kyr0/zaya1-base-8b-8bit-MLX
https://huggingface.co/kyr0/zaya1-base-8b-4bit-MLX

Enjoy!

Transparency: I'm not affiliated with Zyphra

Repro for convert/quants:

mlx_lm.convert \
  --hf-path Zyphra/ZAYA1-8B \
  --mlx-path "./zaya1-base-8b-MLX" \
  --dtype bfloat16

Quantization

Tested with 8 and 4 bits, group size 64. Lower quants lead to garbage results.

A quick test with AWQ quantization support led to OOM in all cases. Dynamic quant as well. I'm too GPU poor on my Mac machine guys... Does anyone have a Mac Pro?

mlx_lm.convert \
    --hf-path Zyphra/ZAYA1-8B \
    --mlx-path "./zaya1-base-8b-8bit-MLX" \
    --dtype bfloat16 \
    -q --q-bits 8 --q-group-size 64

Server + Test

mlx_lm.server \
  --model "./zaya1-base-8b-8bit-MLX" \
  --host 127.0.0.1 \
  --port 8080 \
  --temp 0.0 \
  --top-p 1.0 \
  --max-tokens 8192 \
  --prefill-step-size 512 \
  --prompt-cache-size 0

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "./zaya1-base-8b-8bit-MLX",
    "messages": [{"role": "user", "content": "Solve x+2=7. Answer only."}],
    "temperature": 0,
    "max_tokens": 1024
  }'

…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.

…cing an empty self.cache list; so we correctly return True in this case generically

beamivalice · 2026-05-11T03:15:15Z

I'm testing this model with your 8 bits version in HF. It's good despite some warning below.

Prompt: 32 tokens, 91.910 tokens-per-sec
Generation: 1024 tokens, 30.248 tokens-per-sec
Peak memory: 9.628 GB
M1 Max 32 cores

[transformers] You are using a model of type zaya to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a sam2_video checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.

kyr0 · 2026-05-11T14:21:41Z

@beamivalice Right? I like the quality as well. They also have released a 78B preview model the other day. And also a 8B VL model which @Blaizzy implemented support for yesterday in this PR Blaizzy/mlx-vlm#1159

I don't have enough VRAM - maybe someone with a Mac Pro could convert/quanitze the big model based on this PR. The mlx.convert script is all one needs. It should work without any issues as to my understanding.

Now that we have a second implementation from @Blaizzy - I think it would make sense to discuss the pros and cons of both and agree on the best version. I think I shaped this implementation well for this repo... and it seems to do exactly what it's supposed to do. But if there's anything I can do here to help, I'm all ears :) Maybe @0k1 could you take a look please if my work here is fine?

kyr0 · 2026-05-11T14:31:53Z

One idea that comes to mind is, wether we want to implement kwargs based pass-thru of a new parameter to control CCA and MoE details. The strength of this architecture is that you can get much better quality out of the model by investing more compute time. Having the compute time factor controllable per-request would be really nice. We might have reasoning_effort map to pre-defined factor values as well, but full control would be pretty cool I guess.

kyr0 added 2 commits May 9, 2026 03:38

feat: added support for Zyphra/ZAYA1-base, Compressed Convolutional A…

5da651b

…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.

fix: ZAYA creates ArraysCache(size=0) for non-attention layers, produ…

0a6a958

…cing an empty self.cache list; so we correctly return True in this case generically

kyr0 mentioned this pull request May 9, 2026

[Model] Support for Zyphra ZAYA1 models vllm-project/vllm#41999

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Zyphra/ZAYA1-base#1261

Support for Zyphra/ZAYA1-base#1261
kyr0 wants to merge 2 commits into
ml-explore:mainfrom
kyr0:feat/zaya-support

kyr0 commented May 9, 2026 •

edited

Loading

Uh oh!

beamivalice commented May 11, 2026

Uh oh!

kyr0 commented May 11, 2026 •

edited

Loading

Uh oh!

kyr0 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyr0 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro for convert/quants:

Quantization

Server + Test

Uh oh!

beamivalice commented May 11, 2026

Uh oh!

kyr0 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyr0 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyr0 commented May 9, 2026 •

edited

Loading

kyr0 commented May 11, 2026 •

edited

Loading