Skip to content

Support for Zyphra/ZAYA1-base#1261

Open
kyr0 wants to merge 2 commits into
ml-explore:mainfrom
kyr0:feat/zaya-support
Open

Support for Zyphra/ZAYA1-base#1261
kyr0 wants to merge 2 commits into
ml-explore:mainfrom
kyr0:feat/zaya-support

Conversation

@kyr0
Copy link
Copy Markdown

@kyr0 kyr0 commented May 9, 2026

I found this novel architecture quite interesting, so I clean room implemented support for it after reading and debugging the vLLM patch pushed 2 days ago.

This implementation works well with full precision, 8 bit quantization and 4 bit quantization on a Macbook Air M4 24GB.

Stats:

  • 8 bit quant: ~20t/s generated @ 9.1 GB VRAM (weights) + KV cache
  • 4 bit quant: ~30t/s generated @ 4.99 GB VRAM (weights) + KV cache

I covered testing with manual tests of mlx_lm.benchmark, mlx_lm.convert and mlx_lm.server as well.

Novel features added:

  • Compressed Convolutional Attention (CCA)
  • Residual Scaling
  • Odd/Even Layers and Router1 technique
  • Quantized expert layer switching (SwiGLU)
    ...as described in the technical report.

The converted/quantized models are available on my HF account: https://huggingface.co/kyr0
e.g. https://huggingface.co/kyr0/zaya1-base-8b-MLX

https://huggingface.co/kyr0/zaya1-base-8b-8bit-MLX
https://huggingface.co/kyr0/zaya1-base-8b-4bit-MLX

Enjoy!

Transparency: I'm not affiliated with Zyphra


Repro for convert/quants:

mlx_lm.convert \
  --hf-path Zyphra/ZAYA1-8B \
  --mlx-path "./zaya1-base-8b-MLX" \
  --dtype bfloat16

Quantization

Tested with 8 and 4 bits, group size 64. Lower quants lead to garbage results.

A quick test with AWQ quantization support led to OOM in all cases. Dynamic quant as well. I'm too GPU poor on my Mac machine guys... Does anyone have a Mac Pro?

mlx_lm.convert \
    --hf-path Zyphra/ZAYA1-8B \
    --mlx-path "./zaya1-base-8b-8bit-MLX" \
    --dtype bfloat16 \
    -q --q-bits 8 --q-group-size 64

Server + Test

mlx_lm.server \
  --model "./zaya1-base-8b-8bit-MLX" \
  --host 127.0.0.1 \
  --port 8080 \
  --temp 0.0 \
  --top-p 1.0 \
  --max-tokens 8192 \
  --prefill-step-size 512 \
  --prompt-cache-size 0
curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "./zaya1-base-8b-8bit-MLX",
    "messages": [{"role": "user", "content": "Solve x+2=7. Answer only."}],
    "temperature": 0,
    "max_tokens": 1024
  }'

kyr0 added 2 commits May 9, 2026 03:38
…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.
…cing an empty self.cache list; so we correctly return True in this case generically
@beamivalice
Copy link
Copy Markdown

I'm testing this model with your 8 bits version in HF. It's good despite some warning below.

Prompt: 32 tokens, 91.910 tokens-per-sec
Generation: 1024 tokens, 30.248 tokens-per-sec
Peak memory: 9.628 GB
M1 Max 32 cores

[transformers] You are using a model of type zaya to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a sam2_video checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.

@kyr0
Copy link
Copy Markdown
Author

kyr0 commented May 11, 2026

@beamivalice Right? I like the quality as well. They also have released a 78B preview model the other day. And also a 8B VL model which @Blaizzy implemented support for yesterday in this PR Blaizzy/mlx-vlm#1159

I don't have enough VRAM - maybe someone with a Mac Pro could convert/quanitze the big model based on this PR. The mlx.convert script is all one needs. It should work without any issues as to my understanding.

Now that we have a second implementation from @Blaizzy - I think it would make sense to discuss the pros and cons of both and agree on the best version. I think I shaped this implementation well for this repo... and it seems to do exactly what it's supposed to do. But if there's anything I can do here to help, I'm all ears :) Maybe @0k1 could you take a look please if my work here is fine?

@kyr0
Copy link
Copy Markdown
Author

kyr0 commented May 11, 2026

One idea that comes to mind is, wether we want to implement kwargs based pass-thru of a new parameter to control CCA and MoE details. The strength of this architecture is that you can get much better quality out of the model by investing more compute time. Having the compute time factor controllable per-request would be really nice. We might have reasoning_effort map to pre-defined factor values as well, but full control would be pretty cool I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants