Support for Zyphra/ZAYA1-base#1261
Conversation
…ttention (CCA) and Residual Scaling as well as quantized expert layer switching in SwiGLU scenarios, and general convert/quantization support for ZAYA.
…cing an empty self.cache list; so we correctly return True in this case generically
|
I'm testing this model with your 8 bits version in HF. It's good despite some warning below. Prompt: 32 tokens, 91.910 tokens-per-sec
|
|
@beamivalice Right? I like the quality as well. They also have released a 78B preview model the other day. And also a 8B VL model which @Blaizzy implemented support for yesterday in this PR Blaizzy/mlx-vlm#1159 I don't have enough VRAM - maybe someone with a Mac Pro could convert/quanitze the big model based on this PR. The Now that we have a second implementation from @Blaizzy - I think it would make sense to discuss the pros and cons of both and agree on the best version. I think I shaped this implementation well for this repo... and it seems to do exactly what it's supposed to do. But if there's anything I can do here to help, I'm all ears :) Maybe @0k1 could you take a look please if my work here is fine? |
|
One idea that comes to mind is, wether we want to implement kwargs based pass-thru of a new parameter to control CCA and MoE details. The strength of this architecture is that you can get much better quality out of the model by investing more compute time. Having the compute time factor controllable per-request would be really nice. We might have |
I found this novel architecture quite interesting, so I clean room implemented support for it after reading and debugging the vLLM patch pushed 2 days ago.
This implementation works well with full precision, 8 bit quantization and 4 bit quantization on a Macbook Air M4 24GB.
Stats:
I covered testing with manual tests of
mlx_lm.benchmark,mlx_lm.convertandmlx_lm.serveras well.Novel features added:
...as described in the technical report.
The converted/quantized models are available on my HF account: https://huggingface.co/kyr0
e.g. https://huggingface.co/kyr0/zaya1-base-8b-MLX
https://huggingface.co/kyr0/zaya1-base-8b-8bit-MLX
https://huggingface.co/kyr0/zaya1-base-8b-4bit-MLX
Enjoy!
Transparency: I'm not affiliated with Zyphra
Repro for convert/quants:
mlx_lm.convert \ --hf-path Zyphra/ZAYA1-8B \ --mlx-path "./zaya1-base-8b-MLX" \ --dtype bfloat16Quantization
Tested with 8 and 4 bits, group size 64. Lower quants lead to garbage results.
A quick test with AWQ quantization support led to OOM in all cases. Dynamic quant as well. I'm too GPU poor on my Mac machine guys... Does anyone have a Mac Pro?
Server + Test
mlx_lm.server \ --model "./zaya1-base-8b-8bit-MLX" \ --host 127.0.0.1 \ --port 8080 \ --temp 0.0 \ --top-p 1.0 \ --max-tokens 8192 \ --prefill-step-size 512 \ --prompt-cache-size 0