-
Notifications
You must be signed in to change notification settings - Fork 227
Open
Description
Describe the bug
I am consistently getting a zsh: killed (SIGKILL by macOS Jetsam) when trying to run the Qwen3.5-397B-A17B model, even after successfully repacking the experts to 2-bit.
It seems that on the M2 architecture (which lacks hardware Dynamic Caching present in M3/M4), the strict wired_memory limits for the Metal GPU buffers are being hit, causing Jetsam to instantly kill the process to prevent a kernel panic.
Hardware & OS
- Device: Mac Studio (Apple M2 Max)
- Unified Memory (RAM): 32 GB
- OS: macOS
Steps to reproduce
- Downloaded
mlx-community/Qwen3.5-397B-A17B-4bit. - Repacked experts to 2-bit using
python repack_experts_2bit.py(Completed successfully). - Ran the inference engine:
./infer --serve 8000 --2bit --k 1
Debugging & Logs
I wrote a background script to monitor the exact RSS memory footprint just before the crash:
./infer --serve 8000 --2bit --k 1 &
PID=$!
while kill -0 $PID 2>/dev/null; do
ps -o rss= -p $PID | awk '{print "Memory: " $1/1024 " MB"}'
sleep 0.1
done
Output:
[metal] Device: Apple M2 Max
[metal] Shader compile: 1 ms
[metal] GPU attention buffers: 15 KV caches (16.8 MB each), scores buf 134.2 MB
[metal] Delta-net GPU buffers: 45 layers (195.4 MB state + 0.2 MB scratch)
[metal] Inference pipelines ready (multi-expert[8] + shared buffers allocated)
=== Qwen3.5-397B-A17B Metal Inference Engine ===
...
Quant: 2-bit experts (3932160 bytes each)
...
[weights] mmap'd 5.52 GB from model_weights.bin
[metal] Weight file wrapped as Metal buffer (5.52 GB)
...
Memory: 22834.4 MB
Memory: 23249.4 MB
Memory: 23535.3 MB
Memory: 23746.4 MB
Memory: 0 MB
[1] + killed ./infer --serve 8000 --2bit --k 1Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels