Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 76 additions & 15 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,32 @@
# Flash-MoE: Running a 397B Parameter Model on a Laptop
# Flash-MoE: Running Massive MoE Models on a Laptop

> **[Read the paper](paper/flash_moe.pdf)** — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs **Qwen3.5-397B-A17B** (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at **4.4+ tokens/second** with production-quality output including tool calling.
Pure C/Metal inference engine for **Qwen3.5 Mixture-of-Experts** models on Apple Silicon. Runs models from 35B to 397B parameters on machines with as little as 24GB RAM, streaming expert weights from SSD through a custom Metal compute pipeline.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.
No Python runtime. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders. Model architecture is auto-detected from HuggingFace `config.json` — switch models with a single `--model` flag.

## Compatible Models

Any **Qwen3.5 MoE** model with MLX quantization (`model_type: qwen3_5_moe`) is supported. Use the model manager to discover and download compatible models:

| Model | Params | Active | Quant | Disk | Min RAM |
|-------|--------|--------|-------|------|---------|
| Qwen3.5-35B-A3B | 35B | 3B | 4-bit | ~18GB | 24GB |
| Qwen3.5-35B-A3B | 35B | 3B | 8-bit | ~35GB | 48GB |
| Qwen3.5-122B-A10B | 122B | 10B | 4-bit | ~65GB | 48GB |
| Qwen3.5-397B-A17B | 397B | 17B | 4-bit | ~209GB | 48GB |
| Qwen3.5-397B-A17B | 397B | 17B | 6-bit | ~280GB | 64GB |
| Qwen3.5-397B-A17B | 397B | 17B | 8-bit | ~397GB | 96GB |

The engine auto-detects architecture, dimensions, expert counts, quantization, and layer types from `config.json`. No recompilation needed.

## Results

![Progress](progress.png)

Results below are for Qwen3.5-397B-A17B on MacBook Pro M3 Max (48GB):

| Configuration | tok/s | Quality | Notes |
|--------------|-------|---------|-------|
| 4-bit experts, FMA kernel | **4.36** | Excellent | Current best. Full tool calling. 209GB on disk. |
Expand All @@ -29,7 +46,7 @@ The entire 209GB model streams from SSD through a custom Metal compute pipeline.

## Architecture

The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.
Qwen3.5 MoE models use a hybrid attention architecture with GatedDeltaNet (linear attention) and standard full attention layers, each containing a Mixture-of-Experts MLP. Model dimensions, expert counts, and layer types vary per model and are read from `config.json` at startup. For example, the 397B model has 60 layers (45 linear + 15 full), 512 experts (K=4 active), hidden dim 4096; the 35B model has 40 layers (30 linear + 10 full), 256 experts (K=8 active), hidden dim 2048.

### Key Techniques

Expand Down Expand Up @@ -66,12 +83,52 @@ CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]

On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

## Model Manager

The model manager helps you find, download, and validate compatible models:

```bash
# List local models and search HuggingFace for compatible ones
python model_manager.py

# Search HuggingFace only
python model_manager.py --search

# List local models only
python model_manager.py --local

# Download a specific model
python model_manager.py --download mlx-community/Qwen3.5-35B-A3B-4bit

# Check if a local model is compatible
python model_manager.py --check /path/to/model
```

After downloading, prepare the model for inference:

```bash
# 1. Pack expert weights into per-expert files
python repack_experts.py --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit

# 2. Extract non-expert weights into a single binary
python metal_infer/extract_weights.py --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit

# 3. Run inference
cd metal_infer && ./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt "Hello" --tokens 20
```

## Quick Start

```bash
cd metal_infer
make
# 4-bit inference (needs packed_experts/ directory)

# Run with a specific model (auto-detects architecture from config.json)
./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit \
--prompt "Explain quantum computing" --tokens 100

# Or set FLASH_MOE_MODEL to avoid passing --model every time
export FLASH_MOE_MODEL=~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
./infer --prompt "Explain quantum computing" --tokens 100

# 2-bit inference (faster but breaks tool calling)
Expand All @@ -87,8 +144,16 @@ make
## Project Structure

```
model_manager.py # Model discovery, download, and compatibility checking
repack_experts.py # 4-bit expert packing from safetensors
progress.py # Results visualization (Q2/Q4 tracks)
results.tsv # Experiment log (58 experiments)

metal_infer/
infer.m # Complete inference engine (~7000 lines)
infer.m # Complete inference engine (~7500 lines)
# - ModelConfig struct + config.json parser
# - Runtime model auto-detection
# - Metal compute pipeline
shaders.metal # Metal compute kernels (~1200 lines)
chat.m # Interactive chat TUI with tool calling
tokenizer.h # C BPE tokenizer (single-header, 449 lines)
Expand All @@ -97,14 +162,10 @@ metal_infer/
extract_weights.py # Creates model_weights.bin from safetensors
repack_experts_2bit.py # 4-bit → 2-bit expert requantization
train_predictor.py # Expert routing prediction analysis
model_weights.bin # Non-expert weights (5.5GB, mmap'd)
model_weights.bin # Non-expert weights (model-specific, mmap'd)
model_weights.json # Tensor manifest
vocab.bin # Vocabulary for token decoding
tokenizer.bin # Pre-exported BPE tokenizer data

repack_experts.py # 4-bit expert packing from safetensors
progress.py # Results visualization (Q2/Q4 tracks)
results.tsv # Experiment log (58 experiments)
```

## What We Tried (and What Worked)
Expand Down Expand Up @@ -140,8 +201,8 @@ results.tsv # Experiment log (58 experiments)
## Safety

This is a primary development machine. The engine explicitly controls memory:
- Non-expert weights: 5.5GB (mmap'd, read-only)
- Non-expert weights: model-dependent (e.g., 5.5GB for 397B, ~1.5GB for 35B, mmap'd read-only)
- Metal scratch buffers: ~200MB
- Total: ~6GB, leaving 42GB for OS + page cache
- No OOM risk. Expert data streams from SSD on demand.
- No custom caches. Trust the OS.
- Expert data streams from SSD on demand — no full model load required
- No custom caches. Trust the OS page cache for expert LRU.
- Minimum RAM: 24GB (35B-A3B 4-bit), 48GB (397B-A17B 4-bit)
Loading