danveloper · Alexintosh · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,15 +1,32 @@
-# Flash-MoE: Running a 397B Parameter Model on a Laptop
+# Flash-MoE: Running Massive MoE Models on a Laptop
 
 > **[Read the paper](paper/flash_moe.pdf)** — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.
 
-Pure C/Metal inference engine that runs **Qwen3.5-397B-A17B** (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at **4.4+ tokens/second** with production-quality output including tool calling.
+Pure C/Metal inference engine for **Qwen3.5 Mixture-of-Experts** models on Apple Silicon. Runs models from 35B to 397B parameters on machines with as little as 24GB RAM, streaming expert weights from SSD through a custom Metal compute pipeline.
 
-The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.
+No Python runtime. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders. Model architecture is auto-detected from HuggingFace `config.json` — switch models with a single `--model` flag.
+
+## Compatible Models
+
+Any **Qwen3.5 MoE** model with MLX quantization (`model_type: qwen3_5_moe`) is supported. Use the model manager to discover and download compatible models:
+
+| Model | Params | Active | Quant | Disk | Min RAM |
+|-------|--------|--------|-------|------|---------|
+| Qwen3.5-35B-A3B | 35B | 3B | 4-bit | ~18GB | 24GB |
+| Qwen3.5-35B-A3B | 35B | 3B | 8-bit | ~35GB | 48GB |
+| Qwen3.5-122B-A10B | 122B | 10B | 4-bit | ~65GB | 48GB |
+| Qwen3.5-397B-A17B | 397B | 17B | 4-bit | ~209GB | 48GB |
+| Qwen3.5-397B-A17B | 397B | 17B | 6-bit | ~280GB | 64GB |
+| Qwen3.5-397B-A17B | 397B | 17B | 8-bit | ~397GB | 96GB |
+
+The engine auto-detects architecture, dimensions, expert counts, quantization, and layer types from `config.json`. No recompilation needed.
 
 ## Results
 
 ![Progress](progress.png)
 
+Results below are for Qwen3.5-397B-A17B on MacBook Pro M3 Max (48GB):
+
 | Configuration | tok/s | Quality | Notes |
 |--------------|-------|---------|-------|
 | 4-bit experts, FMA kernel | **4.36** | Excellent | Current best. Full tool calling. 209GB on disk. |
@@ -29,7 +46,7 @@ The entire 209GB model streams from SSD through a custom Metal compute pipeline.
 
 ## Architecture
 
-The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.
+Qwen3.5 MoE models use a hybrid attention architecture with GatedDeltaNet (linear attention) and standard full attention layers, each containing a Mixture-of-Experts MLP. Model dimensions, expert counts, and layer types vary per model and are read from `config.json` at startup. For example, the 397B model has 60 layers (45 linear + 15 full), 512 experts (K=4 active), hidden dim 4096; the 35B model has 40 layers (30 linear + 10 full), 256 experts (K=8 active), hidden dim 2048.
 
 ### Key Techniques
 
@@ -66,12 +83,52 @@ CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
 
 On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
 
+## Model Manager
+
+The model manager helps you find, download, and validate compatible models:
+
+```bash
+# List local models and search HuggingFace for compatible ones
+python model_manager.py
+
+# Search HuggingFace only
+python model_manager.py --search
+
+# List local models only
+python model_manager.py --local
+
+# Download a specific model
+python model_manager.py --download mlx-community/Qwen3.5-35B-A3B-4bit
+
+# Check if a local model is compatible
+python model_manager.py --check /path/to/model
+```
+
+After downloading, prepare the model for inference:
+
+```bash
+# 1. Pack expert weights into per-expert files
+python repack_experts.py --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
+
+# 2. Extract non-expert weights into a single binary
+python metal_infer/extract_weights.py --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
+
+# 3. Run inference
+cd metal_infer && ./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt "Hello" --tokens 20
+```
+
 ## Quick Start
 
 ```bash
 cd metal_infer
 make
-# 4-bit inference (needs packed_experts/ directory)
+
+# Run with a specific model (auto-detects architecture from config.json)
+./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit \
+  --prompt "Explain quantum computing" --tokens 100
+
+# Or set FLASH_MOE_MODEL to avoid passing --model every time
+export FLASH_MOE_MODEL=~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
 ./infer --prompt "Explain quantum computing" --tokens 100
 
 # 2-bit inference (faster but breaks tool calling)
@@ -87,8 +144,16 @@ make
 ## Project Structure
 
 ```
+model_manager.py       # Model discovery, download, and compatibility checking
+repack_experts.py      # 4-bit expert packing from safetensors
+progress.py            # Results visualization (Q2/Q4 tracks)
+results.tsv            # Experiment log (58 experiments)
+
 metal_infer/
-  infer.m              # Complete inference engine (~7000 lines)
+  infer.m              # Complete inference engine (~7500 lines)
+                       #   - ModelConfig struct + config.json parser
+                       #   - Runtime model auto-detection
+                       #   - Metal compute pipeline
   shaders.metal        # Metal compute kernels (~1200 lines)
   chat.m               # Interactive chat TUI with tool calling
   tokenizer.h          # C BPE tokenizer (single-header, 449 lines)
@@ -97,14 +162,10 @@ metal_infer/
   extract_weights.py   # Creates model_weights.bin from safetensors
   repack_experts_2bit.py  # 4-bit → 2-bit expert requantization
   train_predictor.py   # Expert routing prediction analysis
-  model_weights.bin    # Non-expert weights (5.5GB, mmap'd)
+  model_weights.bin    # Non-expert weights (model-specific, mmap'd)
   model_weights.json   # Tensor manifest
   vocab.bin            # Vocabulary for token decoding
   tokenizer.bin        # Pre-exported BPE tokenizer data
-
-repack_experts.py      # 4-bit expert packing from safetensors
-progress.py            # Results visualization (Q2/Q4 tracks)
-results.tsv            # Experiment log (58 experiments)
 ```
 
 ## What We Tried (and What Worked)
@@ -140,8 +201,8 @@ results.tsv            # Experiment log (58 experiments)
 ## Safety
 
 This is a primary development machine. The engine explicitly controls memory:
-- Non-expert weights: 5.5GB (mmap'd, read-only)
+- Non-expert weights: model-dependent (e.g., 5.5GB for 397B, ~1.5GB for 35B, mmap'd read-only)
 - Metal scratch buffers: ~200MB
-- Total: ~6GB, leaving 42GB for OS + page cache
-- No OOM risk. Expert data streams from SSD on demand.
-- No custom caches. Trust the OS.
+- Expert data streams from SSD on demand — no full model load required
+- No custom caches. Trust the OS page cache for expert LRU.
+- Minimum RAM: 24GB (35B-A3B 4-bit), 48GB (397B-A17B 4-bit)