feat: runtime model config from HuggingFace config.json by Alexintosh · Pull Request #3 · danveloper/flash-moe

Alexintosh · 2026-03-20T23:17:13Z

Summary

Replace ~54 hardcoded #define model constants with a runtime ModelConfig struct populated from HuggingFace config.json at startup via NSJSONSerialization. Switch between Qwen3.5 models (35B, 122B, 397B) with just --model <path> — no recompilation needed.
Add model_manager.py utility to list local compatible models, search HuggingFace for MLX-quantized Qwen3.5 MoE models, download them, and validate compatibility.
Update README with compatible models table, model manager docs, --model flag usage, and FLASH_MOE_MODEL env var.

What changed in `infer.m`

ModelConfig struct + load_model_config() parses config.json (architecture, quantization, layer types, RoPE, EOS tokens) and tokenizer.json (think tokens)
compute_expert_offsets() derives all expert byte offsets from dimensions + quantization params
alloc_tracking_arrays() dynamically allocates all tracking arrays (expert freq, cache state, predictions, layer cache) previously sized by compile-time constants
~960 #define references replaced with cfg.* fields via helper macros (FREQ(), CACHE_SEEN(), PRED_EXPERT(), etc.)
MetalCtx buffer arrays converted from fixed-size to dynamically allocated (__strong ARC pointers)
Validated: compiles clean, runs 5.03 tok/s on 35B-A3B 4-bit, correct output

Test plan

Compile with cd metal_infer && make
Run 35B model: ./infer --model ~/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt "What is 2+2?" --tokens 20
Verify config summary printed to stderr on startup
Run python model_manager.py --local to list cached models
Run python model_manager.py --search to find remote models
Test FLASH_MOE_MODEL env var as default model path

🤖 Generated with Claude Code

Spec for replacing ~40 hardcoded #define model constants with a runtime ModelConfig struct populated from HuggingFace config.json, enabling model switching via --model flag without recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add missing arrays (g_lz4_index, g_pred_experts, g_pred_count, stack VLAs), full_attn_interval fallback, thread safety invariant, MODEL_PATH_DEFAULT handling, MAX_BATCH_SLOTS coupling note, and clarify chat.m needs zero changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds ModelConfig struct, compute_expert_offsets(), and load_model_config() that parses HuggingFace config.json + tokenizer.json via NSJSONSerialization. Old #defines still present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove ~54 model-specific #define constants and replace ~960 occurrences with cfg.* runtime struct fields. Convert 13 static/ stack arrays to dynamic allocation. Parse config.json + tokenizer.json at startup via NSJSONSerialization. Expert byte offsets computed from model dimensions and quantization params. Switching models now requires only --model flag, no recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Generalize file header comment to describe multi-model support. Update startup banner from hardcoded model name to "Flash-MoE" with dynamic config path display. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e models Lists local HF-cached models with compatibility check, searches HuggingFace for compatible Qwen3.5 MoE models (35B-A3B, 122B-A10B, 397B-A17B) with MLX quantization, and supports downloading via huggingface-cli or huggingface_hub. Usage: python model_manager.py # list local + remote python model_manager.py --local # local only python model_manager.py --search # remote only python model_manager.py --download <repo> python model_manager.py --check <path> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add compatible models table, model manager usage instructions, updated quick start with --model flag and FLASH_MOE_MODEL env var, revised project structure, and generalized architecture description. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fiveangle · 2026-03-27T15:51:23Z

Download complete: : 20.4GB [06:39, 47.6MB/s]              /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit/snapshots/1e20fd8d42056f870933bf98ca6211024744f7ec
Download complete: : 20.4GB [06:39, 51.1MB/s]

Model is compatible with Flash-MoE!
  Architecture:  40 layers, hidden=2048, 256 experts (K=8)
  Quantization:  4-bit, group_size=64
  Expert data:   16.9 GB on disk, 13.5 MB active/token
  Parameters:    ~64B total
  Packed experts: NOT FOUND (run repack_experts.py)
  Weights file:   OK
  Status:        NEEDS PREPARATION (see above)

Next steps:
  1. python repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
  2. python metal_infer/extract_weights.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
  3. ./metal_infer/infer --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit --prompt 'Hello' --tokens 20
~/dev/flash-moe$ python repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
zsh: command not found: python
~/dev/flash-moe$ python3 repack_experts.py --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit
usage: repack_experts.py [-h] [--index INDEX] [--layers LAYERS] [--dry-run] [--verify-only LAYER]
repack_experts.py: error: unrecognized arguments: --model /Users/speedster/.cache/huggingface/hub/models--mlx-community--Qwen3.5-35B-A3B-4bit```

Alexintosh and others added 7 commits March 20, 2026 22:20

chore: update header and banner for runtime model config

1ae86be

Generalize file header comment to describe multi-model support. Update startup banner from hardcoded model name to "Flash-MoE" with dynamic config path display. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: runtime model config from HuggingFace config.json#3

feat: runtime model config from HuggingFace config.json#3
Alexintosh wants to merge 7 commits intodanveloper:mainfrom
Alexintosh:feature/runtime-model-config

Alexintosh commented Mar 20, 2026

Uh oh!

fiveangle commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alexintosh commented Mar 20, 2026

Summary

What changed in infer.m

Test plan

Uh oh!

fiveangle commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What changed in `infer.m`