Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
797a0ca
feat: CUDA/NVIDIA port — Qwen3.5-397B on single RTX 4090 at 2.45 tok/s
ssubbotin Mar 22, 2026
2da9b19
feat: HTTP server with OpenAI-compatible SSE streaming API
ssubbotin Mar 22, 2026
a62eaaf
feat: tool calling support (OpenAI function calling format)
ssubbotin Mar 22, 2026
0cc7d51
fix: stop after tool call, filter special tokens from output
ssubbotin Mar 22, 2026
8d8fac4
docs: HTTP server, tool calling, Claude Code integration, RAM require…
ssubbotin Mar 22, 2026
44a38bd
feat: native Anthropic Messages API (/v1/messages)
ssubbotin Mar 22, 2026
9e724a5
docs: native Anthropic API, no litellm proxy needed for Claude Code
ssubbotin Mar 22, 2026
27fe398
feat: persistent conversation state + correct special token IDs
ssubbotin Mar 22, 2026
4578b4c
docs: system prompt caching, persistent state, dual API endpoints
ssubbotin Mar 22, 2026
c673735
feat: multi-turn session persistence
ssubbotin Mar 22, 2026
4b225ca
feat: --timing per-layer phase breakdown
ssubbotin Mar 22, 2026
eae84b1
perf: default to pread + page cache instead of GDS
ssubbotin Mar 22, 2026
404522d
docs: page cache > GDS discovery, updated benchmarks to 2.52 tok/s
ssubbotin Mar 22, 2026
8fef508
feat: VRAM expert cache — 3.55 tok/s (+43% over baseline)
ssubbotin Mar 22, 2026
fb512f3
perf: frequency-weighted LRU + vec4 FMA kernel — 5.35 tok/s (+118%)
ssubbotin Mar 28, 2026
d260cca
docs: multi-hardware benchmarks (RTX 4090/3060/2080Ti), 5.35 tok/s
ssubbotin Mar 28, 2026
0b2ac4a
docs: paper revision — expanded related work, multi-hardware benchmarks
ssubbotin Mar 28, 2026
44652b6
paper: R4 profiling (1290 tokens), S1 W sensitivity, S3 working set c…
ssubbotin Mar 28, 2026
ae43327
feat: multi-model support via compile-time config
ssubbotin Mar 28, 2026
c2a0e94
feat: GGML Q4_K dequant kernel — native GGUF format support
ssubbotin Mar 28, 2026
181de41
feat: universal repack — auto-detect model dimensions from index
ssubbotin Mar 29, 2026
880fd6c
paper: R1 llama.cpp comparison — 100x+ faster on same hardware
ssubbotin Mar 29, 2026
fc6ec97
paper: R2 review fixes — std dev measurements, limitations, citations
ssubbotin Mar 29, 2026
24d4fe6
paper: single-column layout, remove all prose em dashes
ssubbotin Mar 29, 2026
45ac3b5
feat: GGUF pipeline + TMLR/arXiv paper formats
ssubbotin Mar 29, 2026
23c5585
fix: GGUF F32-to-bf16 conversion for dt_bias/conv1d/norms
ssubbotin Mar 29, 2026
71cbe3a
fix: GGUF mixed quant types + Q6_K kernel rewrite
ssubbotin Mar 29, 2026
5394961
fix: verified Q5_K/Q6_K kernels correct, remove debug code
ssubbotin Mar 29, 2026
6e2925d
debug: per-layer hidden state dumps for GGUF comparison
ssubbotin Mar 29, 2026
c7a173b
fix: L2 norm for Q/K + correct decay computation for GGUF
ssubbotin Mar 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,103 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

**NOTE:** README.md is a symlink to this file. Keep content useful for both Claude Code and GitHub readers.

## Build & Run

Two backends: `metal_infer/` for Apple Silicon, `cuda_infer/` for NVIDIA GPUs.

### CUDA backend (NVIDIA GPUs)

See [`cuda_infer/README.md`](cuda_infer/README.md) for full documentation. Quick start:

```bash
cd cuda_infer
make # requires CUDA 12.8+ and libcufile
./infer --prompt "Hello" --tokens 20
```

### Metal backend (Apple Silicon)

All binaries are built from `metal_infer/`. Metal shaders compile at runtime (no offline metal compiler needed).

```bash
cd metal_infer
make # builds metal_infer (benchmark) + infer (inference engine)
make chat # builds chat TUI client (separate target)
make clean # remove build artifacts
```

### Inference engine (`infer`)

```bash
./infer --prompt "Hello" --tokens 50 # basic generation
./infer --prompt "Hello" --tokens 50 --2bit # 2-bit mode (faster, breaks JSON)
./infer --prompt "Hello" --tokens 20 --timing # per-layer timing breakdown
./infer --serve 8080 # HTTP server (OpenAI-compatible API)
./infer --prompt "Hello" --tokens 20 --freq # expert frequency tracking
./infer --prompt "Hello" --tokens 20 --cache-telemetry # cold vs eviction miss analysis
```

### Chat TUI (`chat`)

Thin HTTP/SSE client that connects to the inference server. Sessions persist to `~/.flash-moe/sessions/<id>.jsonl`.

```bash
./chat # connect to default port
./chat --port 8000 # specify server port
./chat --show-think # show thinking tokens
./chat --resume <session_id> # resume previous session
```

### Custom system prompt

Place a file at `~/.flash-moe/system.md` to override the default system prompt used by the serve mode.

### MoE benchmark (`metal_infer`)

```bash
make run # single expert forward pass
make verify # Metal vs CPU reference verification
make bench # benchmark single expert (10 iterations)
make moe # full MoE forward pass (K experts, single layer)
make full # full 60-layer forward pass (K=4)
make fullbench # benchmark full 60-layer forward (3 iterations)
```

## Code Architecture

Three Objective-C files, one Metal shader file, one C header — no frameworks, no dependencies beyond Apple system libraries.

- **`infer.m`** (~7000 lines) — The entire inference engine in one file: model loading, Metal pipeline setup, all 60-layer forward pass, tokenization, sampling, HTTP server (OpenAI-compatible SSE), tool calling, KV cache management. This is the core of the project.
- **`shaders.metal`** (~1200 lines) — All Metal compute kernels: 4-bit/2-bit dequant matvec (multiple optimization levels), SwiGLU, RMS norm, attention (Q@K^T, softmax, scores@V), RoPE, MoE combine+residual.
- **`chat.m`** — Thin HTTP/SSE client with linenoise line editing. Connects to the `--serve` mode of `infer`. No model logic.
- **`main.m`** — Standalone MoE benchmark. Tests expert forward pass in isolation, verifies Metal vs CPU.
- **`tokenizer.h`** — Single-header C BPE tokenizer (449 lines).

### Key design constraints

- **Single-file engine**: All inference logic lives in `infer.m`. This is intentional — the entire forward pass, server, and tool calling in one file for simplicity.
- **No custom caching**: Expert data relies entirely on the OS page cache ("Trust the OS"). Every custom cache we tried was slower.
- **Serial GPU→SSD→GPU pipeline**: On Apple Silicon unified memory, SSD DMA and GPU compute share the memory controller. Overlapping them causes GPU latency spikes. The serial pipeline is hardware-optimal.
- **Metal shaders compile at runtime** via `MTLDevice newLibraryWithSource:`. No offline `.metallib` needed (though `make metallib` exists as an option).

### Per-layer pipeline (3 command buffers)

```
CMD3(prev) → CMD1: attention projections + delta-net [GPU]
→ CPU: flush results
→ CMD2: o_proj + norm + routing + shared [GPU]
→ CPU: softmax + topK routing
→ I/O: parallel pread K=4 experts [SSD]
→ CMD3: expert forward + combine + norm [GPU, DEFERRED]
```

CMD3 is submitted without waiting (deferred). The GPU serializes CMD3(N-1) then CMD1(N) via queue ordering.

---

# Flash-MoE: Running a 397B Parameter Model on a Laptop

> **[Read the paper](paper/flash_moe.pdf)** — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.
Expand Down
156 changes: 156 additions & 0 deletions bench_q4k.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
/*
* bench_q4k.cu — Compare MLX affine 4-bit vs GGML Q4_K kernel performance
*
* Build:
* nvcc -O2 -o bench_q4k bench_q4k.cu -lpthread
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>

// Need ROWS_PER_BLOCK and GROUP_SIZE before including kernels
#define GROUP_SIZE 64
#include "kernels.cuh"

#define CHECK_CUDA(call) do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err)); exit(1); \
} \
} while(0)

static inline float bf16_to_f32_h(uint16_t bf16) {
uint32_t tmp = (uint32_t)bf16 << 16;
float f; memcpy(&f, &tmp, sizeof(f)); return f;
}

int main() {
// Test dimensions matching expert projections
struct { int out_dim; int in_dim; const char *name; } tests[] = {
{1024, 4096, "gate/up_proj"},
{4096, 1024, "down_proj"},
{512, 4096, "routing"},
{248320, 4096, "lm_head"},
};

cudaDeviceProp prop;
CHECK_CUDA(cudaGetDeviceProperties(&prop, 0));
printf("GPU: %s, %d SMs, %.0f GB/s\n", prop.name,
prop.multiProcessorCount,
prop.memoryBusWidth / 8.0 * prop.memoryClockRate * 2.0 / 1e6);

int iters = 200;

for (int t = 0; t < 4; t++) {
int out_dim = tests[t].out_dim;
int in_dim = tests[t].in_dim;
printf("\n=== %s [%d, %d] ===\n", tests[t].name, out_dim, in_dim);

// Allocate input vector
float *h_x = (float *)malloc(in_dim * sizeof(float));
for (int i = 0; i < in_dim; i++) h_x[i] = (float)(rand() % 1000) / 1000.0f - 0.5f;
float *d_x, *d_out;
CHECK_CUDA(cudaMalloc(&d_x, in_dim * sizeof(float)));
CHECK_CUDA(cudaMalloc(&d_out, out_dim * sizeof(float)));
CHECK_CUDA(cudaMemcpy(d_x, h_x, in_dim * sizeof(float), cudaMemcpyHostToDevice));

// ---- MLX format ----
uint32_t packed_cols = in_dim / 8;
uint32_t num_groups = in_dim / GROUP_SIZE;
size_t mlx_w_sz = out_dim * packed_cols * sizeof(uint32_t);
size_t mlx_s_sz = out_dim * num_groups * sizeof(uint16_t);

uint32_t *h_W = (uint32_t *)malloc(mlx_w_sz);
uint16_t *h_S = (uint16_t *)malloc(mlx_s_sz);
uint16_t *h_B = (uint16_t *)malloc(mlx_s_sz);
for (size_t i = 0; i < out_dim * packed_cols; i++) h_W[i] = rand();
for (size_t i = 0; i < out_dim * num_groups; i++) {
float sv = 0.01f; uint32_t tmp; memcpy(&tmp, &sv, 4); h_S[i] = tmp >> 16;
float bv = -0.5f; memcpy(&tmp, &bv, 4); h_B[i] = tmp >> 16;
}

uint32_t *d_W; uint16_t *d_S, *d_B;
CHECK_CUDA(cudaMalloc(&d_W, mlx_w_sz));
CHECK_CUDA(cudaMalloc(&d_S, mlx_s_sz));
CHECK_CUDA(cudaMalloc(&d_B, mlx_s_sz));
CHECK_CUDA(cudaMemcpy(d_W, h_W, mlx_w_sz, cudaMemcpyHostToDevice));
CHECK_CUDA(cudaMemcpy(d_S, h_S, mlx_s_sz, cudaMemcpyHostToDevice));
CHECK_CUDA(cudaMemcpy(d_B, h_B, mlx_s_sz, cudaMemcpyHostToDevice));

// Warmup
launch_dequant_matvec(d_W, d_S, d_B, d_x, d_out, out_dim, in_dim);
CHECK_CUDA(cudaDeviceSynchronize());

cudaEvent_t start, stop;
CHECK_CUDA(cudaEventCreate(&start));
CHECK_CUDA(cudaEventCreate(&stop));

CHECK_CUDA(cudaEventRecord(start));
for (int i = 0; i < iters; i++)
launch_dequant_matvec(d_W, d_S, d_B, d_x, d_out, out_dim, in_dim);
CHECK_CUDA(cudaEventRecord(stop));
CHECK_CUDA(cudaEventSynchronize(stop));
float mlx_ms;
CHECK_CUDA(cudaEventElapsedTime(&mlx_ms, start, stop));
mlx_ms /= iters;

size_t mlx_total = mlx_w_sz + mlx_s_sz * 2;
printf(" MLX affine 4-bit: %.3f ms (%.1f GB/s, data=%.1f MB)\n",
mlx_ms, mlx_total / (mlx_ms / 1000.0) / 1e9, mlx_total / 1e6);

// ---- Q4_K format ----
uint32_t blocks_per_row = in_dim / QK_K;
size_t q4k_row_sz = blocks_per_row * Q4_K_BLOCK_SIZE;
size_t q4k_total = (size_t)out_dim * q4k_row_sz;

uint8_t *h_Q4K = (uint8_t *)malloc(q4k_total);
// Fill with synthetic Q4_K data
for (size_t row = 0; row < (size_t)out_dim; row++) {
for (uint32_t bi = 0; bi < blocks_per_row; bi++) {
uint8_t *block = h_Q4K + row * q4k_row_sz + bi * Q4_K_BLOCK_SIZE;
__half d_val = __float2half(0.01f);
__half dmin_val = __float2half(0.005f);
memcpy(block, &d_val, 2);
memcpy(block + 2, &dmin_val, 2);
for (int i = 0; i < 12; i++) block[4 + i] = rand() & 0x3F;
for (int i = 0; i < 128; i++) block[16 + i] = rand();
}
}

uint8_t *d_Q4K;
CHECK_CUDA(cudaMalloc(&d_Q4K, q4k_total));
CHECK_CUDA(cudaMemcpy(d_Q4K, h_Q4K, q4k_total, cudaMemcpyHostToDevice));

// Warmup
launch_dequant_matvec_q4k(d_Q4K, d_x, d_out, out_dim, in_dim);
CHECK_CUDA(cudaDeviceSynchronize());

CHECK_CUDA(cudaEventRecord(start));
for (int i = 0; i < iters; i++)
launch_dequant_matvec_q4k(d_Q4K, d_x, d_out, out_dim, in_dim);
CHECK_CUDA(cudaEventRecord(stop));
CHECK_CUDA(cudaEventSynchronize(stop));
float q4k_ms;
CHECK_CUDA(cudaEventElapsedTime(&q4k_ms, start, stop));
q4k_ms /= iters;

printf(" GGML Q4_K: %.3f ms (%.1f GB/s, data=%.1f MB)\n",
q4k_ms, q4k_total / (q4k_ms / 1000.0) / 1e9, q4k_total / 1e6);

float ratio = q4k_ms / mlx_ms;
printf(" Ratio Q4_K/MLX: %.2fx %s\n", ratio,
ratio < 1.05 ? "(comparable)" : ratio < 1.2 ? "(slightly slower)" : "(slower)");

CHECK_CUDA(cudaFree(d_W)); CHECK_CUDA(cudaFree(d_S)); CHECK_CUDA(cudaFree(d_B));
CHECK_CUDA(cudaFree(d_Q4K));
free(h_W); free(h_S); free(h_B); free(h_Q4K);
CHECK_CUDA(cudaEventDestroy(start)); CHECK_CUDA(cudaEventDestroy(stop));
CHECK_CUDA(cudaFree(d_x)); CHECK_CUDA(cudaFree(d_out));
free(h_x);
}

return 0;
}
Loading