perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch by signalnine · Pull Request #53 · TheTom/llama-cpp-turboquant

signalnine · 2026-04-04T16:17:31Z

Summary

+8-9% turbo3 decode throughput on CUDA + auto-fix for Qwen symmetric turbo3.

Speed (RTX 5090, clean A/B on upstream HEAD)

Model	Upstream	PR	Speedup
Qwen2.5-7B turbo3 (auto-asym)	248.7	272.1	+9.4%
Mistral-7B turbo3 symmetric	239.4	260.4	+8.8%

Quality: zero regression

PPL: Qwen +1.7%, Mistral +2.1% (vs fp16, unchanged from upstream)
NIAH: 5/5 on both models

Changes

FA optimizations (fattn-vec.cuh): nthreads_KQ=1, nthreads_V/=8, warp shuffle KQ, skip sparse V check, precomputed V centroids, __expf, occupancy 2. Found via autoresearch (59 experiments overnight).
Shmem KQ LUT (fattn-vec.cuh): precompute Q×centroid once in shmem, eliminate per-element multiply. Ported from PR perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT #36 (credit: Madreag/turbo3-cuda).
Auto-asymmetric GQA fix (llama-kv-cache.cpp): auto-upgrade K→q8_0 when GQA≥6:1. Fixes Qwen symmetric turbo3 (PPL 2887→7.98). See Symmetric turbo3 KV catastrophic on Qwen2.5 (GQA 7:1) — auto-asymmetric fix proposed #54.
HIP build fix (turbo-quant.cuh): suppress -Wnodiscard warnings.
Autoresearch framework (scripts/autoresearch/): automated CUDA kernel optimization harness.

Test plan

Clean A/B speed (+9% Qwen, +9% Mistral)
PPL zero regression (wikitext-2, 10 chunks)
NIAH 5/5 (Kamradt varied filler, both models)
Auto-asymmetric: Qwen PPL 2887→7.98
Mistral unaffected by auto-asymmetric (GQA 4:1)
HIP -Werror build fix
Ampere (sm_86) testing

New types: GGML_TYPE_TURBO3_0 (3-bit) and GGML_TYPE_TURBO4_0 (4-bit) Implements PolarQuant + QJL compression per the ICLR 2026 paper. Block size = 128 (matching head_dim for optimal rotation Gaussianization) turbo3: 52 bytes per 128 values = 3.25 bits/value (4.9× vs fp16) turbo4: 68 bytes per 128 values = 4.25 bits/value (3.8× vs fp16) Status: - ✅ Type definitions in ggml.h - ✅ Block structures in ggml-common.h - ✅ Quantize/dequantize C implementation in ggml-turbo-quant.c - ✅ Registered in ggml.c type traits - ✅ Added to kv_cache_types in arg.cpp - ✅ Builds successfully - ✅ Shows in --help output - ❌ Metal SET_ROWS kernel not implemented (blocks GPU inference) - ❌ Needs Metal dequantize kernels for attention computation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added Metal shader implementations: - quantize_turbo3_0 / quantize_turbo4_0 (per-block quantization) - dequantize_turbo3_0 / dequantize_turbo4_0 (type4x4 and type4 variants) - kernel_set_rows_turbo template (128-element block size) - Flash attention instantiations for all dk/dv variants Added TURBO3_0/TURBO4_0 to Metal device SET_ROWS validation. Builds successfully. Testing with Qwen 3.5 35B-A3B MoE on M5 Max. Note: Initial version uses simplified quantization (no rotation matrix) for Metal compatibility. Full rotation requires custom kernel with extra buffer bindings — tracked for follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Embedded pre-computed 128×128 rotation and QJL matrices (256KB constant memory) directly in the Metal shader. Both quantize and dequantize now perform the full TurboQuant algorithm: Quantize: normalize → rotate → codebook → inverse rotate → residual → QJL Dequantize: codebook → inverse rotate → QJL correction → rescale Previous version (no rotation) produced garbage. This should produce meaningful output since the rotation Gaussianizes the KV distribution. Note: dequantize does full 128-element rotation per chunk (8× work). Optimization possible with caching or restructured kernel in follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#21 - Inlined turbo-matrices.h directly into ggml-metal.metal (256KB) to fix JIT compilation failure with #include - Added C round-trip test (test-turbo-quant.c): turbo3 cosine=0.906, turbo4 cosine=0.966 — matches Python prototype - Metal library loads successfully ("loaded in 5.9 sec") - Model runs on Metal but output quality needs debugging (Metal quantize/dequantize may have a bug vs the working C version) C round-trip PROVES the algorithm works in C. Metal shader needs debugging — likely an issue with the dequantize chunk addressing or the large constant arrays in thread-local memory. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#23 Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#26 Massive reduction in constant memory and compute: - 256KB of dense matrices → 512 bytes of sign arrays - O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation - Metal shader file: 1.5MB → 432KB Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the bottleneck is redundant calls (8-32× per block from flash attention). The dequantize function is called per 4/16-element chunk, each time doing the full 128-element WHT. Need to modify the flash attention kernel to dequantize once per block. Quality: WHT+signs gives BETTER quality than dense QR on real KV tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution (kurtosis 1.53) means fewer outliers hitting extreme centroids. Reviewed by Codex: WHT butterfly correct, inverse order verified, QJL correction matches reference C implementation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#23 Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…om#23 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#23 Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#27 Full investigation log with all tests, results, and the root cause. Upstream TurboQuant activity tracked in TheTom#27. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…om#28 Key findings from Dejan.ai, unixsysdev, and mudler: 1. QJL naively added back destroys quality (cosine 0.69) 2. Pre-rotate queries eliminates rotation from dequant path 3. WHT abandoned by everyone — dense QR or no rotation preferred 4. unixsysdev gets -0.8% speed loss with fused CUDA kernel 5. We're the only Metal implementation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…in) TheTom#23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speed ceiling confirmed: stripping rotation from dequant gives 49.1 tok/s (vs 10.7 with rotation, vs 85.5 q8_0 baseline). Implementation plan: store rotation matrix in KV cache, apply to Q in graph builder, strip from Metal dequant. 6 files to modify. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#23 Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speed ceiling without Q rotation: 61.3 tok/s (vs 62.2 with it). The 128×128 ggml_mul_mat adds <1% overhead on Metal. Remaining gap is structural (block size + dequant complexity). Final: MoE 62.2 tok/s (73%), Qwopus 15.5 tok/s (88%). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diagnostic benchmark proves the 26% gap is entirely from block size 128. q4_0 (block 32, 4-bit quantization) runs at 84.2 tok/s = identical to q8_0. Next: turbo3 with block size 32. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Changed QK_TURBO3 from 128 to 32 (storage block size). Rotation still operates on 128-element groups (QK_TURBO3_GROUP=128). SET_ROWS kernel processes 4 blocks per rotation group. Flash attention nl_k changed from 32 to 8 (matching q4_0). Block struct: 14 bytes per 32 values = 3.5 bits/val → 4.6× compression. Results: - MoE gen: 62.2 → 77.7 tok/s (91% of q8_0 at 85.5) - MoE prompt: 200 → 218.5 tok/s (98% of q8_0) - Qwopus gen: 15.5 → 17.0 tok/s (97% of q8_0 at 17.6) - Qwopus prompt: 83 → 89.5 tok/s (108% of q8_0 — FASTER) Target was 75+ tok/s. Exceeded. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed TheTom#3 (TURBO_D). TheTom#1 and TheTom#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Tom#30 Perplexity benchmarking reveals catastrophic quality failure: - f16: 6.121, q8_0: 6.111, q4_0: 6.142 - turbo3: 165.6 (27× worse) Speed benchmarks were meaningless — fast garbage. Root cause investigation needed before any quality claims. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. V cache returns rotated-space values (cosine=0.02 vs correct 0.987) 2. dynamic_cast to llama_kv_cache_context fails for MoE models (uses llama_memory_hybrid_context, not kv_cache_context) → Q rotation and V inverse rotation NEVER executed Fix: store rotation tensors in llm_graph_context, not KV cache. Or access through hybrid memory interface. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Quality confirmed: PPL 6.194 (+1.4% of q8_0) Speed: 10.7 tok/s (inverse rotation in dequant, no pre-rotate-queries) Previous speed claims (51-77 tok/s) were invalid — measured garbage output speed. Key lessons documented for future reference. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

signalnine · 2026-04-04T19:56:02Z

Updated: +15.3% decode with shmem LUT

Correction: The 15.3% number was from runs at different thermal states. Clean A/B on same session shows +9% total (see final results below). The shmem LUT adds ~2% on top of the autoresearch optimizations.

signalnine · 2026-04-04T20:31:24Z

Final Results — Clean A/B on upstream HEAD (`bc05a68`)

All benchmarks back-to-back, same GPU thermal state, llama-bench -r 3.

Speed (RTX 5090, tg128, `-p 0 -n 128`)

Model	Config	Upstream	PR #53	Speedup
Qwen2.5-7B Q4_0	turbo3 (auto→q8_0-K)	248.7	272.1	+9.4%
Qwen2.5-7B Q4_0	turbo3 (forced sym)	248.7	267.3	+7.5%
Mistral-7B Q4_0	turbo3 symmetric	239.4	260.4	+8.8%

PPL (wikitext-2-raw, 10 chunks)

Model	Config	PPL	Δ vs fp16 baseline
Qwen2.5-7B	fp16 baseline	7.85	—
Qwen2.5-7B	turbo3 (auto-asymmetric)	7.98	+1.7%
Mistral-7B	fp16 baseline	8.39	—
Mistral-7B	turbo3 symmetric	8.57	+2.1%

NIAH (Kamradt varied filler, 5 depths)

Model	Config	Score
Qwen2.5-7B	turbo3 (auto-asymmetric)	5/5
Mistral-7B	turbo3 symmetric	5/5

What's in this PR

FA kernel optimizations (fattn-vec.cuh): nthreads_KQ=1, nthreads_V/=8, warp shuffle KQ, sparse V skip, precomputed V centroids, __expf, occupancy 2
Shared-memory KQ LUT (fattn-vec.cuh): precompute Q×centroid in shmem, eliminate per-element multiply in KQ inner loop
Auto-asymmetric GQA fix (llama-kv-cache.cpp): auto-upgrade K to q8_0 when GQA ≥ 6:1 — fixes Qwen symmetric turbo3 (PPL 2887→7.98)
HIP build fix (turbo-quant.cuh): suppress -Wnodiscard on unchecked cudaMemcpy calls
Autoresearch framework (scripts/autoresearch/): automated kernel optimization harness

…AM than q8_0 Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel. Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time conversion (240 vs 176 t/s) while using 1.7× LESS VRAM (4.5 vs 7.5 GiB). Key optimizations (found via 86 automated experiments): 1. fp16 activation buffer — halves activation bandwidth (the bottleneck) 2. Shared-memory centroid LUT — eliminates constant memory serialization on divergent lane access (+89% single change) 3. Half2 arithmetic + strided block processing — 2× arithmetic density 4. Vectorized 128-bit loads — uint32×4 weights, int4 activations (+45%) 5. Register __byte_perm centroid decode — zero-memory centroid lookup 6. NWARPS 8→4 Also: - Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead of default. Native kernel is strictly better on both speed and VRAM. - Autoresearch harness gains coherence testing (server API + factual Q&A) to catch silent corruption that PPL alone misses. Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S): Upstream V12 runtime: 67 t/s (4.5 GiB VRAM) q8_0 conversion: 176 t/s (7.5 GiB VRAM) Native optimized: 240 t/s (4.5 GiB VRAM) ← this PR Quality (vs f16 baseline): PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55) Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078) NIAH: 5/5 Coherence: 4/4 (Paris, 4, print, Shakespeare)

seanrasch · 2026-04-06T16:17:02Z

Heads up — PR doesn't build standalone against feature/turboquant-kv-cache HEAD (bc05a6803). Looks like a squash dropped the GGML_TYPE_TQ4_0 enum declaration.

ggml/include/ggml.h still ends at GGML_TYPE_TQ2_0 = 35, but ggml/src/ggml-cpu/ops.cpp references GGML_TYPE_TQ4_0 in 7 switch cases (lines 683, 1136, 1268, 4359, 4638, 4864, 5600), and ggml/src/ggml-turbo-quant.c uses TQ4_0_CENTROIDS. Build error on RTX 3080 Ti / SM 8.6 / GCC 12.4:

ggml/src/ggml-cpu/ops.cpp:683:14: error: 'GGML_TYPE_TQ4_0' was not declared in this scope; did you mean 'GGML_TYPE_TQ2_0'?

Either the enum addition needs to come back, or the switch cases / centroid table belong on a branch that has it (PR #45 doesn't add it either, so it's not a missing dependency on #45).

Happy to benchmark the VEC FA + SMEM LUT optimizations on Ampere once it builds — interested in whether the +15% decode claim holds up at SM 8.6 (our PR #36 test was a wash on Ampere, so concrete numbers would be useful).

signalnine · 2026-04-06T16:21:08Z

Good catch — the 7 GGML_TYPE_TQ4_0 switch cases in ops.cpp leaked from our experimental weight compression branch. TQ4_0 was a prototype format (WHT + uniform q4_0, see optimization log) that never shipped. Removed in bb3a6eb.

The TQ4_0_CENTROIDS table in turbo-quant.c is upstream — that's the Lloyd-Max centroid table name used internally by TQ4_1S quantize/dequant (confusing name, but it's TheTom's code, not ours).

Should build clean now. Would love Ampere numbers — our +9% was Blackwell-only. The shmem LUT and nthreads tuning may behave differently on SM 8.6 with its smaller L1 and different warp scheduler.

Replace per-element generic dequant template (which repeats the full 32-element WHT butterfly 16 times per block) with a warp-cooperative version using __shfl_xor_sync. One WHT per block instead of 16. Note: this improves the dequant kernel itself but doesn't fix the prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs the q8_0 conversion path's native int8 tensor core GEMM. The dequant was never the slow part — the GEMM dispatch is fundamentally different. For prefill-heavy workloads, load-time q8_0 conversion remains the recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy interactive chat where the +29% decode speed matters more.

- Multi-token dp4a kernel for ne[1]≤8 (speculative decoding, small batches) loads weight data once per block, reuses across all ncols_dst tokens - Runtime TQ4_1S→fp16 dequant + cuBLAS for ne[1]>8 prefill - Fix multi-GPU crash: replace static global CUDA buffers with per-device pool allocations from ctx.pool(id), matching mmvq.cu pattern - Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

Enhance Metal operations for TQ weights and concurrency handling

extern "C" GGML_API creates double extern on paths where GGML_API expands to 'extern'. Wrap in extern "C" {} block instead. Reported by Madreag on RTX 5090 WSL2. Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Metal MoE support: - Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256 - Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4, Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B - Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server with flash attention — needs chunked map0 dispatch (follow-up) Backend tests: - Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops - Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB), exceeding the 32KB threadgroup memory limit on Apple Silicon. At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem = 256*8*2 = 4KB, well within limits. Cap the reservation shmem to 32KB to prevent the assert from firing. Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash attention — previously crashed during warmup, now runs at 22 t/s. Fixes TheTom#58 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput (240 t/s on 5090). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead, causing a regression vs the V12 float kernel (101 vs 135 t/s on RX 9070 XT). Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD GPUs to a scalar half-precision kernel (same pattern as TQ3_1S). NVIDIA continues using the dp4a path. Changes: - Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations, shmem centroid LUT, scalar dot product (no dp4a/byte_perm) - Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path. - LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0). Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes required — existing CUDA kernels compile via HIP translation. Test results (Qwen2.5-1.5B Q4_K_M, single MI300X): - WHT roundtrip: PASS (max error 2.98e-07) - turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s) - turbo3 decode: 88% of f16 (160 vs 181 tok/s) - turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s) - turbo4 decode: 89% of f16 (161 vs 181 tok/s) MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's MMQ kernel dispatch (upstream issue, not TurboQuant-specific). Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

Add AMD Instinct MI355X (gfx950) architecture support: Code changes: - vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family - common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro - mma.cuh: Route CDNA4 to compatible MFMA instructions * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8: mfma_i32_16x16x32_i8 (same as CDNA3) * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950) - mmq.cuh: Include CDNA4 in stream-k dispatch - common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn) MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU): - turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%) - turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%) - WHT roundtrip: PASS (max error 2.98e-07) Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported). TurboQuant types force FA and work correctly. Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion

Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 (experimental)

Two-pass block-parallel attention kernel optimized for turbo3 V cache decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and turbo3-K (symmetric) configurations via compile-time function constant. Architecture: - Pass 1: 32-thread SIMD group per (query-head, block) pair - Each lane handles DK/32 interleaved dimensions - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path - K scoring via simd_sum dot product - turbo3 V unpack with register codebook (8 centroids) - Online softmax (m/l/o state) entirely in registers - Zero shared memory in pass 1 - Pass 2: merge partial results across blocks - Online softmax correction with global max/sum - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6) - Eliminates 5 of 7 threadgroup barriers vs naive butterfly Auto-detection: activates for single-token decode (ne01==1) when V is turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var (0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon). Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V): - M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0 - M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s) - Advantage scales with context (+7.3% at 32K) Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: AMD Instinct MI300X + MI355X (gfx942/gfx950) ROCm support

Optimizations found via automated kernel optimization (33 experiments): - nthreads_KQ=1 + nthreads_V/=8 for better occupancy - Warp shuffle KQ scores (eliminates shared memory for reduction) - Precomputed scaled V centroids per block - __expf fast-math softmax - __launch_bounds__ occupancy 2 - Shmem KQ LUT: precompute Q×centroid in shared memory Also includes: - Auto-asymmetric KV: detect GQA ratio ≥6:1, upgrade K to q8_0 (fixes catastrophic PPL on Qwen2.5 symmetric turbo3) - HIP -Wnodiscard fix: (void) casts on cudaMemcpyToSymbol/FromSymbol

signalnine · 2026-04-09T17:34:45Z

RTX 3090 (Ampere, sm_86) Validation

Tested on RunPod RTX 3090 24 GB, CUDA 12.4. Qwen2.5-7B-Instruct Q4_K_M.

Speed

KV Config	pp512	pp2K	pp8K	pp32K	tg128	vs f16 tg
f16	5,635	5,582	4,987	3,456	147.8	—
q4_0	5,488	5,461	4,884	3,391	141.4	-4.3%
turbo3	5,380	5,317	4,752	3,320	139.9	-5.3%
turbo2	5,383	5,327	4,769	3,338	142.3	-3.7%

Quality (PPL @2k, 10 chunks)

KV Config	PPL	vs f16
f16	5.754	—
q4_0	9,714	catastrophic (GQA 7:1)
turbo3	5.762	+0.1%
turbo2	5.870	+2.0%

KLD (ctx=2048, 5 chunks)

KV Config	Mean KLD
q4_0	7.584 (broken)
turbo3	0.024

Key findings

Auto-asymmetric is critical on Ampere: q4_0 KV (4-bit K+V) is catastrophically broken on Qwen's 7:1 GQA — PPL 9,714. turbo3 with auto-asymmetric (q8_0 K, turbo3 V) gives PPL 5.762, within 0.1% of f16.
Consistent overhead across architectures: ~5% prefill and decode overhead matches RTX 5090 (Blackwell) results.
turbo2 is viable on Ampere: PPL 5.870 (+2%), faster decode than turbo3 (142 vs 140 t/s).

TheTom and others added 30 commits April 2, 2026 13:07

docs: log simd_broadcast attempt — no speed improvement TheTom#23

d97ec7f

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: log threadgroup attempt — no speed improvement, rethinking TheT…

9bbdc3b

…om#23 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: final investigation log — 77.7 tok/s, 91% of q8_0

3918d41

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: perplexity 6.194 confirmed — 1.4% of q8_0 TheTom#30

280258c

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

signalnine changed the title ~~perf: turbo VEC flash attention — +7% decode on CUDA~~ perf: turbo VEC flash attention — +15% decode on CUDA Apr 4, 2026

signalnine marked this pull request as ready for review April 4, 2026 20:31

signalnine changed the title ~~perf: turbo VEC flash attention — +15% decode on CUDA~~ perf: turbo VEC flash attention — +9% decode on CUDA Apr 4, 2026

signalnine changed the title ~~perf: turbo VEC flash attention — +9% decode on CUDA~~ perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch Apr 4, 2026

signalnine mentioned this pull request Apr 5, 2026

Compile bug: CUDA error: invalid use of ‘extern’ in linkage specification #48

Open

signalnine and others added 14 commits April 6, 2026 10:56

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

097d765

Merge pull request TheTom#52 from iamwavecut/feature/turboquant-kv-cache

7433ad9

Enhance Metal operations for TQ weights and concurrency handling

perf: TQ4_1S native kernel 3.5× faster + AMD arch dispatch (TheTom#57)

a4e8af4

perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion

Merge pull request TheTom#33 from apollosenvy/pr/vulkan-turbo3

eea498c

feat: Vulkan compute shader support for turbo3 (experimental)

TheTom force-pushed the feature/turboquant-kv-cache branch from 10cb187 to 0d6b38a Compare April 8, 2026 23:49

TheTom and others added 2 commits April 9, 2026 08:20

Merge pull request TheTom#61 from andyluo7/add-rocm-mi300x-support

157cb85

feat: AMD Instinct MI300X + MI355X (gfx942/gfx950) ROCm support

signalnine force-pushed the pr/fattn-vec-turbo-opts branch from bb3a6eb to 348fb77 Compare April 9, 2026 17:07

TheTom force-pushed the feature/turboquant-kv-cache branch from 45f8a06 to 1073622 Compare April 16, 2026 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch#53

perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch#53
signalnine wants to merge 153 commits intoTheTom:feature/turboquant-kv-cachefrom
signalnine:pr/fattn-vec-turbo-opts

signalnine commented Apr 4, 2026 •

edited

Loading

Uh oh!

signalnine commented Apr 4, 2026 •

edited

Loading

Uh oh!

signalnine commented Apr 4, 2026

Uh oh!

seanrasch commented Apr 6, 2026

Uh oh!

signalnine commented Apr 6, 2026

Uh oh!

signalnine commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

signalnine commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Speed (RTX 5090, clean A/B on upstream HEAD)

Quality: zero regression

Changes

Test plan

Uh oh!

signalnine commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated: +15.3% decode with shmem LUT

Uh oh!

signalnine commented Apr 4, 2026

Final Results — Clean A/B on upstream HEAD (bc05a68)

Speed (RTX 5090, tg128, -p 0 -n 128)

PPL (wikitext-2-raw, 10 chunks)

NIAH (Kamradt varied filler, 5 depths)

What's in this PR

Uh oh!

seanrasch commented Apr 6, 2026

Uh oh!

signalnine commented Apr 6, 2026

Uh oh!

signalnine commented Apr 9, 2026

RTX 3090 (Ampere, sm_86) Validation

Speed

Quality (PPL @2k, 10 chunks)

KLD (ctx=2048, 5 chunks)

Key findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

signalnine commented Apr 4, 2026 •

edited

Loading

signalnine commented Apr 4, 2026 •

edited

Loading

Final Results — Clean A/B on upstream HEAD (`bc05a68`)

Speed (RTX 5090, tg128, `-p 0 -n 128`)