feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX)#31
Merged
TheTom merged 1 commit intoTheTom:feature/turboquant-kv-cachefrom Mar 30, 2026
Merged
Conversation
…rnels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tasks
Owner
|
Tested on M5 Max 128GB and M2 Pro 32GB (Metal). HIP-only changes, zero shared code — confirmed no Metal regressions. M5 Max (Qwen3.5-35B-A3B Q8_0):
M2 Pro (Qwen2.5-7B Q4_K_M, asymmetric — correct config for Q4_K_M models):
Clean on both platforms. Nice minimal PR — the variadic shuffle macros and the D>=576 exclusion are both sensible. Thanks for the clean split from PR #5, this is exactly the right approach. Merging. |
shtaylor
pushed a commit
to shtaylor/llama-cpp-turboquant
that referenced
this pull request
Mar 30, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
shtaylor
pushed a commit
to shtaylor/llama-cpp-turboquant
that referenced
this pull request
Mar 30, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mihai-chiorean
pushed a commit
to mihai-chiorean/turbo3-cuda
that referenced
this pull request
Mar 31, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mihai-chiorean
pushed a commit
to mihai-chiorean/turbo3-cuda
that referenced
this pull request
Mar 31, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
…31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
…31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spiritbuun
added a commit
to spiritbuun/buun-llama-cpp
that referenced
this pull request
Apr 6, 2026
- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%) - Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K - Gemma-3 turbo3 results post-iSWA fix (+3.3%) - KVLinC no-K-rotation negative result - Speculative decoding negative result - CUDA 13.2 compatibility verified - Experiments TheTom#31, TheTom#39, TheTom#42, TheTom#45, TheTom#49, TheTom#50, TheTom#51 status updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
iamwavecut
pushed a commit
to iamwavecut/llama-cpp-turboquant
that referenced
this pull request
Apr 8, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
iamwavecut
pushed a commit
to iamwavecut/llama-cpp-turboquant
that referenced
this pull request
Apr 8, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 9, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 9, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spiritbuun
added a commit
to spiritbuun/buun-llama-cpp
that referenced
this pull request
Apr 9, 2026
On Gemma 4 26B-A4B (Ampere), the inverse-FWHT decode K dequant for K=turbo2
produces values that are correct in isolation but trigger degenerate
single-token output (e.g. <thought> repeat, 0000 repeat) when paired with
V types in {turbo3, turbo4, q8_0, f16}. The same K=turbo2 inv-FWHT path
works fine for V in {turbo2, turbo2_tcq, turbo3_tcq}. The native VEC turbo
path also works for the failing combos. Root cause is still undiagnosed
after deep instrumentation: K and V f16 buffers contain correct values,
strides are correct, the FA kernel template selection matches the working
configurations, and L2 norms / value distributions look healthy across
all 30+ FA calls. Yet the model output collapses on Gemma 4 globals.
The PREFILL path uses the rotated-domain dequant kernel (k_turbo2_dequant_f16,
no inverse FWHT) plus a Q rotation to keep K and Q in the same Hadamard
basis. That path works for every K/V combination. This commit conditionally
mirrors the prefill K dequant + Q rotation in the decode path for the
specific K=turbo2 ↔ V={turbo3,turbo4,q8_0,f16} cases, and symmetrically for
K=turbo3 ↔ V=turbo2. Same-type and TCQ-side configurations are unchanged.
Fixes (Gemma 4 26B-A4B, dorei RTX 3090):
K=turbo2 V=turbo3 degenerate `<thought>` repeat → coherent factorial code
K=turbo3 V=turbo2 degenerate `0000` repeat → coherent factorial code
K=turbo2 V=turbo4 degenerate `<start_of_turn>` → coherent factorial code
K=turbo2 V=q8_0 degenerate `<div>` echo → coherent factorial code
Still broken (different code paths, separate root causes):
K=q8_0 V=turbo2 `<|channel>` repeat — K=q8_0 has no rotated-domain
decode option (Q8_0 is naturally in original domain)
K=turbo2 V=f16 crash in llama_decode (separate dispatch ASSERT)
K=f16 V=turbo2 crash in llama_decode (separate dispatch ASSERT)
Verification:
- Qwen3.5-27B PPL (8 chunks @ 2K, wikitext-2, RTX 3090): bit-identical to
master baseline across f16/q8_0/turbo3/turbo4/turbo3_tcq/turbo2/turbo2_tcq
(5.8048 / 5.8385 / 5.8501 / 5.8579 / 5.8017 / 6.0786 / 6.0051).
- Gemma 4 26B-A4B same-type configs (f16, q8_0, turbo2..turbo4, turbo*_tcq):
all coherent.
- Gemma 4 26B-A4B mixed K/V matrix: 19/21 working configs unchanged,
4 previously-failing K=t2/t3 mixed configs newly fixed, K=q8_0+V=t2 still
broken (separate bug).
The conditional dispatch is gated on K type AND V type so it only diverts
the empirically-failing combinations. Same-type configs and configs that
work with the inv-FWHT path are untouched.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 10, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 10, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 13, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 13, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 14, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 14, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
greatyingzi
pushed a commit
to greatyingzi/buun-llama-cpp
that referenced
this pull request
Apr 15, 2026
The previous commit optimized turbo4 decode by skipping the expensive inv-FWHT butterfly on K dequant and using rotated-domain dequant with Q pre-rotation instead. This extends the same optimization to all turbo quantization types: - turbo2_0: switched from k_turbo2_dequant_f16_inv_fwht to k_turbo2_dequant_f16 - turbo3_0: switched from k_turbo3_dequant_f16_inv_fwht to k_turbo3_dequant_f16 - turbo3_tcq: switched from k_turbo3_tcq_dequant_f16_inv_fwht to rotated-domain - turbo2_tcq: switched from k_turbo2_tcq_dequant_f16_inv_fwht to rotated-domain All turbo K types now always use rotated-domain dequant in decode, with Q pre-rotated via FWHT to compensate (cheap: ~1 FWHT group per head vs N FWHT groups per KV row). The Bug TheTom#31 conditional workaround is removed since rotated-domain is now the default for all types. Also adds shared-memory codebook (_cb) variants for turbo2_0 and turbo3_0 VEC path (K dot product and V dequant) to match turbo4 and TCQ patterns, and adds turbo4 tile-path loading in the MMA/f16 kernel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 15, 2026
…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir
pushed a commit
to KGardevoir/llama-cpp-turboquant
that referenced
this pull request
Apr 15, 2026
…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 15, 2026
…31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 15, 2026
…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HIP/ROCm porting for the turbo3/turbo2 warp-cooperative kernels. Split from PR #5 per review feedback.
Single commit, minimal surface area:
hip.h): AddedcudaMemcpyToSymbol/FromSymbolmappings. Fixed__shfl_sync,__shfl_xor_sync,__shfl_up_sync,__shfl_down_syncto support 3-arg calls (CUDA defaults width towarpSize). Added__ballot_sync->__ballotwithuint32_tcast.Test Results (AMD 7900 XTX, ROCm 7.1)
turbo3 at ~98% of F16 speed. Mistral-Small (head_dim=128) confirmed working.
What's NOT in this PR
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com