TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure#75
Open
CG-8663 wants to merge 1 commit intoTheTom:mainfrom
Open
TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure#75CG-8663 wants to merge 1 commit intoTheTom:mainfrom
CG-8663 wants to merge 1 commit intoTheTom:mainfrom
Conversation
Two targeted fixes for NIAH failure on Qwen3.5 hybrid (Mamba+Attention): 1. Scale eviction budget by attention fraction 27B has 16/64 attention layers → each KV token is 4x more critical Fix: effective_budget = 1 - (1 - raw_budget) * attention_fraction 90% retention → 97.5% on hybrid (evict 2.5% instead of 10%) 2. Fix frequency count for partial RoPE Qwen3.5 rotates only 64/256 head dims Current scoring averages 32 bins of signal with 96 bins of noise Fix: freq_count = n_rot/2 (32 frequencies, not 128) Includes step-by-step validation recipe and TQBridge integration analysis. Co-Authored-By: James Tervit, Founder Chronara Group <info@chronara.io>
user-23xyz
added a commit
to user-23xyz/turboquant_plus
that referenced
this pull request
Apr 10, 2026
…kernels First implementation of fused quantized KV attention on Apple Silicon Metal. Reads packed 3-bit K/V directly inside the attention dot product — the decompressed FP16 tensors never touch device memory. Results: 82% per-layer memory reduction, 0.99x baseline speed, 300K NIAH on 16GB M4 Mini. Per-head adaptive sparse attention with tile-level early exit. Interacts with TriAttention V3 (stacks: eviction × compression). Includes: methodology, 6 Metal kernels, NIAH results, bug reports (MLX grid semantics), and hybrid budget scaling interaction with PR TheTom#75. Code: github.com/user-23xyz/forgeattention (MIT)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Response to the open question in triattention-v3.md Section 5 — "failure modes to share or a recipe that fixes the hybrid case."
Two Fixes
Fix 1: Scale eviction budget by attention fraction
The root cause: Qwen3.5-27B has 16/64 attention layers. At 90% retention, you are removing 10% of tokens from a model where each attention token does 4x the work. That is equivalent to 40% effective eviction on a full transformer.
Formula:
Fix 2: Partial RoPE frequency count
Qwen3.5 rotates only 64/256 head dimensions. The scoring loop iterates over head_dim/2 = 128 frequency bins, but 96 of those contribute zero signal (no rotation = no position encoding = no trig score difference). The score averages 32 bins of signal with 96 bins of noise.
Fix: iterate only over n_rot/2 frequencies.
Why These Should Work
Validation Recipe
Included in the document — step-by-step commands for testing each fix independently and stacked, including TurboQuant+ integration.
TQBridge Integration
With both fixes: TriAttention + TurboQuant on reasoning workloads = ~23x combined compression = 2.2KB per token over the wire. Viable over WiFi for distributed inference.
Full analysis: docs/papers/triattention-hybrid-recipe.md