TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure by CG-8663 · Pull Request #75 · TheTom/turboquant_plus

CG-8663 · 2026-04-10T05:19:17Z

Response to the open question in triattention-v3.md Section 5 — "failure modes to share or a recipe that fixes the hybrid case."

Two Fixes

Fix 1: Scale eviction budget by attention fraction

The root cause: Qwen3.5-27B has 16/64 attention layers. At 90% retention, you are removing 10% of tokens from a model where each attention token does 4x the work. That is equivalent to 40% effective eviction on a full transformer.

Formula:

effective_budget = 1.0 - (1.0 - raw_budget) * attention_fraction

Qwen2.5-7B:  attention_fraction = 32/32 = 1.0 → 90% (unchanged)
Qwen3.5-27B: attention_fraction = 16/64 = 0.25 → 97.5% (evict 2.5%)

Fix 2: Partial RoPE frequency count

Qwen3.5 rotates only 64/256 head dimensions. The scoring loop iterates over head_dim/2 = 128 frequency bins, but 96 of those contribute zero signal (no rotation = no position encoding = no trig score difference). The score averages 32 bins of signal with 96 bins of noise.

Fix: iterate only over n_rot/2 frequencies.

int freq_count = model->hparams.n_rot / 2;  // 32 for Qwen3.5, 64 for standard

Why These Should Work

Fix 1 explains every observation: PPL fine (Mamba dominates), NIAH fails at mid/end (attention is sparse), start passes (prefix protection saves it)
Fix 2 is a 4x signal-to-noise improvement on the scoring that chooses which tokens to evict
Both fixes are derived from the model architecture, not tuning
Neither changes the scoring formula — V3 trig scoring is correct, it just needs the right inputs

Validation Recipe

Included in the document — step-by-step commands for testing each fix independently and stacked, including TurboQuant+ integration.

TQBridge Integration

With both fixes: TriAttention + TurboQuant on reasoning workloads = ~23x combined compression = 2.2KB per token over the wire. Viable over WiFi for distributed inference.

Full analysis: docs/papers/triattention-hybrid-recipe.md

Two targeted fixes for NIAH failure on Qwen3.5 hybrid (Mamba+Attention): 1. Scale eviction budget by attention fraction 27B has 16/64 attention layers → each KV token is 4x more critical Fix: effective_budget = 1 - (1 - raw_budget) * attention_fraction 90% retention → 97.5% on hybrid (evict 2.5% instead of 10%) 2. Fix frequency count for partial RoPE Qwen3.5 rotates only 64/256 head dims Current scoring averages 32 bins of signal with 96 bins of noise Fix: freq_count = n_rot/2 (32 frequencies, not 128) Includes step-by-step validation recipe and TQBridge integration analysis. Co-Authored-By: James Tervit, Founder Chronara Group <info@chronara.io>

…kernels First implementation of fused quantized KV attention on Apple Silicon Metal. Reads packed 3-bit K/V directly inside the attention dot product — the decompressed FP16 tensors never touch device memory. Results: 82% per-layer memory reduction, 0.99x baseline speed, 300K NIAH on 16GB M4 Mini. Per-head adaptive sparse attention with tile-level early exit. Interacts with TriAttention V3 (stacks: eviction × compression). Includes: methodology, 6 Metal kernels, NIAH results, bug reports (MLX grid semantics), and hybrid budget scaling interaction with PR TheTom#75. Code: github.com/user-23xyz/forgeattention (MIT)

user-23xyz mentioned this pull request Apr 10, 2026

docs: ForgeAttention — fused 3-bit KV dequant inside Metal attention kernels #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure#75

TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure#75
CG-8663 wants to merge 1 commit intoTheTom:mainfrom
CG-8663:fix/hybrid-triattention-recipe

CG-8663 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

CG-8663 commented Apr 10, 2026

Two Fixes

Fix 1: Scale eviction budget by attention fraction

Fix 2: Partial RoPE frequency count

Why These Should Work

Validation Recipe

TQBridge Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant