Skip to content

TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure#75

Open
CG-8663 wants to merge 1 commit intoTheTom:mainfrom
CG-8663:fix/hybrid-triattention-recipe
Open

TriAttention V3 hybrid recipe: two fixes for Qwen3.5 NIAH failure#75
CG-8663 wants to merge 1 commit intoTheTom:mainfrom
CG-8663:fix/hybrid-triattention-recipe

Conversation

@CG-8663
Copy link
Copy Markdown

@CG-8663 CG-8663 commented Apr 10, 2026

Response to the open question in triattention-v3.md Section 5 — "failure modes to share or a recipe that fixes the hybrid case."

Two Fixes

Fix 1: Scale eviction budget by attention fraction

The root cause: Qwen3.5-27B has 16/64 attention layers. At 90% retention, you are removing 10% of tokens from a model where each attention token does 4x the work. That is equivalent to 40% effective eviction on a full transformer.

Formula:

effective_budget = 1.0 - (1.0 - raw_budget) * attention_fraction

Qwen2.5-7B:  attention_fraction = 32/32 = 1.0 → 90% (unchanged)
Qwen3.5-27B: attention_fraction = 16/64 = 0.25 → 97.5% (evict 2.5%)

Fix 2: Partial RoPE frequency count

Qwen3.5 rotates only 64/256 head dimensions. The scoring loop iterates over head_dim/2 = 128 frequency bins, but 96 of those contribute zero signal (no rotation = no position encoding = no trig score difference). The score averages 32 bins of signal with 96 bins of noise.

Fix: iterate only over n_rot/2 frequencies.

int freq_count = model->hparams.n_rot / 2;  // 32 for Qwen3.5, 64 for standard

Why These Should Work

  • Fix 1 explains every observation: PPL fine (Mamba dominates), NIAH fails at mid/end (attention is sparse), start passes (prefix protection saves it)
  • Fix 2 is a 4x signal-to-noise improvement on the scoring that chooses which tokens to evict
  • Both fixes are derived from the model architecture, not tuning
  • Neither changes the scoring formula — V3 trig scoring is correct, it just needs the right inputs

Validation Recipe

Included in the document — step-by-step commands for testing each fix independently and stacked, including TurboQuant+ integration.

TQBridge Integration

With both fixes: TriAttention + TurboQuant on reasoning workloads = ~23x combined compression = 2.2KB per token over the wire. Viable over WiFi for distributed inference.

Full analysis: docs/papers/triattention-hybrid-recipe.md

Two targeted fixes for NIAH failure on Qwen3.5 hybrid (Mamba+Attention):

1. Scale eviction budget by attention fraction
   27B has 16/64 attention layers → each KV token is 4x more critical
   Fix: effective_budget = 1 - (1 - raw_budget) * attention_fraction
   90% retention → 97.5% on hybrid (evict 2.5% instead of 10%)

2. Fix frequency count for partial RoPE
   Qwen3.5 rotates only 64/256 head dims
   Current scoring averages 32 bins of signal with 96 bins of noise
   Fix: freq_count = n_rot/2 (32 frequencies, not 128)

Includes step-by-step validation recipe and TQBridge integration analysis.

Co-Authored-By: James Tervit, Founder Chronara Group <info@chronara.io>
user-23xyz added a commit to user-23xyz/turboquant_plus that referenced this pull request Apr 10, 2026
…kernels

First implementation of fused quantized KV attention on Apple Silicon Metal.
Reads packed 3-bit K/V directly inside the attention dot product — the
decompressed FP16 tensors never touch device memory.

Results: 82% per-layer memory reduction, 0.99x baseline speed, 300K NIAH
on 16GB M4 Mini. Per-head adaptive sparse attention with tile-level early
exit. Interacts with TriAttention V3 (stacks: eviction × compression).

Includes: methodology, 6 Metal kernels, NIAH results, bug reports
(MLX grid semantics), and hybrid budget scaling interaction with PR TheTom#75.

Code: github.com/user-23xyz/forgeattention (MIT)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant