Skip to content

Optimize reduction stage of dot product of q4_L/q5_K to q8_K on AVX2#22181

Open
nariox wants to merge 1 commit intoggml-org:masterfrom
nariox:no-vphaddw-on-avx2
Open

Optimize reduction stage of dot product of q4_L/q5_K to q8_K on AVX2#22181
nariox wants to merge 1 commit intoggml-org:masterfrom
nariox:no-vphaddw-on-avx2

Conversation

@nariox
Copy link
Copy Markdown

@nariox nariox commented Apr 20, 2026

Overview

This PR optimizes the reduction stage of the dot product kernels for AVX2, specifically targeting the logic used in Q4_K/Q8_K and Q5_K/Q8_K interactions (i.e., ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q5_K_q8_K).

The primary change is replacing the SSSE3 instruction for horizontal add instruction _mm_hadd_epi16 (VPHADDW) with the AVX2 "equivalent" approach using _mm256_madd_epi16 and specialized shuffles. When running a large model on my CPU (qwen3.6-35b-a3b with q4_K weights and q4_0 kv cache), I noticed that these functions (ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q5_K_q8_K taking a substantial amount of time in my processor (7.45% and 4.66% respectively) and when looking at the perf top annotations, it seemed that vphaddw was taking a big chunk of those operations as well (9.01% on the q4_k one). Although other functions consumed more time, I felt this seemed like a low hanging fruit since AVX2 offers better performing alternatives for operations like this reduction.

Key Technical Changes:

  • Elimination of VPHADDW: Horizontal instructions are poorly pipelined on many x86 architectures (Intel and AMD), requiring multiple micro-ops and causing execution stalls.
  • Vertical Pairwise Summation: Uses a multiply-by-one strategy with madd to perform sums on standard FMA units, which are heavily optimized and fully pipelined.
  • Improved Reduction Path: Restructured the final summation using _mm_shuffle_ps and _mm_shuffle_epi32. This reduces port pressure and allows for better Instruction Level Parallelism (ILP).

Performance Impact:

Profiling via perf on modern AVX2 hardware showed that the. With this patch, end-to-end inference speed improved by a bit (from 14.5-15.2 t/s to 15.8-16.2 t/s on my machine.

Additional information

  • Architecture: While tested primarily on Zen 4 (Ryzen 5 7600), these improvements are fundamentally more efficient for all AVX2-capable processors (Haswell through modern architectures) because they move work from specialized horizontal units to the general-purpose SIMD execution pipeline.
  • Accuracy: The logic should maintain bit-level parity with the original implementation (save it for some odd microcode innacuracy or IEEE754 non-compliance);

Requirements

I have read and agree with the contributing guidelines

AI usage disclosure: YES. Although I have generally kept up with the theoretical advances in SIMD instruction sets, I have not done much in terms of low level SIMD programming myself since 2013 (CUDA and SSE2). I have asked an LLM to help me generate a snippet for executing these instructions. But I inspected the functions to make sure they were bit-level equal to the original and test ran the llm afterwards.

@nariox nariox requested a review from ggerganov as a code owner April 20, 2026 19:10
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant