Optimize reduction stage of dot product of q4_L/q5_K to q8_K on AVX2 by nariox · Pull Request #22181 · ggml-org/llama.cpp

nariox · 2026-04-20T19:10:42Z

Overview

This PR optimizes the reduction stage of the dot product kernels for AVX2, specifically targeting the logic used in Q4_K/Q8_K and Q5_K/Q8_K interactions (i.e., ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q5_K_q8_K).

The primary change is replacing the SSSE3 instruction for horizontal add instruction _mm_hadd_epi16 (VPHADDW) with the AVX2 "equivalent" approach using _mm256_madd_epi16 and specialized shuffles. When running a large model on my CPU (qwen3.6-35b-a3b with q4_K weights and q4_0 kv cache), I noticed that these functions (ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q5_K_q8_K taking a substantial amount of time in my processor (7.45% and 4.66% respectively) and when looking at the perf top annotations, it seemed that vphaddw was taking a big chunk of those operations as well (9.01% on the q4_k one). Although other functions consumed more time, I felt this seemed like a low hanging fruit since AVX2 offers better performing alternatives for operations like this reduction.

Key Technical Changes:

Elimination of VPHADDW: Horizontal instructions are poorly pipelined on many x86 architectures (Intel and AMD), requiring multiple micro-ops and causing execution stalls.
Vertical Pairwise Summation: Uses a multiply-by-one strategy with madd to perform sums on standard FMA units, which are heavily optimized and fully pipelined.
Improved Reduction Path: Restructured the final summation using _mm_shuffle_ps and _mm_shuffle_epi32. This reduces port pressure and allows for better Instruction Level Parallelism (ILP).

Performance Impact:

Profiling via perf on modern AVX2 hardware showed that the. With this patch, end-to-end inference speed improved by a bit (from 14.5-15.2 t/s to 15.8-16.2 t/s on my machine.

Additional information

Architecture: While tested primarily on Zen 4 (Ryzen 5 7600), these improvements are fundamentally more efficient for all AVX2-capable processors (Haswell through modern architectures) because they move work from specialized horizontal units to the general-purpose SIMD execution pipeline.
Accuracy: The logic should maintain bit-level parity with the original implementation (save it for some odd microcode innacuracy or IEEE754 non-compliance);

Requirements

I have read and agree with the contributing guidelines

AI usage disclosure: YES. Although I have generally kept up with the theoretical advances in SIMD instruction sets, I have not done much in terms of low level SIMD programming myself since 2013 (CUDA and SSE2). I have asked an LLM to help me generate a snippet for executing these instructions. But I inspected the functions to make sure they were bit-level equal to the original and test ran the llm afterwards.

Optimize reduction stage of dot product of q4_L/q5_K to q8_K on AVX2

fe4a9a1

nariox requested a review from ggerganov as a code owner April 20, 2026 19:10

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize reduction stage of dot product of q4_L/q5_K to q8_K on AVX2#22181

Optimize reduction stage of dot product of q4_L/q5_K to q8_K on AVX2#22181
nariox wants to merge 1 commit intoggml-org:masterfrom
nariox:no-vphaddw-on-avx2

nariox commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nariox commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Technical Changes:

Performance Impact:

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nariox commented Apr 20, 2026 •

edited

Loading