ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic) by Simlowker · Pull Request #22134 · ggml-org/llama.cpp

Simlowker · 2026-04-19T21:47:57Z

Summary

Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K on the
WASM SIMD128 code path in ggml-cpu/arch/wasm/quants.c. PR #11453 (Jan 2025)
vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a
scalar loop. This PR closes that remaining scalar region.

Motivation

For models quantized in Q6_K running on WASM SIMD128 environments — including
(but not limited to) deterministic/fuelled runtimes like Internet Computer
canisters, Wasmtime with --wasm-features simd, WasmEdge, and WebLLM/MLC —
the Q6_K dot-product is a hot inner loop. Its Phase 2 (dot + scaling) was
already SIMD. Phase 1 (decode ql[64] + qh[32] → int8 aux8[256]) was 256
scalar stores per block with per-byte bit manipulation.

In resource-bounded environments (per-call instruction quotas, deterministic
metering), reducing the Phase 1 instruction count has a direct effect on
measurable throughput. The optimization also removes a dependency on
compiler auto-vectorization for a consistent code path across LLVM versions.

Approach

Process 16 output lanes at once using strict (non-relaxed) WASM SIMD128
intrinsics. For each j-iteration (128 decoded weights), the loop now
runs 2 × 16-lane chunks instead of 32 × 4 scalar stores.

Per chunk:

v128_t q4_lo_src = wasm_v128_load(q4 + chunk);       // q4[chunk..chunk+15]
v128_t q4_hi_src = wasm_v128_load(q4 + chunk + 32);  // q4[chunk+32..chunk+47]
v128_t qh_src    = wasm_v128_load(qh + chunk);       // qh[chunk..chunk+15]

// Low nibbles of q4 via mask (0x0F)
v128_t q4_lo_nib = wasm_v128_and(q4_lo_src, mask_0F);
v128_t q4_hi_nib = wasm_v128_and(q4_hi_src, mask_0F);

// High nibbles of q4 via unsigned 8-bit shift right
v128_t q4_lo_hnib = wasm_u8x16_shr(q4_lo_src, 4);
v128_t q4_hi_hnib = wasm_u8x16_shr(q4_hi_src, 4);

// Extract 4 × 2-bit groups from qh
v128_t qh_b01 = wasm_v128_and(qh_src, mask_03);
v128_t qh_b23 = wasm_v128_and(wasm_u8x16_shr(qh_src, 2), mask_03);
v128_t qh_b45 = wasm_v128_and(wasm_u8x16_shr(qh_src, 4), mask_03);
v128_t qh_b67 = wasm_u8x16_shr(qh_src, 6);  // only 2 bits remain

// Merge: shift qh bits into upper nibble, OR with q4 nibbles, subtract bias 32
v128_t out_0  = wasm_i8x16_sub(wasm_v128_or(q4_lo_nib,  wasm_i8x16_shl(qh_b01, 4)), bias_32);
v128_t out_32 = wasm_i8x16_sub(wasm_v128_or(q4_hi_nib,  wasm_i8x16_shl(qh_b23, 4)), bias_32);
v128_t out_64 = wasm_i8x16_sub(wasm_v128_or(q4_lo_hnib, wasm_i8x16_shl(qh_b45, 4)), bias_32);
v128_t out_96 = wasm_i8x16_sub(wasm_v128_or(q4_hi_hnib, wasm_i8x16_shl(qh_b67, 4)), bias_32);

wasm_v128_store(a + chunk +  0, out_0);
wasm_v128_store(a + chunk + 32, out_32);
wasm_v128_store(a + chunk + 64, out_64);
wasm_v128_store(a + chunk + 96, out_96);

Determinism

No relaxed SIMD ops (wasm_*_relaxed_*, i32x4.relaxed_dot_i8x16_i7x16,
f32x4.relaxed_madd, etc.) are used. All intrinsics employed
(v128_load/store, i8x16_splat, v128_and/or, u8x16_shr, i8x16_shl,
i8x16_sub) have fully specified semantics in the WASM SIMD128 spec and
produce bit-exact identical output across conforming implementations.

This matters for environments that require deterministic compute across
replicas (consensus-based VMs, reproducible research pipelines, debugging
deterministic replays).

Microbench results

Measured with Emscripten -O3 -msimd128, Node.js v24, N=4 runs:

Variant	ns/iter	GFLOPS	CV	Bit-exact
Baseline (scalar unpack)	357.81	22.91	7.98%	reference
Patched (vectorized)	349.85	23.43	2.53%	identical

Speedup: +2.3% mean. Per-run variance drops 3× (CV 7.98% → 2.53%) because
the vectorized path has fewer branches and more predictable cycle counts.

The modest mean speedup reflects that LLVM -O3 already extracts a
non-trivial fraction of the SIMD parallelism from the scalar loop via its
auto-vectorizer. The explicit SIMD code path:

Guarantees SIMD codegen independent of compiler version / flags.
Reduces run-to-run variance (useful for deterministic metering and
reproducibility audits).
Provides a stable baseline for further kernel-level tuning.

Bit-exactness regression test

The microbench harness (matmul-bench/q6k_vectorize_bench.c in the
external project. generates deterministic Q6_K and Q8_K blocks
(xorshift32 seeded to 42), runs both variants, and compares the float
result. All 8 total runs produced result=56754044928.000000 identically
— demonstrating bit-exact equivalence.

Testing

Microbench compiles with emcc -O3 -msimd128
Bit-exact output vs scalar baseline (seed=42, 16 blocks × 256 elements)
No measurable regression on repeated runs
Compiles into downstream canister build (icpp-pro 5.3.1, wasi-sdk 25.0,
target wasm32-wasi, with -msimd128)

Related work

This patch came out of running Q6_K-quantized models inside the Internet
Computer's WASM runtime, where matmul dominates the per-call instruction
budget. The change is small and self-contained and independently useful
to anyone running Q6_K on WASM SIMD128.

Files changed

Only ggml/src/ggml-cpu/arch/wasm/quants.c — specifically the
ggml_vec_dot_q6_K_q8_K function's #if defined __wasm_simd128__ branch,
Phase 1 loop (lines 1115-1130 in the pre-PR state).

Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are
unchanged.

Checklist

Fork the latest version of the upstream repository and create a PR from that fork
Make only the changes described above (single-function, single-arch scope)
Run the test suite? — ggml standalone tests, llama-perplexity on
a Q6_K-quantized model. PR author's environment doesn't have GGUFs in
Q6_K on hand for llama-perplexity; relying on maintainer CI for that.
Bit-exact verification via microbench

Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K on the WASM SIMD128 code path. PR ggml-org#11453 (Jan 2025) vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a scalar loop running 256 stores per block with per-byte bit manipulation. This PR closes that remaining scalar region. Approach: process 16 output lanes at once using strict (non-relaxed) WASM SIMD128 intrinsics. For each j-iteration (128 decoded weights), the loop now runs 2 × 16-lane chunks instead of 32 × 4 scalar stores. All intrinsics used (v128_load/store, i8x16_splat, v128_and/or, u8x16_shr, i8x16_shl, i8x16_sub) have fully specified semantics in the WASM SIMD128 spec — no relaxed_simd ops — so output is bit-exact identical across conforming implementations. Important for runtimes that require deterministic compute (consensus-based VMs, fuelled runtimes, reproducible-research pipelines). Microbench (Emscripten -O3 -msimd128, Node.js v24, N=4 runs): Variant | ns/iter | GFLOPS | CV | Bit-exact -------------------------|---------|--------|-------|---------- Baseline (scalar unpack) | 357.81 | 22.91 | 7.98% | reference Patched (vectorized) | 349.85 | 23.43 | 2.53% | identical Speedup +2.3% mean. Per-run variance drops 3× (CV 7.98 → 2.53) because the vectorized path has fewer branches and more predictable cycle counts. The modest mean speedup reflects that LLVM -O3 already extracts a non-trivial fraction of SIMD parallelism from the scalar loop via auto-vectorization; the explicit SIMD path guarantees SIMD codegen independent of compiler version, reduces variance, and provides a stable baseline for further tuning. Bit-exactness: microbench seeds Q6_K and Q8_K blocks deterministically (xorshift32, seed=42) and compares the float result. All 8 runs produced result=56754044928.000000 identically across baseline and patched paths. Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are unchanged.

ggml-gh-bot · 2026-04-19T21:51:51Z

Hi @Simlowker, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Simlowker requested a review from ggerganov as a code owner April 19, 2026 21:47

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic)#22134

ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic)#22134
Simlowker wants to merge 1 commit intoggml-org:masterfrom
Simlowker:q6k-wasm-vectorize

Simlowker commented Apr 19, 2026

Uh oh!

ggml-gh-bot bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Simlowker commented Apr 19, 2026

Summary

Motivation

Approach

Determinism

Microbench results

Bit-exactness regression test

Testing

Related work

Files changed

Checklist

Uh oh!

ggml-gh-bot bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant