Skip to content

ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic)#22134

Open
Simlowker wants to merge 1 commit intoggml-org:masterfrom
Simlowker:q6k-wasm-vectorize
Open

ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic)#22134
Simlowker wants to merge 1 commit intoggml-org:masterfrom
Simlowker:q6k-wasm-vectorize

Conversation

@Simlowker
Copy link
Copy Markdown

Summary

Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K on the
WASM SIMD128 code path in ggml-cpu/arch/wasm/quants.c. PR #11453 (Jan 2025)
vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a
scalar loop. This PR closes that remaining scalar region.

Motivation

For models quantized in Q6_K running on WASM SIMD128 environments — including
(but not limited to) deterministic/fuelled runtimes like Internet Computer
canisters, Wasmtime with --wasm-features simd, WasmEdge, and WebLLM/MLC —
the Q6_K dot-product is a hot inner loop. Its Phase 2 (dot + scaling) was
already SIMD. Phase 1 (decode ql[64] + qh[32]int8 aux8[256]) was 256
scalar stores per block with per-byte bit manipulation.

In resource-bounded environments (per-call instruction quotas, deterministic
metering), reducing the Phase 1 instruction count has a direct effect on
measurable throughput. The optimization also removes a dependency on
compiler auto-vectorization for a consistent code path across LLVM versions.

Approach

Process 16 output lanes at once using strict (non-relaxed) WASM SIMD128
intrinsics. For each j-iteration (128 decoded weights), the loop now
runs 2 × 16-lane chunks instead of 32 × 4 scalar stores.

Per chunk:

v128_t q4_lo_src = wasm_v128_load(q4 + chunk);       // q4[chunk..chunk+15]
v128_t q4_hi_src = wasm_v128_load(q4 + chunk + 32);  // q4[chunk+32..chunk+47]
v128_t qh_src    = wasm_v128_load(qh + chunk);       // qh[chunk..chunk+15]

// Low nibbles of q4 via mask (0x0F)
v128_t q4_lo_nib = wasm_v128_and(q4_lo_src, mask_0F);
v128_t q4_hi_nib = wasm_v128_and(q4_hi_src, mask_0F);

// High nibbles of q4 via unsigned 8-bit shift right
v128_t q4_lo_hnib = wasm_u8x16_shr(q4_lo_src, 4);
v128_t q4_hi_hnib = wasm_u8x16_shr(q4_hi_src, 4);

// Extract 4 × 2-bit groups from qh
v128_t qh_b01 = wasm_v128_and(qh_src, mask_03);
v128_t qh_b23 = wasm_v128_and(wasm_u8x16_shr(qh_src, 2), mask_03);
v128_t qh_b45 = wasm_v128_and(wasm_u8x16_shr(qh_src, 4), mask_03);
v128_t qh_b67 = wasm_u8x16_shr(qh_src, 6);  // only 2 bits remain

// Merge: shift qh bits into upper nibble, OR with q4 nibbles, subtract bias 32
v128_t out_0  = wasm_i8x16_sub(wasm_v128_or(q4_lo_nib,  wasm_i8x16_shl(qh_b01, 4)), bias_32);
v128_t out_32 = wasm_i8x16_sub(wasm_v128_or(q4_hi_nib,  wasm_i8x16_shl(qh_b23, 4)), bias_32);
v128_t out_64 = wasm_i8x16_sub(wasm_v128_or(q4_lo_hnib, wasm_i8x16_shl(qh_b45, 4)), bias_32);
v128_t out_96 = wasm_i8x16_sub(wasm_v128_or(q4_hi_hnib, wasm_i8x16_shl(qh_b67, 4)), bias_32);

wasm_v128_store(a + chunk +  0, out_0);
wasm_v128_store(a + chunk + 32, out_32);
wasm_v128_store(a + chunk + 64, out_64);
wasm_v128_store(a + chunk + 96, out_96);

Determinism

No relaxed SIMD ops (wasm_*_relaxed_*, i32x4.relaxed_dot_i8x16_i7x16,
f32x4.relaxed_madd, etc.) are used. All intrinsics employed
(v128_load/store, i8x16_splat, v128_and/or, u8x16_shr, i8x16_shl,
i8x16_sub) have fully specified semantics in the WASM SIMD128 spec and
produce bit-exact identical output across conforming implementations.

This matters for environments that require deterministic compute across
replicas (consensus-based VMs, reproducible research pipelines, debugging
deterministic replays).

Microbench results

Measured with Emscripten -O3 -msimd128, Node.js v24, N=4 runs:

Variant ns/iter GFLOPS CV Bit-exact
Baseline (scalar unpack) 357.81 22.91 7.98% reference
Patched (vectorized) 349.85 23.43 2.53% identical

Speedup: +2.3% mean. Per-run variance drops 3× (CV 7.98% → 2.53%) because
the vectorized path has fewer branches and more predictable cycle counts.

The modest mean speedup reflects that LLVM -O3 already extracts a
non-trivial fraction of the SIMD parallelism from the scalar loop via its
auto-vectorizer. The explicit SIMD code path:

  1. Guarantees SIMD codegen independent of compiler version / flags.
  2. Reduces run-to-run variance (useful for deterministic metering and
    reproducibility audits).
  3. Provides a stable baseline for further kernel-level tuning.

Bit-exactness regression test

The microbench harness (matmul-bench/q6k_vectorize_bench.c in the
external project. generates deterministic Q6_K and Q8_K blocks
(xorshift32 seeded to 42), runs both variants, and compares the float
result. All 8 total runs produced result=56754044928.000000 identically
— demonstrating bit-exact equivalence.

Testing

  • Microbench compiles with emcc -O3 -msimd128
  • Bit-exact output vs scalar baseline (seed=42, 16 blocks × 256 elements)
  • No measurable regression on repeated runs
  • Compiles into downstream canister build (icpp-pro 5.3.1, wasi-sdk 25.0,
    target wasm32-wasi, with -msimd128)

Related work

This patch came out of running Q6_K-quantized models inside the Internet
Computer's WASM runtime, where matmul dominates the per-call instruction
budget. The change is small and self-contained and independently useful
to anyone running Q6_K on WASM SIMD128.

Files changed

Only ggml/src/ggml-cpu/arch/wasm/quants.c — specifically the
ggml_vec_dot_q6_K_q8_K function's #if defined __wasm_simd128__ branch,
Phase 1 loop (lines 1115-1130 in the pre-PR state).

Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are
unchanged.

Checklist

  • Fork the latest version of the upstream repository and create a PR from that fork
  • Make only the changes described above (single-function, single-arch scope)
  • Run the test suite? — ggml standalone tests, llama-perplexity on
    a Q6_K-quantized model. PR author's environment doesn't have GGUFs in
    Q6_K on hand for llama-perplexity; relying on maintainer CI for that.
  • Bit-exact verification via microbench

Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K
on the WASM SIMD128 code path. PR ggml-org#11453 (Jan 2025) vectorized the
Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a scalar loop
running 256 stores per block with per-byte bit manipulation. This PR
closes that remaining scalar region.

Approach: process 16 output lanes at once using strict (non-relaxed)
WASM SIMD128 intrinsics. For each j-iteration (128 decoded weights),
the loop now runs 2 × 16-lane chunks instead of 32 × 4 scalar stores.
All intrinsics used (v128_load/store, i8x16_splat, v128_and/or,
u8x16_shr, i8x16_shl, i8x16_sub) have fully specified semantics in
the WASM SIMD128 spec — no relaxed_simd ops — so output is bit-exact
identical across conforming implementations. Important for runtimes
that require deterministic compute (consensus-based VMs, fuelled
runtimes, reproducible-research pipelines).

Microbench (Emscripten -O3 -msimd128, Node.js v24, N=4 runs):

  Variant                  | ns/iter | GFLOPS | CV    | Bit-exact
  -------------------------|---------|--------|-------|----------
  Baseline (scalar unpack) |  357.81 |  22.91 | 7.98% | reference
  Patched  (vectorized)    |  349.85 |  23.43 | 2.53% | identical

Speedup +2.3% mean. Per-run variance drops 3× (CV 7.98 → 2.53)
because the vectorized path has fewer branches and more predictable
cycle counts. The modest mean speedup reflects that LLVM -O3 already
extracts a non-trivial fraction of SIMD parallelism from the scalar
loop via auto-vectorization; the explicit SIMD path guarantees SIMD
codegen independent of compiler version, reduces variance, and
provides a stable baseline for further tuning.

Bit-exactness: microbench seeds Q6_K and Q8_K blocks deterministically
(xorshift32, seed=42) and compares the float result. All 8 runs
produced result=56754044928.000000 identically across baseline and
patched paths.

Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are
unchanged.
@Simlowker Simlowker requested a review from ggerganov as a code owner April 19, 2026 21:47
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 19, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 19, 2026

Hi @Simlowker, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant