ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic)#22134
Open
Simlowker wants to merge 1 commit intoggml-org:masterfrom
Open
ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic)#22134Simlowker wants to merge 1 commit intoggml-org:masterfrom
Simlowker wants to merge 1 commit intoggml-org:masterfrom
Conversation
Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K on the WASM SIMD128 code path. PR ggml-org#11453 (Jan 2025) vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a scalar loop running 256 stores per block with per-byte bit manipulation. This PR closes that remaining scalar region. Approach: process 16 output lanes at once using strict (non-relaxed) WASM SIMD128 intrinsics. For each j-iteration (128 decoded weights), the loop now runs 2 × 16-lane chunks instead of 32 × 4 scalar stores. All intrinsics used (v128_load/store, i8x16_splat, v128_and/or, u8x16_shr, i8x16_shl, i8x16_sub) have fully specified semantics in the WASM SIMD128 spec — no relaxed_simd ops — so output is bit-exact identical across conforming implementations. Important for runtimes that require deterministic compute (consensus-based VMs, fuelled runtimes, reproducible-research pipelines). Microbench (Emscripten -O3 -msimd128, Node.js v24, N=4 runs): Variant | ns/iter | GFLOPS | CV | Bit-exact -------------------------|---------|--------|-------|---------- Baseline (scalar unpack) | 357.81 | 22.91 | 7.98% | reference Patched (vectorized) | 349.85 | 23.43 | 2.53% | identical Speedup +2.3% mean. Per-run variance drops 3× (CV 7.98 → 2.53) because the vectorized path has fewer branches and more predictable cycle counts. The modest mean speedup reflects that LLVM -O3 already extracts a non-trivial fraction of SIMD parallelism from the scalar loop via auto-vectorization; the explicit SIMD path guarantees SIMD codegen independent of compiler version, reduces variance, and provides a stable baseline for further tuning. Bit-exactness: microbench seeds Q6_K and Q8_K blocks deterministically (xorshift32, seed=42) and compares the float result. All 8 runs produced result=56754044928.000000 identically across baseline and patched paths. Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are unchanged.
|
Hi @Simlowker, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Vectorize the 6-bit weight unpacking phase of
ggml_vec_dot_q6_K_q8_Kon theWASM SIMD128 code path in
ggml-cpu/arch/wasm/quants.c. PR #11453 (Jan 2025)vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's
ql/qhunpacking as ascalar loop. This PR closes that remaining scalar region.
Motivation
For models quantized in Q6_K running on WASM SIMD128 environments — including
(but not limited to) deterministic/fuelled runtimes like Internet Computer
canisters, Wasmtime with
--wasm-features simd, WasmEdge, and WebLLM/MLC —the Q6_K dot-product is a hot inner loop. Its Phase 2 (dot + scaling) was
already SIMD. Phase 1 (decode
ql[64] + qh[32]→int8 aux8[256]) was 256scalar stores per block with per-byte bit manipulation.
In resource-bounded environments (per-call instruction quotas, deterministic
metering), reducing the Phase 1 instruction count has a direct effect on
measurable throughput. The optimization also removes a dependency on
compiler auto-vectorization for a consistent code path across LLVM versions.
Approach
Process 16 output lanes at once using strict (non-relaxed) WASM SIMD128
intrinsics. For each
j-iteration (128 decoded weights), the loop nowruns 2 × 16-lane chunks instead of 32 × 4 scalar stores.
Per chunk:
Determinism
No relaxed SIMD ops (
wasm_*_relaxed_*,i32x4.relaxed_dot_i8x16_i7x16,f32x4.relaxed_madd, etc.) are used. All intrinsics employed(
v128_load/store,i8x16_splat,v128_and/or,u8x16_shr,i8x16_shl,i8x16_sub) have fully specified semantics in the WASM SIMD128 spec andproduce bit-exact identical output across conforming implementations.
This matters for environments that require deterministic compute across
replicas (consensus-based VMs, reproducible research pipelines, debugging
deterministic replays).
Microbench results
Measured with Emscripten
-O3 -msimd128, Node.js v24, N=4 runs:Speedup: +2.3% mean. Per-run variance drops 3× (CV 7.98% → 2.53%) because
the vectorized path has fewer branches and more predictable cycle counts.
The modest mean speedup reflects that LLVM
-O3already extracts anon-trivial fraction of the SIMD parallelism from the scalar loop via its
auto-vectorizer. The explicit SIMD code path:
reproducibility audits).
Bit-exactness regression test
The microbench harness (
matmul-bench/q6k_vectorize_bench.cin theexternal project. generates deterministic Q6_K and Q8_K blocks
(xorshift32 seeded to 42), runs both variants, and compares the
floatresult. All 8 total runs produced
result=56754044928.000000identically— demonstrating bit-exact equivalence.
Testing
emcc -O3 -msimd128target
wasm32-wasi, with-msimd128)Related work
This patch came out of running Q6_K-quantized models inside the Internet
Computer's WASM runtime, where matmul dominates the per-call instruction
budget. The change is small and self-contained and independently useful
to anyone running Q6_K on WASM SIMD128.
Files changed
Only
ggml/src/ggml-cpu/arch/wasm/quants.c— specifically theggml_vec_dot_q6_K_q8_Kfunction's#if defined __wasm_simd128__branch,Phase 1 loop (lines 1115-1130 in the pre-PR state).
Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are
unchanged.
Checklist
ggmlstandalone tests,llama-perplexityona Q6_K-quantized model. PR author's environment doesn't have GGUFs in
Q6_K on hand for llama-perplexity; relying on maintainer CI for that.