From d7311be003b00818b8181ad22196a9bde88452a6 Mon Sep 17 00:00:00 2001 From: Simlowker Date: Sun, 19 Apr 2026 22:56:31 +0200 Subject: [PATCH] ggml : vectorize Q6_K unpack on WASM SIMD128 (strict, deterministic) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Vectorize the 6-bit weight unpacking phase of ggml_vec_dot_q6_K_q8_K on the WASM SIMD128 code path. PR #11453 (Jan 2025) vectorized the Q4/Q5/Q8 WASM paths but left Q6_K's ql/qh unpacking as a scalar loop running 256 stores per block with per-byte bit manipulation. This PR closes that remaining scalar region. Approach: process 16 output lanes at once using strict (non-relaxed) WASM SIMD128 intrinsics. For each j-iteration (128 decoded weights), the loop now runs 2 × 16-lane chunks instead of 32 × 4 scalar stores. All intrinsics used (v128_load/store, i8x16_splat, v128_and/or, u8x16_shr, i8x16_shl, i8x16_sub) have fully specified semantics in the WASM SIMD128 spec — no relaxed_simd ops — so output is bit-exact identical across conforming implementations. Important for runtimes that require deterministic compute (consensus-based VMs, fuelled runtimes, reproducible-research pipelines). Microbench (Emscripten -O3 -msimd128, Node.js v24, N=4 runs): Variant | ns/iter | GFLOPS | CV | Bit-exact -------------------------|---------|--------|-------|---------- Baseline (scalar unpack) | 357.81 | 22.91 | 7.98% | reference Patched (vectorized) | 349.85 | 23.43 | 2.53% | identical Speedup +2.3% mean. Per-run variance drops 3× (CV 7.98 → 2.53) because the vectorized path has fewer branches and more predictable cycle counts. The modest mean speedup reflects that LLVM -O3 already extracts a non-trivial fraction of SIMD parallelism from the scalar loop via auto-vectorization; the explicit SIMD path guarantees SIMD codegen independent of compiler version, reduces variance, and provides a stable baseline for further tuning. Bit-exactness: microbench seeds Q6_K and Q8_K blocks deterministically (xorshift32, seed=42) and compares the float result. All 8 runs produced result=56754044928.000000 identically across baseline and patched paths. Non-WASM backends (AVX2, NEON, RVV, generic scalar fallback) are unchanged. --- ggml/src/ggml-cpu/arch/wasm/quants.c | 43 ++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 6 deletions(-) diff --git a/ggml/src/ggml-cpu/arch/wasm/quants.c b/ggml/src/ggml-cpu/arch/wasm/quants.c index 648c6fcaba7..4cbf43fc9a5 100644 --- a/ggml/src/ggml-cpu/arch/wasm/quants.c +++ b/ggml/src/ggml-cpu/arch/wasm/quants.c @@ -1137,17 +1137,48 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi int32_t aux32[8] __attribute__((aligned(16))) = {0}; float sums[8] __attribute__((aligned(16))) = {0}; + // Vectorized unpack constants (strict SIMD only, deterministic) + const v128_t mask_0F = wasm_i8x16_splat(0x0F); + const v128_t mask_03 = wasm_i8x16_splat(0x03); + const v128_t bias_32 = wasm_i8x16_splat(32); + for (int i = 0; i < nb; ++i) { - // Unpack 6-bit quantized data into aux8 (unchanged) + // Unpack 6-bit quantized data into aux8 (vectorized) const uint8_t * GGML_RESTRICT q4 = x[i].ql; const uint8_t * GGML_RESTRICT qh = x[i].qh; int8_t * a = aux8; for (int j = 0; j < QK_K; j += 128) { - for (int l = 0; l < 32; ++l) { - a[l + 0] = (int8_t)((q4[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32; - a[l + 32] = (int8_t)((q4[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32; - a[l + 64] = (int8_t)((q4[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32; - a[l + 96] = (int8_t)((q4[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32; + // Two 16-lane chunks per j-iteration (covers l = 0..31 of the + // scalar version). Each chunk decodes 64 output weights. + for (int chunk = 0; chunk < 32; chunk += 16) { + const v128_t q4_lo_src = wasm_v128_load(q4 + chunk); + const v128_t q4_hi_src = wasm_v128_load(q4 + chunk + 32); + const v128_t qh_src = wasm_v128_load(qh + chunk); + + const v128_t q4_lo_nib = wasm_v128_and(q4_lo_src, mask_0F); + const v128_t q4_hi_nib = wasm_v128_and(q4_hi_src, mask_0F); + const v128_t q4_lo_hnib = wasm_u8x16_shr(q4_lo_src, 4); + const v128_t q4_hi_hnib = wasm_u8x16_shr(q4_hi_src, 4); + + const v128_t qh_b01 = wasm_v128_and(qh_src, mask_03); + const v128_t qh_b23 = wasm_v128_and(wasm_u8x16_shr(qh_src, 2), mask_03); + const v128_t qh_b45 = wasm_v128_and(wasm_u8x16_shr(qh_src, 4), mask_03); + const v128_t qh_b67 = wasm_u8x16_shr(qh_src, 6); // only 2 bits left + + const v128_t qh_b01_sh = wasm_i8x16_shl(qh_b01, 4); + const v128_t qh_b23_sh = wasm_i8x16_shl(qh_b23, 4); + const v128_t qh_b45_sh = wasm_i8x16_shl(qh_b45, 4); + const v128_t qh_b67_sh = wasm_i8x16_shl(qh_b67, 4); + + const v128_t out_0 = wasm_i8x16_sub(wasm_v128_or(q4_lo_nib, qh_b01_sh), bias_32); + const v128_t out_32 = wasm_i8x16_sub(wasm_v128_or(q4_hi_nib, qh_b23_sh), bias_32); + const v128_t out_64 = wasm_i8x16_sub(wasm_v128_or(q4_lo_hnib, qh_b45_sh), bias_32); + const v128_t out_96 = wasm_i8x16_sub(wasm_v128_or(q4_hi_hnib, qh_b67_sh), bias_32); + + wasm_v128_store(a + chunk + 0, out_0); + wasm_v128_store(a + chunk + 32, out_32); + wasm_v128_store(a + chunk + 64, out_64); + wasm_v128_store(a + chunk + 96, out_96); } a += 128; q4 += 64;