feat: x86 matmul and transpose support (SDPA llama runnable) by oreomaker · Pull Request #641 · UbiquitousLearning/mllm

oreomaker · 2026-02-19T13:03:10Z

Please check Guidelines for Contributing.

Summary by CodeRabbit

Release Notes

New Features
- Added optimized x86 CPU kernels for matrix multiplication with AVX/AVX2 SIMD acceleration and multi-threaded support.
- Added tensor transpose and permutation operations for x86 CPUs.
- Added batched matrix multiplication support on x86 architectures.
- Extended embedding operations to support Q4_0 quantized format.
Improvements
- Enhanced matrix multiplication routing logic for broader transpose configuration support.

…tion

chenghuaWang

LGTM

chenghuaWang · 2026-02-19T14:05:07Z

@coderabbitai review

coderabbitai · 2026-02-19T14:05:46Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-02-19T14:05:55Z

📝 Walkthrough

Walkthrough

This PR introduces optimized x86 CPU kernels for BLAS matrix multiplication (GEMM/GEMV with AVX/AVX2 acceleration) and tensor transpose operations (with SIMD variants), integrating them into the MatMul and Transpose operators. It also adds support for GGUF Q4_0 quantized format in embedding operations.

Changes

Cohort / File(s)	Summary
x86 BLAS/GEMM Implementation `mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.hpp`, `mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp`	Introduces high-performance FP32 GEMM/GEMV kernels with optimized micro-kernels (8x8, 4x8, 1x8), decode-optimized paths for small dimensions, AVX/AVX2 specializations with FMA, tiled dispatch, batched support, and C-linkage entry points for matrix multiplication with optional bias and thread control.
x86 Transpose/Permutation `mllm/backends/cpu/kernels/x86/transpose.hpp`, `mllm/backends/cpu/kernels/x86/transpose.cpp`	Adds x86-optimized transpose and permutation kernels for FP32/INT64 tensors, including 2D transposition (H×W), batched reordering (BSHD↔BHSD), dimension exchange, and generic permutation with SSE/AVX block processing and scalar fallbacks.
Kernel Header Exports `mllm/backends/cpu/kernels/Kernels.hpp`	Exports new kernel headers transpose.hpp and mllm_blas/mllm_blas_sgemm.hpp alongside existing x86 kernel declarations.
MatMul & Transpose Ops Integration `mllm/backends/cpu/ops/MatMulOp.cpp`, `mllm/backends/cpu/ops/TransposeOp.cpp`	Wires x86 BLAS implementations into MatMul dispatch (with M≥4 threshold for GGUF, thread support), and replaces transpose operation NYI stubs with x86-specific kernel calls for HW↔WH, BSHD↔BHSD, and general permutations, while maintaining ARM fallbacks.
Quantization Format Support `mllm/backends/cpu/ops/EmbeddingOp.cpp`, `mllm/core/aops/EmbeddingOp.cpp`	Adds support for kGGUF_Q4_0 quantized weight format in embedding operations and updates output dtype inference to map Q4_0/Q4_K formats to Float32.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(x86,cpu) add support for Qwen3 MoE model on x86 #619: Adds x86 CPU kernel implementations for matrix multiply and transpose operations into the same CPU backend ops dispatch paths.

Suggested reviewers

yirongjie
liang1232018

Poem

🐰 With SIMD wings and AVX grace,
transpose dances through memory space,
GEMM kernels bloom in 8x8 rows,
while batched matmuls steal the show—
x86 rabbits, swift and keen,
computing faster than ever seen! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description contains only the template text with a reference to contributing guidelines, lacking any substantive details about the changes, objectives, testing, or rationale.	Provide a detailed description explaining what features were added, why they matter, how they were tested, and any relevant architectural or compatibility notes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title 'feat: x86 matmul and transpose support (SDPA llama runnable)' clearly and accurately describes the main change: adding x86 support for matrix multiplication and transpose operations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

mllm/backends/cpu/ops/TransposeOp.cpp (1)
95-103: ⚠️ Potential issue | 🔴 Critical

Fix critical bug: batch initialization must be 1, not 0, to correctly compute batch dimensions.

Line 96 initializes batch = 0, then multiplies it in a loop — the result is always 0. The x86 dispatch on lines 102–103 and 121–122 (and ARM equivalents at 105–106 and 124–125) inherit this bug, causing transpose_last_dims_* to process 0 batches and produce no output for any CASE 3 transpose operation.
Fix
-    int batch = 0;
+    int batch = 1;
     for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/cpu/ops/TransposeOp.cpp` around lines 95 - 103, The batch
accumulator in TransposeOp.cpp is initialized to 0 which makes subsequent
multiplications yield 0; change its initialization to 1 so the loop that
multiplies over the leading dims correctly computes the batch size (in the
branch that checks input_shape.size() != 2 && (dim0 + dim1 == 1)); update the
variable declared as int batch = 1 and leave the existing loop and downstream
calls to x86::transpose_last_dims_fp32 / transpose_last_dims_fp16 (and ARM
equivalents) unchanged so they receive the correct batch value.

🧹 Nitpick comments (2)

mllm/core/aops/EmbeddingOp.cpp (1)
74-77: LGTM — also incidentally fixes a pre-existing Q4_K output-dtype bug

Before this change, kGGUF_Q4_K weight would fall through to out_dtype = weight_.dtype() (line 74), allocating the output tensor with dtype kGGUF_Q4_K. CPUEmbeddingOp::forward then wrote float values into it via coffsettedPtr<float>, producing a type mismatch. This PR correctly maps both quantised formats to kFloat32.

Optionally, the two sequential if statements can be made an else if chain to make the mutual-exclusivity explicit and avoid evaluating the second condition when the first matches.
♻️ Optional: use else-if chain
   auto out_dtype = weight_.dtype();
-  if (weight_.dtype() == kUInt16) { out_dtype = kUInt16PerTensorAsy; }
-  if (weight_.dtype() == kGGUF_Q4_0 || weight_.dtype() == kGGUF_Q4_K) { out_dtype = kFloat32; }
+  if (weight_.dtype() == kUInt16) {
+    out_dtype = kUInt16PerTensorAsy;
+  } else if (weight_.dtype() == kGGUF_Q4_0 || weight_.dtype() == kGGUF_Q4_K) {
+    out_dtype = kFloat32;
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/core/aops/EmbeddingOp.cpp` around lines 74 - 77, The current logic sets
out_dtype from weight_.dtype() then overwrites it for certain formats; change
the two separate conditionals to an explicit mutually-exclusive chain so only
one branch runs: check weight_.dtype() once, then use "if (weight_.dtype() ==
kUInt16) { out_dtype = kUInt16PerTensorAsy; } else if (weight_.dtype() ==
kGGUF_Q4_0 || weight_.dtype() == kGGUF_Q4_K) { out_dtype = kFloat32; }" to
ensure kGGUF_Q4_K and kGGUF_Q4_0 map to kFloat32 before calling
Tensor::empty(o_shape, out_dtype, i.device()), avoiding the
CPUEmbeddingOp::forward write via koffsettedPtr<float> into a quantised-typed
tensor.
mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.hpp (1)
43-61: Guard against null-pointer arithmetic on C.

When C is nullptr (common — MatMulOp.cpp passes nullptr), C + b * C_batch_stride is technically undefined behavior even when the stride is 0. Same issue in the __mllm_blas_batch_matmul_fp32_gemv_nt_nt_decode_small_d_wv template (lines 90–108).
🛡️ Suggested defensive fix (applies to both templates)
       auto a_ptr = A + b * A_batch_stride;
       auto b_ptr = B + b * B_batch_stride;
-      auto c_ptr = C + b * C_batch_stride;
+      auto c_ptr = C ? C + b * C_batch_stride : nullptr;
       auto d_ptr = dst + b * Dst_batch_stride;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.hpp` around lines 43
- 61, The pointer arithmetic on C can UB when C == nullptr; change the per-batch
pointer computation in both templates (the loop in the function containing
__mllm_blas_matmul_fp32_gemv_nt_t_decode_small_d_qk and the other template
__mllm_blas_batch_matmul_fp32_gemv_nt_nt_decode_small_d_wv) to compute c_ptr
conditionally (e.g., set c_ptr to nullptr when C is nullptr, otherwise C + b *
C_batch_stride) both inside the MLLM_AUTO_PARALLEL_FOR_BEGIN_NT block and the
fallback for-loop so you never perform pointer arithmetic on a null base
pointer.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp`:
- Around line 519-523: The batch GEMM loop passes the same C pointer for every
batch, causing incorrect bias usage when C is per-batch; inside the for loop
that calls __mllm_blas_sgemm_nt_t, compute a per-batch c_ptr = (C ? C + i *
C_batch_stride : nullptr) (matching the GEMV pattern) and pass c_ptr instead of
C; update the call site in the batch loop to use that c_ptr so each batch uses C
+ i * C_batch_stride when C is non-null.
- Around line 174-175: Remove the unused AVX variable a_vec declared with
_mm256_set_ps in this SGEMM implementation: the expression assigned to a_vec is
never used later (the code uses a0_vec, a1_vec, etc.), so delete the declaration
of a_vec to eliminate dead code and compiler warnings; verify no other code
depends on the a_vec symbol in the same function.
- Line 483: Fix the typo in the NYI message used in mllm_blas_matmul_fp32
(gemm/gemv) by removing the duplicated phrase "not supported" so the message
reads e.g. "transpose_a && transpose_b not supported in mllm_blas_matmul_fp32
gemm/gemv"; update both NYI(...) calls (the two occurrences within
mllm_blas_matmul_fp32) to use the corrected single "not supported" wording.

In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Line 57: The NYI message in the default branch of the dtype switch in
EmbeddingOp.cpp is misleadingly specific to "arm"; update the error emitted by
NYI("Not supported weight dtype for arm llm embedding token op") to a generic
message (e.g., "Not supported weight dtype for llm embedding token op" or
include the actual backend name dynamically) so it no longer references
"arm"—locate the default case in the dtype switch inside the embedding operator
implementation and replace the string passed to NYI accordingly.
- Around line 49-56: In the kGGUF_Q4_0 branch of the switch, when token_idx < 0
the output row at ous.coffsettedPtr<float>({b, (int)s, 0}) is left
uninitialized; update the negative-token branch to explicitly zero-fill that
output row (length options_.hidden_size) instead of doing nothing. Use the same
target pointer used for dequantize_row_q4_0 (ous.coffsettedPtr<float>({b,
(int)s, 0})) and fill options_.hidden_size floats (e.g., memset to 0 or
std::fill_n) so downstream readers see zeros; keep the dequantize call for
token_idx >= 0 unchanged.

---

Outside diff comments:
In `@mllm/backends/cpu/ops/TransposeOp.cpp`:
- Around line 95-103: The batch accumulator in TransposeOp.cpp is initialized to
0 which makes subsequent multiplications yield 0; change its initialization to 1
so the loop that multiplies over the leading dims correctly computes the batch
size (in the branch that checks input_shape.size() != 2 && (dim0 + dim1 == 1));
update the variable declared as int batch = 1 and leave the existing loop and
downstream calls to x86::transpose_last_dims_fp32 / transpose_last_dims_fp16
(and ARM equivalents) unchanged so they receive the correct batch value.

---

Nitpick comments:
In `@mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.hpp`:
- Around line 43-61: The pointer arithmetic on C can UB when C == nullptr;
change the per-batch pointer computation in both templates (the loop in the
function containing __mllm_blas_matmul_fp32_gemv_nt_t_decode_small_d_qk and the
other template __mllm_blas_batch_matmul_fp32_gemv_nt_nt_decode_small_d_wv) to
compute c_ptr conditionally (e.g., set c_ptr to nullptr when C is nullptr,
otherwise C + b * C_batch_stride) both inside the
MLLM_AUTO_PARALLEL_FOR_BEGIN_NT block and the fallback for-loop so you never
perform pointer arithmetic on a null base pointer.

In `@mllm/core/aops/EmbeddingOp.cpp`:
- Around line 74-77: The current logic sets out_dtype from weight_.dtype() then
overwrites it for certain formats; change the two separate conditionals to an
explicit mutually-exclusive chain so only one branch runs: check weight_.dtype()
once, then use "if (weight_.dtype() == kUInt16) { out_dtype =
kUInt16PerTensorAsy; } else if (weight_.dtype() == kGGUF_Q4_0 || weight_.dtype()
== kGGUF_Q4_K) { out_dtype = kFloat32; }" to ensure kGGUF_Q4_K and kGGUF_Q4_0
map to kFloat32 before calling Tensor::empty(o_shape, out_dtype, i.device()),
avoiding the CPUEmbeddingOp::forward write via koffsettedPtr<float> into a
quantised-typed tensor.

coderabbitai · 2026-02-19T14:15:24Z

mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp

+    __m256 a_vec =
+        _mm256_set_ps(A[k + 3], A[k + 3], A[k + 2], A[k + 2], A[k + 1], A[k + 1], A[k], A[k]);  // For broadcasting later


⚠️ Potential issue | 🟡 Minor

Dead code: a_vec is computed but never used.

The _mm256_set_ps(...) result on line 174 is assigned to a_vec but the subsequent code uses individual scalar broadcasts (a0_vec, a1_vec, etc.) instead. Remove the unused variable to avoid confusion and compiler warnings.

🧹 Proposed fix

for (; k <= K - 4; k += 4) { - __m256 a_vec = - _mm256_set_ps(A[k + 3], A[k + 3], A[k + 2], A[k + 2], A[k + 1], A[k + 1], A[k], A[k]); // For broadcasting later float a0 = A[k + 0]; float a1 = A[k + 1]; float a2 = A[k + 2]; float a3 = A[k + 3];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

__m256 a_vec =

_mm256_set_ps(A[k + 3], A[k + 3], A[k + 2], A[k + 2], A[k + 1], A[k + 1], A[k], A[k]); // For broadcasting later

float a0 = A[k + 0];

float a1 = A[k + 1];

float a2 = A[k + 2];

float a3 = A[k + 3];

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp` around lines 174 - 175, Remove the unused AVX variable a_vec declared with _mm256_set_ps in this SGEMM implementation: the expression assigned to a_vec is never used later (the code uses a0_vec, a1_vec, etc.), so delete the declaration of a_vec to eliminate dead code and compiler warnings; verify no other code depends on the a_vec symbol in the same function.

coderabbitai · 2026-02-19T14:15:24Z

mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp

+    }
+    return;
+  } else {
+    NYI("transpose_a && transpose_b not supported not supported in mllm_blas_matmul_fp32 gemm/gemv");


⚠️ Potential issue | 🟡 Minor

Typo: doubled "not supported" in NYI message.

Same typo on line 527.

✏️ Proposed fix

- NYI("transpose_a && transpose_b not supported not supported in mllm_blas_matmul_fp32 gemm/gemv"); + NYI("transpose_a && transpose_b not supported in mllm_blas_matmul_fp32 gemm/gemv");

And line 527:

- NYI("transpose_a && transpose_b not supported not supported in mllm_blas_matmul_fp32 gemm/gemv"); + NYI("transpose_a && transpose_b not supported in mllm_blas_batch_matmul_fp32 gemm/gemv");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

NYI("transpose_a && transpose_b not supported not supported in mllm_blas_matmul_fp32 gemm/gemv");

NYI("transpose_a && transpose_b not supported in mllm_blas_matmul_fp32 gemm/gemv");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp` at line 483, Fix the typo in the NYI message used in mllm_blas_matmul_fp32 (gemm/gemv) by removing the duplicated phrase "not supported" so the message reads e.g. "transpose_a && transpose_b not supported in mllm_blas_matmul_fp32 gemm/gemv"; update both NYI(...) calls (the two occurrences within mllm_blas_matmul_fp32) to use the corrected single "not supported" wording.

coderabbitai · 2026-02-19T14:15:24Z

mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp

+      // Parallel is in the inner loops, not here.
+      for (int i = 0; i < BATCH; ++i) {
+        __mllm_blas_sgemm_nt_t(M, N, K, A + i * A_batch_stride, K, B + i * B_batch_stride, K, dst + i * Dst_batch_stride, N, 0,
+                               C, thread_count);
+      }


⚠️ Potential issue | 🟡 Minor

Batch NT_T GEMM path does not offset C (bias) by batch stride.

The GEMV batch templates (__mllm_blas_batch_matmul_fp32_gemv_nt_t_decode_small_d_qk) correctly compute c_ptr = C + b * C_batch_stride per batch, but the GEMM path here passes the same C pointer for every batch iteration. If C is ever non-null with a per-batch layout, this will produce incorrect results.

Currently C is always nullptr from MatMulOp.cpp, so no immediate breakage, but this is inconsistent with the GEMV path and a latent bug.

🐛 Suggested fix for consistency

for (int i = 0; i < BATCH; ++i) { __mllm_blas_sgemm_nt_t(M, N, K, A + i * A_batch_stride, K, B + i * B_batch_stride, K, dst + i * Dst_batch_stride, N, 0, - C, thread_count); + C ? C + i * C_batch_stride : nullptr, thread_count); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/kernels/x86/mllm_blas/mllm_blas_sgemm.cpp` around lines 519 - 523, The batch GEMM loop passes the same C pointer for every batch, causing incorrect bias usage when C is per-batch; inside the for loop that calls __mllm_blas_sgemm_nt_t, compute a per-batch c_ptr = (C ? C + i * C_batch_stride : nullptr) (matching the GEMV pattern) and pass c_ptr instead of C; update the call site in the batch loop to use that c_ptr so each batch uses C + i * C_batch_stride when C is non-null.

coderabbitai · 2026-02-19T14:15:24Z

mllm/backends/cpu/ops/EmbeddingOp.cpp

+        case kGGUF_Q4_0: {
+          auto token_idx = *ins.coffsettedPtr<mllm_int64_t>({b, (int)s});
+          if (token_idx >= 0) {
+            dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0,
+                                ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size);
+          }
+          break;
+        }


⚠️ Potential issue | 🟡 Minor

Uninitialized output row when token_idx < 0

When token_idx < 0, the output row at {b, s, 0} is silently left uninitialized. This is consistent with the kGGUF_Q4_K case, but both share the same gap: a downstream consumer reading that row will see garbage. The kFloat32/kFloat16 cases are even less defensive (no guard at all), but since Q4_0 adds the check, a zero-fill for the negative branch would make it fully safe.

🛡️ Optional: zero-fill the output row for negative token indices

case kGGUF_Q4_0: { auto token_idx = *ins.coffsettedPtr<mllm_int64_t>({b, (int)s}); if (token_idx >= 0) { dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0, ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size); + } else { + std::memset(ous.coffsettedPtr<char>({b, (int)s, 0}), 0, options_.hidden_size * sizeof(float)); } break; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

case kGGUF_Q4_0: {

auto token_idx = *ins.coffsettedPtr<mllm_int64_t>({b, (int)s});

if (token_idx >= 0) {

dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0,

ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size);

}

break;

}

case kGGUF_Q4_0: {

auto token_idx = *ins.coffsettedPtr<mllm_int64_t>({b, (int)s});

if (token_idx >= 0) {

dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0,

ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size);

} else {

std::memset(ous.coffsettedPtr<char>({b, (int)s, 0}), 0, options_.hidden_size * sizeof(float));

}

break;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` around lines 49 - 56, In the kGGUF_Q4_0 branch of the switch, when token_idx < 0 the output row at ous.coffsettedPtr<float>({b, (int)s, 0}) is left uninitialized; update the negative-token branch to explicitly zero-fill that output row (length options_.hidden_size) instead of doing nothing. Use the same target pointer used for dequantize_row_q4_0 (ous.coffsettedPtr<float>({b, (int)s, 0})) and fill options_.hidden_size floats (e.g., memset to 0 or std::fill_n) so downstream readers see zeros; keep the dequantize call for token_idx >= 0 unchanged.

coderabbitai · 2026-02-19T14:15:24Z

mllm/backends/cpu/ops/EmbeddingOp.cpp

+          }
+          break;
+        }
        default: NYI("Not supported weight dtype for arm llm embedding token op");


⚠️ Potential issue | 🟡 Minor

Misleading "arm" in NYI error message

The default-case error message "Not supported weight dtype for arm llm embedding token op" refers to "arm" in the generic CPU backend that now also serves x86 paths (the whole point of this PR). This message will appear to x86 users encountering an unsupported dtype.

✏️ Suggested fix

- default: NYI("Not supported weight dtype for arm llm embedding token op"); + default: NYI("Not supported weight dtype for CPU embedding token op");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

default: NYI("Not supported weight dtype for arm llm embedding token op");

default: NYI("Not supported weight dtype for CPU embedding token op");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` at line 57, The NYI message in the default branch of the dtype switch in EmbeddingOp.cpp is misleadingly specific to "arm"; update the error emitted by NYI("Not supported weight dtype for arm llm embedding token op") to a generic message (e.g., "Not supported weight dtype for llm embedding token op" or include the actual backend name dynamically) so it no longer references "arm"—locate the default case in the dtype switch inside the embedding operator implementation and replace the string passed to NYI accordingly.

oreomaker added 3 commits February 19, 2026 15:46

feat(cpu): add GGUF Q4_0 support for CPU embedding operation

ae432f4

feat(cpu): add x86 transpose kernel implementation with SIMD optimiza…

5928f3b

…tion

feat(cpu): add x86 BLAS SGEMM implementation with AVX optimization

60962ea

oreomaker requested review from chenghuaWang and yirongjie as code owners February 19, 2026 13:03

chenghuaWang approved these changes Feb 19, 2026

View reviewed changes

chenghuaWang changed the title ~~x86 matmul and transpose support (sdpa llama runnable)~~ feat: x86 matmul and transpose support (FlashAttention llama runnable) Feb 19, 2026

chenghuaWang changed the title ~~feat: x86 matmul and transpose support (FlashAttention llama runnable)~~ feat: x86 matmul and transpose support (SDPA llama runnable) Feb 19, 2026

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

oreomaker merged commit 87e5d92 into UbiquitousLearning:main Feb 20, 2026
3 checks passed

		__m256 a_vec =
		_mm256_set_ps(A[k + 3], A[k + 3], A[k + 2], A[k + 2], A[k + 1], A[k + 1], A[k], A[k]); // For broadcasting later

-    __m256 a_vec =
-        _mm256_set_ps(A[k + 3], A[k + 3], A[k + 2], A[k + 2], A[k + 1], A[k + 1], A[k], A[k]);  // For broadcasting later
+     float a0 = A[k + 0];
+     float a1 = A[k + 1];
+     float a2 = A[k + 2];
+     float a3 = A[k + 3];

	NYI("transpose_a && transpose_b not supported not supported in mllm_blas_matmul_fp32 gemm/gemv");
	NYI("transpose_a && transpose_b not supported in mllm_blas_matmul_fp32 gemm/gemv");

	default: NYI("Not supported weight dtype for arm llm embedding token op");
	default: NYI("Not supported weight dtype for CPU embedding token op");

Comments

Conversation

oreomaker commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

chenghuaWang left a comment

Choose a reason for hiding this comment

Uh oh!

chenghuaWang commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oreomaker commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

chenghuaWang commented Feb 19, 2026 •

edited

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading