During the tuning phase, we observed invalid config as follows
module attributes {dlti.target_system_spec = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : ui32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : ui64>, #dlti.dl_entry<"L3_cache_size_in_bytes", 110100480 : ui64>, #dlti.dl_entry<"num_threads", 56 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i64>>>} {
func.func @entry(%arg0: tensor<128x11008xbf16>, %arg1: tensor<11008x4096xbf16>) -> tensor<128x4096xbf16> attributes {llvm.emit_c_interface} {
%cst = arith.constant 0.000000e+00 : bf16
%0 = tensor.empty() : tensor<128x4096xbf16>
%1 = linalg.fill ins(%cst : bf16) outs(%0 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
%2 = linalg.matmul {KBlock = 4096 : i32, KThreads = 2 : i32, MBlock = 32 : i32, MThreads = 1 : i32, NBlock = 32 : i32, NThreads = 28 : i32, cast = #linalg.type_fn<cast_signed>, innermostKBlock = 32 : i32, innermostMBlock = 32 : i32, innermostNBlock = 32 : i32} ins(%arg0, %arg1 : tensor<128x11008xbf16>, tensor<11008x4096xbf16>) outs(%1 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
return %2 : tensor<128x4096xbf16>
}
}
In this case, the existing tiling logic does not correctly handle the boundary of K dimension, generating code like
%19 = scf.for %arg10 = %c0 to %c172 step %c128 iter_args(%arg11 = %extracted_slice_8) -> (tensor<32x32xf32>) {
%21 = affine.apply affine_map<(d0) -> (d0 * 32)>(%arg10)
%extracted_slice_10 = tensor.extract_slice %extracted_slice_4[0, %21] [32, 4096] [1, 1] : tensor<32x5504xbf16> to tensor<32x4096xbf16>
and causing runtime out of bound access.
During the tuning phase, we observed invalid config as follows
In this case, the existing tiling logic does not correctly handle the boundary of K dimension, generating code like
and causing runtime out of bound access.