Open
Description
During the tuning phase, we observed invalid config as follows
module attributes {dlti.target_system_spec = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : ui32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : ui64>, #dlti.dl_entry<"L3_cache_size_in_bytes", 110100480 : ui64>, #dlti.dl_entry<"num_threads", 56 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i64>>>} {
func.func @entry(%arg0: tensor<128x11008xbf16>, %arg1: tensor<11008x4096xbf16>) -> tensor<128x4096xbf16> attributes {llvm.emit_c_interface} {
%cst = arith.constant 0.000000e+00 : bf16
%0 = tensor.empty() : tensor<128x4096xbf16>
%1 = linalg.fill ins(%cst : bf16) outs(%0 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
%2 = linalg.matmul {KBlock = 4096 : i32, KThreads = 2 : i32, MBlock = 32 : i32, MThreads = 1 : i32, NBlock = 32 : i32, NThreads = 28 : i32, cast = #linalg.type_fn<cast_signed>, innermostKBlock = 32 : i32, innermostMBlock = 32 : i32, innermostNBlock = 32 : i32} ins(%arg0, %arg1 : tensor<128x11008xbf16>, tensor<11008x4096xbf16>) outs(%1 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
return %2 : tensor<128x4096xbf16>
}
}
In this case, the existing tiling logic does not correctly handle the boundary of K dimension, generating code like
%19 = scf.for %arg10 = %c0 to %c172 step %c128 iter_args(%arg11 = %extracted_slice_8) -> (tensor<32x32xf32>) {
%21 = affine.apply affine_map<(d0) -> (d0 * 32)>(%arg10)
%extracted_slice_10 = tensor.extract_slice %extracted_slice_4[0, %21] [32, 4096] [1, 1] : tensor<32x5504xbf16> to tensor<32x4096xbf16>
and causing runtime out of bound access.