Skip to content

Single matmul runtime segfault due to K dimension out of bound access #378

Open
@yifeizh2

Description

@yifeizh2

During the tuning phase, we observed invalid config as follows

module attributes {dlti.target_system_spec = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : ui32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : ui64>, #dlti.dl_entry<"L3_cache_size_in_bytes", 110100480 : ui64>, #dlti.dl_entry<"num_threads", 56 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i64>>>} {
  func.func @entry(%arg0: tensor<128x11008xbf16>, %arg1: tensor<11008x4096xbf16>) -> tensor<128x4096xbf16> attributes {llvm.emit_c_interface} {
    %cst = arith.constant 0.000000e+00 : bf16
    %0 = tensor.empty() : tensor<128x4096xbf16>
    %1 = linalg.fill ins(%cst : bf16) outs(%0 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
    %2 = linalg.matmul {KBlock = 4096 : i32, KThreads = 2 : i32, MBlock = 32 : i32, MThreads = 1 : i32, NBlock = 32 : i32, NThreads = 28 : i32, cast = #linalg.type_fn<cast_signed>, innermostKBlock = 32 : i32, innermostMBlock = 32 : i32, innermostNBlock = 32 : i32} ins(%arg0, %arg1 : tensor<128x11008xbf16>, tensor<11008x4096xbf16>) outs(%1 : tensor<128x4096xbf16>) -> tensor<128x4096xbf16>
    return %2 : tensor<128x4096xbf16>
  }
}

In this case, the existing tiling logic does not correctly handle the boundary of K dimension, generating code like

          %19 = scf.for %arg10 = %c0 to %c172 step %c128 iter_args(%arg11 = %extracted_slice_8) -> (tensor<32x32xf32>) {
            %21 = affine.apply affine_map<(d0) -> (d0 * 32)>(%arg10)
            %extracted_slice_10 = tensor.extract_slice %extracted_slice_4[0, %21] [32, 4096] [1, 1] : tensor<32x5504xbf16> to tensor<32x4096xbf16>

and causing runtime out of bound access.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions