[MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Dimension #97788

zhaoshiz · 2024-07-05T05:24:50Z

Allow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator with two restrictions:

The reduction dim is the last (innermost) dim of the op; and
Only the reduction dim is requested for scalable vectorization.

One exception is that scalable vectorization of the reduction dim in Matmul-like ops are not supported even above restrictions are met.

Allowed combinations of scalable flags and iterator types:
Matmul:
     Iterators: ["parallel", "parallel", "reduction"]
Scalable Flags: ["true",     "true",     "false"]
                ["false",    "true",     "false"]
Matvec:
     Iterators: ["parallel", "reduction"]
Scalable Flags: ["false",    "true"]
                ["true",     "false"]

llvmbot · 2024-07-05T05:25:17Z

@llvm/pr-subscribers-mlir-linalg
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-sve

Author: Zhaoshi Zheng (zhaoshiz)

Changes

Allow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator. For now, only reduction on the trailing dimension is supported.

Patch is 22.75 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97788.diff

6 Files Affected:

(modified) mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp (+27-1)
(modified) mlir/test/Dialect/Linalg/vectorization-scalable.mlir (+80)
(modified) mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir (+24)
(added) mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir (+95)
(added) mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir (+90)
(added) mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir (+91)

diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index 3a75d2ac08157..b1aae46237451 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -582,6 +582,12 @@ static SmallVector<bool> getDimsToReduce(LinalgOp linalgOp) {
       llvm::map_range(linalgOp.getIteratorTypesArray(), isReductionIterator));
 }
 
+static bool isLinalgReduction(LinalgOp &op) {
+  return isa<linalg::ReduceOp>(op) ||
+         (isa<linalg::GenericOp>(op) &&
+          llvm::any_of(op.getIteratorTypesArray(), isReductionIterator));
+}
+
 /// Build a vector.transfer_write of `value` into `outputOperand` at indices set
 /// to all `0`; where `outputOperand` is an output operand of the LinalgOp
 /// currently being vectorized. If `dest` has null rank, build an memref.store.
@@ -1773,6 +1779,9 @@ vectorizeDynamicLinalgOpPrecondition(linalg::LinalgOp op,
   if (isa<ConvolutionOpInterface>(op.getOperation()))
     return vectorizeDynamicConvOpPrecondition(op, flatten1DDepthwiseConv);
 
+  if (isLinalgReduction(op))
+    return reductionPreconditions(op);
+
   // TODO: Masking only supports dynamic element-wise ops, linalg.generic ops,
   // linalg.copy ops and ops that implement ContractionOpInterface for now.
   if (!isElementwise(op) &&
@@ -1942,13 +1951,30 @@ vectorizeScalableVectorPrecondition(Operation *op,
   if (inputVectorSizes.empty())
     return success();
 
+  auto linalgOp = dyn_cast<LinalgOp>(op);
+  if (linalgOp && isLinalgReduction(linalgOp)) {
+    LDBG("Checking reduce op dims for scalable vectorization\n");
+    auto iteratorTypes = linalgOp.getIteratorTypesArray();
+    assert(iteratorTypes.size() == inputScalableVecDims.size() &&
+           "Number of iterator types and input scalable dims mismatch");
+    // For now, only support scalable vectorization of a reduction on the
+    // trailing dim.
+    for (size_t i = 0; i < inputScalableVecDims.size() - 1; ++i) {
+      if (inputScalableVecDims[i] && isReductionIterator(iteratorTypes[i])) {
+        LDBG("Non-trailing reduction dim requested for scalable "
+             "vectorization\n");
+        return failure();
+      }
+    }
+    return success();
+  }
+
   bool isScalable = inputScalableVecDims.back();
   if (!isScalable)
     return success();
 
   // Only element-wise and 1d depthwise conv ops supported in the presence of
   // scalable dims.
-  auto linalgOp = dyn_cast<LinalgOp>(op);
   return success(linalgOp && (isElementwise(linalgOp) ||
                               isa<linalg::DepthwiseConv1DNwcWcOp>(op)));
 }
diff --git a/mlir/test/Dialect/Linalg/vectorization-scalable.mlir b/mlir/test/Dialect/Linalg/vectorization-scalable.mlir
index d6f8d78358370..e0dae167b8625 100644
--- a/mlir/test/Dialect/Linalg/vectorization-scalable.mlir
+++ b/mlir/test/Dialect/Linalg/vectorization-scalable.mlir
@@ -142,3 +142,83 @@ module attributes {transform.with_named_sequence} {
   }
 }
 
+// -----
+
+func.func @vectorize_dynamic_reduction_1d(%arg0: tensor<?xf32>,
+                                          %arg1: tensor<f32>) -> tensor<f32> {
+
+  %0 = linalg.reduce ins(%arg0 : tensor<?xf32>) outs(%arg1 : tensor<f32>) dimensions = [0]
+  (%in: f32, %init: f32) {
+    %0 = arith.addf %in, %init : f32
+    linalg.yield %0 : f32
+  }
+  return %0 : tensor<f32>
+}
+
+// CHECK-LABEL:  func.func @vectorize_dynamic_reduction_1d(
+// CHECK-SAME:     %[[ARG_0:.*]]: tensor<?xf32>, %[[ARG_1:.*]]: tensor<f32>) -> tensor<f32> {
+// CHECK:          %[[VAL_0:.*]] = arith.constant 0 : index
+// CHECK:          %[[VAL_1:.*]] = tensor.dim %[[ARG_0]], %[[VAL_0]] : tensor<?xf32>
+// CHECK:          %[[VAL_2:.*]] = arith.constant 0 : index
+// CHECK:          %[[VAL_3:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:          %[[VAL_4:.*]] = vector.create_mask %[[VAL_1]] : vector<[4]xi1>
+// CHECK:          %[[VAL_5:.*]] = vector.mask %[[VAL_4]] { vector.transfer_read %[[ARG_0]][%[[VAL_2]]], %[[VAL_3]] {in_bounds = [true]} : tensor<?xf32>, vector<[4]xf32> } : vector<[4]xi1> -> vector<[4]xf32>
+// CHECK:          %[[VAL_6:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:          %[[VAL_7:.*]] = vector.transfer_read %[[ARG_1]][], %[[VAL_6]] : tensor<f32>, vector<f32>
+// CHECK:          %[[VAL_8:.*]] = vector.extractelement %[[VAL_7]][] : vector<f32>
+// CHECK:          %[[VAL_9:.*]] = vector.mask %[[VAL_4]] { vector.multi_reduction <add>, %[[VAL_5]], %[[VAL_8]] [0] : vector<[4]xf32> to f32 } : vector<[4]xi1> -> f32
+// CHECK:          %[[VAL_10:.*]] = vector.broadcast %[[VAL_9]] : f32 to vector<f32>
+// CHECK:          %[[VAL_11:.*]] = vector.transfer_write %[[VAL_10]], %[[ARG_1]][] : vector<f32>, tensor<f32>
+// CHECK:          return %[[VAL_11]] : tensor<f32>
+// CHECK:        }
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.reduce"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [[4]] : !transform.any_op
+    transform.yield
+  }
+}
+
+// -----
+
+func.func @vectorize_dynamic_reduction_2d(%arg0: tensor<?x?xf32>,
+                                          %arg1: tensor<?xf32>) -> tensor<?xf32> {
+  %0 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
+                                         affine_map<(d0, d1) -> (d0)>],
+                        iterator_types = ["parallel", "reduction"] }
+    ins(%arg0 : tensor<?x?xf32>)
+    outs(%arg1 : tensor<?xf32>) {
+    ^bb(%in: f32, %out: f32) :
+      %0 = arith.addf %in, %out : f32
+      linalg.yield %0 : f32
+    } -> tensor<?xf32>
+  return %0 : tensor<?xf32>
+}
+
+// CHECK-LABEL:  func.func @vectorize_dynamic_reduction_2d(
+// CHECK-SAME:     %[[ARG_0:.*]]: tensor<?x?xf32>, %[[ARG_1:.*]]: tensor<?xf32>) -> tensor<?xf32> {
+// CHECK:    %[[VAL_0:.*]] = arith.constant 0 : index
+// CHECK:    %[[VAL_1:.*]] = tensor.dim %[[ARG_0]], %[[VAL_0]] : tensor<?x?xf32>
+// CHECK:    %[[VAL_2:.*]] = arith.constant 1 : index
+// CHECK:    %[[VAL_3:.*]] = tensor.dim %[[ARG_0]], %[[VAL_2]] : tensor<?x?xf32>
+// CHECK:    %[[VAL_4:.*]] = arith.constant 0 : index
+// CHECK:    %[[VAL_5:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:    %[[VAL_6:.*]] = vector.create_mask %[[VAL_1]], %[[VAL_3]] : vector<1x[4]xi1>
+// CHECK:    %[[VAL_7:.*]] = vector.mask %[[VAL_6]] { vector.transfer_read %[[ARG_0]][%[[VAL_4]], %[[VAL_4]]], %[[VAL_5]] {in_bounds = [true, true]} : tensor<?x?xf32>, vector<1x[4]xf32> } : vector<1x[4]xi1> -> vector<1x[4]xf32>
+// CHECK:    %[[VAL_8:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:    %[[VAL_9:.*]] = vector.create_mask %[[VAL_1]] : vector<1xi1>
+// CHECK:    %[[VAL_10:.*]] = vector.mask %[[VAL_9]] { vector.transfer_read %[[ARG_1]][%[[VAL_4]]], %[[VAL_8]] {in_bounds = [true]} : tensor<?xf32>, vector<1xf32> } : vector<1xi1> -> vector<1xf32>
+// CHECK:    %[[VAL_11:.*]] = vector.mask %[[VAL_6]] { vector.multi_reduction <add>, %[[VAL_7]], %[[VAL_10]] [1] : vector<1x[4]xf32> to vector<1xf32> } : vector<1x[4]xi1> -> vector<1xf32>
+// CHECK:    %[[VAL_12:.*]] = arith.constant 0 : index
+// CHECK:    %[[VAL_13:.*]] = vector.mask %[[VAL_9]] { vector.transfer_write %[[VAL_11]], %[[ARG_1]][%[[VAL_12]]] {in_bounds = [true]} : vector<1xf32>, tensor<?xf32> } : vector<1xi1> -> tensor<?xf32>
+// CHECK:    return %[[VAL_13]] : tensor<?xf32>
+// CHECK:  }
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [1, [4]] : !transform.any_op
+    transform.yield
+  }
+}
diff --git a/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir b/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir
index f70d23a193229..03cdd4f1cc2b6 100644
--- a/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir
+++ b/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir
@@ -298,6 +298,30 @@ func.func @scalable_dim_1d(%A: vector<[4]xf32>, %B: f32, %C: vector<[4]xi1>) ->
 // CHECK:          %[[VAL_4:.*]] = vector.extract %[[VAL_3]][0] : f32 from vector<1xf32>
 // CHECK:          return %[[VAL_4]] : f32
 
+func.func @scalable_dim_2d(%A: vector<2x[4]xf32>, %B: vector<2xf32>, %C: vector<2x[4]xi1>) -> vector<2xf32> {
+    %0 = vector.mask %C { vector.multi_reduction <add>, %A, %B [1] : vector<2x[4]xf32> to vector<2xf32> } : vector<2x[4]xi1> -> vector<2xf32>
+    return %0 : vector<2xf32>
+}
+
+// CHECK-LABEL:  func.func @scalable_dim_2d(
+// CHECK-SAME:                                      %[[ARG_0:.*]]: vector<2x[4]xf32>,
+// CHECK-SAME:                                      %[[ARG_1:.*]]: vector<2xf32>,
+// CHECK-SAME:                                      %[[ARG_2:.*]]: vector<2x[4]xi1>) -> vector<2xf32> {
+// CHECK-DAG:      %[[CON_0:.*]] = arith.constant 1 : index
+// CHECK-DAG:      %[[CON_1:.*]] = arith.constant 0 : index
+// CHECK-DAG:      %[[CON_2:.*]] = arith.constant dense<0.000000e+00> : vector<2xf32>
+// CHECK:          %[[VAL_0:.*]] = vector.extract %[[ARG_0]][0] : vector<[4]xf32> from vector<2x[4]xf32>
+// CHECK:          %[[VAL_1:.*]] = vector.extract %[[ARG_1]][0] : f32 from vector<2xf32>
+// CHECK:          %[[VAL_2:.*]] = vector.extract %[[ARG_2]][0] : vector<[4]xi1> from vector<2x[4]xi1>
+// CHECK:          %[[VAL_3:.*]] = vector.mask %[[VAL_2]] { vector.reduction <add>, %[[VAL_0]], %[[VAL_1]] : vector<[4]xf32> into f32 } : vector<[4]xi1> -> f32
+// CHECK:          %[[VAL_4:.*]] = vector.insertelement %[[VAL_3]], %[[CON_2]][%[[CON_1]] : index] : vector<2xf32>
+// CHECK:          %[[VAL_5:.*]] = vector.extract %[[ARG_0]][1] : vector<[4]xf32> from vector<2x[4]xf32>
+// CHECK:          %[[VAL_6:.*]] = vector.extract %[[ARG_1]][1] : f32 from vector<2xf32>
+// CHECK:          %[[VAL_7:.*]] = vector.extract %[[ARG_2]][1] : vector<[4]xi1> from vector<2x[4]xi1>
+// CHECK:          %[[VAL_8:.*]] = vector.mask %[[VAL_7]] { vector.reduction <add>, %[[VAL_5]], %[[VAL_6]] : vector<[4]xf32> into f32 } : vector<[4]xi1> -> f32
+// CHECK:          %[[VAL_9:.*]] = vector.insertelement %[[VAL_8]], %[[VAL_4]][%[[CON_0]] : index] : vector<2xf32>
+// CHECK:          return %[[VAL_9]] : vector<2xf32>
+
 module attributes {transform.with_named_sequence} {
   transform.named_sequence @__transform_main(%root : !transform.any_op {transform.readonly}) {
     %func_op = transform.structured.match ops{["func.func"]} in %root : (!transform.any_op) -> !transform.op<"func.func">
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir
new file mode 100644
index 0000000000000..42a6f55e56a6f
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir
@@ -0,0 +1,95 @@
+// DEFINE: %{compile} =  mlir-opt %s \
+// DEFINE:    -transform-interpreter -test-transform-dialect-erase-schedule \
+// DEFINE:    -one-shot-bufferize="bufferize-function-boundaries" -buffer-deallocation-pipeline -cse -canonicalize -convert-vector-to-scf -arm-sve-legalize-vector-storage \
+// DEFINE:    -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = generic_reduce_2d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE:    -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext
+
+// RUN: %{compile}
+
+// RUN: %{run} | FileCheck %s --check-prefix=F32
+
+func.func @generic_reduce_2d_f32() {
+  // 2-D Tensor
+  %M = arith.constant 16 : index
+  %N = arith.constant 1000 : index
+  %c0_f32 = arith.constant 0.0 : f32
+
+  // Allocate the input and output tensors
+  %A_alloc = bufferization.alloc_tensor(%M, %N) : tensor<?x?xf32>
+  %C_alloc = bufferization.alloc_tensor(%M) : tensor<?xf32>
+
+  // Initialise the tensors
+  %pi = arith.constant  3.1416 : f32
+  %A_in = linalg.fill ins(%pi : f32) outs(%A_alloc : tensor<?x?xf32>) -> tensor<?x?xf32>
+  %C_in = linalg.fill ins(%c0_f32 : f32) outs(%C_alloc : tensor<?xf32>) -> tensor<?xf32>
+
+  // Reduce
+  %C_out = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
+                                         affine_map<(d0, d1) -> (d0)>],
+                        iterator_types = ["parallel", "reduction"] }
+    ins(%A_in : tensor<?x?xf32>)
+    outs(%C_in : tensor<?xf32>) {
+    ^bb(%in: f32, %out: f32) :
+      %0 = arith.addf %in, %out : f32
+      linalg.yield %0 : f32
+    } -> tensor<?xf32>
+
+  // Print and verify the output
+  // F32-LABEL: SVE: START OF TEST OUTPUT
+  vector.print str "SVE: START OF TEST OUTPUT\n"
+
+  // F32-NEXT: Unranked Memref {{.*}} rank = 1 offset = 0 sizes = [16] strides = [1] data =
+  // F32-NEXT: [3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6,  3141.6]
+
+  %xf = tensor.cast %C_out : tensor<?xf32> to tensor<*xf32>
+  call @printMemrefF32(%xf) : (tensor<*xf32>) -> ()
+
+  // F32-NEXT: SVE: END OF TEST OUTPUT
+  vector.print str "SVE: END OF TEST OUTPUT\n"
+
+  return
+}
+
+module attributes {transform.with_named_sequence} {
+  // A sequence that will tile and vectorise a Reduce Op
+  transform.named_sequence @tile_and_vectorize_reduce(%func
+    : !transform.op<"func.func"> {transform.readonly}) {
+
+    // Step 0: Get a handle to the reduce Op
+    %reduce = transform.structured.match ops{["linalg.generic"]} in %func
+      : (!transform.op<"func.func">) -> !transform.any_op
+
+    // Step 1: Tile
+    %tiled_reduce, %loops:2 = transform.structured.tile_using_for %reduce tile_sizes [1, [4]]
+      : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)
+
+    // Step 2: Vectorize
+    transform.structured.vectorize %tiled_reduce vector_sizes [1, [4]] : !transform.any_op
+
+    // Step 3: Lower vector.multi_reduction
+    transform.apply_patterns to %func {
+      transform.apply_patterns.vector.lower_masked_transfers
+      transform.apply_patterns.vector.lower_multi_reduction lowering_strategy = "innerreduction"
+    } : !transform.op<"func.func">
+
+    transform.yield
+  }
+
+  // A sequence that goes over all functions in tis module and applies
+  // "tile_and_vectorize_reduce"
+  transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
+    %funcs = transform.structured.match ops{["func.func"]} in %module
+        : (!transform.any_op) -> !transform.op<"func.func">
+
+    transform.foreach %funcs : !transform.op<"func.func"> {
+      ^bb2(%func : !transform.op<"func.func">):
+        transform.include @tile_and_vectorize_reduce failures(propagate)
+        (%func) : (!transform.op<"func.func">) -> ()
+    }
+    transform.yield
+  }
+}
+
+func.func private @printMemrefF32(%ptr : tensor<*xf32>)
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir
new file mode 100644
index 0000000000000..e9f7154b10d42
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir
@@ -0,0 +1,90 @@
+// DEFINE: %{compile} =  mlir-opt %s \
+// DEFINE:    -transform-interpreter -test-transform-dialect-erase-schedule \
+// DEFINE:    -one-shot-bufferize="bufferize-function-boundaries" -buffer-deallocation-pipeline -cse -canonicalize -convert-vector-to-scf -arm-sve-legalize-vector-storage \
+// DEFINE:    -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = reduce_1d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE:    -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext
+
+// RUN: %{compile}
+
+// RUN: %{run} | FileCheck %s --check-prefix=F32
+
+func.func @reduce_1d_f32() {
+  // 1-D Tensor
+  %N = arith.constant 1000 : index
+  %c0_f32 = arith.constant 0.0 : f32
+
+  // Allocate the input and output tensors
+  %A_alloc = bufferization.alloc_tensor(%N) : tensor<?xf32>
+  %C_alloc = bufferization.alloc_tensor() : tensor<f32>
+
+  // Initialise the tensors
+  %pi = arith.constant  3.1416 : f32
+  %A_in = linalg.fill ins(%pi : f32) outs(%A_alloc : tensor<?xf32>) -> tensor<?xf32>
+  %C_in = tensor.insert %c0_f32 into %C_alloc[] : tensor<f32>
+
+  // Reduce
+  %C_out = linalg.reduce ins(%A_in : tensor<?xf32>) outs(%C_in: tensor<f32>) dimensions = [0]
+    (%in: f32, %init: f32) {
+      %0 = arith.addf %in, %init : f32
+      linalg.yield %0 : f32
+    }
+
+  // Print and verify the output
+  // F32-LABEL: SVE: START OF TEST OUTPUT
+  vector.print str "SVE: START OF TEST OUTPUT\n"
+
+  // F32-NEXT: Unranked Memref {{.*}} rank = 0 offset = 0 sizes = [] strides = [] data =
+  // F32-NEXT: [3141.6]
+
+  %xf = tensor.cast %C_out : tensor<f32> to tensor<*xf32>
+  call @printMemrefF32(%xf) : (tensor<*xf32>) -> ()
+
+  // F32-NEXT: SVE: END OF TEST OUTPUT
+  vector.print str "SVE: END OF TEST OUTPUT\n"
+
+  return
+}
+
+module attributes {transform.with_named_sequence} {
+  // A sequence that will tile and vectorise a Reduce Op
+  transform.named_sequence @tile_and_vectorize_reduce(%func
+    : !transform.op<"func.func"> {transform.readonly}) {
+
+    // Step 0: Get a handle to the reduce Op
+    %reduce = transform.structured.match ops{["linalg.reduce"]} in %func
+      : (!transform.op<"func.func">) -> !transform.any_op
+
+    // Step 1: Tile
+    %tiled_reduce, %loops:1 = transform.structured.tile_using_for %reduce tile_sizes [[4]]
+      : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
+
+    // Step 2: Vectorize
+    transform.structured.vectorize %tiled_reduce vector_sizes [[4]] : !transform.any_op
+
+    // Step 3: Lower vector.multi_reduction
+    transform.apply_patterns to %func {
+      transform.apply_patterns.vector.lower_masked_transfers
+      transform.apply_patterns.vector.lower_multi_reduction lowering_strategy = "innerreduction"
+    } : !transform.op<"func.func">
+
+    transform.yield
+  }
+
+  // A sequence that goes over all functions in tis module and applies
+  // "tile_and_vectorize_reduce"
+  transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
+    %funcs = transform.structured.match ops{["func.func"]} in %module
+        : (!transform.any_op) -> !transform.op<"func.func">
+
+    transform.foreach %funcs : !transform.op<"func.func"> {
+      ^bb2(%func : !transform.op<"func.func">):
+        transform.include @tile_and_vectorize_reduce failures(propagate)
+        (%func) : (!transform.op<"func.func">) -> ()
+    }
+    transform.yield
+  }
+}
+
+func.func private @printMemrefF32(%ptr : tensor<*xf32>)
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir
new file mode 100644
index 0000000000000..349966d7c85d5
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir
@@ -0,0 +1,91 @@
+// DEFINE: %{compile} =  mlir-opt %s \
+// DEFINE:    -transform-interpreter -test-transform-dialect-erase-schedule \
+// DEFINE:    -one-shot-bufferize="bufferize-function-boundaries" -buffer-deallocation-pipeline -cse -canonicalize -convert-vector-to-scf -arm-sve-legalize-vector-storage \
+// DEFINE:    -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = reduce_2d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE:    -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext
+
+// RUN: %{compile}
+
+// RUN: %{run} | FileCheck %s --check-prefix=F32
+
+func.func @reduce_2d_f32() {
+  // 2-D Tensor
+  %M = arith.constant 16 : index
+  %N = arith.constant 1000 : index
+  %c0_f32 = arith.constant 0.0 : f32
+
+  // Allocate the input and output tensors
+  %A_alloc = bufferization.alloc_tensor(%M, %N) : tensor<?x?xf32>
+  %C_alloc = bufferization.alloc_tenso...
[truncated]

zhaoshiz · 2024-07-05T05:37:30Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+    }
+    return success();
+  }
+
  bool isScalable = inputScalableVecDims.back();


this seems to be missed by 5f6c036, lifting the restriction that only the trailing dimension can be scalably vectorized.

@banach-space, I think we should check that all dims are not scalable?

Indeed, feel free to remove (just add a note in the summary).

Indeed, feel free to remove (just add a note in the summary).

like mentioned in my Q: this allows vector sizes such as [[4], [4], 1] to be applied without checking the type of the linalg op.
Simply removing it will break some useful cases like matmul. Making sure we allow all correct combinations of vector sizes and op types and prevent unsupported cases is beyond the scope of this PR. I'm happy to work on it later.

zhaoshiz · 2024-07-05T05:46:53Z

mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir

+// DEFINE:    -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = generic_reduce_2d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE:    -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext


%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext will expands to config.arm_emulator_utils_lib_dir/libmlir_runner_utils.so on linux.

config.arm_emulator_utils_lib_dir is defined in https://github.com/llvm/llvm-project/blob/main/mlir/test/lit.site.cfg.py.in#L59

banach-space · 2024-07-10T13:33:41Z

@zhaoshiz Apologies for the delay with this - I was travelling last week and still catching up with PRs. If not today, I promise to go over this tomorrow. In the meantime, would you mind fixing the conflict?

zhaoshiz · 2024-07-10T16:57:08Z

@zhaoshiz Apologies for the delay with this - I was travelling last week and still catching up with PRs. If not today, I promise to go over this tomorrow. In the meantime, would you mind fixing the conflict?

No worries @banach-space. By "conflict" do you mean
https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp#L1950C1-L1952C22?

I tried to change it locally to check that all dims are not scalable:
if (llvm::none_of(inputScalableVecDims, [](bool isScalable) { return isScalable; })) return success();
but broke the the integration test https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/matmul.mlir.

Essentially we are doing white-list of linalg ops in function vectorizeScalableVectorPrecondition, the existing check
bool isScalable = inputScalableVecDims.back(); if (!isScalable) return success(); allows vector sizes like [[4],4] or [1, [4], 1] to proceed regardless of the linalg op being vectorized. So to fix it we'll need to verifiy a lot of ops (e.g.: matvec) can be scalably vectorized and add them.

I have another question: a great number of mlir integration tests are written with -shared-libs=%mlir_runner_utils,%mlir_c_runner_utils which looks for the shared libs in <build_dir/lib> on the host machine (in my case, x86_64-linux). These tests will fail if config-ed to run in an emulator (qemu-aarch64).
I wrote the tests in this PR with -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext so they run successfully in qemu. Can you help run them on a native arm64-linux host with sve support?

banach-space · 2024-07-10T20:00:07Z

@zhaoshiz Apologies for the delay with this - I was travelling last week and still catching up with PRs. If not today, I promise to go over this tomorrow. In the meantime, would you mind fixing the conflict?

No worries @banach-space. By "conflict" do you mean

I meant this:

I will reply more tomorrow :)

zhaoshiz · 2024-07-11T03:36:34Z

rebased and fixed the conflict

banach-space

Thank you for working on this! 🙏🏻 Overall LG. I've made a few suggestions, but nothing major.

Essentially we are doing white-list of linalg ops in function vectorizeScalableVectorPrecondition, the existing check
bool isScalable = inputScalableVecDims.back(); if (!isScalable) return success(); allows vector sizes like [[4],4] or [1, [4], 1] to proceed regardless of the linalg op being vectorized. So to fix it we'll need to verifiy a lot of ops (e.g.: matvec) can be scalably vectorized and add them.

Please don't forget about this check:

llvm-project/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Lines 1952 to 1953 in 1ed84a8

    
           return success(linalgOp && (isElementwise(linalgOp) || 
        
                                       isa<linalg::DepthwiseConv1DNwcWcOp>(op)));

Also, note:

llvm-project/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Lines 1756 to 1765 in 1ed84a8

    
           // Support dynamic shapes in 1D depthwise convolution, but only in the 
        
           // _channel_ dimension. 
        
           Value lhs = conv.getDpsInputOperand(0)->get(); 
        
           ArrayRef<int64_t> lhsShape = cast<ShapedType>(lhs.getType()).getShape(); 
        
           auto shapeWithoutCh = lhsShape.drop_back(1); 
        
           if (ShapedType::isDynamicShape(shapeWithoutCh)) { 
        
             LDBG("Dynamically-shaped op vectorization precondition failed: only " 
        
                  "channel dim can be dynamic\n"); 
        
             return failure(); 
        
           }

So, it looks like there's at least two hooks to check the preconditions for linalg.reduce:

vectorizeScalableVectorPrecondition
vectorizeDynamicLinalgOpPrecondition
vectorizeLinalgOpPrecondition

Given that we are adding limitations specific to scalable vectors, I think that vectorizeScalableVectorPrecondition is the right place for now.

I have another question:

Let me try something and I will get back to you tomorrow!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

banach-space · 2024-07-11T13:51:04Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+    for (size_t i = 0; i < inputScalableVecDims.size() - 1; ++i) {
+      if (inputScalableVecDims[i] && isReductionIterator(iteratorTypes[i])) {
+        LDBG("Non-trailing reduction dim requested for scalable "
+             "vectorization\n");
+        return failure();
+      }
+    }
+    return success();
+  }


Wouldn't this be sufficient?

// Only the trailing scalable dim is allowed to be scalable. if (llvm::all_of(ArrayRef<bool>(inputScalableVecDims).drop_back(1), [](bool flag) {return flag == false;}) return failure();

As in, we only need to make sure that all the flags except for the trailing one are false. Btw, I might have made a typo - let me know if my suggestion "doesn't work" for you :)

Also, we shouldn't return success until all pre-conditions are checked (there's more further down).

this would work for linalg.reduce ops but prevent vectorizing dimensions with parallel iterators of linalg.generic ops, e.g.: requested vector sizes are [4, [4], 1] for the op below:

%result = linalg.generic { indexing_maps = [affine_map<(i, j, k) -> (i, k)>, affine_map<(i, j, k) -> (k, j)>, affine_map<(i, j, k) -> (i, j)>], iterator_types = ["parallel", "parallel", "reduction"] } ins(%lhs, %rhs : tensor<8x10xf32>,tensor<10x16xf32>) outs(%init :tensor<8x16xf32>) { ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32): %0 = arith.mulf %lhs_one, %rhs_one : f32 %1 = arith.addf %init_one, %0 : f32 linalg.yield %1 : f32 } -> tensor<8x16xf32>

mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir

banach-space · 2024-07-11T13:52:18Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+  if (isLinalgReduction(op))
+    return reductionPreconditions(op);
+


Do we need this? It's already invoked by vectorizeLinalgOpPrecondition

linalg.reduce ops will fail the next check L1792~L1795 and cause vectorizeLinalgOpPrecondition() to return failure.
For static-shaped reduce ops I don't think we need it. But after tiling with scalable vector like [[4]], we got dynamic-shaped ops.

mlir/test/Dialect/Linalg/vectorization-scalable.mlir

mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

banach-space · 2024-07-11T14:13:07Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+    }
+    return success();
+  }
+
  bool isScalable = inputScalableVecDims.back();


Indeed, feel free to remove (just add a note in the summary).

zhaoshiz

working on tidy-up the test cases, commented inline about issues in mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

zhaoshiz · 2024-07-11T17:36:47Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+    }
+    return success();
+  }
+
  bool isScalable = inputScalableVecDims.back();


Indeed, feel free to remove (just add a note in the summary).

like mentioned in my Q: this allows vector sizes such as [[4], [4], 1] to be applied without checking the type of the linalg op.
Simply removing it will break some useful cases like matmul. Making sure we allow all correct combinations of vector sizes and op types and prevent unsupported cases is beyond the scope of this PR. I'm happy to work on it later.

zhaoshiz · 2024-07-11T17:36:50Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+    for (size_t i = 0; i < inputScalableVecDims.size() - 1; ++i) {
+      if (inputScalableVecDims[i] && isReductionIterator(iteratorTypes[i])) {
+        LDBG("Non-trailing reduction dim requested for scalable "
+             "vectorization\n");
+        return failure();
+      }
+    }
+    return success();
+  }


this would work for linalg.reduce ops but prevent vectorizing dimensions with parallel iterators of linalg.generic ops, e.g.: requested vector sizes are [4, [4], 1] for the op below:

%result = linalg.generic { indexing_maps = [affine_map<(i, j, k) -> (i, k)>, affine_map<(i, j, k) -> (k, j)>, affine_map<(i, j, k) -> (i, j)>], iterator_types = ["parallel", "parallel", "reduction"] } ins(%lhs, %rhs : tensor<8x10xf32>,tensor<10x16xf32>) outs(%init :tensor<8x16xf32>) { ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32): %0 = arith.mulf %lhs_one, %rhs_one : f32 %1 = arith.addf %init_one, %0 : f32 linalg.yield %1 : f32 } -> tensor<8x16xf32>

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

zhaoshiz · 2024-07-11T17:36:56Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

@@ -1947,13 +1956,30 @@ vectorizeScalableVectorPrecondition(Operation *op,
  if (inputVectorSizes.empty())
    return success();

+  auto linalgOp = dyn_cast<LinalgOp>(op);
+  if (linalgOp && isLinalgReduction(linalgOp)) {


Note isLinalgReduction(linalgOp)

zhaoshiz · 2024-07-11T17:37:00Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+  if (isLinalgReduction(op))
+    return reductionPreconditions(op);
+


linalg.reduce ops will fail the next check L1792~L1795 and cause vectorizeLinalgOpPrecondition() to return failure.
For static-shaped reduce ops I don't think we need it. But after tiling with scalable vector like [[4]], we got dynamic-shaped ops.

banach-space

I've finally had a bit more time for this. I think that a lot of complexities in this PR stem from the fact that vectorizeScalableVectorPrecondition is a bit messy and hard to extend (my fault!). Also:

Simply removing it will break some useful cases like matmul. Making sure we allow all correct combinations of vector sizes and op types and prevent unsupported cases is beyond the scope of this PR. I'm happy to work on it later.

I think that we should refactor things a bit first and then build this PR on top of that. To quickly unblock you, here's what I'm proposing:

[mlir][linalg] Restrict scalable vectorisation #98639

I think that you should be able to enable "reductions" quite easily. Sadly re-basing won't be straightforward :(

Also:

[MLIR][Linalg] Scalable Vectorization of Reduction

To me this says that you are adding scalable vectorisation of e.g. linalg.reduce, but in practice you are doing something more generic - allowing reduction dimensions to be scalable. It's worth updating the summary.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

banach-space · 2024-07-15T19:24:46Z

I have another question: a great number of mlir integration tests are written with -shared-libs=%mlir_runner_utils,%mlir_c_runner_utils which looks for the shared libs in <build_dir/lib> on the host machine (in my case, x86_64-linux). These tests will fail if config-ed to run in an emulator (qemu-aarch64).
I wrote the tests in this PR with -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext so they run successfully in qemu. Can you help run them on a native arm64-linux host with sve support?

Sorry for not getting back to you earlier. Could you clarify - are you asking how to run this on x86_64 or AArch64? The former will require cross-compiling. The latter, should just work ™️ 😅 If it didn't, could you share more details?

zhaoshiz · 2024-07-15T21:41:22Z

Sorry for not getting back to you earlier. Could you clarify - are you asking how to run this on x86_64 or AArch64? The former will require cross-compiling. The latter, should just work ™️ 😅 If it didn't, could you share more details?

We should be able to run the integration tests on SVE/SME in both ways: 1. qemu-aarch64 on x86_64; 2 native aarch64-linux
For 2, I'm assuming they just work with -shared-libs=%libmlir_runner_utils,....
For 1, I can build llvm/mlir on x86_64 with addtional cmake variables specifying pre-built aarch64 mlir-cpu-runner and shared libs:

  -DARM_EMULATOR_EXECUTABLE="<path_to_qemu_bin>/qemu-aarch64" \
  -DARM_EMULATOR_OPTIONS="-L /usr/aarch64-linux-gnu" \
  -DARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE="<path_to_llvm_arm64_build>/bin/mlir-cpu-runner-arm64" \
  -DARM_EMULATOR_UTILS_LIB_DIR="<path_to_llvm_arm64_build>/lib"

and write integration tests with -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,... which expands to: -shared-libs=config.arm_emulator_utils_lib_dir/libmlir_runner_utils.so,... where "config.arm_emulator_utils_lib_dir" is set by "-DARM_EMULATOR_UTILS_LIB_DIR".

My question is will -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,... work in native aarch64-linux environment?

banach-space · 2024-07-16T19:07:37Z

My question is will -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,... work in native aarch64-linux environment?

It should. When ARM_EMULATOR_UTILS_LIB_DIR is not set, it's basically config.mlir_lib_dir:

llvm-project/mlir/test/Integration/lit.local.cfg

Lines 18 to 19 in 93d7d9b

    
           "%mlir_native_utils_lib_dir", 
        
           config.arm_emulator_utils_lib_dir or config.mlir_lib_dir,

The naming is not great though ... Btw, if you are building on X86, do other SVE integration tests work for you? I imagine that they fail - we've not really used %mlir_native_utils_lib_dir much.

zhaoshiz · 2024-07-17T18:22:45Z

if you are building on X86, do other SVE integration tests work for you?

no, all SVE test written with %libmlir_runner_utils will fail when built with -DARM_EMULATOR_EXECUTABLE, etc.
so to make them work in both ways, we need to update the tests to use %mlir_native_utils_lib_dir?

banach-space · 2024-07-18T18:32:03Z

no, all SVE test written with %libmlir_runner_utils will fail when built with -DARM_EMULATOR_EXECUTABLE, etc. so to make them work in both ways, we need to update the tests to use %mlir_native_utils_lib_dir?

Sorry about this - thank you for checking and for reporting 🙏🏻 Yes, we need to fix this. Would you have the cycles for this?

zhaoshiz · 2024-07-18T19:14:03Z

Sorry about this - thank you for checking and for reporting 🙏🏻 Yes, we need to fix this. Would you have the cycles for this?

yea I'm happy to fix that but it'll take some time for me to set up an environment to test %mlir_native_utils_lib_dir in native mode.

banach-space · 2024-07-18T19:51:46Z

Sorry about this - thank you for checking and for reporting 🙏🏻 Yes, we need to fix this. Would you have the cycles for this?

yea I'm happy to fix that but it'll take some time for me to set up an environment to test %mlir_native_utils_lib_dir in native mode.

If you test it on X86, then I can take care of testing on AArch64. IIUC, the former is easier for you to set-up?

zhaoshiz · 2024-07-18T20:07:45Z

If you test it on X86, then I can take care of testing on AArch64. IIUC, the former is easier for you to set-up?

yes, I can test on x86 and push a PR.

…mension Allow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator. For now, only reduction on the trailing dimension is supported.

…uction Note: I don't have a setup to run these tests natively (arm64-linux with sve). I am able to run them using QEMU on a x86_64-linux with below cmake variables when building llvm: -DARM_EMULATOR_EXECUTABLE="<path_to_qemu_bin>/qemu-aarch64" \ -DARM_EMULATOR_OPTIONS="-L /usr/aarch64-linux-gnu" \ -DARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE="<path_to_llvm_arm64_build>/bin/mlir-cpu-runner-arm64" \ -DARM_EMULATOR_UTILS_LIB_DIR="<path_to_llvm_arm64_build>/lib"

zhaoshiz · 2024-07-20T03:43:55Z

rebased/reworked after #98639 is merged

github-actions · 2024-07-20T03:46:31Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2024-07-20T03:46:35Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 05f0e86cc895181b3d2210458c78938f83353002 fba222e9377302c8263a847ba30268c334d2c5bf --extensions cpp -- mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

View the diff from clang-format here.

diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index b2324d8aaf..7e3048b15f 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -2004,9 +2004,9 @@ vectorizeScalableVectorPrecondition(Operation *op,
 
   if (iterators.back() == utils::IteratorType::reduction) {
     if (iterators.size() != inputVectorSizes.size()) {
-       LDBG("Non-trailing reduction dim requested for scalable "
-            "vectorization\n");
-       return failure();
+      LDBG("Non-trailing reduction dim requested for scalable "
+           "vectorization\n");
+      return failure();
     }
   }

banach-space · 2024-07-21T15:52:07Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

-  // TODO: Support scalable vectorisation for reduction dims
-  if (iterators.back() == utils::IteratorType::reduction)
-    return failure();
+  if (iterators.back() == utils::IteratorType::reduction) {
+    if (iterators.size() != inputVectorSizes.size()) {
+      LDBG("Non-trailing reduction dim requested for scalable "
+           "vectorization\n");
+      return failure();
+    }
+  }

-  // If this is not the _last_ parallel dim, 1. above is not met
+  // If this is not the _last_ parallel dim, 1. or 3. above is not met
  if (seenParalell)
    return failure();


There are two cases here. Should we turn this into a switch statement to combine this somehow?

switch (iterators.back()) { case utils::IteratorType::reduction: { // Check 3. above is met. if (iterators.size() != inputVectorSizes.size()) { LDBG("Non-trailing reduction dim requested for scalable " "vectorization\n"); return failure(); break; } } case utils::IteratorType::parallel: { // Check 1. and 2. above are met. if (seenParalell) { LDBG("Inner parallel dim requested for scalable " "vectorization\n"); return failure(); } break; }

WDYT? I'm open to suggestion :)

banach-space · 2024-07-21T15:52:42Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

@@ -586,6 +586,12 @@ static SmallVector<bool> getDimsToReduce(LinalgOp linalgOp) {
      llvm::map_range(linalgOp.getIteratorTypesArray(), isReductionIterator));
 }

+static bool hasLinalgReduction(LinalgOp &op) {


Please document. Also, why hasLinalgReduction raher than isaLinalgReduction?

banach-space · 2024-07-21T15:53:46Z

mlir/test/Dialect/Linalg/vectorization-unsupported.mlir

    transform.yield
  }
 }

 // -----

-func.func @linalg_generic_scalable_reduction_dim(%input: tensor<?x?xf32>,
-                                                 %acc: tensor<?xf32>) -> tensor<?xf32> {
+func.func @linalg_generic_scalable_reduction_leading_dim(%input: tensor<?x?xf32>,


[nit] I think this would read a bit better:

Suggested change

func.func @linalg_generic_scalable_reduction_leading_dim(%input: tensor<?x?xf32>,

func.func @linalg_generic_reduction_scalable_leading_dim(%input: tensor<?x?xf32>,

banach-space · 2024-07-21T15:56:36Z

mlir/test/Dialect/Linalg/vectorization-scalable.mlir

+func.func @vectorize_dynamic_reduction_scalable_1d(%arg0: tensor<?xf32>,
+                                          %arg1: tensor<f32>) -> tensor<f32> {


[nit] Indentation - please align %arg1 with %arg0. Same comment below.

banach-space · 2024-07-21T16:02:18Z

mlir/test/Dialect/Linalg/vectorization-scalable.mlir

+// CHECK:          return %[[VAL_11]] : tensor<f32>
+// CHECK:        }


These two lines can be skipped:

https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices

banach-space · 2024-07-21T16:03:23Z

mlir/test/Dialect/Linalg/vectorization-scalable.mlir

+// CHECK:          %[[VAL_0:.*]] = arith.constant 0 : index
+// CHECK:          %[[VAL_1:.*]] = tensor.dim %[[ARG_0]], %[[VAL_0]] : tensor<?xf32>
+// CHECK:          %[[VAL_2:.*]] = arith.constant 0 : index
+// CHECK:          %[[VAL_3:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:          %[[VAL_4:.*]] = vector.create_mask %[[VAL_1]] : vector<[4]xi1>
+// CHECK:          %[[VAL_5:.*]] = vector.mask %[[VAL_4]] { vector.transfer_read %[[ARG_0]][%[[VAL_2]]], %[[VAL_3]] {in_bounds = [true]} : tensor<?xf32>, vector<[4]xf32> } : vector<[4]xi1> -> vector<[4]xf32>
+// CHECK:          %[[VAL_6:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:          %[[VAL_7:.*]] = vector.transfer_read %[[ARG_1]][], %[[VAL_6]] : tensor<f32>, vector<f32>
+// CHECK:          %[[VAL_8:.*]] = vector.extractelement %[[VAL_7]][] : vector<f32>
+// CHECK:          %[[VAL_9:.*]] = vector.mask %[[VAL_4]] { vector.multi_reduction <add>, %[[VAL_5]], %[[VAL_8]] [0] : vector<[4]xf32> to f32 } : vector<[4]xi1> -> f32
+// CHECK:          %[[VAL_10:.*]] = vector.broadcast %[[VAL_9]] : f32 to vector<f32>
+// CHECK:          %[[VAL_11:.*]] = vector.transfer_write %[[VAL_10]], %[[ARG_1]][] : vector<f32>, tensor<f32>


[nit] Kind request - descriptive LIT variable names.

banach-space · 2024-07-21T16:04:03Z

mlir/test/Dialect/Linalg/vectorization-scalable.mlir

+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [1, [4]] : !transform.any_op


[nit] Use [4, [8]] instead for consistency:

llvm-project/mlir/test/Dialect/Linalg/vectorization.mlir

Line 246 in 867ff2d

transform.structured.vectorize %0 vector_sizes [4, 8] : !transform.any_op

banach-space · 2024-07-21T16:08:48Z

mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir

+// REDEFINE: %{entry_point} = generic_reduce_1d_f32
+// RUN: %{run} | FileCheck %s --check-prefix=GENERIC
+
+func.func @reduce_1d_f32() {


Would you mind adding a test for i16 as well? Or any integer value. Just to make sure that we check both FP and integers. Thanks!

In summary: 1. Do not allow scalable vectorization of the reduction dim of Matmul-like ops. 2. Allow scalable vectorization on only one dim of Matvec op. Allowed combinations of scalable flags and iterator types: Matmul: Iterators: ["parallel", "parallel", "reduction"] Scalable Flags: ["true", "true", "false"] ["false", "true", "false"] Matvec: Iterators: ["parallel", "reduction"] Scalable Flags: ["false", "true"] ["true", "false"]

zhaoshiz

revised per comments

banach-space

LGTM

Thank you for bearing with me and for addressing my comments. Great test coverage!

…mension (#97788) Summary: Allow scalable vectorization of linalg::reduce and linalg::generic that has reduction iterator(s) with two restrictions: 1. The reduction dim is the last (innermost) dim of the op; and 2. Only the reduction dim is requested for scalable vectorization. One exception is that scalable vectorization of the reduction dim in Matmul-like ops are not supported even above restrictions are met. Allowed combinations of scalable flags and iterator types: Matmul: Iterators: ["parallel", "parallel", "reduction"] Scalable Flags: ["true", "true", "false"] ["false", "true", "false"] Matvec: Iterators: ["parallel", "reduction"] Scalable Flags: ["false", "true"] ["true", "false"] Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250598

zhaoshiz requested review from banach-space, dcaballe, nicolasvasilache and hanhanW as code owners July 5, 2024 05:24

llvmbot added mlir:linalg mlir:vectorops mlir mlir:sve mlir:vector labels Jul 5, 2024

zhaoshiz commented Jul 5, 2024

View reviewed changes

zhaoshiz closed this Jul 10, 2024

zhaoshiz force-pushed the main branch from 71d9f21 to 31c9c41 Compare July 10, 2024 21:06

zhaoshiz reopened this Jul 11, 2024

banach-space reviewed Jul 11, 2024

View reviewed changes

zhaoshiz commented Jul 11, 2024

View reviewed changes

banach-space reviewed Jul 12, 2024

View reviewed changes

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp Outdated Show resolved Hide resolved

zhaoshiz mentioned this pull request Jul 13, 2024

[mlir][linalg] Restrict scalable vectorisation #98639

Merged

zhaoshiz closed this Jul 20, 2024

zhaoshiz force-pushed the main branch from 2e34ba8 to 05f0e86 Compare July 20, 2024 00:34

zhaoshiz added 2 commits July 19, 2024 20:33

[MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Di…

0861873

…mension Allow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator. For now, only reduction on the trailing dimension is supported.

zhaoshiz reopened this Jul 20, 2024

fix per clang-format

1e5ef34

zhaoshiz changed the title ~~[MLIR][Linalg] Scalable Vectorization of Reduction~~ [MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Dimension Jul 20, 2024

banach-space reviewed Jul 21, 2024

View reviewed changes

zhaoshiz added 2 commits July 21, 2024 17:51

Addressed review comments

75f0da2

zhaoshiz commented Jul 22, 2024

View reviewed changes

zhaoshiz added 3 commits July 22, 2024 16:19

update per clang-format

5a4ac6d

2nd update per clang-format

7c71012

another update per clang-format

0044740

banach-space approved these changes Jul 23, 2024

View reviewed changes

zhaoshiz merged commit 6942f1d into llvm:main Jul 24, 2024
7 checks passed

	return success(linalgOp && (isElementwise(linalgOp) \|\|
	isa<linalg::DepthwiseConv1DNwcWcOp>(op)));

	// Support dynamic shapes in 1D depthwise convolution, but only in the
	// _channel_ dimension.
	Value lhs = conv.getDpsInputOperand(0)->get();
	ArrayRef<int64_t> lhsShape = cast<ShapedType>(lhs.getType()).getShape();
	auto shapeWithoutCh = lhsShape.drop_back(1);
	if (ShapedType::isDynamicShape(shapeWithoutCh)) {
	LDBG("Dynamically-shaped op vectorization precondition failed: only "
	"channel dim can be dynamic\n");
	return failure();
	}

		if (isLinalgReduction(op))
		return reductionPreconditions(op);

	func.func @linalg_generic_scalable_reduction_leading_dim(%input: tensor<?x?xf32>,
	func.func @linalg_generic_reduction_scalable_leading_dim(%input: tensor<?x?xf32>,

		func.func @vectorize_dynamic_reduction_scalable_1d(%arg0: tensor<?xf32>,
		%arg1: tensor<f32>) -> tensor<f32> {

[MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Dimension #97788

[MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Dimension #97788

Uh oh!

Conversation

zhaoshiz commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaoshiz Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaoshiz Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaoshiz Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banach-space commented Jul 10, 2024

Uh oh!

zhaoshiz commented Jul 10, 2024

Uh oh!

banach-space commented Jul 10, 2024

Uh oh!

zhaoshiz commented Jul 11, 2024

Uh oh!

banach-space left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaoshiz Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaoshiz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaoshiz Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaoshiz Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banach-space left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

banach-space commented Jul 15, 2024

Uh oh!

zhaoshiz commented Jul 15, 2024

Uh oh!

banach-space commented Jul 16, 2024

Uh oh!

zhaoshiz commented Jul 17, 2024

zhaoshiz commented Jul 5, 2024 •

edited

Loading

llvmbot commented Jul 5, 2024 •

edited

Loading

zhaoshiz Jul 5, 2024 •

edited

Loading

zhaoshiz Jul 11, 2024 •

edited

Loading

zhaoshiz Jul 5, 2024 •

edited

Loading

zhaoshiz Jul 11, 2024 •

edited

Loading

zhaoshiz Jul 11, 2024 •

edited

Loading

zhaoshiz Jul 11, 2024 •

edited

Loading

github-actions bot commented Jul 20, 2024 •

edited

Loading