-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Dimension #97788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-mlir-linalg @llvm/pr-subscribers-mlir-sve Author: Zhaoshi Zheng (zhaoshiz) ChangesAllow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator. For now, only reduction on the trailing dimension is supported. Patch is 22.75 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97788.diff 6 Files Affected:
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index 3a75d2ac08157..b1aae46237451 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -582,6 +582,12 @@ static SmallVector<bool> getDimsToReduce(LinalgOp linalgOp) {
llvm::map_range(linalgOp.getIteratorTypesArray(), isReductionIterator));
}
+static bool isLinalgReduction(LinalgOp &op) {
+ return isa<linalg::ReduceOp>(op) ||
+ (isa<linalg::GenericOp>(op) &&
+ llvm::any_of(op.getIteratorTypesArray(), isReductionIterator));
+}
+
/// Build a vector.transfer_write of `value` into `outputOperand` at indices set
/// to all `0`; where `outputOperand` is an output operand of the LinalgOp
/// currently being vectorized. If `dest` has null rank, build an memref.store.
@@ -1773,6 +1779,9 @@ vectorizeDynamicLinalgOpPrecondition(linalg::LinalgOp op,
if (isa<ConvolutionOpInterface>(op.getOperation()))
return vectorizeDynamicConvOpPrecondition(op, flatten1DDepthwiseConv);
+ if (isLinalgReduction(op))
+ return reductionPreconditions(op);
+
// TODO: Masking only supports dynamic element-wise ops, linalg.generic ops,
// linalg.copy ops and ops that implement ContractionOpInterface for now.
if (!isElementwise(op) &&
@@ -1942,13 +1951,30 @@ vectorizeScalableVectorPrecondition(Operation *op,
if (inputVectorSizes.empty())
return success();
+ auto linalgOp = dyn_cast<LinalgOp>(op);
+ if (linalgOp && isLinalgReduction(linalgOp)) {
+ LDBG("Checking reduce op dims for scalable vectorization\n");
+ auto iteratorTypes = linalgOp.getIteratorTypesArray();
+ assert(iteratorTypes.size() == inputScalableVecDims.size() &&
+ "Number of iterator types and input scalable dims mismatch");
+ // For now, only support scalable vectorization of a reduction on the
+ // trailing dim.
+ for (size_t i = 0; i < inputScalableVecDims.size() - 1; ++i) {
+ if (inputScalableVecDims[i] && isReductionIterator(iteratorTypes[i])) {
+ LDBG("Non-trailing reduction dim requested for scalable "
+ "vectorization\n");
+ return failure();
+ }
+ }
+ return success();
+ }
+
bool isScalable = inputScalableVecDims.back();
if (!isScalable)
return success();
// Only element-wise and 1d depthwise conv ops supported in the presence of
// scalable dims.
- auto linalgOp = dyn_cast<LinalgOp>(op);
return success(linalgOp && (isElementwise(linalgOp) ||
isa<linalg::DepthwiseConv1DNwcWcOp>(op)));
}
diff --git a/mlir/test/Dialect/Linalg/vectorization-scalable.mlir b/mlir/test/Dialect/Linalg/vectorization-scalable.mlir
index d6f8d78358370..e0dae167b8625 100644
--- a/mlir/test/Dialect/Linalg/vectorization-scalable.mlir
+++ b/mlir/test/Dialect/Linalg/vectorization-scalable.mlir
@@ -142,3 +142,83 @@ module attributes {transform.with_named_sequence} {
}
}
+// -----
+
+func.func @vectorize_dynamic_reduction_1d(%arg0: tensor<?xf32>,
+ %arg1: tensor<f32>) -> tensor<f32> {
+
+ %0 = linalg.reduce ins(%arg0 : tensor<?xf32>) outs(%arg1 : tensor<f32>) dimensions = [0]
+ (%in: f32, %init: f32) {
+ %0 = arith.addf %in, %init : f32
+ linalg.yield %0 : f32
+ }
+ return %0 : tensor<f32>
+}
+
+// CHECK-LABEL: func.func @vectorize_dynamic_reduction_1d(
+// CHECK-SAME: %[[ARG_0:.*]]: tensor<?xf32>, %[[ARG_1:.*]]: tensor<f32>) -> tensor<f32> {
+// CHECK: %[[VAL_0:.*]] = arith.constant 0 : index
+// CHECK: %[[VAL_1:.*]] = tensor.dim %[[ARG_0]], %[[VAL_0]] : tensor<?xf32>
+// CHECK: %[[VAL_2:.*]] = arith.constant 0 : index
+// CHECK: %[[VAL_3:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK: %[[VAL_4:.*]] = vector.create_mask %[[VAL_1]] : vector<[4]xi1>
+// CHECK: %[[VAL_5:.*]] = vector.mask %[[VAL_4]] { vector.transfer_read %[[ARG_0]][%[[VAL_2]]], %[[VAL_3]] {in_bounds = [true]} : tensor<?xf32>, vector<[4]xf32> } : vector<[4]xi1> -> vector<[4]xf32>
+// CHECK: %[[VAL_6:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK: %[[VAL_7:.*]] = vector.transfer_read %[[ARG_1]][], %[[VAL_6]] : tensor<f32>, vector<f32>
+// CHECK: %[[VAL_8:.*]] = vector.extractelement %[[VAL_7]][] : vector<f32>
+// CHECK: %[[VAL_9:.*]] = vector.mask %[[VAL_4]] { vector.multi_reduction <add>, %[[VAL_5]], %[[VAL_8]] [0] : vector<[4]xf32> to f32 } : vector<[4]xi1> -> f32
+// CHECK: %[[VAL_10:.*]] = vector.broadcast %[[VAL_9]] : f32 to vector<f32>
+// CHECK: %[[VAL_11:.*]] = vector.transfer_write %[[VAL_10]], %[[ARG_1]][] : vector<f32>, tensor<f32>
+// CHECK: return %[[VAL_11]] : tensor<f32>
+// CHECK: }
+
+module attributes {transform.with_named_sequence} {
+ transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+ %0 = transform.structured.match ops{["linalg.reduce"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+ transform.structured.vectorize %0 vector_sizes [[4]] : !transform.any_op
+ transform.yield
+ }
+}
+
+// -----
+
+func.func @vectorize_dynamic_reduction_2d(%arg0: tensor<?x?xf32>,
+ %arg1: tensor<?xf32>) -> tensor<?xf32> {
+ %0 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
+ affine_map<(d0, d1) -> (d0)>],
+ iterator_types = ["parallel", "reduction"] }
+ ins(%arg0 : tensor<?x?xf32>)
+ outs(%arg1 : tensor<?xf32>) {
+ ^bb(%in: f32, %out: f32) :
+ %0 = arith.addf %in, %out : f32
+ linalg.yield %0 : f32
+ } -> tensor<?xf32>
+ return %0 : tensor<?xf32>
+}
+
+// CHECK-LABEL: func.func @vectorize_dynamic_reduction_2d(
+// CHECK-SAME: %[[ARG_0:.*]]: tensor<?x?xf32>, %[[ARG_1:.*]]: tensor<?xf32>) -> tensor<?xf32> {
+// CHECK: %[[VAL_0:.*]] = arith.constant 0 : index
+// CHECK: %[[VAL_1:.*]] = tensor.dim %[[ARG_0]], %[[VAL_0]] : tensor<?x?xf32>
+// CHECK: %[[VAL_2:.*]] = arith.constant 1 : index
+// CHECK: %[[VAL_3:.*]] = tensor.dim %[[ARG_0]], %[[VAL_2]] : tensor<?x?xf32>
+// CHECK: %[[VAL_4:.*]] = arith.constant 0 : index
+// CHECK: %[[VAL_5:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK: %[[VAL_6:.*]] = vector.create_mask %[[VAL_1]], %[[VAL_3]] : vector<1x[4]xi1>
+// CHECK: %[[VAL_7:.*]] = vector.mask %[[VAL_6]] { vector.transfer_read %[[ARG_0]][%[[VAL_4]], %[[VAL_4]]], %[[VAL_5]] {in_bounds = [true, true]} : tensor<?x?xf32>, vector<1x[4]xf32> } : vector<1x[4]xi1> -> vector<1x[4]xf32>
+// CHECK: %[[VAL_8:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK: %[[VAL_9:.*]] = vector.create_mask %[[VAL_1]] : vector<1xi1>
+// CHECK: %[[VAL_10:.*]] = vector.mask %[[VAL_9]] { vector.transfer_read %[[ARG_1]][%[[VAL_4]]], %[[VAL_8]] {in_bounds = [true]} : tensor<?xf32>, vector<1xf32> } : vector<1xi1> -> vector<1xf32>
+// CHECK: %[[VAL_11:.*]] = vector.mask %[[VAL_6]] { vector.multi_reduction <add>, %[[VAL_7]], %[[VAL_10]] [1] : vector<1x[4]xf32> to vector<1xf32> } : vector<1x[4]xi1> -> vector<1xf32>
+// CHECK: %[[VAL_12:.*]] = arith.constant 0 : index
+// CHECK: %[[VAL_13:.*]] = vector.mask %[[VAL_9]] { vector.transfer_write %[[VAL_11]], %[[ARG_1]][%[[VAL_12]]] {in_bounds = [true]} : vector<1xf32>, tensor<?xf32> } : vector<1xi1> -> tensor<?xf32>
+// CHECK: return %[[VAL_13]] : tensor<?xf32>
+// CHECK: }
+
+module attributes {transform.with_named_sequence} {
+ transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+ %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+ transform.structured.vectorize %0 vector_sizes [1, [4]] : !transform.any_op
+ transform.yield
+ }
+}
diff --git a/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir b/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir
index f70d23a193229..03cdd4f1cc2b6 100644
--- a/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir
+++ b/mlir/test/Dialect/Vector/vector-multi-reduction-lowering.mlir
@@ -298,6 +298,30 @@ func.func @scalable_dim_1d(%A: vector<[4]xf32>, %B: f32, %C: vector<[4]xi1>) ->
// CHECK: %[[VAL_4:.*]] = vector.extract %[[VAL_3]][0] : f32 from vector<1xf32>
// CHECK: return %[[VAL_4]] : f32
+func.func @scalable_dim_2d(%A: vector<2x[4]xf32>, %B: vector<2xf32>, %C: vector<2x[4]xi1>) -> vector<2xf32> {
+ %0 = vector.mask %C { vector.multi_reduction <add>, %A, %B [1] : vector<2x[4]xf32> to vector<2xf32> } : vector<2x[4]xi1> -> vector<2xf32>
+ return %0 : vector<2xf32>
+}
+
+// CHECK-LABEL: func.func @scalable_dim_2d(
+// CHECK-SAME: %[[ARG_0:.*]]: vector<2x[4]xf32>,
+// CHECK-SAME: %[[ARG_1:.*]]: vector<2xf32>,
+// CHECK-SAME: %[[ARG_2:.*]]: vector<2x[4]xi1>) -> vector<2xf32> {
+// CHECK-DAG: %[[CON_0:.*]] = arith.constant 1 : index
+// CHECK-DAG: %[[CON_1:.*]] = arith.constant 0 : index
+// CHECK-DAG: %[[CON_2:.*]] = arith.constant dense<0.000000e+00> : vector<2xf32>
+// CHECK: %[[VAL_0:.*]] = vector.extract %[[ARG_0]][0] : vector<[4]xf32> from vector<2x[4]xf32>
+// CHECK: %[[VAL_1:.*]] = vector.extract %[[ARG_1]][0] : f32 from vector<2xf32>
+// CHECK: %[[VAL_2:.*]] = vector.extract %[[ARG_2]][0] : vector<[4]xi1> from vector<2x[4]xi1>
+// CHECK: %[[VAL_3:.*]] = vector.mask %[[VAL_2]] { vector.reduction <add>, %[[VAL_0]], %[[VAL_1]] : vector<[4]xf32> into f32 } : vector<[4]xi1> -> f32
+// CHECK: %[[VAL_4:.*]] = vector.insertelement %[[VAL_3]], %[[CON_2]][%[[CON_1]] : index] : vector<2xf32>
+// CHECK: %[[VAL_5:.*]] = vector.extract %[[ARG_0]][1] : vector<[4]xf32> from vector<2x[4]xf32>
+// CHECK: %[[VAL_6:.*]] = vector.extract %[[ARG_1]][1] : f32 from vector<2xf32>
+// CHECK: %[[VAL_7:.*]] = vector.extract %[[ARG_2]][1] : vector<[4]xi1> from vector<2x[4]xi1>
+// CHECK: %[[VAL_8:.*]] = vector.mask %[[VAL_7]] { vector.reduction <add>, %[[VAL_5]], %[[VAL_6]] : vector<[4]xf32> into f32 } : vector<[4]xi1> -> f32
+// CHECK: %[[VAL_9:.*]] = vector.insertelement %[[VAL_8]], %[[VAL_4]][%[[CON_0]] : index] : vector<2xf32>
+// CHECK: return %[[VAL_9]] : vector<2xf32>
+
module attributes {transform.with_named_sequence} {
transform.named_sequence @__transform_main(%root : !transform.any_op {transform.readonly}) {
%func_op = transform.structured.match ops{["func.func"]} in %root : (!transform.any_op) -> !transform.op<"func.func">
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir
new file mode 100644
index 0000000000000..42a6f55e56a6f
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/generic_reduce_2d.mlir
@@ -0,0 +1,95 @@
+// DEFINE: %{compile} = mlir-opt %s \
+// DEFINE: -transform-interpreter -test-transform-dialect-erase-schedule \
+// DEFINE: -one-shot-bufferize="bufferize-function-boundaries" -buffer-deallocation-pipeline -cse -canonicalize -convert-vector-to-scf -arm-sve-legalize-vector-storage \
+// DEFINE: -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = generic_reduce_2d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE: -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext
+
+// RUN: %{compile}
+
+// RUN: %{run} | FileCheck %s --check-prefix=F32
+
+func.func @generic_reduce_2d_f32() {
+ // 2-D Tensor
+ %M = arith.constant 16 : index
+ %N = arith.constant 1000 : index
+ %c0_f32 = arith.constant 0.0 : f32
+
+ // Allocate the input and output tensors
+ %A_alloc = bufferization.alloc_tensor(%M, %N) : tensor<?x?xf32>
+ %C_alloc = bufferization.alloc_tensor(%M) : tensor<?xf32>
+
+ // Initialise the tensors
+ %pi = arith.constant 3.1416 : f32
+ %A_in = linalg.fill ins(%pi : f32) outs(%A_alloc : tensor<?x?xf32>) -> tensor<?x?xf32>
+ %C_in = linalg.fill ins(%c0_f32 : f32) outs(%C_alloc : tensor<?xf32>) -> tensor<?xf32>
+
+ // Reduce
+ %C_out = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
+ affine_map<(d0, d1) -> (d0)>],
+ iterator_types = ["parallel", "reduction"] }
+ ins(%A_in : tensor<?x?xf32>)
+ outs(%C_in : tensor<?xf32>) {
+ ^bb(%in: f32, %out: f32) :
+ %0 = arith.addf %in, %out : f32
+ linalg.yield %0 : f32
+ } -> tensor<?xf32>
+
+ // Print and verify the output
+ // F32-LABEL: SVE: START OF TEST OUTPUT
+ vector.print str "SVE: START OF TEST OUTPUT\n"
+
+ // F32-NEXT: Unranked Memref {{.*}} rank = 1 offset = 0 sizes = [16] strides = [1] data =
+ // F32-NEXT: [3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6, 3141.6]
+
+ %xf = tensor.cast %C_out : tensor<?xf32> to tensor<*xf32>
+ call @printMemrefF32(%xf) : (tensor<*xf32>) -> ()
+
+ // F32-NEXT: SVE: END OF TEST OUTPUT
+ vector.print str "SVE: END OF TEST OUTPUT\n"
+
+ return
+}
+
+module attributes {transform.with_named_sequence} {
+ // A sequence that will tile and vectorise a Reduce Op
+ transform.named_sequence @tile_and_vectorize_reduce(%func
+ : !transform.op<"func.func"> {transform.readonly}) {
+
+ // Step 0: Get a handle to the reduce Op
+ %reduce = transform.structured.match ops{["linalg.generic"]} in %func
+ : (!transform.op<"func.func">) -> !transform.any_op
+
+ // Step 1: Tile
+ %tiled_reduce, %loops:2 = transform.structured.tile_using_for %reduce tile_sizes [1, [4]]
+ : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)
+
+ // Step 2: Vectorize
+ transform.structured.vectorize %tiled_reduce vector_sizes [1, [4]] : !transform.any_op
+
+ // Step 3: Lower vector.multi_reduction
+ transform.apply_patterns to %func {
+ transform.apply_patterns.vector.lower_masked_transfers
+ transform.apply_patterns.vector.lower_multi_reduction lowering_strategy = "innerreduction"
+ } : !transform.op<"func.func">
+
+ transform.yield
+ }
+
+ // A sequence that goes over all functions in tis module and applies
+ // "tile_and_vectorize_reduce"
+ transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
+ %funcs = transform.structured.match ops{["func.func"]} in %module
+ : (!transform.any_op) -> !transform.op<"func.func">
+
+ transform.foreach %funcs : !transform.op<"func.func"> {
+ ^bb2(%func : !transform.op<"func.func">):
+ transform.include @tile_and_vectorize_reduce failures(propagate)
+ (%func) : (!transform.op<"func.func">) -> ()
+ }
+ transform.yield
+ }
+}
+
+func.func private @printMemrefF32(%ptr : tensor<*xf32>)
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir
new file mode 100644
index 0000000000000..e9f7154b10d42
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_1d.mlir
@@ -0,0 +1,90 @@
+// DEFINE: %{compile} = mlir-opt %s \
+// DEFINE: -transform-interpreter -test-transform-dialect-erase-schedule \
+// DEFINE: -one-shot-bufferize="bufferize-function-boundaries" -buffer-deallocation-pipeline -cse -canonicalize -convert-vector-to-scf -arm-sve-legalize-vector-storage \
+// DEFINE: -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = reduce_1d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE: -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext
+
+// RUN: %{compile}
+
+// RUN: %{run} | FileCheck %s --check-prefix=F32
+
+func.func @reduce_1d_f32() {
+ // 1-D Tensor
+ %N = arith.constant 1000 : index
+ %c0_f32 = arith.constant 0.0 : f32
+
+ // Allocate the input and output tensors
+ %A_alloc = bufferization.alloc_tensor(%N) : tensor<?xf32>
+ %C_alloc = bufferization.alloc_tensor() : tensor<f32>
+
+ // Initialise the tensors
+ %pi = arith.constant 3.1416 : f32
+ %A_in = linalg.fill ins(%pi : f32) outs(%A_alloc : tensor<?xf32>) -> tensor<?xf32>
+ %C_in = tensor.insert %c0_f32 into %C_alloc[] : tensor<f32>
+
+ // Reduce
+ %C_out = linalg.reduce ins(%A_in : tensor<?xf32>) outs(%C_in: tensor<f32>) dimensions = [0]
+ (%in: f32, %init: f32) {
+ %0 = arith.addf %in, %init : f32
+ linalg.yield %0 : f32
+ }
+
+ // Print and verify the output
+ // F32-LABEL: SVE: START OF TEST OUTPUT
+ vector.print str "SVE: START OF TEST OUTPUT\n"
+
+ // F32-NEXT: Unranked Memref {{.*}} rank = 0 offset = 0 sizes = [] strides = [] data =
+ // F32-NEXT: [3141.6]
+
+ %xf = tensor.cast %C_out : tensor<f32> to tensor<*xf32>
+ call @printMemrefF32(%xf) : (tensor<*xf32>) -> ()
+
+ // F32-NEXT: SVE: END OF TEST OUTPUT
+ vector.print str "SVE: END OF TEST OUTPUT\n"
+
+ return
+}
+
+module attributes {transform.with_named_sequence} {
+ // A sequence that will tile and vectorise a Reduce Op
+ transform.named_sequence @tile_and_vectorize_reduce(%func
+ : !transform.op<"func.func"> {transform.readonly}) {
+
+ // Step 0: Get a handle to the reduce Op
+ %reduce = transform.structured.match ops{["linalg.reduce"]} in %func
+ : (!transform.op<"func.func">) -> !transform.any_op
+
+ // Step 1: Tile
+ %tiled_reduce, %loops:1 = transform.structured.tile_using_for %reduce tile_sizes [[4]]
+ : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
+
+ // Step 2: Vectorize
+ transform.structured.vectorize %tiled_reduce vector_sizes [[4]] : !transform.any_op
+
+ // Step 3: Lower vector.multi_reduction
+ transform.apply_patterns to %func {
+ transform.apply_patterns.vector.lower_masked_transfers
+ transform.apply_patterns.vector.lower_multi_reduction lowering_strategy = "innerreduction"
+ } : !transform.op<"func.func">
+
+ transform.yield
+ }
+
+ // A sequence that goes over all functions in tis module and applies
+ // "tile_and_vectorize_reduce"
+ transform.named_sequence @__transform_main(%module: !transform.any_op {transform.readonly}) {
+ %funcs = transform.structured.match ops{["func.func"]} in %module
+ : (!transform.any_op) -> !transform.op<"func.func">
+
+ transform.foreach %funcs : !transform.op<"func.func"> {
+ ^bb2(%func : !transform.op<"func.func">):
+ transform.include @tile_and_vectorize_reduce failures(propagate)
+ (%func) : (!transform.op<"func.func">) -> ()
+ }
+ transform.yield
+ }
+}
+
+func.func private @printMemrefF32(%ptr : tensor<*xf32>)
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir
new file mode 100644
index 0000000000000..349966d7c85d5
--- /dev/null
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSVE/reduce_2d.mlir
@@ -0,0 +1,91 @@
+// DEFINE: %{compile} = mlir-opt %s \
+// DEFINE: -transform-interpreter -test-transform-dialect-erase-schedule \
+// DEFINE: -one-shot-bufferize="bufferize-function-boundaries" -buffer-deallocation-pipeline -cse -canonicalize -convert-vector-to-scf -arm-sve-legalize-vector-storage \
+// DEFINE: -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t
+// DEFINE: %{entry_point} = reduce_2d_f32
+// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\
+// DEFINE: -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext
+
+// RUN: %{compile}
+
+// RUN: %{run} | FileCheck %s --check-prefix=F32
+
+func.func @reduce_2d_f32() {
+ // 2-D Tensor
+ %M = arith.constant 16 : index
+ %N = arith.constant 1000 : index
+ %c0_f32 = arith.constant 0.0 : f32
+
+ // Allocate the input and output tensors
+ %A_alloc = bufferization.alloc_tensor(%M, %N) : tensor<?x?xf32>
+ %C_alloc = bufferization.alloc_tenso...
[truncated]
|
} | ||
return success(); | ||
} | ||
|
||
bool isScalable = inputScalableVecDims.back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be missed by 5f6c036, lifting the restriction that only the trailing dimension can be scalably vectorized.
@banach-space, I think we should check that all dims are not scalable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, feel free to remove (just add a note in the summary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, feel free to remove (just add a note in the summary).
like mentioned in my Q: this allows vector sizes such as [[4], [4], 1] to be applied without checking the type of the linalg op.
Simply removing it will break some useful cases like matmul. Making sure we allow all correct combinations of vector sizes and op types and prevent unsupported cases is beyond the scope of this PR. I'm happy to work on it later.
// DEFINE: -convert-vector-to-llvm="enable-arm-sve" -test-lower-to-llvm -o %t | ||
// DEFINE: %{entry_point} = generic_reduce_2d_f32 | ||
// DEFINE: %{run} = %mcr_aarch64_cmd %t -e %{entry_point} -entry-point-result=void --march=aarch64 --mattr="+sve"\ | ||
// DEFINE: -shared-libs=%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext,%mlir_native_utils_lib_dir/libmlir_c_runner_utils%shlibext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%mlir_native_utils_lib_dir/libmlir_runner_utils%shlibext
will expands to config.arm_emulator_utils_lib_dir/libmlir_runner_utils.so
on linux.
config.arm_emulator_utils_lib_dir
is defined in https://github.com/llvm/llvm-project/blob/main/mlir/test/lit.site.cfg.py.in#L59
@zhaoshiz Apologies for the delay with this - I was travelling last week and still catching up with PRs. If not today, I promise to go over this tomorrow. In the meantime, would you mind fixing the conflict? |
No worries @banach-space. By "conflict" do you mean I tried to change it locally to check that all dims are not scalable: Essentially we are doing white-list of linalg ops in function I have another question: a great number of mlir integration tests are written with |
I meant this: I will reply more tomorrow :) |
rebased and fixed the conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! 🙏🏻 Overall LG. I've made a few suggestions, but nothing major.
Essentially we are doing white-list of linalg ops in function vectorizeScalableVectorPrecondition, the existing check
bool isScalable = inputScalableVecDims.back(); if (!isScalable) return success(); allows vector sizes like [[4],4] or [1, [4], 1] to proceed regardless of the linalg op being vectorized. So to fix it we'll need to verifiy a lot of ops (e.g.: matvec) can be scalably vectorized and add them.
Please don't forget about this check:
llvm-project/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
Lines 1952 to 1953 in 1ed84a8
return success(linalgOp && (isElementwise(linalgOp) || | |
isa<linalg::DepthwiseConv1DNwcWcOp>(op))); |
Also, note:
llvm-project/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
Lines 1756 to 1765 in 1ed84a8
// Support dynamic shapes in 1D depthwise convolution, but only in the | |
// _channel_ dimension. | |
Value lhs = conv.getDpsInputOperand(0)->get(); | |
ArrayRef<int64_t> lhsShape = cast<ShapedType>(lhs.getType()).getShape(); | |
auto shapeWithoutCh = lhsShape.drop_back(1); | |
if (ShapedType::isDynamicShape(shapeWithoutCh)) { | |
LDBG("Dynamically-shaped op vectorization precondition failed: only " | |
"channel dim can be dynamic\n"); | |
return failure(); | |
} |
So, it looks like there's at least two hooks to check the preconditions for linalg.reduce
:
vectorizeScalableVectorPrecondition
vectorizeDynamicLinalgOpPrecondition
vectorizeLinalgOpPrecondition
Given that we are adding limitations specific to scalable vectors, I think that vectorizeScalableVectorPrecondition
is the right place for now.
I have another question:
Let me try something and I will get back to you tomorrow!
for (size_t i = 0; i < inputScalableVecDims.size() - 1; ++i) { | ||
if (inputScalableVecDims[i] && isReductionIterator(iteratorTypes[i])) { | ||
LDBG("Non-trailing reduction dim requested for scalable " | ||
"vectorization\n"); | ||
return failure(); | ||
} | ||
} | ||
return success(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be sufficient?
// Only the trailing scalable dim is allowed to be scalable.
if (llvm::all_of(ArrayRef<bool>(inputScalableVecDims).drop_back(1), [](bool flag) {return flag == false;})
return failure();
As in, we only need to make sure that all the flags except for the trailing one are false
. Btw, I might have made a typo - let me know if my suggestion "doesn't work" for you :)
Also, we shouldn't return success
until all pre-conditions are checked (there's more further down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would work for linalg.reduce ops but prevent vectorizing dimensions with parallel iterators of linalg.generic ops, e.g.: requested vector sizes are [4, [4], 1] for the op below:
%result = linalg.generic {
indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
affine_map<(i, j, k) -> (k, j)>,
affine_map<(i, j, k) -> (i, j)>],
iterator_types = ["parallel", "parallel", "reduction"]
} ins(%lhs, %rhs : tensor<8x10xf32>,tensor<10x16xf32>)
outs(%init :tensor<8x16xf32>) {
^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
%0 = arith.mulf %lhs_one, %rhs_one : f32
%1 = arith.addf %init_one, %0 : f32
linalg.yield %1 : f32
} -> tensor<8x16xf32>
if (isLinalgReduction(op)) | ||
return reductionPreconditions(op); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? It's already invoked by vectorizeLinalgOpPrecondition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linalg.reduce ops will fail the next check L1792~L1795 and cause vectorizeLinalgOpPrecondition() to return failure.
For static-shaped reduce ops I don't think we need it. But after tiling with scalable vector like [[4]], we got dynamic-shaped ops.
} | ||
return success(); | ||
} | ||
|
||
bool isScalable = inputScalableVecDims.back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, feel free to remove (just add a note in the summary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working on tidy-up the test cases, commented inline about issues in mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
} | ||
return success(); | ||
} | ||
|
||
bool isScalable = inputScalableVecDims.back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, feel free to remove (just add a note in the summary).
like mentioned in my Q: this allows vector sizes such as [[4], [4], 1] to be applied without checking the type of the linalg op.
Simply removing it will break some useful cases like matmul. Making sure we allow all correct combinations of vector sizes and op types and prevent unsupported cases is beyond the scope of this PR. I'm happy to work on it later.
for (size_t i = 0; i < inputScalableVecDims.size() - 1; ++i) { | ||
if (inputScalableVecDims[i] && isReductionIterator(iteratorTypes[i])) { | ||
LDBG("Non-trailing reduction dim requested for scalable " | ||
"vectorization\n"); | ||
return failure(); | ||
} | ||
} | ||
return success(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would work for linalg.reduce ops but prevent vectorizing dimensions with parallel iterators of linalg.generic ops, e.g.: requested vector sizes are [4, [4], 1] for the op below:
%result = linalg.generic {
indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
affine_map<(i, j, k) -> (k, j)>,
affine_map<(i, j, k) -> (i, j)>],
iterator_types = ["parallel", "parallel", "reduction"]
} ins(%lhs, %rhs : tensor<8x10xf32>,tensor<10x16xf32>)
outs(%init :tensor<8x16xf32>) {
^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
%0 = arith.mulf %lhs_one, %rhs_one : f32
%1 = arith.addf %init_one, %0 : f32
linalg.yield %1 : f32
} -> tensor<8x16xf32>
@@ -1947,13 +1956,30 @@ vectorizeScalableVectorPrecondition(Operation *op, | |||
if (inputVectorSizes.empty()) | |||
return success(); | |||
|
|||
auto linalgOp = dyn_cast<LinalgOp>(op); | |||
if (linalgOp && isLinalgReduction(linalgOp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note isLinalgReduction(linalgOp)
if (isLinalgReduction(op)) | ||
return reductionPreconditions(op); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linalg.reduce ops will fail the next check L1792~L1795 and cause vectorizeLinalgOpPrecondition() to return failure.
For static-shaped reduce ops I don't think we need it. But after tiling with scalable vector like [[4]], we got dynamic-shaped ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finally had a bit more time for this. I think that a lot of complexities in this PR stem from the fact that vectorizeScalableVectorPrecondition
is a bit messy and hard to extend (my fault!). Also:
Simply removing it will break some useful cases like matmul. Making sure we allow all correct combinations of vector sizes and op types and prevent unsupported cases is beyond the scope of this PR. I'm happy to work on it later.
I think that we should refactor things a bit first and then build this PR on top of that. To quickly unblock you, here's what I'm proposing:
I think that you should be able to enable "reductions" quite easily. Sadly re-basing won't be straightforward :(
Also:
[MLIR][Linalg] Scalable Vectorization of Reduction
To me this says that you are adding scalable vectorisation of e.g. linalg.reduce
, but in practice you are doing something more generic - allowing reduction dimensions to be scalable. It's worth updating the summary.
Sorry for not getting back to you earlier. Could you clarify - are you asking how to run this on x86_64 or AArch64? The former will require cross-compiling. The latter, should just work ™️ 😅 If it didn't, could you share more details? |
We should be able to run the integration tests on SVE/SME in both ways: 1. qemu-aarch64 on x86_64; 2 native aarch64-linux
and write integration tests with My question is will |
It should. When llvm-project/mlir/test/Integration/lit.local.cfg Lines 18 to 19 in 93d7d9b
The naming is not great though ... Btw, if you are building on X86, do other SVE integration tests work for you? I imagine that they fail - we've not really used |
no, all SVE test written with |
Sorry about this - thank you for checking and for reporting 🙏🏻 Yes, we need to fix this. Would you have the cycles for this? |
yea I'm happy to fix that but it'll take some time for me to set up an environment to test |
If you test it on X86, then I can take care of testing on AArch64. IIUC, the former is easier for you to set-up? |
yes, I can test on x86 and push a PR. |
…mension Allow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator. For now, only reduction on the trailing dimension is supported.
…uction Note: I don't have a setup to run these tests natively (arm64-linux with sve). I am able to run them using QEMU on a x86_64-linux with below cmake variables when building llvm: -DARM_EMULATOR_EXECUTABLE="<path_to_qemu_bin>/qemu-aarch64" \ -DARM_EMULATOR_OPTIONS="-L /usr/aarch64-linux-gnu" \ -DARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE="<path_to_llvm_arm64_build>/bin/mlir-cpu-runner-arm64" \ -DARM_EMULATOR_UTILS_LIB_DIR="<path_to_llvm_arm64_build>/lib"
rebased/reworked after #98639 is merged |
✅ With the latest revision this PR passed the C/C++ code formatter. |
You can test this locally with the following command:git-clang-format --diff 05f0e86cc895181b3d2210458c78938f83353002 fba222e9377302c8263a847ba30268c334d2c5bf --extensions cpp -- mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp View the diff from clang-format here.diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index b2324d8aaf..7e3048b15f 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -2004,9 +2004,9 @@ vectorizeScalableVectorPrecondition(Operation *op,
if (iterators.back() == utils::IteratorType::reduction) {
if (iterators.size() != inputVectorSizes.size()) {
- LDBG("Non-trailing reduction dim requested for scalable "
- "vectorization\n");
- return failure();
+ LDBG("Non-trailing reduction dim requested for scalable "
+ "vectorization\n");
+ return failure();
}
}
|
// TODO: Support scalable vectorisation for reduction dims | ||
if (iterators.back() == utils::IteratorType::reduction) | ||
return failure(); | ||
if (iterators.back() == utils::IteratorType::reduction) { | ||
if (iterators.size() != inputVectorSizes.size()) { | ||
LDBG("Non-trailing reduction dim requested for scalable " | ||
"vectorization\n"); | ||
return failure(); | ||
} | ||
} | ||
|
||
// If this is not the _last_ parallel dim, 1. above is not met | ||
// If this is not the _last_ parallel dim, 1. or 3. above is not met | ||
if (seenParalell) | ||
return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two cases here. Should we turn this into a switch statement to combine this somehow?
switch (iterators.back()) {
case utils::IteratorType::reduction: {
// Check 3. above is met.
if (iterators.size() != inputVectorSizes.size()) {
LDBG("Non-trailing reduction dim requested for scalable "
"vectorization\n");
return failure();
break;
}
}
case utils::IteratorType::parallel: {
// Check 1. and 2. above are met.
if (seenParalell) {
LDBG("Inner parallel dim requested for scalable "
"vectorization\n");
return failure();
}
break;
}
WDYT? I'm open to suggestion :)
@@ -586,6 +586,12 @@ static SmallVector<bool> getDimsToReduce(LinalgOp linalgOp) { | |||
llvm::map_range(linalgOp.getIteratorTypesArray(), isReductionIterator)); | |||
} | |||
|
|||
static bool hasLinalgReduction(LinalgOp &op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document. Also, why hasLinalgReduction
raher than isaLinalgReduction
?
transform.yield | ||
} | ||
} | ||
|
||
// ----- | ||
|
||
func.func @linalg_generic_scalable_reduction_dim(%input: tensor<?x?xf32>, | ||
%acc: tensor<?xf32>) -> tensor<?xf32> { | ||
func.func @linalg_generic_scalable_reduction_leading_dim(%input: tensor<?x?xf32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I think this would read a bit better:
func.func @linalg_generic_scalable_reduction_leading_dim(%input: tensor<?x?xf32>, | |
func.func @linalg_generic_reduction_scalable_leading_dim(%input: tensor<?x?xf32>, |
func.func @vectorize_dynamic_reduction_scalable_1d(%arg0: tensor<?xf32>, | ||
%arg1: tensor<f32>) -> tensor<f32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Indentation - please align %arg1
with %arg0
. Same comment below.
// CHECK: return %[[VAL_11]] : tensor<f32> | ||
// CHECK: } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines can be skipped:
// CHECK: %[[VAL_0:.*]] = arith.constant 0 : index | ||
// CHECK: %[[VAL_1:.*]] = tensor.dim %[[ARG_0]], %[[VAL_0]] : tensor<?xf32> | ||
// CHECK: %[[VAL_2:.*]] = arith.constant 0 : index | ||
// CHECK: %[[VAL_3:.*]] = arith.constant 0.000000e+00 : f32 | ||
// CHECK: %[[VAL_4:.*]] = vector.create_mask %[[VAL_1]] : vector<[4]xi1> | ||
// CHECK: %[[VAL_5:.*]] = vector.mask %[[VAL_4]] { vector.transfer_read %[[ARG_0]][%[[VAL_2]]], %[[VAL_3]] {in_bounds = [true]} : tensor<?xf32>, vector<[4]xf32> } : vector<[4]xi1> -> vector<[4]xf32> | ||
// CHECK: %[[VAL_6:.*]] = arith.constant 0.000000e+00 : f32 | ||
// CHECK: %[[VAL_7:.*]] = vector.transfer_read %[[ARG_1]][], %[[VAL_6]] : tensor<f32>, vector<f32> | ||
// CHECK: %[[VAL_8:.*]] = vector.extractelement %[[VAL_7]][] : vector<f32> | ||
// CHECK: %[[VAL_9:.*]] = vector.mask %[[VAL_4]] { vector.multi_reduction <add>, %[[VAL_5]], %[[VAL_8]] [0] : vector<[4]xf32> to f32 } : vector<[4]xi1> -> f32 | ||
// CHECK: %[[VAL_10:.*]] = vector.broadcast %[[VAL_9]] : f32 to vector<f32> | ||
// CHECK: %[[VAL_11:.*]] = vector.transfer_write %[[VAL_10]], %[[ARG_1]][] : vector<f32>, tensor<f32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Kind request - descriptive LIT variable names.
module attributes {transform.with_named_sequence} { | ||
transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) { | ||
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op | ||
transform.structured.vectorize %0 vector_sizes [1, [4]] : !transform.any_op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Use [4, [8]]
instead for consistency:
transform.structured.vectorize %0 vector_sizes [4, 8] : !transform.any_op |
// REDEFINE: %{entry_point} = generic_reduce_1d_f32 | ||
// RUN: %{run} | FileCheck %s --check-prefix=GENERIC | ||
|
||
func.func @reduce_1d_f32() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a test for i16
as well? Or any integer value. Just to make sure that we check both FP and integers. Thanks!
In summary: 1. Do not allow scalable vectorization of the reduction dim of Matmul-like ops. 2. Allow scalable vectorization on only one dim of Matvec op. Allowed combinations of scalable flags and iterator types: Matmul: Iterators: ["parallel", "parallel", "reduction"] Scalable Flags: ["true", "true", "false"] ["false", "true", "false"] Matvec: Iterators: ["parallel", "reduction"] Scalable Flags: ["false", "true"] ["true", "false"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised per comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for bearing with me and for addressing my comments. Great test coverage!
…mension (#97788) Summary: Allow scalable vectorization of linalg::reduce and linalg::generic that has reduction iterator(s) with two restrictions: 1. The reduction dim is the last (innermost) dim of the op; and 2. Only the reduction dim is requested for scalable vectorization. One exception is that scalable vectorization of the reduction dim in Matmul-like ops are not supported even above restrictions are met. Allowed combinations of scalable flags and iterator types: Matmul: Iterators: ["parallel", "parallel", "reduction"] Scalable Flags: ["true", "true", "false"] ["false", "true", "false"] Matvec: Iterators: ["parallel", "reduction"] Scalable Flags: ["false", "true"] ["true", "false"] Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250598
Allow scalable vectorization of linalg::reduce and linalg::generic with reduction iterator with two restrictions:
One exception is that scalable vectorization of the reduction dim in Matmul-like ops are not supported even above restrictions are met.