-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. #141853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-mlir-gpu @llvm/pr-subscribers-mlir-vector Author: Charitha Saumya (charithaintc) ChangesDiscourse post: https://discourse.llvm.org/t/mlir-vector-distribution-warpopscfforop-fails-when-scf-for-has-results-that-are-unused/86390 Example:
This example crashes because currently Crash trace:
Solution Full diff: https://github.com/llvm/llvm-project/pull/141853.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
index 045c192787f10..94435588459e6 100644
--- a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
+++ b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
@@ -15,10 +15,13 @@
#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/Dialect/Vector/Transforms/VectorDistribution.h"
#include "mlir/IR/AffineExpr.h"
+#include "mlir/IR/Value.h"
#include "mlir/Interfaces/SideEffectInterfaces.h"
#include "mlir/Transforms/RegionUtils.h"
+#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
#include "llvm/Support/FormatVariadic.h"
+#include "llvm/Support/raw_ostream.h"
#include <utility>
using namespace mlir;
@@ -1554,22 +1557,37 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
llvm::SmallSetVector<Value, 32> escapingValues;
SmallVector<Type> inputTypes;
SmallVector<Type> distTypes;
+ auto collectEscapingValues = [&](Value value) {
+ if (!escapingValues.insert(value))
+ return;
+ Type distType = value.getType();
+ if (auto vecType = dyn_cast<VectorType>(distType)) {
+ AffineMap map = distributionMapFn(value);
+ distType = getDistributedType(vecType, map, warpOp.getWarpSize());
+ }
+ inputTypes.push_back(value.getType());
+ distTypes.push_back(distType);
+ };
+
mlir::visitUsedValuesDefinedAbove(
forOp.getBodyRegion(), [&](OpOperand *operand) {
Operation *parent = operand->get().getParentRegion()->getParentOp();
if (warpOp->isAncestor(parent)) {
- if (!escapingValues.insert(operand->get()))
- return;
- Type distType = operand->get().getType();
- if (auto vecType = dyn_cast<VectorType>(distType)) {
- AffineMap map = distributionMapFn(operand->get());
- distType = getDistributedType(vecType, map, warpOp.getWarpSize());
- }
- inputTypes.push_back(operand->get().getType());
- distTypes.push_back(distType);
+ collectEscapingValues(operand->get());
}
});
+ // Any forOp result that is not already yielded by the warpOp
+ // region is also considered escaping and must be returned by the
+ // original warpOp.
+ for (OpResult forResult : forOp.getResults()) {
+ // Check if this forResult is already yielded by the yield op.
+ if (llvm::is_contained(yield->getOperands(), forResult)) {
+ continue;
+ }
+ collectEscapingValues(forResult);
+ }
+
if (llvm::is_contained(distTypes, Type{}))
return failure();
@@ -1609,7 +1627,12 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
forOp.getResultTypes().end());
llvm::SmallDenseMap<Value, int64_t> argIndexMapping;
for (auto [i, retIdx] : llvm::enumerate(newRetIndices)) {
- warpInput.push_back(newWarpOp.getResult(retIdx));
+ auto newWarpResult = newWarpOp.getResult(retIdx);
+ // Unused forOp results yielded by the warpOp region are already included
+ // in the new ForOp.
+ if (llvm::is_contained(newOperands, newWarpResult))
+ continue;
+ warpInput.push_back(newWarpResult);
argIndexMapping[escapingValues[i]] = warpInputType.size();
warpInputType.push_back(inputTypes[i]);
}
diff --git a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
index 38771f2593449..6c7ac7a5196a7 100644
--- a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
+++ b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
@@ -584,6 +584,42 @@ func.func @warp_scf_for_multiple_yield(%arg0: index, %arg1: memref<?xf32>, %arg2
return
}
+// -----
+// CHECK-PROP-LABEL: func.func @warp_scf_for_unused_yield(
+// CHECK-PROP: %[[W0:.*]]:2 = gpu.warp_execute_on_lane_0(%{{.*}})[32] -> (vector<4xf32>, vector<4xf32>) {
+// CHECK-PROP: %[[INI0:.*]] = "some_def"() : () -> vector<128xf32>
+// CHECK-PROP: %[[INI1:.*]] = "some_def"() : () -> vector<128xf32>
+// CHECK-PROP: gpu.yield %[[INI0]], %[[INI1]] : vector<128xf32>, vector<128xf32>
+// CHECK-PROP: }
+// CHECK-PROP: %[[F:.*]]:2 = scf.for %{{.*}} iter_args(%{{.*}} = %[[W0]]#0, %{{.*}} = %[[W0]]#1) -> (vector<4xf32>, vector<4xf32>) {
+// CHECK-PROP: %[[W1:.*]]:2 = gpu.warp_execute_on_lane_0(%{{.*}})[32] args(%{{.*}} : vector<4xf32>, vector<4xf32>) -> (vector<4xf32>, vector<4xf32>) {
+// CHECK-PROP: %[[ACC0:.*]] = "some_def"(%{{.*}}) : (vector<128xf32>, index) -> vector<128xf32>
+// CHECK-PROP: %[[ACC1:.*]] = "some_def"(%{{.*}}) : (index, vector<128xf32>, vector<128xf32>) -> vector<128xf32>
+// CHECK-PROP: gpu.yield %[[ACC1]], %[[ACC0]] : vector<128xf32>, vector<128xf32>
+// CHECK-PROP: }
+// CHECK-PROP: scf.yield %[[W1]]#0, %[[W1]]#1 : vector<4xf32>, vector<4xf32>
+// CHECK-PROP: }
+// CHECK-PROP: "some_use"(%[[F]]#0) : (vector<4xf32>) -> ()
+func.func @warp_scf_for_unused_yield(%arg0: index) {
+ %c128 = arith.constant 128 : index
+ %c1 = arith.constant 1 : index
+ %c0 = arith.constant 0 : index
+ %0 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>) {
+ %ini = "some_def"() : () -> (vector<128xf32>)
+ %ini1 = "some_def"() : () -> (vector<128xf32>)
+ %3:2 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1) -> (vector<128xf32>, vector<128xf32>) {
+ %add = arith.addi %arg3, %c1 : index
+ %1 = "some_def"(%arg5, %add) : (vector<128xf32>, index) -> (vector<128xf32>)
+ %acc = "some_def"(%add, %arg4, %1) : (index, vector<128xf32>, vector<128xf32>) -> (vector<128xf32>)
+ scf.yield %acc, %1 : vector<128xf32>, vector<128xf32>
+ }
+ gpu.yield %3#0 : vector<128xf32>
+ }
+ "some_use"(%0) : (vector<4xf32>) -> ()
+ return
+}
+
+
// -----
// CHECK-PROP-LABEL: func @vector_reduction(
|
@kurapov-peter, @Groverkss, @matthias-springer Can you please help with reviewing this PR? |
// Any forOp result that is not already yielded by the warpOp | ||
// region is also considered escaping and must be returned by the | ||
// original warpOp. | ||
for (OpResult forResult : forOp.getResults()) { | ||
// Check if this forResult is already yielded by the yield op. | ||
if (llvm::is_contained(yield->getOperands(), forResult)) { | ||
continue; | ||
} | ||
collectEscapingValues(forResult); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get this part, could you please clarify? This should collect the arguments of the scf.for
that would be distributed and returned by the new signature of the original warp op. Why would the original warp op return scf.for
's results?
Shouldn't the results be only collected and inserted into the second warp op? The one that is created inside scf.for
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. This is how the original implementation work. Initially original warpOp will return the used forOp results + any escaping values. Later the for op results are replaced with the corresponding initArgs, Here:
yieldOperand.set(forOp.getInitArgs()[forResult.getResultNumber()]); |
I did not change the original logic flow. Hope this clarifies your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I had to play with it for a bit to understand how it works. I think adding results of the ForOp
to the yielded values as the first step is pretty confusing since the resulting warp op after transformation should not have those. I've modified your lit to make an example that would probably be helpful to others to understand it better.
Initial IR:
func.func @warp_scf_for_unused_yield(%arg0: index) {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%0:2 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>, vector<4xf32>) {
%ini = "some_def"() : () -> (vector<128xf32>)
%ini1 = "some_def"() : () -> (vector<128xf32>)
%other = "other_def"() : () -> (vector<128xf32>)
%3:3 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1, %other_2 = %other) -> (vector<128xf32>, vector<128xf32>, vector<128xf32>) {
%add = arith.addi %arg3, %c1 : index
%1 = "some_def"(%arg5, %add) : (vector<128xf32>, index) -> (vector<128xf32>)
%acc = "some_def"(%add, %arg4, %1) : (index, vector<128xf32>, vector<128xf32>) -> (vector<128xf32>)
%other2 = "other2_def"(%arg4) : (vector<128xf32>) -> (vector<128xf32>)
scf.yield %acc, %1, %other2 : vector<128xf32>, vector<128xf32>, vector<128xf32>
}
gpu.yield %3#0, %other : vector<128xf32>, vector<128xf32>
}
"some_use"(%0#0) : (vector<4xf32>) -> ()
"other_use"(%0#1) : (vector<4xf32>) -> ()
return
}
Right after the newWarpOp
creation the warp op looks like:
%0:4 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>, vector<4xf32>, vector<4xf32>, vector<4xf32>) {
%1 = "some_def"() : () -> vector<128xf32>
%2 = "some_def"() : () -> vector<128xf32>
%3 = "other_def"() : () -> vector<128xf32>
%4:3 = scf.for %arg1 = %c0 to %c128 step %c1 iter_args(%arg2 = %1, %arg3 = %2, %arg4 = %3) -> (vector<128xf32>, vector<128xf32>, vector<128xf32>) {
%5 = arith.addi %arg1, %c1 : index
%6 = "some_def"(%arg3, %5) : (vector<128xf32>, index) -> vector<128xf32>
%7 = "some_def"(%5, %arg2, %6) : (index, vector<128xf32>, vector<128xf32>) -> vector<128xf32>
%8 = "other2_def"(%arg2) : (vector<128xf32>) -> vector<128xf32>
scf.yield %7, %6, %8 : vector<128xf32>, vector<128xf32>, vector<128xf32>
}
gpu.yield %4#0, %3, %4#1, %4#2 : vector<128xf32>, vector<128xf32>, vector<128xf32>, vector<128xf32>
}
The results of the ForOp
are now returned even though the op itself will be sinked through gpu.yield
. These are patched up later again to produce the valid result. This is a bit unexpected after you read the commend for the transformation (and the reason it caught my attention).
I think this is somewhat brittle. If it is done to preserve some existing code I'd reconsider in favor of simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review :-)
I agree the existing logic is bit confusing. I am also not sure why the forOp results are yielded instead of the iterArgs directly. Maybe code owners have some insights on this?
But overall the logic will be similar apart from this part (even for a refactored version). This pattern is anyway bit complex since it has a lot of moving parts. :-)
Discourse post: https://discourse.llvm.org/t/mlir-vector-distribution-warpopscfforop-fails-when-scf-for-has-results-that-are-unused/86390
Example:
This example crashes because currently
WarpOpScfForOp
do not support cases where scf.for result is unused. Here%1
is yielded but has no users outside the scf.for. It can't be hoisted also due to loop-carried dependances.Crash trace:
Solution
Currently, only the values defined outside ForOp but inside the original WarpOp are considered "escaping values". However this is not true if the ForOp has some unused results. In this case, corresponding IterArgs must also be yielded by the original WarpOp. This PR adds the required code changes to achieve this.