[mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. #141853

charithaintc · 2025-05-28T20:42:56Z

Discourse post: https://discourse.llvm.org/t/mlir-vector-distribution-warpopscfforop-fails-when-scf-for-has-results-that-are-unused/86390

Example:

func.func @warp_scf_for_unused_yield(%arg0: index) {
  %c128 = arith.constant 128 : index
  %c1 = arith.constant 1 : index
  %c0 = arith.constant 0 : index
  %0 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>) {
    %ini = "some_def"() : () -> (vector<128xf32>)
    %ini1 = "some_def"() : () -> (vector<128xf32>)
    %3:2 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1) -> (vector<128xf32>, vector<128xf32>) {
      %add = arith.addi %arg3, %c1 : index
      %1  = "some_def"(%arg5, %add) : (vector<128xf32>, index) -> (vector<128xf32>)
      %acc = "some_def"(%add, %arg4, %1) : (index, vector<128xf32>, vector<128xf32>) -> (vector<128xf32>)
      scf.yield %acc, %1 : vector<128xf32>, vector<128xf32>
    }
    gpu.yield %3#0 : vector<128xf32>
  }
  "some_use"(%0) : (vector<4xf32>) -> ()
  return
}

This example crashes because currently WarpOpScfForOp do not support cases where scf.for result is unused. Here %1 is yielded but has no users outside the scf.for. It can't be hoisted also due to loop-carried dependances.

Crash trace:

llvm-project/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp:2301: static void mlir::gpu::WarpExecuteOnLane0Op::build(mlir::OpBuilder&, mlir::OperationState&, mlir::TypeRange, mlir::Value, int64_t, mlir::ValueRange, mlir::TypeRange): Assertion `args.size() == blockArgTypes.size()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:

Solution
Currently, only the values defined outside ForOp but inside the original WarpOp are considered "escaping values". However this is not true if the ForOp has some unused results. In this case, corresponding IterArgs must also be yielded by the original WarpOp. This PR adds the required code changes to achieve this.

llvmbot · 2025-05-28T20:43:35Z

@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-vector

Author: Charitha Saumya (charithaintc)

Changes

Discourse post: https://discourse.llvm.org/t/mlir-vector-distribution-warpopscfforop-fails-when-scf-for-has-results-that-are-unused/86390

Example:

func.func @<!-- -->warp_scf_for_unused_yield(%arg0: index) {
  %c128 = arith.constant 128 : index
  %c1 = arith.constant 1 : index
  %c0 = arith.constant 0 : index
  %0 = gpu.warp_execute_on_lane_0(%arg0)[32] -&gt; (vector&lt;4xf32&gt;) {
    %ini = "some_def"() : () -&gt; (vector&lt;128xf32&gt;)
    %ini1 = "some_def"() : () -&gt; (vector&lt;128xf32&gt;)
    %3:2 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1) -&gt; (vector&lt;128xf32&gt;, vector&lt;128xf32&gt;) {
      %add = arith.addi %arg3, %c1 : index
      %1  = "some_def"(%arg5, %add) : (vector&lt;128xf32&gt;, index) -&gt; (vector&lt;128xf32&gt;)
      %acc = "some_def"(%add, %arg4, %1) : (index, vector&lt;128xf32&gt;, vector&lt;128xf32&gt;) -&gt; (vector&lt;128xf32&gt;)
      scf.yield %acc, %1 : vector&lt;128xf32&gt;, vector&lt;128xf32&gt;
    }
    gpu.yield %3#<!-- -->0 : vector&lt;128xf32&gt;
  }
  "some_use"(%0) : (vector&lt;4xf32&gt;) -&gt; ()
  return
}

This example crashes because currently WarpOpScfForOp do not support cases where scf.for result is unused. Here %1 is yielded but has no users outside the scf.for. It can't be hoisted also due to loop-carried dependances.

Crash trace:

llvm-project/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp:2301: static void mlir::gpu::WarpExecuteOnLane0Op::build(mlir::OpBuilder&amp;, mlir::OperationState&amp;, mlir::TypeRange, mlir::Value, int64_t, mlir::ValueRange, mlir::TypeRange): Assertion `args.size() == blockArgTypes.size()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:

Solution
Currently, only the values defined outside ForOp but inside the original WarpOp are considered "escaping values". However this is not true if the ForOp has some unused results. In this case, corresponding IterArgs must also be yielded by the original WarpOp. This PR adds the required code changes to achieve this.

Full diff: https://github.com/llvm/llvm-project/pull/141853.diff

2 Files Affected:

(modified) mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp (+33-10)
(modified) mlir/test/Dialect/Vector/vector-warp-distribute.mlir (+36)

diff --git a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
index 045c192787f10..94435588459e6 100644
--- a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
+++ b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
@@ -15,10 +15,13 @@
 #include "mlir/Dialect/Vector/IR/VectorOps.h"
 #include "mlir/Dialect/Vector/Transforms/VectorDistribution.h"
 #include "mlir/IR/AffineExpr.h"
+#include "mlir/IR/Value.h"
 #include "mlir/Interfaces/SideEffectInterfaces.h"
 #include "mlir/Transforms/RegionUtils.h"
+#include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/Support/FormatVariadic.h"
+#include "llvm/Support/raw_ostream.h"
 #include <utility>
 
 using namespace mlir;
@@ -1554,22 +1557,37 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
     llvm::SmallSetVector<Value, 32> escapingValues;
     SmallVector<Type> inputTypes;
     SmallVector<Type> distTypes;
+    auto collectEscapingValues = [&](Value value) {
+      if (!escapingValues.insert(value))
+        return;
+      Type distType = value.getType();
+      if (auto vecType = dyn_cast<VectorType>(distType)) {
+        AffineMap map = distributionMapFn(value);
+        distType = getDistributedType(vecType, map, warpOp.getWarpSize());
+      }
+      inputTypes.push_back(value.getType());
+      distTypes.push_back(distType);
+    };
+
     mlir::visitUsedValuesDefinedAbove(
         forOp.getBodyRegion(), [&](OpOperand *operand) {
           Operation *parent = operand->get().getParentRegion()->getParentOp();
           if (warpOp->isAncestor(parent)) {
-            if (!escapingValues.insert(operand->get()))
-              return;
-            Type distType = operand->get().getType();
-            if (auto vecType = dyn_cast<VectorType>(distType)) {
-              AffineMap map = distributionMapFn(operand->get());
-              distType = getDistributedType(vecType, map, warpOp.getWarpSize());
-            }
-            inputTypes.push_back(operand->get().getType());
-            distTypes.push_back(distType);
+            collectEscapingValues(operand->get());
           }
         });
 
+    // Any forOp result that is not already yielded by the warpOp
+    // region is also considered escaping and must be returned by the
+    // original warpOp.
+    for (OpResult forResult : forOp.getResults()) {
+      // Check if this forResult is already yielded by the yield op.
+      if (llvm::is_contained(yield->getOperands(), forResult)) {
+        continue;
+      }
+      collectEscapingValues(forResult);
+    }
+
     if (llvm::is_contained(distTypes, Type{}))
       return failure();
 
@@ -1609,7 +1627,12 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
                                     forOp.getResultTypes().end());
     llvm::SmallDenseMap<Value, int64_t> argIndexMapping;
     for (auto [i, retIdx] : llvm::enumerate(newRetIndices)) {
-      warpInput.push_back(newWarpOp.getResult(retIdx));
+      auto newWarpResult = newWarpOp.getResult(retIdx);
+      // Unused forOp results yielded by the warpOp region are already included
+      // in the new ForOp.
+      if (llvm::is_contained(newOperands, newWarpResult))
+        continue;
+      warpInput.push_back(newWarpResult);
       argIndexMapping[escapingValues[i]] = warpInputType.size();
       warpInputType.push_back(inputTypes[i]);
     }
diff --git a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
index 38771f2593449..6c7ac7a5196a7 100644
--- a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
+++ b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
@@ -584,6 +584,42 @@ func.func @warp_scf_for_multiple_yield(%arg0: index, %arg1: memref<?xf32>, %arg2
   return
 }
 
+// -----
+// CHECK-PROP-LABEL: func.func @warp_scf_for_unused_yield(
+//       CHECK-PROP: %[[W0:.*]]:2 = gpu.warp_execute_on_lane_0(%{{.*}})[32] -> (vector<4xf32>, vector<4xf32>) {
+//       CHECK-PROP: %[[INI0:.*]] = "some_def"() : () -> vector<128xf32>
+//       CHECK-PROP: %[[INI1:.*]] = "some_def"() : () -> vector<128xf32>
+//       CHECK-PROP: gpu.yield %[[INI0]], %[[INI1]] : vector<128xf32>, vector<128xf32>
+//       CHECK-PROP: }
+//       CHECK-PROP: %[[F:.*]]:2 = scf.for %{{.*}} iter_args(%{{.*}} = %[[W0]]#0, %{{.*}} = %[[W0]]#1) -> (vector<4xf32>, vector<4xf32>) {
+//       CHECK-PROP: %[[W1:.*]]:2 = gpu.warp_execute_on_lane_0(%{{.*}})[32] args(%{{.*}} : vector<4xf32>, vector<4xf32>) -> (vector<4xf32>, vector<4xf32>) {
+//       CHECK-PROP: %[[ACC0:.*]] = "some_def"(%{{.*}}) : (vector<128xf32>, index) -> vector<128xf32>
+//       CHECK-PROP: %[[ACC1:.*]] = "some_def"(%{{.*}}) : (index, vector<128xf32>, vector<128xf32>) -> vector<128xf32>
+//       CHECK-PROP: gpu.yield %[[ACC1]], %[[ACC0]] : vector<128xf32>, vector<128xf32>
+//       CHECK-PROP: }
+//       CHECK-PROP: scf.yield %[[W1]]#0, %[[W1]]#1 : vector<4xf32>, vector<4xf32>
+//       CHECK-PROP: }
+//       CHECK-PROP: "some_use"(%[[F]]#0) : (vector<4xf32>) -> ()
+func.func @warp_scf_for_unused_yield(%arg0: index) {
+  %c128 = arith.constant 128 : index
+  %c1 = arith.constant 1 : index
+  %c0 = arith.constant 0 : index
+  %0 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>) {
+    %ini = "some_def"() : () -> (vector<128xf32>)
+    %ini1 = "some_def"() : () -> (vector<128xf32>)
+    %3:2 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1) -> (vector<128xf32>, vector<128xf32>) {
+      %add = arith.addi %arg3, %c1 : index
+      %1  = "some_def"(%arg5, %add) : (vector<128xf32>, index) -> (vector<128xf32>)
+      %acc = "some_def"(%add, %arg4, %1) : (index, vector<128xf32>, vector<128xf32>) -> (vector<128xf32>)
+      scf.yield %acc, %1 : vector<128xf32>, vector<128xf32>
+    }
+    gpu.yield %3#0 : vector<128xf32>
+  }
+  "some_use"(%0) : (vector<4xf32>) -> ()
+  return
+}
+
+
 // -----
 
 // CHECK-PROP-LABEL: func @vector_reduction(

charithaintc · 2025-05-28T20:50:41Z

@kurapov-peter, @Groverkss, @matthias-springer Can you please help with reviewing this PR?

kurapov-peter · 2025-05-29T07:56:46Z

mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp

+    // Any forOp result that is not already yielded by the warpOp
+    // region is also considered escaping and must be returned by the
+    // original warpOp.
+    for (OpResult forResult : forOp.getResults()) {
+      // Check if this forResult is already yielded by the yield op.
+      if (llvm::is_contained(yield->getOperands(), forResult)) {
+        continue;
+      }
+      collectEscapingValues(forResult);
+    }
+


I didn't get this part, could you please clarify? This should collect the arguments of the scf.for that would be distributed and returned by the new signature of the original warp op. Why would the original warp op return scf.for's results?
Shouldn't the results be only collected and inserted into the second warp op? The one that is created inside scf.for.

Great question. This is how the original implementation work. Initially original warpOp will return the used forOp results + any escaping values. Later the for op results are replaced with the corresponding initArgs, Here:

llvm-project/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp

Line 1607 in fe3ab99

yieldOperand.set(forOp.getInitArgs()[forResult.getResultNumber()]);

I did not change the original logic flow. Hope this clarifies your question.

Ok, I had to play with it for a bit to understand how it works. I think adding results of the ForOp to the yielded values as the first step is pretty confusing since the resulting warp op after transformation should not have those. I've modified your lit to make an example that would probably be helpful to others to understand it better.
Initial IR:

func.func @warp_scf_for_unused_yield(%arg0: index) { %c128 = arith.constant 128 : index %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %0:2 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>, vector<4xf32>) { %ini = "some_def"() : () -> (vector<128xf32>) %ini1 = "some_def"() : () -> (vector<128xf32>) %other = "other_def"() : () -> (vector<128xf32>) %3:3 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1, %other_2 = %other) -> (vector<128xf32>, vector<128xf32>, vector<128xf32>) { %add = arith.addi %arg3, %c1 : index %1 = "some_def"(%arg5, %add) : (vector<128xf32>, index) -> (vector<128xf32>) %acc = "some_def"(%add, %arg4, %1) : (index, vector<128xf32>, vector<128xf32>) -> (vector<128xf32>) %other2 = "other2_def"(%arg4) : (vector<128xf32>) -> (vector<128xf32>) scf.yield %acc, %1, %other2 : vector<128xf32>, vector<128xf32>, vector<128xf32> } gpu.yield %3#0, %other : vector<128xf32>, vector<128xf32> } "some_use"(%0#0) : (vector<4xf32>) -> () "other_use"(%0#1) : (vector<4xf32>) -> () return }

Right after the newWarpOp creation the warp op looks like:

%0:4 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>, vector<4xf32>, vector<4xf32>, vector<4xf32>) { %1 = "some_def"() : () -> vector<128xf32> %2 = "some_def"() : () -> vector<128xf32> %3 = "other_def"() : () -> vector<128xf32> %4:3 = scf.for %arg1 = %c0 to %c128 step %c1 iter_args(%arg2 = %1, %arg3 = %2, %arg4 = %3) -> (vector<128xf32>, vector<128xf32>, vector<128xf32>) { %5 = arith.addi %arg1, %c1 : index %6 = "some_def"(%arg3, %5) : (vector<128xf32>, index) -> vector<128xf32> %7 = "some_def"(%5, %arg2, %6) : (index, vector<128xf32>, vector<128xf32>) -> vector<128xf32> %8 = "other2_def"(%arg2) : (vector<128xf32>) -> vector<128xf32> scf.yield %7, %6, %8 : vector<128xf32>, vector<128xf32>, vector<128xf32> } gpu.yield %4#0, %3, %4#1, %4#2 : vector<128xf32>, vector<128xf32>, vector<128xf32>, vector<128xf32> }

The results of the ForOp are now returned even though the op itself will be sinked through gpu.yield. These are patched up later again to produce the valid result. This is a bit unexpected after you read the commend for the transformation (and the reason it caught my attention).

I think this is somewhat brittle. If it is done to preserve some existing code I'd reconsider in favor of simplicity.

Thanks for the review :-)

I agree the existing logic is bit confusing. I am also not sure why the forOp results are yielded instead of the iterArgs directly. Maybe code owners have some insights on this?

But overall the logic will be similar apart from this part (even for a refactored version). This pattern is anyway bit complex since it has a lot of moving parts. :-)

charithaintc added 4 commits May 27, 2025 23:40

add bug fix

ff1012e

add test

c6eb53f

add comments

3bdb596

Merge branch 'main' into scf_for_bug

6d47e3f

charithaintc requested review from banach-space, dcaballe, hanhanW and nicolasvasilache as code owners May 28, 2025 20:42

llvmbot added mlir:vectorops mlir mlir:vector labels May 28, 2025

remove unsused headers

fe3ab99

kurapov-peter reviewed May 29, 2025

View reviewed changes

charithaintc requested review from chencha3 and adam-smnk May 29, 2025 18:08

charithaintc added the mlir:gpu label May 29, 2025

kurapov-peter approved these changes May 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. #141853

[mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. #141853

charithaintc commented May 28, 2025

Uh oh!

llvmbot commented May 28, 2025 •

edited

Loading

Uh oh!

charithaintc commented May 28, 2025

Uh oh!

kurapov-peter May 29, 2025

Uh oh!

charithaintc May 29, 2025

Uh oh!

kurapov-peter May 30, 2025

Uh oh!

charithaintc May 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

[mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. #141853

Are you sure you want to change the base?

[mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. #141853

Conversation

charithaintc commented May 28, 2025

Uh oh!

llvmbot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charithaintc commented May 28, 2025

Uh oh!

kurapov-peter May 29, 2025

Choose a reason for hiding this comment

Uh oh!

charithaintc May 29, 2025

Choose a reason for hiding this comment

Uh oh!

kurapov-peter May 30, 2025

Choose a reason for hiding this comment

Uh oh!

charithaintc May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented May 28, 2025 •

edited

Loading

charithaintc May 30, 2025 •

edited

Loading