Skip to content

[mlir][TilingInterface] Use LoopLikeOpInterface in tiling using SCF to unify tiling with scf.for and scf.forall. #77874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 26, 2024

Conversation

MaheshRavishankar
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar commented Jan 12, 2024

Using LoopLikeOpInterface as the basis for the implementation unifies all the tiling logic for both scf.for and scf.forall. The only difference is the actual loop generation. This is a follow up to #72178
Instead of many entry points for each loop type, the loop type is now passed as part of the options passed to the tiling method.

This is a breaking change with the following changes

  1. The scf::tileUsingSCFForOp is renamed to scf::tileUsingSCF
  2. The scf::tileUsingSCFForallOp is deprecated. The same
    functionality is obtained by using scf::tileUsingSCF and setting
    the loop type in scf::SCFTilingOptions passed into this method to
    scf::SCFTilingOptions::LoopType::ForallOp (using the
    setLoopType method).
  3. The scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp is
    renamed to scf::tileConsumerAndFuseProducerUsingSCF. The use of
    the controlFn in scf::SCFTileAndFuseOptions allows implementing
    any strategy with the default callback implemeting the greedy fusion.
  4. The scf::SCFTilingResult and scf::SCFTileAndFuseResult now use
    SmallVector<LoopLikeOpInterface>.
  5. To make scf::ForallOp implement the parts of
    LoopLikeOpInterface needed, the getOutputBlockArguments()
    method is replaced with getRegionIterArgs()

These changes now bring the tiling and fusion capabilities using scf.forall on par with what was already supported by scf.for

@llvmbot
Copy link
Member

llvmbot commented Jan 12, 2024

@llvm/pr-subscribers-flang-fir-hlfir
@llvm/pr-subscribers-mlir-affine
@llvm/pr-subscribers-mlir-scf
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-linalg

Author: None (MaheshRavishankar)

Changes

Using LoopLikeOpInterface as the basis for the implementation unifies all the tiling logic for both scf.for and scf.forall. The only difference is the actual loop generation.
Instead of many entry points for each loop type, the loop type is now passed as part of the options passed to the tiling method.

This is a breaking change with the following changes

  1. The scf::tileUsingSCFForOp is renamed to scf::tileUsingSCF 2) The scf::tileUsingSCFForallOp is deprecated. The same
    functionality is obtained by using scf::tileUsingSCF and setting
    the loop type in scf::SCFTilingOptions passed into this method to
    scf::SCFTilingOptions::LoopType::ForallOp (using the
    setLoopType method).
  2. The scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp is
    renamed to scf::tileConsumerAndFuseProducerUsingSCF. The use of
    the controlFn in scf::SCFTileAndFuseOptions allows implementing
    any strategy with the default callback implemeting the greedy fusion.
  3. The scf::SCFTilingResult and scf::SCFTileAndFuseResult now use
    SmallVector&lt;LoopLikeOpInterface&gt;.
  4. To make scf::ForallOp implement the parts of
    LoopLikeOpInterface needed, the getOutputBlockArguments()
    method is replaced with getRegionIterArgs()

These changes now bring the tiling and fusion capabilities using scf.forall on par with what was already supported


Patch is 107.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/77874.diff

18 Files Affected:

  • (modified) mlir/include/mlir/Dialect/SCF/IR/SCFOps.td (+6-13)
  • (modified) mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h (+22-19)
  • (modified) mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp (+5-5)
  • (modified) mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp (+2-2)
  • (modified) mlir/lib/Dialect/SCF/IR/SCF.cpp (+47-3)
  • (modified) mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp (+411-398)
  • (modified) mlir/lib/Dialect/SCF/Utils/Utils.cpp (+4-4)
  • (modified) mlir/lib/Interfaces/LoopLikeInterface.cpp (+19-17)
  • (modified) mlir/test/Dialect/Linalg/generalize-tensor-unpack-tile.mlir (+4-4)
  • (modified) mlir/test/Dialect/Linalg/tile-conv.mlir (+1-1)
  • (modified) mlir/test/Dialect/Linalg/tile-tensors.mlir (+1-1)
  • (modified) mlir/test/Dialect/Tensor/tiling.mlir (+10-14)
  • (added) mlir/test/Interfaces/TilingInterface/tile-and-fuse-using-scfforall.mlir (+176)
  • (modified) mlir/test/Interfaces/TilingInterface/tile-pad-using-interface.mlir (+2-3)
  • (modified) mlir/test/Interfaces/TilingInterface/tile-using-interface.mlir (+9-9)
  • (modified) mlir/test/Interfaces/TilingInterface/tile-using-scfforall.mlir (+148-2)
  • (modified) mlir/test/lib/Interfaces/TilingInterface/TestTilingInterfaceTransformOps.cpp (+95-3)
  • (modified) mlir/test/lib/Interfaces/TilingInterface/TestTilingInterfaceTransformOps.td (+24-2)
diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
index 8d65d3dd820baf..819d88df0ae57f 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
@@ -135,9 +135,9 @@ def ExecuteRegionOp : SCF_Op<"execute_region", [
 
 def ForOp : SCF_Op<"for",
       [AutomaticAllocationScope, DeclareOpInterfaceMethods<LoopLikeOpInterface,
-       ["getInitsMutable", "getSingleInductionVar", "getSingleLowerBound",
-        "getSingleStep", "getSingleUpperBound", "getYieldedValuesMutable",
-        "getLoopResults", "promoteIfSingleIteration",
+       ["getInitsMutable", "getRegionIterArgs", "getSingleInductionVar", 
+        "getSingleLowerBound", "getSingleStep", "getSingleUpperBound",
+        "getYieldedValuesMutable", "getLoopResults", "promoteIfSingleIteration",
         "replaceWithAdditionalYields"]>,
        AllTypesMatch<["lowerBound", "upperBound", "step"]>,
        ConditionallySpeculatable,
@@ -259,10 +259,6 @@ def ForOp : SCF_Op<"for",
 
     Value getInductionVar() { return getBody()->getArgument(0); }
 
-    Block::BlockArgListType getRegionIterArgs() {
-      return getBody()->getArguments().drop_front(getNumInductionVars());
-    }
-
     /// Return the `index`-th region iteration argument.
     BlockArgument getRegionIterArg(unsigned index) {
       assert(index < getNumRegionIterArgs() &&
@@ -304,8 +300,9 @@ def ForallOp : SCF_Op<"forall", [
        AttrSizedOperandSegments,
        AutomaticAllocationScope,
        DeclareOpInterfaceMethods<LoopLikeOpInterface,
-          ["promoteIfSingleIteration", "getSingleInductionVar",
-          "getSingleLowerBound", "getSingleUpperBound", "getSingleStep"]>,
+          ["getInitsMutable", "getRegionIterArgs", "getSingleInductionVar", 
+           "getSingleLowerBound", "getSingleUpperBound", "getSingleStep",
+           "promoteIfSingleIteration", "replaceWithAdditionalYields"]>,
        RecursiveMemoryEffects,
        SingleBlockImplicitTerminator<"scf::InParallelOp">,
        DeclareOpInterfaceMethods<RegionBranchOpInterface>,
@@ -585,10 +582,6 @@ def ForallOp : SCF_Op<"forall", [
                                     getNumDynamicControlOperands() + getRank());
     }
 
-    ArrayRef<BlockArgument> getOutputBlockArguments() {
-      return getBody()->getArguments().drop_front(getRank());
-    }
-
     ::mlir::ValueRange getInductionVars() {
       return getBody()->getArguments().take_front(getRank());
     }
diff --git a/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h b/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
index 5d2d78e6e6165b..965ef9e203be28 100644
--- a/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
+++ b/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
@@ -12,6 +12,7 @@
 #include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/Transforms/Transforms.h"
 #include "mlir/IR/PatternMatch.h"
+#include "mlir/Interfaces/LoopLikeInterface.h"
 #include "mlir/Interfaces/TilingInterface.h"
 
 #include <deque>
@@ -52,6 +53,14 @@ struct SCFTilingOptions {
     return *this;
   }
 
+  /// Specify which loop construct to use for tile and fuse.
+  enum class LoopType { ForOp, ForallOp };
+  LoopType loopType = LoopType::ForOp;
+  SCFTilingOptions &setLoopType(LoopType type) {
+    loopType = type;
+    return *this;
+  }
+
   /// Specify mapping of loops to devices. This is only respected when the loop
   /// constructs support such a mapping (like `scf.forall`). Will be ignored
   /// when using loop constructs that dont support such a mapping (like
@@ -71,7 +80,7 @@ struct SCFTilingResult {
   /// of the last op.
   SmallVector<Operation *> tiledOps;
   /// The `scf.for` operations that iterate over the tiles.
-  SmallVector<Operation *> loops;
+  SmallVector<LoopLikeOpInterface> loops;
   /// Values to use as replacements for the untiled op. Is the same size as the
   /// number of results of the untiled op.
   SmallVector<Value> replacements;
@@ -79,15 +88,9 @@ struct SCFTilingResult {
 
 /// Method to tile an op that implements the `TilingInterface` using
 /// `scf.for` for iterating over the tiles.
-FailureOr<SCFTilingResult> tileUsingSCFForOp(RewriterBase &rewriter,
-                                             TilingInterface op,
-                                             const SCFTilingOptions &options);
-
-/// Method to tile an op that implements the `TilingInterface` using
-/// `scf.forall`.
-FailureOr<SCFTilingResult>
-tileUsingSCFForallOp(RewriterBase &rewriter, TilingInterface op,
-                     const SCFTilingOptions &options);
+FailureOr<SCFTilingResult> tileUsingSCF(RewriterBase &rewriter,
+                                        TilingInterface op,
+                                        const SCFTilingOptions &options);
 
 /// Options used to control tile + fuse.
 struct SCFTileAndFuseOptions {
@@ -135,7 +138,7 @@ struct SCFFuseProducerOfSliceResult {
 std::optional<SCFFuseProducerOfSliceResult>
 tileAndFuseProducerOfSlice(RewriterBase &rewriter,
                            tensor::ExtractSliceOp candidateSliceOp,
-                           MutableArrayRef<scf::ForOp> loops);
+                           MutableArrayRef<LoopLikeOpInterface> loops);
 
 /// Reconstruct the fused producer from within the tiled-and-fused code. Based
 /// on the slice of the producer computed in place it is possible that within
@@ -187,10 +190,10 @@ tileAndFuseProducerOfSlice(RewriterBase &rewriter,
 /// where `%0` had other uses as well. If not reconstructed from within the loop
 /// body, uses of `%0` could not be replaced, making it still live and the
 /// fusion immaterial.
-void yieldReplacementForFusedProducer(
+LogicalResult yieldReplacementForFusedProducer(
     RewriterBase &rewriter, tensor::ExtractSliceOp sliceOp,
     scf::SCFFuseProducerOfSliceResult fusedProducerInfo,
-    MutableArrayRef<scf::ForOp> loops);
+    MutableArrayRef<LoopLikeOpInterface> loops);
 
 /// Transformation information returned after tile and fuse.
 struct SCFTileAndFuseResult {
@@ -201,7 +204,7 @@ struct SCFTileAndFuseResult {
   /// generated operation.
   llvm::SetVector<Operation *> tiledAndFusedOps;
   /// The `scf.for` operations that iterate over the tiles.
-  SmallVector<Operation *> loops;
+  SmallVector<LoopLikeOpInterface> loops;
   /// The replacement values to use for the tiled and fused operations.
   llvm::DenseMap<Value, Value> replacements;
 };
@@ -232,9 +235,9 @@ struct SCFTileAndFuseResult {
 /// }
 /// ```
 FailureOr<SCFTileAndFuseResult>
-tileConsumerAndFuseProducerGreedilyUsingSCFForOp(
-    RewriterBase &rewriter, TilingInterface consumer,
-    const SCFTileAndFuseOptions &options);
+tileConsumerAndFuseProducersUsingSCF(RewriterBase &rewriter,
+                                     TilingInterface consumer,
+                                     const SCFTileAndFuseOptions &options);
 
 /// Method to lower an `op` that implements the `TilingInterface` to
 /// loops/scalars.
@@ -249,8 +252,8 @@ struct SCFReductionTilingResult {
   Operation *mergeOp;
   /// Initial op
   Operation *initialOp;
-  /// The `scf.for` operations that iterate over the tiles.
-  SmallVector<scf::ForOp> loops;
+  /// The loop operations that iterate over the tiles.
+  SmallVector<LoopLikeOpInterface> loops;
 };
 
 /// Method to tile a reduction and generate a parallel op within a serial loop.
diff --git a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
index 97d2b4a3be5c56..5efa2b3eb7476f 100644
--- a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
+++ b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
@@ -485,8 +485,8 @@ transform::FuseOp::apply(transform::TransformRewriter &rewriter,
       tileSizes.size() - llvm::count(tileSizes, 0), transformResults,
       [&](TilingInterface tilingInterfaceOp)
           -> FailureOr<scf::SCFTileAndFuseResult> {
-        return tileConsumerAndFuseProducerGreedilyUsingSCFForOp(
-            rewriter, tilingInterfaceOp, tileAndFuseOptions);
+        return tileConsumerAndFuseProducersUsingSCF(rewriter, tilingInterfaceOp,
+                                                    tileAndFuseOptions);
       });
   return failed(result) ? DiagnosedSilenceableFailure::definiteFailure()
                         : DiagnosedSilenceableFailure::success();
@@ -584,7 +584,7 @@ static Operation *replaceForAllWithNewSignature(
   Operation *firstYieldOp = yieldingOps.front();
   rewriter.setInsertionPoint(firstYieldOp);
   Value src = tileAndFuseResult.tiledValues[0];
-  Value dst = newforallOp.getOutputBlockArguments().back();
+  Value dst = newforallOp.getRegionIterArgs().back();
   SmallVector<OpFoldResult> strides(offsets.size(), rewriter.getIndexAttr(1));
   rewriter.create<tensor::ParallelInsertSliceOp>(firstYieldOp->getLoc(), src,
                                                  dst, offsets, sizes, strides);
@@ -2063,7 +2063,7 @@ transform::ScalarizeOp::applyToOne(transform::TransformRewriter &rewriter,
   });
   SmallVector<int64_t> emptyTileSizes;
   rewriter.setInsertionPoint(target);
-  FailureOr<scf::SCFTilingResult> maybeTilingResult = tileUsingSCFForOp(
+  FailureOr<scf::SCFTilingResult> maybeTilingResult = tileUsingSCF(
       rewriter, cast<TilingInterface>(target.getOperation()), tilingOptions);
   if (failed(maybeTilingResult))
     return emitDefaultDefiniteFailure(target);
@@ -2647,7 +2647,7 @@ transform::TileUsingForOp::apply(transform::TransformRewriter &rewriter,
 
     tilingOptions.setInterchange(getInterchange());
     FailureOr<scf::SCFTilingResult> maybeTilingResult =
-        tileUsingSCFForOp(rewriter, tilingInterface, tilingOptions);
+        tileUsingSCF(rewriter, tilingInterface, tilingOptions);
     if (failed(maybeTilingResult))
       return DiagnosedSilenceableFailure::definiteFailure();
 
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp b/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
index 7f3ab1f1a24b2f..339d06cdeaf60a 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
@@ -358,7 +358,7 @@ static FailureOr<ForallTilingResult> tileToForallOpImpl(
 
   // 3. Clone the tileable op and update its destination operands to use the
   // output bbArgs of the ForallOp.
-  ArrayRef<BlockArgument> destBbArgs = forallOp.getOutputBlockArguments();
+  ArrayRef<BlockArgument> destBbArgs = forallOp.getRegionIterArgs();
   Operation *tiledOp = nullptr;
   SmallVector<Value> tiledValues;
   {
@@ -695,7 +695,7 @@ FailureOr<linalg::ForallReductionTilingResult> linalg::tileReductionUsingForall(
   // 4. Clone the tileable op and update its destination operands to use the
   // output bbArgs of the ForallOp.
   SmallVector<Value> tilingResults;
-  ArrayRef<BlockArgument> destBbArgs = forallOp.getOutputBlockArguments();
+  ArrayRef<BlockArgument> destBbArgs = forallOp.getRegionIterArgs();
   {
     // 4.a. RAII guard, inserting within forallOp, before terminator.
     OpBuilder::InsertionGuard g(b);
diff --git a/mlir/lib/Dialect/SCF/IR/SCF.cpp b/mlir/lib/Dialect/SCF/IR/SCF.cpp
index 5570c2ec688c8a..7dbc8f250b25eb 100644
--- a/mlir/lib/Dialect/SCF/IR/SCF.cpp
+++ b/mlir/lib/Dialect/SCF/IR/SCF.cpp
@@ -527,6 +527,10 @@ ParseResult ForOp::parse(OpAsmParser &parser, OperationState &result) {
 
 SmallVector<Region *> ForOp::getLoopRegions() { return {&getRegion()}; }
 
+Block::BlockArgListType ForOp::getRegionIterArgs() {
+  return getBody()->getArguments().drop_front(getNumInductionVars());
+}
+
 MutableArrayRef<OpOperand> ForOp::getInitsMutable() {
   return getInitArgsMutable();
 }
@@ -622,6 +626,47 @@ LogicalResult scf::ForallOp::promoteIfSingleIteration(RewriterBase &rewriter) {
   return success();
 }
 
+Block::BlockArgListType ForallOp::getRegionIterArgs() {
+  return getBody()->getArguments().drop_front(getRank());
+}
+
+MutableArrayRef<OpOperand> ForallOp::getInitsMutable() {
+  return getOutputsMutable();
+}
+
+FailureOr<LoopLikeOpInterface>
+ForallOp::replaceWithAdditionalYields(RewriterBase &rewriter,
+                                      ValueRange newInitOperands,
+                                      bool replaceInitOperandUsesInLoop,
+                                      const NewYieldValuesFn &newYieldValueFn) {
+  OpBuilder::InsertionGuard g(rewriter);
+  rewriter.setInsertionPoint(getOperation());
+  auto inits = llvm::to_vector(getOutputs());
+  inits.append(newInitOperands.begin(), newInitOperands.end());
+  auto newLoop = rewriter.create<scf::ForallOp>(
+      getLoc(), getMixedLowerBound(), getMixedUpperBound(), getMixedStep(),
+      inits, getMapping(), [](OpBuilder &, Location, ValueRange) {});
+
+  // Move the region of the current block to the newly created op.
+  Block *newLoopBody = newLoop.getBody();
+  rewriter.mergeBlocks(
+      getBody(), newLoopBody,
+      newLoopBody->getArguments().take_front(getBody()->getNumArguments()));
+
+  // Update the terminator.
+  {
+    OpBuilder::InsertionGuard g(rewriter);
+    auto terminator = cast<scf::InParallelOp>(newLoopBody->getTerminator());
+    rewriter.setInsertionPointToEnd(terminator.getBody());
+    newYieldValueFn(
+        rewriter, getLoc(),
+        newLoopBody->getArguments().take_back(newInitOperands.size()));
+  }
+  rewriter.replaceOp(getOperation(),
+                     newLoop->getResults().take_front(getNumResults()));
+  return cast<LoopLikeOpInterface>(newLoop.getOperation());
+}
+
 /// Promotes the loop body of a scf::ForallOp to its containing block.
 void mlir::scf::promote(RewriterBase &rewriter, scf::ForallOp forallOp) {
   OpBuilder::InsertionGuard g(rewriter);
@@ -1630,9 +1675,8 @@ struct FoldTensorCastOfOutputIntoForallOp
     // mapped to the tensor.cast old-typed results of the output bbArgs. The
     // destination have to be updated to point to the output bbArgs directly.
     auto terminator = newForallOp.getTerminator();
-    for (auto [yieldingOp, outputBlockArg] :
-         llvm::zip(terminator.getYieldingOps(),
-                   newForallOp.getOutputBlockArguments())) {
+    for (auto [yieldingOp, outputBlockArg] : llvm::zip(
+             terminator.getYieldingOps(), newForallOp.getRegionIterArgs())) {
       auto insertSliceOp = cast<tensor::ParallelInsertSliceOp>(yieldingOp);
       insertSliceOp.getDestMutable().assign(outputBlockArg);
     }
diff --git a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
index 38e0625d7ce093..20baa5c8dfcfa5 100644
--- a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
+++ b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
@@ -55,32 +55,8 @@ fillInterchangeVector(ArrayRef<int64_t> interchangeVector,
   return filledVector;
 }
 
-/// Convert a list of ops of type `SrcOpTy` to list of `Operation *`.
-template <typename SrcOpTy>
-static SmallVector<Operation *> getAsOperations(ArrayRef<SrcOpTy> ops) {
-  return llvm::to_vector(
-      llvm::map_range(ops, [](auto op) -> Operation * { return op; }));
-}
-template <typename SrcOpTy>
-static SmallVector<Operation *>
-getAsOperations(const SmallVector<SrcOpTy> &ops) {
-  return getAsOperations(ArrayRef<SrcOpTy>(ops));
-}
-
-/// Convert a list of `Operation *` to a list of `DstOpTy.
-template <typename DstOpTy>
-static SmallVector<DstOpTy> castToTypedOperations(ArrayRef<Operation *> ops) {
-  return llvm::to_vector(
-      llvm::map_range(ops, [](Operation *op) { return cast<DstOpTy>(op); }));
-}
-template <typename DstOpTy>
-static SmallVector<DstOpTy>
-castToTypedOperations(const SmallVector<Operation *> &ops) {
-  return castToTypedOperations<DstOpTy>(ArrayRef<Operation *>(ops));
-}
-
 //===----------------------------------------------------------------------===//
-// tileUsingSCFForOp implementation.
+// tileUsingSCF implementation.
 //===----------------------------------------------------------------------===//
 
 // Check if `stride` evenly divides the trip count `size - offset`.
@@ -135,66 +111,204 @@ static Operation *cloneOpAndUpdateDestinationArgs(RewriterBase &rewriter,
   return clonedOp;
 }
 
-/// Generate an empty loop nest that represents the tiled loop nest shell.
+/// Type of the call back function used to generate the body of the tiled
+/// loops. The loop generation methods use this callback to generate
+/// the body of the inner-most tile loop. The call back is provided with
+/// - `rewriter`: with insertion point set to the end of the inner most loop
+///    body
+/// - `ivs`: the induction variables for the surrounding loops.
+/// - `regionIterArgs`: the basic block arguments of the inner most loop that
+///   correspond to the init/result of the loop.
+/// - `resultOffsets/Sizes`: The call back returns the result of tiling the
+///   operation, but in `resultOffsets` and `resultSizes` the callback function
+///   is expected to populate the offsets/sizes to use while inserting the
+///   result back into the corresponding `regionIterArgs`. These values
+///   are used by the loop generation methods to create the appropriate yield
+///   values from the inner most loop.
+/// The call back returns the `TilingResult` obtained from tiling the operation.
+using TileLoopBodyFn = std::function<FailureOr<TilingResult>(
+    RewriterBase &rewriter, Location loc, ValueRange ivs,
+    ValueRange regionIterArgs,
+    SmallVector<SmallVector<OpFoldResult>> &resultOffsets,
+    SmallVector<SmallVector<OpFoldResult>> &resultSizes)>;
+
+/// Generate the tile-loop nest using `scf.for` operation.
 /// - `loopRanges` specifies the lb, ub and step of the untiled iteration space.
 /// - `tileSizes` is the tile sizes to use. Zero represent untiled loops.
-/// - In `offsets` and `sizes` return the multi-dimensional offset and size of
-///   the tile processed within the inner most loop.
-/// Note that this methods adds `scf.yield` operation for all but the innermost
-/// loop. These yield the value returned by the immediately inner loop. The
-/// caller is expected to add the scf.yield operation for the innermost loop.
-static SmallVector<scf::ForOp> generateTileLoopNest(
-    OpBuilder &builder, Location loc, ArrayRef<Range> loopRanges,
-    ArrayRef<OpFoldResult> tileSizes, SmallVector<OpFoldResult> &offsets,
-    SmallVector<OpFoldResult> &sizes, ValueRange destinationTensors = {}) {
-  if (loopRanges.empty())
-    return {};
+/// - `destinationTensors` are the init values to use for the outer most loop.
+/// - `tileLoopBodyFn` is called to generated the loop body of the inner most
+///    loop.
+/// - `loops` is an in-out parameter into which the generated loops are
+///    populated.
+/// The `TilingResult` returned by calling `tileLoopBodyFn` is returned back
+/// to the caller.
+static FailureOr<TilingResult> generateLoopNestUsingForOp(
+    RewriterBase &rewriter, Location loc, ArrayRef<Range> loopRanges,
+    ArrayRef<OpFoldResult> tileSizes, ValueRange destinationTensors,
+    TileLoopBodyFn tileLoopBodyFn, SmallVector<LoopLikeOpInterface> &loops) {
+  assert(!loopRanges.empty() && "unexpected empty loop ranges");
   assert(loopRanges.size() == tileSizes.size() &&
          "expected as many tile sizes as loop ranges");
-  OpBuilder::InsertionGuard guard(builder);
-  SmallVector<scf::ForOp> loops;
-  offsets.resize(loopRanges.size());
-  sizes.resize(loopRanges.size());
-
-  for (auto loopRange : llvm::enumerate(loopRanges)) {
-    Value offset =
-        getValueOrCreateConstantIndexOp(builder, loc, loopRange.value().offset);
-    Value size =
-        getValueOrCreateConstantIndexOp(builder, loc, loopRange.value().size);
-    Value tileSize = getValueOrCreateConstantIndexOp(
-        builder, loc, tileSizes[loopRange.index()]);
+  OpBuilder::InsertionGuard guard(rewriter);
+  SmallVector<Value> ivs;
+
+  for (auto [loopRange, tileSize] : llvm::zip_equal(loopRanges, tileSizes)) {
     // No loops if tile size is zero. Set offset and size to the loop
     // offset and size.
-    if (matchPattern(tileSize, m_Zero())) {
-      offsets[loopRange.index()] = offset;
-      sizes[loopRange.index()] = size;
+    if (isConstantIntValue(tileSize, 0)) {
       continue;
     }
 
-    auto loop = builder.create<scf::ForOp>(
-        loc, offset, size, tileSize, destinationTensors,
-        [&](OpBuilder &bodyBuilder, Location bodyLoc, Value iv,
-            ValueRange /*iterArgs*/) {
-          sizes[loopRange.index()] =
-   ...
[truncated]

Copy link
Contributor

@Abhishek-Varma Abhishek-Varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing @MaheshRavishankar !
Just one nit comment and CI related fix from my end.

OpBuilder::InsertionGuard g(rewriter);
auto terminator = cast<scf::InParallelOp>(newLoopBody->getTerminator());
rewriter.setInsertionPointToEnd(terminator.getBody());
newYieldValueFn(
Copy link
Member

@matthias-springer matthias-springer Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the place in the code, but I think this lambda creates the parallel_insert_slice ops. It probably returns an empty SmallVector. This does not fit with the comment in LoopLikeInterface.h:

/// A function that returns the additional yielded values during
/// `replaceWithAdditionalYields`. `newBbArgs` are the newly added region
/// iter_args. This function should return as many values as there are block
/// arguments in `newBbArgs`.
using NewYieldValuesFn = std::function<SmallVector<Value>(
    OpBuilder &b, Location loc, ArrayRef<BlockArgument> newBbArgs)>;

It is a bit confusing that for scf.for we expect the function to return a list of Value and for scf.forall we expect the function to build new IR and not return anything.

How about generalizing NewYieldValuesFn to RewriteTerminatorFn = std::function<void(RewriterBase &rewriter, Operation *terminator, ArrayRef<BlockArgument> newBbArgs)>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the scf.forall does not return any values. I looked at place where NewYieldValuesFn is used and it is quite a bit. Want to limit the breaking changes here, but I can do a quick follow up on that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that might not work as well. Consider

NewYieldValuesFn yieldFn = [&](OpBuilder &b, Location loc,
. This is written in a loop agnostic way, but yield value function will have to handle scf.forall separately. I think there is an inherent mismatch between the behavior of scf.forall and other loop constructs in the sense it doesnt yield anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I do have an issue here (and I didnt see it cause I am also missing a test that will show up this issue). So good catch... Ill drop this part and do something different...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an inherent mismatch between the behavior of scf.forall and other loop constructs in the sense it doesnt yield anything.

I ran into the same problem when cleaning up the LoopLikeOpInterface recently. I was able to add the interface to scf.while, but not scf.forall because of this issue.

I think we should generalize NewYieldValuesFn to RewriteTerminatorFn. That lambda can do whatever it wants.

/// A function that updates the terminator of a loop-like op during
/// `replaceWithAdditionalYields`. `newBbArgs` are the newly added region
/// iter_args. This function should rewrite the loop terminator (`terminator`)
/// such that the loop op verifies successfully.
using RewriteTerminatorFn =
    std::function<void(RewriterBase &rewriter, Operation *terminator, ArrayRef<BlockArgument> newBbArgs)>;

I think we also have to turn this interface function into an interface method:

    /// Append the specified additional "init" operands: replace this loop with
    /// a new loop that has the additional init operands. The loop body of this
    /// loop is moved over to the new loop.
    ///
    /// The newly added region iter_args are yielded from the loop.
    ::mlir::FailureOr<::mlir::LoopLikeOpInterface>
        replaceWithAdditionalIterOperands(::mlir::RewriterBase &rewriter,
                                          ::mlir::ValueRange newInitOperands,
                                          bool replaceInitOperandUsesInLoop)

For scf.forall, this interface could either:

  • "return failure();"
  • or: insert a parallel_insert_slice that inserts the newly added bbArg into itself (which is a no-op)
  • or: remove this interface method entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see actually how this work. Even with this new lambda, any algorithm that is working with LoopLikeOpInterface will have to special case scf.forall to handle the different yield approach it takes.

I went a different approach. I added a new interface method to LoopLikeOpInterface that allows handling of tiled yields. This allows unifying scf.for and scf.forall since they handle tiles very differently, and that gets now rolled into the interface method implementation each operation.
Also added a test that should have caught this issue earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementations of yieldTiledValuesAndReplace make sense to me. But I wouldn't put it in LoopLikeOpInterface.

LoopLikeOpInterface was meant to expose basic loop properties and low-level transformations of loop-like ops. The new yieldTiledValuesAndReplace interface method is tiling-specific and does not belong in LoopLikeOpInterface imo. How about putting yieldTiledValuesAndReplace in a new interface in TilingInterface.td? Then we have all tiling-related stuff in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure a whole interface for just one method is worth it. I think this method is optional for the most part. No op needs to implement it. It is needed only if it wants to work with tiling, which is nice cause all the loop construct has to do to work with tiling is implement this method. Id prefer to keep it in LoopLikeOpInterface. If you feel strongly about it, I can move it out of the interface.

Copy link
Member

@matthias-springer matthias-springer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a small issue with NewYieldValuesFn.

Copy link
Contributor

@chelini chelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this +1

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the followup from #72178, right?
Let's mention it in the commit message so we can refer back to it later, had to do a bit of spelunking to find it.

Overall comment:

So IIRC, #72178 adopted the logic of tiling scf.forall from LinalgTransformOps.cpp and/or Linalg/Transforms/Tiling.cpp to redo scf.for.

This was a welcome refactoring and this PR is a nice continuation, nice!

To proceed confidently, I would now like to see red diffs in LinalgTransformOps.cpp and Linalg/Transforms/Tiling.cpp to see older implementations be actually replaced: atm this seems it is still adding more code in parallel to the old implementation.

For instance, I still see LinalgTransformOps.cpp calling code from Linalg/Transforms/Tiling.cpp:

  FailureOr<linalg::ForallTilingResult> maybeTilingResult = failure();
  if (!mixedNumThreads.empty()) {
    maybeTilingResult =
        linalg::tileToForallOp(rewriter, tileableOp, mixedNumThreads, mapping);
  } else {
    maybeTilingResult = linalg::tileToForallOpUsingTileSizes(
        rewriter, tileableOp, mixedTileSizes, mapping);
  }

Do you have a followup PR which cleans up the old?

Note, as part of this, I would also like to see something concrete on the num_threads discussion.

Besides the meta-comment that we now need to resolve, I'll be able to review this early next week.

Thanks!

/// populated.
/// The `TilingResult` returned by calling `tileLoopBodyFn` is returned back
/// to the caller.
static FailureOr<TilingResult> generateLoopNestUsingForOp(
Copy link
Contributor

@nicolasvasilache nicolasvasilache Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ / resemble older code that lowers Linalg to loops?

It would be nice if we could reuse the same utility for both places but I am a bit rusty of what utilities we have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean lowering to loops code? Maybe.. Ill look into that.

@MaheshRavishankar
Copy link
Contributor Author

This is the followup from #72178, right? Let's mention it in the commit message so we can refer back to it later, had to do a bit of spelunking to find it.

It is a change from a long list of related changes, but yes, #72178 is probably the start of it. Updated the comment anyway.

Overall comment:

So IIRC, #72178 adopted the logic of tiling scf.forall from LinalgTransformOps.cpp and/or Linalg/Transforms/Tiling.cpp to redo scf.for.

This was a welcome refactoring and this PR is a nice continuation, nice!

To proceed confidently, I would now like to see red diffs in LinalgTransformOps.cpp and Linalg/Transforms/Tiling.cpp to see older implementations be actually replaced: atm this seems it is still adding more code in parallel to the old implementation.

Yes, that is the goal, but I want to proceed incrementally. With this change, the same tests that work with scf.for work with scf.forall (so interchange, scalars, etc. etc) and avoids duplicating code for tiling using the two loop constructs. The next step for me is to move over the tiling logic that exists currently in Linalg/Tiling.cpp and LinalgTransformOps.cpp to this and fix as needed.

There is lot of context here spread all over the place w.r.t tiling. I want to unify all of it, but I dont think I can do that in one step. I would rather incrementally build up the required functionality in one place and deprecate/move pieces as needed. I do plan to do this in short order though..

For instance, I still see LinalgTransformOps.cpp calling code from Linalg/Transforms/Tiling.cpp:

  FailureOr<linalg::ForallTilingResult> maybeTilingResult = failure();
  if (!mixedNumThreads.empty()) {
    maybeTilingResult =
        linalg::tileToForallOp(rewriter, tileableOp, mixedNumThreads, mapping);
  } else {
    maybeTilingResult = linalg::tileToForallOpUsingTileSizes(
        rewriter, tileableOp, mixedTileSizes, mapping);
  }

Do you have a followup PR which cleans up the old?

Note, as part of this, I would also like to see something concrete on the num_threads discussion.

num_threads is the missing piece for me. I had asked this previously as to what is the difference between using loops of the form [0, num_threads) with the induction variable being some thread ID, vs. induction variable being [lb, ub) with step. It seems to be that the use of thread ID needs to come in only at the point of "resolution" of scf.forall but I maybe missing something. Again, treating it incrementally. Now with this PR I have the basics to start unwinding some of this.

Besides the meta-comment that we now need to resolve, I'll be able to review this early next week.

Thanks!

Thanks!

Copy link

github-actions bot commented Jan 12, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@MaheshRavishankar
Copy link
Contributor Author

@matthias-springer I am having the following three tests failing for me, for reasons that seem unrelated to this PR (related by the fact that this is happening with this PR, but unrelated in the sense that its the pad hoisting code failing even though the tiling is the same).

[build] Failed Tests (3):
[build]   MLIR :: Dialect/Linalg/transform-op-hoist-pad-build-packing-loop-nest.mlir
[build]   MLIR :: Dialect/Linalg/transform-op-hoist-pad.mlir
[build]   MLIR :: Dialect/Linalg/transform-op-peel-and-vectorize.mlir

I tried to triage this, but the issue seems deep in the weeds. I will continue looking, but if you have any pointers, that would be very helpful.

@MaheshRavishankar
Copy link
Contributor Author

@matthias-springer I am having the following three tests failing for me, for reasons that seem unrelated to this PR (related by the fact that this is happening with this PR, but unrelated in the sense that its the pad hoisting code failing even though the tiling is the same).

[build] Failed Tests (3):
[build]   MLIR :: Dialect/Linalg/transform-op-hoist-pad-build-packing-loop-nest.mlir
[build]   MLIR :: Dialect/Linalg/transform-op-hoist-pad.mlir
[build]   MLIR :: Dialect/Linalg/transform-op-peel-and-vectorize.mlir

I tried to triage this, but the issue seems deep in the weeds. I will continue looking, but if you have any pointers, that would be very helpful.

Never mind. Figured it out.. the hoisting logic needed some LICM to kick to work. Fixed now.

return getBody()->getArguments().drop_front(getRank());
}

MutableArrayRef<OpOperand> ForallOp::getInitsMutable() {
Copy link
Member

@matthias-springer matthias-springer Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing a getYieldedValuesMutable implementation for ForallOp.

This is what the interface documentation says:

    Loop-carried variables can be exposed through this interface. There are
    3 components to a loop-carried variable.
    - The "region iter_arg" is the block argument of the entry block that
      represents the loop-carried variable in each iteration.
    - The "init value" is an operand of the loop op that serves as the initial
      region iter_arg value for the first iteration (if any).
    - The "yielded" value is the value that is forwarded from one iteration to
      serve as the region iter_arg of the next iteration.

    If one of the respective interface methods is implemented, so must the other
    two. The interface verifier ensures that the number of types of the region
    iter_args, init values and yielded values match.

This is checked by the op verifier, so you should be seeing verification failures.

The problem is that this op does not have yielded values.

We could have getYieldedValuesMutable return a std::optional<::llvm::MutableArrayRef<::mlir::OpOperand>> to account for the fact that some loops do not have yielding semantics. (Same as we do for getLoopResults.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, scf.forall should not implement getRegionIterArgs at all. The region iter_args in the loop like op interface are loop-carried variables, but the scf.forall does not have any loop-carried variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing a getYieldedValuesMutable implementation for ForallOp.

This is what the interface documentation says:

    Loop-carried variables can be exposed through this interface. There are
    3 components to a loop-carried variable.
    - The "region iter_arg" is the block argument of the entry block that
      represents the loop-carried variable in each iteration.
    - The "init value" is an operand of the loop op that serves as the initial
      region iter_arg value for the first iteration (if any).
    - The "yielded" value is the value that is forwarded from one iteration to
      serve as the region iter_arg of the next iteration.

    If one of the respective interface methods is implemented, so must the other
    two. The interface verifier ensures that the number of types of the region
    iter_args, init values and yielded values match.

Obviously that definition does not work for scf.forall. Might need to update this since the scf.forall does not yield any value. The region iter arg and init value can stay though.

This is checked by the op verifier, so you should be seeing verification failures.

I did. I fixed the verifier also, but I think as you note below, getYieldedValuesMutable should return an std::optional<..>

On second thought, scf.forall should not implement getRegionIterArgs at all. The region iter_args in the loop like op interface are loop-carried variables, but the scf.forall does not have any loop-carried variables.

We probably dont need that.. the region iter args can still be tied to init. If there is no yield value, then the verifier can handle it appropriately.

@MaheshRavishankar
Copy link
Contributor Author

@nicolasvasilache could you take a look again?

@nicolasvasilache
Copy link
Contributor

This PR looks fine to me, thanks!

I'd like to see the cleanup work truly start before we land more code.

There is lot of context here spread all over the place w.r.t tiling. I want to unify all of it, but I dont think I can do that in one step. I would rather incrementally build up the required functionality in one place and deprecate/move pieces as needed. I do plan to do this in short order though..

To be clear, I did not ask for a making this PR more complex but specifically asked for a followup PR to see that work has truly unambiguously started on the deprecation of the legacy implementation (I already voiced this back in Nov: #72178 (comment)).

Now is the time to see concrete movement on the deprecation before we land more code.

@MaheshRavishankar
Copy link
Contributor Author

This PR looks fine to me, thanks!

I'd like to see the cleanup work truly start before we land more code.

There is lot of context here spread all over the place w.r.t tiling. I want to unify all of it, but I dont think I can do that in one step. I would rather incrementally build up the required functionality in one place and deprecate/move pieces as needed. I do plan to do this in short order though..

To be clear, I did not ask for a making this PR more complex but specifically asked for a followup PR to see that work has truly unambiguously started on the deprecation of the legacy implementation (I already voiced this back in Nov: #72178 (comment)).

Now is the time to see concrete movement on the deprecation before we land more code.

I agree, my next step is to move the transform dialect op to use this implementation. So thats top on the list.

MaheshRavishankar added a commit to MaheshRavishankar/iree that referenced this pull request Jan 26, 2024
MaheshRavishankar added a commit to MaheshRavishankar/iree that referenced this pull request Jan 26, 2024
MaheshRavishankar and others added 15 commits January 25, 2024 20:22
… to unify tiling with `scf.for` and `scf.forall`.

Using `LoopLikeOpInterface` as the basis for the implementation
unifies all the tiling logic for both `scf.for` and `scf.forall`. The
only difference is the actual loop generation.
Instead of many entry points for each loop type, the loop type is now
passed as part of the options passed to the tiling method.

This is a breaking change with the following changes

1) The `scf::tileUsingSCFForOp` is renamed to `scf::tileUsingSCF`
2) The `scf::tileUsingSCFForallOp` is deprecated. The same
   functionality is obtained by using `scf::tileUsingSCF` and setting
   the loop type in `scf::SCFTilingOptions` passed into this method to
   `scf::SCFTilingOptions::LoopType::ForallOp` (using the
   `setLoopType` method).
3) The `scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp` is
   renamed to `scf::tileConsumerAndFuseProducerUsingSCF`. The use of
   the `controlFn` in `scf::SCFTileAndFuseOptions` allows implementing
   any strategy with the default callback implemeting the greedy fusion.
4) The `scf::SCFTilingResult` and `scf::SCFTileAndFuseResult` now use
   `SmallVector<LoopLikeOpInterface>`.
5) To make `scf::ForallOp` implement the parts of
   `LoopLikeOpInterface` needed, the `getOutputBlockArguments()`
   method is replaced with `getRegionIterArgs()`

This change also introduces a new interface method for
`LoopLikeOpInterface`, that allows loop constructs to handle tiled
yields.

These changes now bring the tiling and fusion capabilities using
`scf.forall` on par with what was already supported
- Remove passing as `const &` for the `llvm::function_ref`.
- Fix ordering of interface methods.
@MaheshRavishankar MaheshRavishankar dismissed nicolasvasilache’s stale review January 26, 2024 05:25

#77874 (comment) states that the PR is fine, and the follow up is agreed upon.

@MaheshRavishankar MaheshRavishankar merged commit 76ead96 into llvm:main Jan 26, 2024
MaheshRavishankar added a commit to iree-org/iree that referenced this pull request Jan 26, 2024
MaheshRavishankar added a commit to iree-org/iree that referenced this pull request Jan 26, 2024
[Advance LLVM to 76ead96c1d06ee0d828238bce96d0107e650b5fa:
[mlir][TilingInterface] Use `LoopLikeOpInterface` in tiling using SCF to
unify tiling with `scf.for` and `scf.forall`. (#77874)
(MaheshRavishankar on 2024-01-25 21:26:23 -0800) (40 of 41)

---------

Co-authored-by: hanhanW <[email protected]>
Groverkss pushed a commit to iree-org/iree that referenced this pull request Feb 2, 2024
[Advance LLVM to 76ead96c1d06ee0d828238bce96d0107e650b5fa:
[mlir][TilingInterface] Use `LoopLikeOpInterface` in tiling using SCF to
unify tiling with `scf.for` and `scf.forall`. (#77874)
(MaheshRavishankar on 2024-01-25 21:26:23 -0800) (40 of 41)

---------

Co-authored-by: hanhanW <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants