-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[mlir][TilingInterface] Use LoopLikeOpInterface
in tiling using SCF to unify tiling with scf.for
and scf.forall
.
#77874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][TilingInterface] Use LoopLikeOpInterface
in tiling using SCF to unify tiling with scf.for
and scf.forall
.
#77874
Conversation
@llvm/pr-subscribers-flang-fir-hlfir @llvm/pr-subscribers-mlir-linalg Author: None (MaheshRavishankar) ChangesUsing This is a breaking change with the following changes
These changes now bring the tiling and fusion capabilities using Patch is 107.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/77874.diff 18 Files Affected:
diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
index 8d65d3dd820baf..819d88df0ae57f 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
@@ -135,9 +135,9 @@ def ExecuteRegionOp : SCF_Op<"execute_region", [
def ForOp : SCF_Op<"for",
[AutomaticAllocationScope, DeclareOpInterfaceMethods<LoopLikeOpInterface,
- ["getInitsMutable", "getSingleInductionVar", "getSingleLowerBound",
- "getSingleStep", "getSingleUpperBound", "getYieldedValuesMutable",
- "getLoopResults", "promoteIfSingleIteration",
+ ["getInitsMutable", "getRegionIterArgs", "getSingleInductionVar",
+ "getSingleLowerBound", "getSingleStep", "getSingleUpperBound",
+ "getYieldedValuesMutable", "getLoopResults", "promoteIfSingleIteration",
"replaceWithAdditionalYields"]>,
AllTypesMatch<["lowerBound", "upperBound", "step"]>,
ConditionallySpeculatable,
@@ -259,10 +259,6 @@ def ForOp : SCF_Op<"for",
Value getInductionVar() { return getBody()->getArgument(0); }
- Block::BlockArgListType getRegionIterArgs() {
- return getBody()->getArguments().drop_front(getNumInductionVars());
- }
-
/// Return the `index`-th region iteration argument.
BlockArgument getRegionIterArg(unsigned index) {
assert(index < getNumRegionIterArgs() &&
@@ -304,8 +300,9 @@ def ForallOp : SCF_Op<"forall", [
AttrSizedOperandSegments,
AutomaticAllocationScope,
DeclareOpInterfaceMethods<LoopLikeOpInterface,
- ["promoteIfSingleIteration", "getSingleInductionVar",
- "getSingleLowerBound", "getSingleUpperBound", "getSingleStep"]>,
+ ["getInitsMutable", "getRegionIterArgs", "getSingleInductionVar",
+ "getSingleLowerBound", "getSingleUpperBound", "getSingleStep",
+ "promoteIfSingleIteration", "replaceWithAdditionalYields"]>,
RecursiveMemoryEffects,
SingleBlockImplicitTerminator<"scf::InParallelOp">,
DeclareOpInterfaceMethods<RegionBranchOpInterface>,
@@ -585,10 +582,6 @@ def ForallOp : SCF_Op<"forall", [
getNumDynamicControlOperands() + getRank());
}
- ArrayRef<BlockArgument> getOutputBlockArguments() {
- return getBody()->getArguments().drop_front(getRank());
- }
-
::mlir::ValueRange getInductionVars() {
return getBody()->getArguments().take_front(getRank());
}
diff --git a/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h b/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
index 5d2d78e6e6165b..965ef9e203be28 100644
--- a/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
+++ b/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
@@ -12,6 +12,7 @@
#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/Tensor/Transforms/Transforms.h"
#include "mlir/IR/PatternMatch.h"
+#include "mlir/Interfaces/LoopLikeInterface.h"
#include "mlir/Interfaces/TilingInterface.h"
#include <deque>
@@ -52,6 +53,14 @@ struct SCFTilingOptions {
return *this;
}
+ /// Specify which loop construct to use for tile and fuse.
+ enum class LoopType { ForOp, ForallOp };
+ LoopType loopType = LoopType::ForOp;
+ SCFTilingOptions &setLoopType(LoopType type) {
+ loopType = type;
+ return *this;
+ }
+
/// Specify mapping of loops to devices. This is only respected when the loop
/// constructs support such a mapping (like `scf.forall`). Will be ignored
/// when using loop constructs that dont support such a mapping (like
@@ -71,7 +80,7 @@ struct SCFTilingResult {
/// of the last op.
SmallVector<Operation *> tiledOps;
/// The `scf.for` operations that iterate over the tiles.
- SmallVector<Operation *> loops;
+ SmallVector<LoopLikeOpInterface> loops;
/// Values to use as replacements for the untiled op. Is the same size as the
/// number of results of the untiled op.
SmallVector<Value> replacements;
@@ -79,15 +88,9 @@ struct SCFTilingResult {
/// Method to tile an op that implements the `TilingInterface` using
/// `scf.for` for iterating over the tiles.
-FailureOr<SCFTilingResult> tileUsingSCFForOp(RewriterBase &rewriter,
- TilingInterface op,
- const SCFTilingOptions &options);
-
-/// Method to tile an op that implements the `TilingInterface` using
-/// `scf.forall`.
-FailureOr<SCFTilingResult>
-tileUsingSCFForallOp(RewriterBase &rewriter, TilingInterface op,
- const SCFTilingOptions &options);
+FailureOr<SCFTilingResult> tileUsingSCF(RewriterBase &rewriter,
+ TilingInterface op,
+ const SCFTilingOptions &options);
/// Options used to control tile + fuse.
struct SCFTileAndFuseOptions {
@@ -135,7 +138,7 @@ struct SCFFuseProducerOfSliceResult {
std::optional<SCFFuseProducerOfSliceResult>
tileAndFuseProducerOfSlice(RewriterBase &rewriter,
tensor::ExtractSliceOp candidateSliceOp,
- MutableArrayRef<scf::ForOp> loops);
+ MutableArrayRef<LoopLikeOpInterface> loops);
/// Reconstruct the fused producer from within the tiled-and-fused code. Based
/// on the slice of the producer computed in place it is possible that within
@@ -187,10 +190,10 @@ tileAndFuseProducerOfSlice(RewriterBase &rewriter,
/// where `%0` had other uses as well. If not reconstructed from within the loop
/// body, uses of `%0` could not be replaced, making it still live and the
/// fusion immaterial.
-void yieldReplacementForFusedProducer(
+LogicalResult yieldReplacementForFusedProducer(
RewriterBase &rewriter, tensor::ExtractSliceOp sliceOp,
scf::SCFFuseProducerOfSliceResult fusedProducerInfo,
- MutableArrayRef<scf::ForOp> loops);
+ MutableArrayRef<LoopLikeOpInterface> loops);
/// Transformation information returned after tile and fuse.
struct SCFTileAndFuseResult {
@@ -201,7 +204,7 @@ struct SCFTileAndFuseResult {
/// generated operation.
llvm::SetVector<Operation *> tiledAndFusedOps;
/// The `scf.for` operations that iterate over the tiles.
- SmallVector<Operation *> loops;
+ SmallVector<LoopLikeOpInterface> loops;
/// The replacement values to use for the tiled and fused operations.
llvm::DenseMap<Value, Value> replacements;
};
@@ -232,9 +235,9 @@ struct SCFTileAndFuseResult {
/// }
/// ```
FailureOr<SCFTileAndFuseResult>
-tileConsumerAndFuseProducerGreedilyUsingSCFForOp(
- RewriterBase &rewriter, TilingInterface consumer,
- const SCFTileAndFuseOptions &options);
+tileConsumerAndFuseProducersUsingSCF(RewriterBase &rewriter,
+ TilingInterface consumer,
+ const SCFTileAndFuseOptions &options);
/// Method to lower an `op` that implements the `TilingInterface` to
/// loops/scalars.
@@ -249,8 +252,8 @@ struct SCFReductionTilingResult {
Operation *mergeOp;
/// Initial op
Operation *initialOp;
- /// The `scf.for` operations that iterate over the tiles.
- SmallVector<scf::ForOp> loops;
+ /// The loop operations that iterate over the tiles.
+ SmallVector<LoopLikeOpInterface> loops;
};
/// Method to tile a reduction and generate a parallel op within a serial loop.
diff --git a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
index 97d2b4a3be5c56..5efa2b3eb7476f 100644
--- a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
+++ b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
@@ -485,8 +485,8 @@ transform::FuseOp::apply(transform::TransformRewriter &rewriter,
tileSizes.size() - llvm::count(tileSizes, 0), transformResults,
[&](TilingInterface tilingInterfaceOp)
-> FailureOr<scf::SCFTileAndFuseResult> {
- return tileConsumerAndFuseProducerGreedilyUsingSCFForOp(
- rewriter, tilingInterfaceOp, tileAndFuseOptions);
+ return tileConsumerAndFuseProducersUsingSCF(rewriter, tilingInterfaceOp,
+ tileAndFuseOptions);
});
return failed(result) ? DiagnosedSilenceableFailure::definiteFailure()
: DiagnosedSilenceableFailure::success();
@@ -584,7 +584,7 @@ static Operation *replaceForAllWithNewSignature(
Operation *firstYieldOp = yieldingOps.front();
rewriter.setInsertionPoint(firstYieldOp);
Value src = tileAndFuseResult.tiledValues[0];
- Value dst = newforallOp.getOutputBlockArguments().back();
+ Value dst = newforallOp.getRegionIterArgs().back();
SmallVector<OpFoldResult> strides(offsets.size(), rewriter.getIndexAttr(1));
rewriter.create<tensor::ParallelInsertSliceOp>(firstYieldOp->getLoc(), src,
dst, offsets, sizes, strides);
@@ -2063,7 +2063,7 @@ transform::ScalarizeOp::applyToOne(transform::TransformRewriter &rewriter,
});
SmallVector<int64_t> emptyTileSizes;
rewriter.setInsertionPoint(target);
- FailureOr<scf::SCFTilingResult> maybeTilingResult = tileUsingSCFForOp(
+ FailureOr<scf::SCFTilingResult> maybeTilingResult = tileUsingSCF(
rewriter, cast<TilingInterface>(target.getOperation()), tilingOptions);
if (failed(maybeTilingResult))
return emitDefaultDefiniteFailure(target);
@@ -2647,7 +2647,7 @@ transform::TileUsingForOp::apply(transform::TransformRewriter &rewriter,
tilingOptions.setInterchange(getInterchange());
FailureOr<scf::SCFTilingResult> maybeTilingResult =
- tileUsingSCFForOp(rewriter, tilingInterface, tilingOptions);
+ tileUsingSCF(rewriter, tilingInterface, tilingOptions);
if (failed(maybeTilingResult))
return DiagnosedSilenceableFailure::definiteFailure();
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp b/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
index 7f3ab1f1a24b2f..339d06cdeaf60a 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
@@ -358,7 +358,7 @@ static FailureOr<ForallTilingResult> tileToForallOpImpl(
// 3. Clone the tileable op and update its destination operands to use the
// output bbArgs of the ForallOp.
- ArrayRef<BlockArgument> destBbArgs = forallOp.getOutputBlockArguments();
+ ArrayRef<BlockArgument> destBbArgs = forallOp.getRegionIterArgs();
Operation *tiledOp = nullptr;
SmallVector<Value> tiledValues;
{
@@ -695,7 +695,7 @@ FailureOr<linalg::ForallReductionTilingResult> linalg::tileReductionUsingForall(
// 4. Clone the tileable op and update its destination operands to use the
// output bbArgs of the ForallOp.
SmallVector<Value> tilingResults;
- ArrayRef<BlockArgument> destBbArgs = forallOp.getOutputBlockArguments();
+ ArrayRef<BlockArgument> destBbArgs = forallOp.getRegionIterArgs();
{
// 4.a. RAII guard, inserting within forallOp, before terminator.
OpBuilder::InsertionGuard g(b);
diff --git a/mlir/lib/Dialect/SCF/IR/SCF.cpp b/mlir/lib/Dialect/SCF/IR/SCF.cpp
index 5570c2ec688c8a..7dbc8f250b25eb 100644
--- a/mlir/lib/Dialect/SCF/IR/SCF.cpp
+++ b/mlir/lib/Dialect/SCF/IR/SCF.cpp
@@ -527,6 +527,10 @@ ParseResult ForOp::parse(OpAsmParser &parser, OperationState &result) {
SmallVector<Region *> ForOp::getLoopRegions() { return {&getRegion()}; }
+Block::BlockArgListType ForOp::getRegionIterArgs() {
+ return getBody()->getArguments().drop_front(getNumInductionVars());
+}
+
MutableArrayRef<OpOperand> ForOp::getInitsMutable() {
return getInitArgsMutable();
}
@@ -622,6 +626,47 @@ LogicalResult scf::ForallOp::promoteIfSingleIteration(RewriterBase &rewriter) {
return success();
}
+Block::BlockArgListType ForallOp::getRegionIterArgs() {
+ return getBody()->getArguments().drop_front(getRank());
+}
+
+MutableArrayRef<OpOperand> ForallOp::getInitsMutable() {
+ return getOutputsMutable();
+}
+
+FailureOr<LoopLikeOpInterface>
+ForallOp::replaceWithAdditionalYields(RewriterBase &rewriter,
+ ValueRange newInitOperands,
+ bool replaceInitOperandUsesInLoop,
+ const NewYieldValuesFn &newYieldValueFn) {
+ OpBuilder::InsertionGuard g(rewriter);
+ rewriter.setInsertionPoint(getOperation());
+ auto inits = llvm::to_vector(getOutputs());
+ inits.append(newInitOperands.begin(), newInitOperands.end());
+ auto newLoop = rewriter.create<scf::ForallOp>(
+ getLoc(), getMixedLowerBound(), getMixedUpperBound(), getMixedStep(),
+ inits, getMapping(), [](OpBuilder &, Location, ValueRange) {});
+
+ // Move the region of the current block to the newly created op.
+ Block *newLoopBody = newLoop.getBody();
+ rewriter.mergeBlocks(
+ getBody(), newLoopBody,
+ newLoopBody->getArguments().take_front(getBody()->getNumArguments()));
+
+ // Update the terminator.
+ {
+ OpBuilder::InsertionGuard g(rewriter);
+ auto terminator = cast<scf::InParallelOp>(newLoopBody->getTerminator());
+ rewriter.setInsertionPointToEnd(terminator.getBody());
+ newYieldValueFn(
+ rewriter, getLoc(),
+ newLoopBody->getArguments().take_back(newInitOperands.size()));
+ }
+ rewriter.replaceOp(getOperation(),
+ newLoop->getResults().take_front(getNumResults()));
+ return cast<LoopLikeOpInterface>(newLoop.getOperation());
+}
+
/// Promotes the loop body of a scf::ForallOp to its containing block.
void mlir::scf::promote(RewriterBase &rewriter, scf::ForallOp forallOp) {
OpBuilder::InsertionGuard g(rewriter);
@@ -1630,9 +1675,8 @@ struct FoldTensorCastOfOutputIntoForallOp
// mapped to the tensor.cast old-typed results of the output bbArgs. The
// destination have to be updated to point to the output bbArgs directly.
auto terminator = newForallOp.getTerminator();
- for (auto [yieldingOp, outputBlockArg] :
- llvm::zip(terminator.getYieldingOps(),
- newForallOp.getOutputBlockArguments())) {
+ for (auto [yieldingOp, outputBlockArg] : llvm::zip(
+ terminator.getYieldingOps(), newForallOp.getRegionIterArgs())) {
auto insertSliceOp = cast<tensor::ParallelInsertSliceOp>(yieldingOp);
insertSliceOp.getDestMutable().assign(outputBlockArg);
}
diff --git a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
index 38e0625d7ce093..20baa5c8dfcfa5 100644
--- a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
+++ b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
@@ -55,32 +55,8 @@ fillInterchangeVector(ArrayRef<int64_t> interchangeVector,
return filledVector;
}
-/// Convert a list of ops of type `SrcOpTy` to list of `Operation *`.
-template <typename SrcOpTy>
-static SmallVector<Operation *> getAsOperations(ArrayRef<SrcOpTy> ops) {
- return llvm::to_vector(
- llvm::map_range(ops, [](auto op) -> Operation * { return op; }));
-}
-template <typename SrcOpTy>
-static SmallVector<Operation *>
-getAsOperations(const SmallVector<SrcOpTy> &ops) {
- return getAsOperations(ArrayRef<SrcOpTy>(ops));
-}
-
-/// Convert a list of `Operation *` to a list of `DstOpTy.
-template <typename DstOpTy>
-static SmallVector<DstOpTy> castToTypedOperations(ArrayRef<Operation *> ops) {
- return llvm::to_vector(
- llvm::map_range(ops, [](Operation *op) { return cast<DstOpTy>(op); }));
-}
-template <typename DstOpTy>
-static SmallVector<DstOpTy>
-castToTypedOperations(const SmallVector<Operation *> &ops) {
- return castToTypedOperations<DstOpTy>(ArrayRef<Operation *>(ops));
-}
-
//===----------------------------------------------------------------------===//
-// tileUsingSCFForOp implementation.
+// tileUsingSCF implementation.
//===----------------------------------------------------------------------===//
// Check if `stride` evenly divides the trip count `size - offset`.
@@ -135,66 +111,204 @@ static Operation *cloneOpAndUpdateDestinationArgs(RewriterBase &rewriter,
return clonedOp;
}
-/// Generate an empty loop nest that represents the tiled loop nest shell.
+/// Type of the call back function used to generate the body of the tiled
+/// loops. The loop generation methods use this callback to generate
+/// the body of the inner-most tile loop. The call back is provided with
+/// - `rewriter`: with insertion point set to the end of the inner most loop
+/// body
+/// - `ivs`: the induction variables for the surrounding loops.
+/// - `regionIterArgs`: the basic block arguments of the inner most loop that
+/// correspond to the init/result of the loop.
+/// - `resultOffsets/Sizes`: The call back returns the result of tiling the
+/// operation, but in `resultOffsets` and `resultSizes` the callback function
+/// is expected to populate the offsets/sizes to use while inserting the
+/// result back into the corresponding `regionIterArgs`. These values
+/// are used by the loop generation methods to create the appropriate yield
+/// values from the inner most loop.
+/// The call back returns the `TilingResult` obtained from tiling the operation.
+using TileLoopBodyFn = std::function<FailureOr<TilingResult>(
+ RewriterBase &rewriter, Location loc, ValueRange ivs,
+ ValueRange regionIterArgs,
+ SmallVector<SmallVector<OpFoldResult>> &resultOffsets,
+ SmallVector<SmallVector<OpFoldResult>> &resultSizes)>;
+
+/// Generate the tile-loop nest using `scf.for` operation.
/// - `loopRanges` specifies the lb, ub and step of the untiled iteration space.
/// - `tileSizes` is the tile sizes to use. Zero represent untiled loops.
-/// - In `offsets` and `sizes` return the multi-dimensional offset and size of
-/// the tile processed within the inner most loop.
-/// Note that this methods adds `scf.yield` operation for all but the innermost
-/// loop. These yield the value returned by the immediately inner loop. The
-/// caller is expected to add the scf.yield operation for the innermost loop.
-static SmallVector<scf::ForOp> generateTileLoopNest(
- OpBuilder &builder, Location loc, ArrayRef<Range> loopRanges,
- ArrayRef<OpFoldResult> tileSizes, SmallVector<OpFoldResult> &offsets,
- SmallVector<OpFoldResult> &sizes, ValueRange destinationTensors = {}) {
- if (loopRanges.empty())
- return {};
+/// - `destinationTensors` are the init values to use for the outer most loop.
+/// - `tileLoopBodyFn` is called to generated the loop body of the inner most
+/// loop.
+/// - `loops` is an in-out parameter into which the generated loops are
+/// populated.
+/// The `TilingResult` returned by calling `tileLoopBodyFn` is returned back
+/// to the caller.
+static FailureOr<TilingResult> generateLoopNestUsingForOp(
+ RewriterBase &rewriter, Location loc, ArrayRef<Range> loopRanges,
+ ArrayRef<OpFoldResult> tileSizes, ValueRange destinationTensors,
+ TileLoopBodyFn tileLoopBodyFn, SmallVector<LoopLikeOpInterface> &loops) {
+ assert(!loopRanges.empty() && "unexpected empty loop ranges");
assert(loopRanges.size() == tileSizes.size() &&
"expected as many tile sizes as loop ranges");
- OpBuilder::InsertionGuard guard(builder);
- SmallVector<scf::ForOp> loops;
- offsets.resize(loopRanges.size());
- sizes.resize(loopRanges.size());
-
- for (auto loopRange : llvm::enumerate(loopRanges)) {
- Value offset =
- getValueOrCreateConstantIndexOp(builder, loc, loopRange.value().offset);
- Value size =
- getValueOrCreateConstantIndexOp(builder, loc, loopRange.value().size);
- Value tileSize = getValueOrCreateConstantIndexOp(
- builder, loc, tileSizes[loopRange.index()]);
+ OpBuilder::InsertionGuard guard(rewriter);
+ SmallVector<Value> ivs;
+
+ for (auto [loopRange, tileSize] : llvm::zip_equal(loopRanges, tileSizes)) {
// No loops if tile size is zero. Set offset and size to the loop
// offset and size.
- if (matchPattern(tileSize, m_Zero())) {
- offsets[loopRange.index()] = offset;
- sizes[loopRange.index()] = size;
+ if (isConstantIntValue(tileSize, 0)) {
continue;
}
- auto loop = builder.create<scf::ForOp>(
- loc, offset, size, tileSize, destinationTensors,
- [&](OpBuilder &bodyBuilder, Location bodyLoc, Value iv,
- ValueRange /*iterArgs*/) {
- sizes[loopRange.index()] =
- ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing @MaheshRavishankar !
Just one nit comment and CI related fix from my end.
mlir/lib/Dialect/SCF/IR/SCF.cpp
Outdated
OpBuilder::InsertionGuard g(rewriter); | ||
auto terminator = cast<scf::InParallelOp>(newLoopBody->getTerminator()); | ||
rewriter.setInsertionPointToEnd(terminator.getBody()); | ||
newYieldValueFn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find the place in the code, but I think this lambda creates the parallel_insert_slice
ops. It probably returns an empty SmallVector
. This does not fit with the comment in LoopLikeInterface.h
:
/// A function that returns the additional yielded values during
/// `replaceWithAdditionalYields`. `newBbArgs` are the newly added region
/// iter_args. This function should return as many values as there are block
/// arguments in `newBbArgs`.
using NewYieldValuesFn = std::function<SmallVector<Value>(
OpBuilder &b, Location loc, ArrayRef<BlockArgument> newBbArgs)>;
It is a bit confusing that for scf.for
we expect the function to return a list of Value
and for scf.forall
we expect the function to build new IR and not return anything.
How about generalizing NewYieldValuesFn
to RewriteTerminatorFn = std::function<void(RewriterBase &rewriter, Operation *terminator, ArrayRef<BlockArgument> newBbArgs)>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but the scf.forall
does not return any values. I looked at place where NewYieldValuesFn
is used and it is quite a bit. Want to limit the breaking changes here, but I can do a quick follow up on that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that might not work as well. Consider
NewYieldValuesFn yieldFn = [&](OpBuilder &b, Location loc, |
scf.forall
separately. I think there is an inherent mismatch between the behavior of scf.forall
and other loop constructs in the sense it doesnt yield anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I do have an issue here (and I didnt see it cause I am also missing a test that will show up this issue). So good catch... Ill drop this part and do something different...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an inherent mismatch between the behavior of
scf.forall
and other loop constructs in the sense it doesnt yield anything.
I ran into the same problem when cleaning up the LoopLikeOpInterface
recently. I was able to add the interface to scf.while
, but not scf.forall
because of this issue.
I think we should generalize NewYieldValuesFn
to RewriteTerminatorFn
. That lambda can do whatever it wants.
/// A function that updates the terminator of a loop-like op during
/// `replaceWithAdditionalYields`. `newBbArgs` are the newly added region
/// iter_args. This function should rewrite the loop terminator (`terminator`)
/// such that the loop op verifies successfully.
using RewriteTerminatorFn =
std::function<void(RewriterBase &rewriter, Operation *terminator, ArrayRef<BlockArgument> newBbArgs)>;
I think we also have to turn this interface function into an interface method:
/// Append the specified additional "init" operands: replace this loop with
/// a new loop that has the additional init operands. The loop body of this
/// loop is moved over to the new loop.
///
/// The newly added region iter_args are yielded from the loop.
::mlir::FailureOr<::mlir::LoopLikeOpInterface>
replaceWithAdditionalIterOperands(::mlir::RewriterBase &rewriter,
::mlir::ValueRange newInitOperands,
bool replaceInitOperandUsesInLoop)
For scf.forall
, this interface could either:
- "return failure();"
- or: insert a parallel_insert_slice that inserts the newly added bbArg into itself (which is a no-op)
- or: remove this interface method entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see actually how this work. Even with this new lambda, any algorithm that is working with LoopLikeOpInterface
will have to special case scf.forall
to handle the different yield approach it takes.
I went a different approach. I added a new interface method to LoopLikeOpInterface
that allows handling of tiled yields. This allows unifying scf.for
and scf.forall
since they handle tiles very differently, and that gets now rolled into the interface method implementation each operation.
Also added a test that should have caught this issue earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementations of yieldTiledValuesAndReplace
make sense to me. But I wouldn't put it in LoopLikeOpInterface
.
LoopLikeOpInterface
was meant to expose basic loop properties and low-level transformations of loop-like ops. The new yieldTiledValuesAndReplace
interface method is tiling-specific and does not belong in LoopLikeOpInterface
imo. How about putting yieldTiledValuesAndReplace
in a new interface in TilingInterface.td
? Then we have all tiling-related stuff in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure a whole interface for just one method is worth it. I think this method is optional for the most part. No op needs to implement it. It is needed only if it wants to work with tiling, which is nice cause all the loop construct has to do to work with tiling is implement this method. Id prefer to keep it in LoopLikeOpInterface
. If you feel strongly about it, I can move it out of the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a small issue with NewYieldValuesFn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this +1
mlir/test/lib/Interfaces/TilingInterface/TestTilingInterfaceTransformOps.td
Show resolved
Hide resolved
mlir/test/lib/Interfaces/TilingInterface/TestTilingInterfaceTransformOps.cpp
Outdated
Show resolved
Hide resolved
mlir/test/lib/Interfaces/TilingInterface/TestTilingInterfaceTransformOps.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the followup from #72178, right?
Let's mention it in the commit message so we can refer back to it later, had to do a bit of spelunking to find it.
Overall comment:
So IIRC, #72178 adopted the logic of tiling scf.forall
from LinalgTransformOps.cpp
and/or Linalg/Transforms/Tiling.cpp
to redo scf.for
.
This was a welcome refactoring and this PR is a nice continuation, nice!
To proceed confidently, I would now like to see red diffs in LinalgTransformOps.cpp
and Linalg/Transforms/Tiling.cpp
to see older implementations be actually replaced: atm this seems it is still adding more code in parallel to the old implementation.
For instance, I still see LinalgTransformOps.cpp calling code from Linalg/Transforms/Tiling.cpp
:
FailureOr<linalg::ForallTilingResult> maybeTilingResult = failure();
if (!mixedNumThreads.empty()) {
maybeTilingResult =
linalg::tileToForallOp(rewriter, tileableOp, mixedNumThreads, mapping);
} else {
maybeTilingResult = linalg::tileToForallOpUsingTileSizes(
rewriter, tileableOp, mixedTileSizes, mapping);
}
Do you have a followup PR which cleans up the old?
Note, as part of this, I would also like to see something concrete on the num_threads
discussion.
Besides the meta-comment that we now need to resolve, I'll be able to review this early next week.
Thanks!
/// populated. | ||
/// The `TilingResult` returned by calling `tileLoopBodyFn` is returned back | ||
/// to the caller. | ||
static FailureOr<TilingResult> generateLoopNestUsingForOp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ / resemble older code that lowers Linalg to loops?
It would be nice if we could reuse the same utility for both places but I am a bit rusty of what utilities we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean lowering to loops code? Maybe.. Ill look into that.
It is a change from a long list of related changes, but yes, #72178 is probably the start of it. Updated the comment anyway.
Yes, that is the goal, but I want to proceed incrementally. With this change, the same tests that work with There is lot of context here spread all over the place w.r.t tiling. I want to unify all of it, but I dont think I can do that in one step. I would rather incrementally build up the required functionality in one place and deprecate/move pieces as needed. I do plan to do this in short order though..
Thanks! |
809341a
to
75b34c5
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
@matthias-springer I am having the following three tests failing for me, for reasons that seem unrelated to this PR (related by the fact that this is happening with this PR, but unrelated in the sense that its the pad hoisting code failing even though the tiling is the same).
I tried to triage this, but the issue seems deep in the weeds. I will continue looking, but if you have any pointers, that would be very helpful. |
75b34c5
to
be18c25
Compare
Never mind. Figured it out.. the hoisting logic needed some LICM to kick to work. Fixed now. |
return getBody()->getArguments().drop_front(getRank()); | ||
} | ||
|
||
MutableArrayRef<OpOperand> ForallOp::getInitsMutable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing a getYieldedValuesMutable
implementation for ForallOp
.
This is what the interface documentation says:
Loop-carried variables can be exposed through this interface. There are
3 components to a loop-carried variable.
- The "region iter_arg" is the block argument of the entry block that
represents the loop-carried variable in each iteration.
- The "init value" is an operand of the loop op that serves as the initial
region iter_arg value for the first iteration (if any).
- The "yielded" value is the value that is forwarded from one iteration to
serve as the region iter_arg of the next iteration.
If one of the respective interface methods is implemented, so must the other
two. The interface verifier ensures that the number of types of the region
iter_args, init values and yielded values match.
This is checked by the op verifier, so you should be seeing verification failures.
The problem is that this op does not have yielded values.
We could have getYieldedValuesMutable
return a std::optional<::llvm::MutableArrayRef<::mlir::OpOperand>>
to account for the fact that some loops do not have yielding semantics. (Same as we do for getLoopResults
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, scf.forall
should not implement getRegionIterArgs
at all. The region iter_args in the loop like op interface are loop-carried variables, but the scf.forall
does not have any loop-carried variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing a
getYieldedValuesMutable
implementation forForallOp
.This is what the interface documentation says:
Loop-carried variables can be exposed through this interface. There are 3 components to a loop-carried variable. - The "region iter_arg" is the block argument of the entry block that represents the loop-carried variable in each iteration. - The "init value" is an operand of the loop op that serves as the initial region iter_arg value for the first iteration (if any). - The "yielded" value is the value that is forwarded from one iteration to serve as the region iter_arg of the next iteration. If one of the respective interface methods is implemented, so must the other two. The interface verifier ensures that the number of types of the region iter_args, init values and yielded values match.
Obviously that definition does not work for scf.forall
. Might need to update this since the scf.forall
does not yield any value. The region iter arg and init value can stay though.
This is checked by the op verifier, so you should be seeing verification failures.
I did. I fixed the verifier also, but I think as you note below, getYieldedValuesMutable
should return an std::optional<..>
On second thought, scf.forall should not implement getRegionIterArgs at all. The region iter_args in the loop like op interface are loop-carried variables, but the scf.forall does not have any loop-carried variables.
We probably dont need that.. the region iter args can still be tied to init. If there is no yield value, then the verifier can handle it appropriately.
be18c25
to
391d9a7
Compare
@nicolasvasilache could you take a look again? |
This PR looks fine to me, thanks! I'd like to see the cleanup work truly start before we land more code.
To be clear, I did not ask for a making this PR more complex but specifically asked for a followup PR to see that work has truly unambiguously started on the deprecation of the legacy implementation (I already voiced this back in Nov: #72178 (comment)). Now is the time to see concrete movement on the deprecation before we land more code. |
I agree, my next step is to move the transform dialect op to use this implementation. So thats top on the list. |
… to unify tiling with `scf.for` and `scf.forall`. Using `LoopLikeOpInterface` as the basis for the implementation unifies all the tiling logic for both `scf.for` and `scf.forall`. The only difference is the actual loop generation. Instead of many entry points for each loop type, the loop type is now passed as part of the options passed to the tiling method. This is a breaking change with the following changes 1) The `scf::tileUsingSCFForOp` is renamed to `scf::tileUsingSCF` 2) The `scf::tileUsingSCFForallOp` is deprecated. The same functionality is obtained by using `scf::tileUsingSCF` and setting the loop type in `scf::SCFTilingOptions` passed into this method to `scf::SCFTilingOptions::LoopType::ForallOp` (using the `setLoopType` method). 3) The `scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp` is renamed to `scf::tileConsumerAndFuseProducerUsingSCF`. The use of the `controlFn` in `scf::SCFTileAndFuseOptions` allows implementing any strategy with the default callback implemeting the greedy fusion. 4) The `scf::SCFTilingResult` and `scf::SCFTileAndFuseResult` now use `SmallVector<LoopLikeOpInterface>`. 5) To make `scf::ForallOp` implement the parts of `LoopLikeOpInterface` needed, the `getOutputBlockArguments()` method is replaced with `getRegionIterArgs()` This change also introduces a new interface method for `LoopLikeOpInterface`, that allows loop constructs to handle tiled yields. These changes now bring the tiling and fusion capabilities using `scf.forall` on par with what was already supported
- Remove passing as `const &` for the `llvm::function_ref`. - Fix ordering of interface methods.
98b49d4
to
5d2f0e9
Compare
#77874 (comment) states that the PR is fine, and the follow up is agreed upon.
[Advance LLVM to 76ead96c1d06ee0d828238bce96d0107e650b5fa: [mlir][TilingInterface] Use `LoopLikeOpInterface` in tiling using SCF to unify tiling with `scf.for` and `scf.forall`. (#77874) (MaheshRavishankar on 2024-01-25 21:26:23 -0800) (40 of 41) --------- Co-authored-by: hanhanW <[email protected]>
[Advance LLVM to 76ead96c1d06ee0d828238bce96d0107e650b5fa: [mlir][TilingInterface] Use `LoopLikeOpInterface` in tiling using SCF to unify tiling with `scf.for` and `scf.forall`. (#77874) (MaheshRavishankar on 2024-01-25 21:26:23 -0800) (40 of 41) --------- Co-authored-by: hanhanW <[email protected]>
Using
LoopLikeOpInterface
as the basis for the implementation unifies all the tiling logic for bothscf.for
andscf.forall
. The only difference is the actual loop generation. This is a follow up to #72178Instead of many entry points for each loop type, the loop type is now passed as part of the options passed to the tiling method.
This is a breaking change with the following changes
scf::tileUsingSCFForOp
is renamed toscf::tileUsingSCF
scf::tileUsingSCFForallOp
is deprecated. The samefunctionality is obtained by using
scf::tileUsingSCF
and settingthe loop type in
scf::SCFTilingOptions
passed into this method toscf::SCFTilingOptions::LoopType::ForallOp
(using thesetLoopType
method).scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp
isrenamed to
scf::tileConsumerAndFuseProducerUsingSCF
. The use ofthe
controlFn
inscf::SCFTileAndFuseOptions
allows implementingany strategy with the default callback implemeting the greedy fusion.
scf::SCFTilingResult
andscf::SCFTileAndFuseResult
now useSmallVector<LoopLikeOpInterface>
.scf::ForallOp
implement the parts ofLoopLikeOpInterface
needed, thegetOutputBlockArguments()
method is replaced with
getRegionIterArgs()
These changes now bring the tiling and fusion capabilities using
scf.forall
on par with what was already supported byscf.for