-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[LoopInterchange] Constrain number of load/stores in a loop #118973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Madhur Amilkanthwar (madhur13490) ChangesIn the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 100th entry it terminates and thus overall wastes compile-time. It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors:
This patch just constrains the whole computation on the number of loadds and store instructions in the loop. In an other approach, I experimented with computing 1 and constraining the number of pairs but that did not lead to any additional benefit in terms of compile-time. However, when other issues are fixed, I can revisit this approach. Full diff: https://github.com/llvm/llvm-project/pull/118973.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index a0c0080c0bda1c..42db6ecc9f9c4a 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -66,8 +66,8 @@ using CharMatrix = std::vector<std::vector<char>>;
} // end anonymous namespace
-// Maximum number of dependencies that can be handled in the dependency matrix.
-static const unsigned MaxMemInstrCount = 100;
+// Maximum number of load-stores that can be handled in the dependency matrix.
+static const unsigned MaxMemInstrCount = 64;
// Maximum loop depth supported.
static const unsigned MaxLoopNestDepth = 10;
@@ -84,7 +84,7 @@ static void printDepMatrix(CharMatrix &DepMatrix) {
static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
Loop *L, DependenceInfo *DI,
- ScalarEvolution *SE) {
+ ScalarEvolution *SE, OptimizationRemarkEmitter *ORE) {
using ValueVector = SmallVector<Value *, 16>;
ValueVector MemInstr;
@@ -110,6 +110,17 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
LLVM_DEBUG(dbgs() << "Found " << MemInstr.size()
<< " Loads and Stores to analyze\n");
+ if (MemInstr.size() > MaxMemInstrCount) {
+ LLVM_DEBUG(dbgs() << "The transform doesn't support more than "
+ << MaxMemInstrCount << " load stores in a loop\n");
+ ORE->emit([&]() {
+ return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedLoop",
+ L->getStartLoc(),
+ L->getHeader())
+ << "Reached maximum loads/stores than can be supported in the loop.";
+ });
+ return false;
+ }
ValueVector::iterator I, IE, J, JE;
StringSet<> Seen;
@@ -161,12 +172,6 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
// Make sure we only add unique entries to the dependency matrix.
if (Seen.insert(StringRef(Dep.data(), Dep.size())).second)
DepMatrix.push_back(Dep);
-
- if (DepMatrix.size() > MaxMemInstrCount) {
- LLVM_DEBUG(dbgs() << "Cannot handle more than " << MaxMemInstrCount
- << " dependencies inside loop\n");
- return false;
- }
}
}
}
@@ -450,7 +455,7 @@ struct LoopInterchange {
CharMatrix DependencyMatrix;
Loop *OuterMostLoop = *(LoopList.begin());
if (!populateDependencyMatrix(DependencyMatrix, LoopNestDepth,
- OuterMostLoop, DI, SE)) {
+ OuterMostLoop, DI, SE, ORE)) {
LLVM_DEBUG(dbgs() << "Populating dependency matrix failed\n");
return false;
}
@@ -1728,7 +1733,6 @@ PreservedAnalyses LoopInterchangePass::run(LoopNest &LN,
// Ensure minimum depth of the loop nest to do the interchange.
if (!hasMinimumLoopDepth(LoopList))
return PreservedAnalyses::all();
-
DependenceInfo DI(&F, &AR.AA, &AR.SE, &AR.LI);
std::unique_ptr<CacheCost> CC =
CacheCost::getCacheCost(LN.getOutermostLoop(), AR, DI);
diff --git a/llvm/test/Transforms/LoopInterchange/many-load-stores.ll b/llvm/test/Transforms/LoopInterchange/many-load-stores.ll
new file mode 100644
index 00000000000000..e718d0b72cfed7
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/many-load-stores.ll
@@ -0,0 +1,262 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes=loop-interchange -S -disable-output | FileCheck %s
+; ModuleID = 'vadd.cpp'
+source_filename = "vadd.cpp"
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+@A = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+@B = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+@C = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+
+; CHECK: The transform doesn't support more than 64 load stores in a loop
+; CHECK: Populating dependency matrix failed
+; Function Attrs: mustprogress norecurse nounwind uwtable
+define dso_local noundef i32 @main() local_unnamed_addr #0 {
+ br label %1
+
+1: ; preds = %9, %0
+ %2 = phi i32 [ 0, %0 ], [ %10, %9 ]
+ %3 = icmp slt i32 %2, 2048
+ br i1 %3, label %5, label %4
+
+4: ; preds = %1
+ ret i32 0
+
+5: ; preds = %1
+ br label %6
+
+6: ; preds = %11, %5
+ %7 = phi i32 [ 0, %5 ], [ %208, %11 ]
+ %8 = icmp slt i32 %7, 85
+ br i1 %8, label %11, label %9
+
+9: ; preds = %6
+ %10 = add nsw i32 %2, 1
+ br label %1
+
+11: ; preds = %6
+ %12 = sext i32 %2 to i64
+ %13 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @B, i64 0, i64 %12
+ %14 = sext i32 %7 to i64
+ %15 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %14
+ %16 = load i32, i32* %15, align 4
+ %17 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @C, i64 0, i64 %12
+ %18 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %14
+ %19 = load i32, i32* %18, align 4
+ %20 = add nsw i32 %16, %19
+ %21 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @A, i64 0, i64 %12
+ %22 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %14
+ store i32 %20, i32* %22, align 4
+ %23 = add nsw i32 %7, 1
+ %24 = sext i32 %23 to i64
+ %25 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %24
+ %26 = load i32, i32* %25, align 4
+ %27 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %24
+ %28 = load i32, i32* %27, align 4
+ %29 = add nsw i32 %26, %28
+ %30 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %24
+ store i32 %29, i32* %30, align 4
+ %31 = add nsw i32 %23, 1
+ %32 = sext i32 %31 to i64
+ %33 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %32
+ %34 = load i32, i32* %33, align 4
+ %35 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %32
+ %36 = load i32, i32* %35, align 4
+ %37 = add nsw i32 %34, %36
+ %38 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %32
+ store i32 %37, i32* %38, align 4
+ %39 = add nsw i32 %31, 1
+ %40 = sext i32 %39 to i64
+ %41 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %40
+ %42 = load i32, i32* %41, align 4
+ %43 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %40
+ %44 = load i32, i32* %43, align 4
+ %45 = add nsw i32 %42, %44
+ %46 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %40
+ store i32 %45, i32* %46, align 4
+ %47 = add nsw i32 %39, 1
+ %48 = sext i32 %47 to i64
+ %49 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %48
+ %50 = load i32, i32* %49, align 4
+ %51 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %48
+ %52 = load i32, i32* %51, align 4
+ %53 = add nsw i32 %50, %52
+ %54 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %48
+ store i32 %53, i32* %54, align 4
+ %55 = add nsw i32 %47, 1
+ %56 = sext i32 %55 to i64
+ %57 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %56
+ %58 = load i32, i32* %57, align 4
+ %59 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %56
+ %60 = load i32, i32* %59, align 4
+ %61 = add nsw i32 %58, %60
+ %62 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %56
+ store i32 %61, i32* %62, align 4
+ %63 = add nsw i32 %55, 1
+ %64 = sext i32 %63 to i64
+ %65 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %64
+ %66 = load i32, i32* %65, align 4
+ %67 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %64
+ %68 = load i32, i32* %67, align 4
+ %69 = add nsw i32 %66, %68
+ %70 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %64
+ store i32 %69, i32* %70, align 4
+ %71 = add nsw i32 %63, 1
+ %72 = sext i32 %71 to i64
+ %73 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %72
+ %74 = load i32, i32* %73, align 4
+ %75 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %72
+ %76 = load i32, i32* %75, align 4
+ %77 = add nsw i32 %74, %76
+ %78 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %72
+ store i32 %77, i32* %78, align 4
+ %79 = add nsw i32 %71, 1
+ %80 = sext i32 %79 to i64
+ %81 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %80
+ %82 = load i32, i32* %81, align 4
+ %83 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %80
+ %84 = load i32, i32* %83, align 4
+ %85 = add nsw i32 %82, %84
+ %86 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %80
+ store i32 %85, i32* %86, align 4
+ %87 = add nsw i32 %79, 1
+ %88 = sext i32 %87 to i64
+ %89 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %88
+ %90 = load i32, i32* %89, align 4
+ %91 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %88
+ %92 = load i32, i32* %91, align 4
+ %93 = add nsw i32 %90, %92
+ %94 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %88
+ store i32 %93, i32* %94, align 4
+ %95 = add nsw i32 %87, 1
+ %96 = sext i32 %95 to i64
+ %97 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %96
+ %98 = load i32, i32* %97, align 4
+ %99 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %96
+ %100 = load i32, i32* %99, align 4
+ %101 = add nsw i32 %98, %100
+ %102 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %96
+ store i32 %101, i32* %102, align 4
+ %103 = add nsw i32 %95, 1
+ %104 = sext i32 %103 to i64
+ %105 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %104
+ %106 = load i32, i32* %105, align 4
+ %107 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %104
+ %108 = load i32, i32* %107, align 4
+ %109 = add nsw i32 %106, %108
+ %110 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %104
+ store i32 %109, i32* %110, align 4
+ %111 = add nsw i32 %103, 1
+ %112 = sext i32 %111 to i64
+ %113 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %112
+ %114 = load i32, i32* %113, align 4
+ %115 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %112
+ %116 = load i32, i32* %115, align 4
+ %117 = add nsw i32 %114, %116
+ %118 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %112
+ store i32 %117, i32* %118, align 4
+ %119 = add nsw i32 %111, 1
+ %120 = sext i32 %119 to i64
+ %121 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %120
+ %122 = load i32, i32* %121, align 4
+ %123 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %120
+ %124 = load i32, i32* %123, align 4
+ %125 = add nsw i32 %122, %124
+ %126 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %120
+ store i32 %125, i32* %126, align 4
+ %127 = add nsw i32 %119, 1
+ %128 = sext i32 %127 to i64
+ %129 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %128
+ %130 = load i32, i32* %129, align 4
+ %131 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %128
+ %132 = load i32, i32* %131, align 4
+ %133 = add nsw i32 %130, %132
+ %134 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %128
+ store i32 %133, i32* %134, align 4
+ %135 = add nsw i32 %127, 1
+ %136 = sext i32 %135 to i64
+ %137 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %136
+ %138 = load i32, i32* %137, align 4
+ %139 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %136
+ %140 = load i32, i32* %139, align 4
+ %141 = add nsw i32 %138, %140
+ %142 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %136
+ store i32 %141, i32* %142, align 4
+ %143 = add nsw i32 %135, 1
+ %144 = sext i32 %143 to i64
+ %145 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %144
+ %146 = load i32, i32* %145, align 4
+ %147 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %144
+ %148 = load i32, i32* %147, align 4
+ %149 = add nsw i32 %146, %148
+ %150 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %144
+ store i32 %149, i32* %150, align 4
+ %151 = add nsw i32 %143, 1
+ %152 = sext i32 %151 to i64
+ %153 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %152
+ %154 = load i32, i32* %153, align 4
+ %155 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %152
+ %156 = load i32, i32* %155, align 4
+ %157 = add nsw i32 %154, %156
+ %158 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %152
+ store i32 %157, i32* %158, align 4
+ %159 = add nsw i32 %151, 1
+ %160 = sext i32 %159 to i64
+ %161 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %160
+ %162 = load i32, i32* %161, align 4
+ %163 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %160
+ %164 = load i32, i32* %163, align 4
+ %165 = add nsw i32 %162, %164
+ %166 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %160
+ store i32 %165, i32* %166, align 4
+ %167 = add nsw i32 %159, 1
+ %168 = sext i32 %167 to i64
+ %169 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %168
+ %170 = load i32, i32* %169, align 4
+ %171 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %168
+ %172 = load i32, i32* %171, align 4
+ %173 = add nsw i32 %170, %172
+ %174 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %168
+ store i32 %173, i32* %174, align 4
+ %175 = add nsw i32 %167, 1
+ %176 = sext i32 %175 to i64
+ %177 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %176
+ %178 = load i32, i32* %177, align 4
+ %179 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %176
+ %180 = load i32, i32* %179, align 4
+ %181 = add nsw i32 %178, %180
+ %182 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %176
+ store i32 %181, i32* %182, align 4
+ %183 = add nsw i32 %175, 1
+ %184 = sext i32 %183 to i64
+ %185 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %184
+ %186 = load i32, i32* %185, align 4
+ %187 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %184
+ %188 = load i32, i32* %187, align 4
+ %189 = add nsw i32 %186, %188
+ %190 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %184
+ store i32 %189, i32* %190, align 4
+ %191 = add nsw i32 %183, 1
+ %192 = sext i32 %191 to i64
+ %193 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %192
+ %194 = load i32, i32* %193, align 4
+ %195 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %192
+ %196 = load i32, i32* %195, align 4
+ %197 = add nsw i32 %194, %196
+ %198 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %192
+ store i32 %197, i32* %198, align 4
+ %199 = add nsw i32 %191, 1
+ %200 = sext i32 %199 to i64
+ %201 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %200
+ %202 = load i32, i32* %201, align 4
+ %203 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %200
+ %204 = load i32, i32* %203, align 4
+ %205 = add nsw i32 %202, %204
+ %206 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %200
+ store i32 %205, i32* %206, align 4
+ %207 = add nsw i32 %199, 1
+ %208 = add nsw i32 %207, 24
+ br label %6
+}
+
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
ca6235d
to
2022daf
Compare
The idea makes sense, I think. I.e., bailing out before the more expensive dependence checks would be the sensible thing to do. But we do need numbers to discuss this trade-off:
Maybe you can share those numbers for the LLVM test-suite here. Ideally, interchange triggers the same amount of times, and we save compile-time by doing the check upfront. Also, it is probably better to now turn that constant |
2022daf
to
1a6c755
Compare
Also used remark message in the test and removed dependency from asserts build. |
Ah! I think the branch is messed up due to reverts and other things in it. To simplify things, I have rerun the experiments yesterday. I have created two branches. b) FWIW, since this patch eliminates the expensive computation of dependence info in many cases and bails out early, we see an overall improvement in compile time, as expected. Does this clear the confusion? |
@CongzheUalberta Happy new year! Did you get a chance to look at this further? |
Thanks for the clarification. Now I can see this patch could reduce the compile time of loop interchange.
I do second this. It would be more handy to turn MaxMemInstrCount into a command line option. |
1a6c755
to
484524c
Compare
Sure. Addressed in the latest commit with one minor sanity check on the value. |
Gentle ping!. FYI, I added a check for the new command-line option in the test. |
484524c
to
ac0fad8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote earlier, I think the idea makes sense, so this LGTM.
I have left 2 nits about debug/remark messages, but they can addressed without another review.
Please wait a day before merging this to give others a last chance to comment.
|
||
if (MemInstr.size() > MaxMemInstrCount) { | ||
LLVM_DEBUG(dbgs() << "The transform doesn't support more than " | ||
<< MaxMemInstrCount << " load stores in a loop\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "load stores" -> loads/stores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ORE->emit([&]() { | ||
return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedLoop", | ||
L->getStartLoc(), L->getHeader()) | ||
<< "Number of loads/stores in the loop are more than threshold."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: " .... more than threshold": I would like to add something about which threshold exactly we're talking about here. Or the message could be something like
Number of supported loads/stores exceeded, the supported maximum can be increased
with option -loop-interchange-max-meminstr-count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
In the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 99th entry it terminates and thus overall wastes compile-time. It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors: 1. Number of load-store pairs in a loop. 2. Number of common loop levels for each of the pair. This patch just constrains the whole computation on the number of loads and store instructions in the loop. With 64, I see 39 interchanges compared to 42 with the trunk. (With 128, it is 42 vs 43 but I see increase in the compile-time.) In an other approach, I experimented with computing 1 and constraining the number of pairs but that did not lead to any additional benefit in terms of compile-time. However, when other issues are fixed, I can revisit this approach.
ac0fad8
to
2e3567d
Compare
Thanks @sjoerdmeijer. I will merge to main early next week giving some more time for the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM giving all comments are addressed.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12063 Here is the relevant piece of the build log for the reference
|
This is a work in progress patch to enable loop-interchange by default and is a continuation of the RFC: https://discourse.llvm.org/t/enabling-loop-interchange/82589 Basically, we promised to fix any compile-time and correctness issues in the different components involved here (loop-interchange and dependence analaysis.) before discussing enabling interchange by default. We think are close to complete this; I would like to explain where we are and wanted to check if there are any thoughts or concerns. A quick overview of the correctness and compile-time improvements that we have made include: Correctness: - [LoopInterchange] Remove 'S' Scalar Dependencies (llvm#119345) - [LoopInterchange] Fix overflow in cost calculation (llvm#111807) - [LoopInterchange] Handle LE and GE correctly (PR llvm#124901) @kasuga-fj - [DA] disambiguate evolution of base addresses (llvm#116628) Compile-times: - [LoopInterchange] Constrain number of load/stores in a loop (llvm#118973) - [LoopInterchange] Bail out early if minimum loop nest is not met (llvm#115128) - [LoopInterchange] Hoist isComputableLoopNest() in the control flow (llvm#124247) And in terms of remaining work, we think we are very close to fixing these depenence analysis issues: - [DA] do not handle array accesses of different offsets (llvm#123436) - [DA] Dependence analysis does not handle array accesses of different sizes (llvm#116630) - [DA] use NSW arithmetic llvm#116632 The compile-time increase with a geomean increase of 0.19% looks good (after committing llvm#124247), I think: stage1-O3: Benchmark kimwitu++ +0.10% sqlite3 +0.14% consumer-typeset +0.07% Bullet +0.06% tramp3d-v4 +0.21% mafft +0.39% ClamAVi +0.06% lencod +0.61% SPASS +0.17% 7zip +0.08% geomean +0.19% See also: http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au We might want to look into lencod to see if we can improve more, but not sure it is strictly necessary.
In the current state of the code, the transform computes entries for the dependency matrix until
MaxMemInstrCount
which is 100. After 99th entry, it terminates and thus overall wastes compile-time.It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors:
This patch constrains the whole computation on the number of loads and stores instructions in the loop.
In another approach, I experimented with computing 1 and constraining the number of pairs, but that did not lead to any additional benefit in terms of compile time. However, when other issues are fixed, I can revisit this approach.