[LoopInterchange] Constrain number of load/stores in a loop #118973

madhur13490 · 2024-12-06T14:01:00Z

In the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 99th entry, it terminates and thus overall wastes compile-time.

It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors:

Number of load-store pairs in a loop.
Number of common loop levels for each of the pair.

This patch constrains the whole computation on the number of loads and stores instructions in the loop.

In another approach, I experimented with computing 1 and constraining the number of pairs, but that did not lead to any additional benefit in terms of compile time. However, when other issues are fixed, I can revisit this approach.

llvmbot · 2024-12-06T14:01:38Z

@llvm/pr-subscribers-llvm-transforms

Author: Madhur Amilkanthwar (madhur13490)

Changes

In the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 100th entry it terminates and thus overall wastes compile-time.

It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors:

Number of load-store pairs in a loop.
Number of common loop levels for each of the pair.

This patch just constrains the whole computation on the number of loadds and store instructions in the loop.

In an other approach, I experimented with computing 1 and constraining the number of pairs but that did not lead to any additional benefit in terms of compile-time. However, when other issues are fixed, I can revisit this approach.

Full diff: https://github.com/llvm/llvm-project/pull/118973.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+15-11)
(added) llvm/test/Transforms/LoopInterchange/many-load-stores.ll (+262)

diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index a0c0080c0bda1c..42db6ecc9f9c4a 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -66,8 +66,8 @@ using CharMatrix = std::vector<std::vector<char>>;
 
 } // end anonymous namespace
 
-// Maximum number of dependencies that can be handled in the dependency matrix.
-static const unsigned MaxMemInstrCount = 100;
+// Maximum number of load-stores that can be handled in the dependency matrix.
+static const unsigned MaxMemInstrCount = 64;
 
 // Maximum loop depth supported.
 static const unsigned MaxLoopNestDepth = 10;
@@ -84,7 +84,7 @@ static void printDepMatrix(CharMatrix &DepMatrix) {
 
 static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
                                      Loop *L, DependenceInfo *DI,
-                                     ScalarEvolution *SE) {
+                                     ScalarEvolution *SE, OptimizationRemarkEmitter *ORE) {
   using ValueVector = SmallVector<Value *, 16>;
 
   ValueVector MemInstr;
@@ -110,6 +110,17 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
   LLVM_DEBUG(dbgs() << "Found " << MemInstr.size()
                     << " Loads and Stores to analyze\n");
 
+  if (MemInstr.size() > MaxMemInstrCount) {
+    LLVM_DEBUG(dbgs() << "The transform doesn't support more than "
+                      << MaxMemInstrCount << " load stores in a loop\n");
+    ORE->emit([&]() {
+      return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedLoop",
+                                L->getStartLoc(),
+                                L->getHeader())
+             << "Reached maximum loads/stores than can be supported in the loop.";
+    });
+    return false;
+  }
   ValueVector::iterator I, IE, J, JE;
   StringSet<> Seen;
 
@@ -161,12 +172,6 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
         // Make sure we only add unique entries to the dependency matrix.
         if (Seen.insert(StringRef(Dep.data(), Dep.size())).second)
           DepMatrix.push_back(Dep);
-
-        if (DepMatrix.size() > MaxMemInstrCount) {
-          LLVM_DEBUG(dbgs() << "Cannot handle more than " << MaxMemInstrCount
-                            << " dependencies inside loop\n");
-          return false;
-        }
       }
     }
   }
@@ -450,7 +455,7 @@ struct LoopInterchange {
     CharMatrix DependencyMatrix;
     Loop *OuterMostLoop = *(LoopList.begin());
     if (!populateDependencyMatrix(DependencyMatrix, LoopNestDepth,
-                                  OuterMostLoop, DI, SE)) {
+                                  OuterMostLoop, DI, SE, ORE)) {
       LLVM_DEBUG(dbgs() << "Populating dependency matrix failed\n");
       return false;
     }
@@ -1728,7 +1733,6 @@ PreservedAnalyses LoopInterchangePass::run(LoopNest &LN,
   // Ensure minimum depth of the loop nest to do the interchange.
   if (!hasMinimumLoopDepth(LoopList))
     return PreservedAnalyses::all();
-
   DependenceInfo DI(&F, &AR.AA, &AR.SE, &AR.LI);
   std::unique_ptr<CacheCost> CC =
       CacheCost::getCacheCost(LN.getOutermostLoop(), AR, DI);
diff --git a/llvm/test/Transforms/LoopInterchange/many-load-stores.ll b/llvm/test/Transforms/LoopInterchange/many-load-stores.ll
new file mode 100644
index 00000000000000..e718d0b72cfed7
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/many-load-stores.ll
@@ -0,0 +1,262 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes=loop-interchange -S -disable-output | FileCheck %s
+; ModuleID = 'vadd.cpp'
+source_filename = "vadd.cpp"
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+@A = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+@B = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+@C = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+
+; CHECK: The transform doesn't support more than 64 load stores in a loop
+; CHECK: Populating dependency matrix failed
+; Function Attrs: mustprogress norecurse nounwind uwtable
+define dso_local noundef i32 @main() local_unnamed_addr #0 {
+  br label %1
+
+1:                                                ; preds = %9, %0
+  %2 = phi i32 [ 0, %0 ], [ %10, %9 ]
+  %3 = icmp slt i32 %2, 2048
+  br i1 %3, label %5, label %4
+
+4:                                                ; preds = %1
+  ret i32 0
+
+5:                                                ; preds = %1
+  br label %6
+
+6:                                                ; preds = %11, %5
+  %7 = phi i32 [ 0, %5 ], [ %208, %11 ]
+  %8 = icmp slt i32 %7, 85
+  br i1 %8, label %11, label %9
+
+9:                                                ; preds = %6
+  %10 = add nsw i32 %2, 1
+  br label %1
+
+11:                                               ; preds = %6
+  %12 = sext i32 %2 to i64
+  %13 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @B, i64 0, i64 %12
+  %14 = sext i32 %7 to i64
+  %15 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %14
+  %16 = load i32, i32* %15, align 4
+  %17 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @C, i64 0, i64 %12
+  %18 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %14
+  %19 = load i32, i32* %18, align 4
+  %20 = add nsw i32 %16, %19
+  %21 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @A, i64 0, i64 %12
+  %22 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %14
+  store i32 %20, i32* %22, align 4
+  %23 = add nsw i32 %7, 1
+  %24 = sext i32 %23 to i64
+  %25 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %24
+  %26 = load i32, i32* %25, align 4
+  %27 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %24
+  %28 = load i32, i32* %27, align 4
+  %29 = add nsw i32 %26, %28
+  %30 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %24
+  store i32 %29, i32* %30, align 4
+  %31 = add nsw i32 %23, 1
+  %32 = sext i32 %31 to i64
+  %33 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %32
+  %34 = load i32, i32* %33, align 4
+  %35 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %32
+  %36 = load i32, i32* %35, align 4
+  %37 = add nsw i32 %34, %36
+  %38 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %32
+  store i32 %37, i32* %38, align 4
+  %39 = add nsw i32 %31, 1
+  %40 = sext i32 %39 to i64
+  %41 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %40
+  %42 = load i32, i32* %41, align 4
+  %43 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %40
+  %44 = load i32, i32* %43, align 4
+  %45 = add nsw i32 %42, %44
+  %46 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %40
+  store i32 %45, i32* %46, align 4
+  %47 = add nsw i32 %39, 1
+  %48 = sext i32 %47 to i64
+  %49 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %48
+  %50 = load i32, i32* %49, align 4
+  %51 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %48
+  %52 = load i32, i32* %51, align 4
+  %53 = add nsw i32 %50, %52
+  %54 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %48
+  store i32 %53, i32* %54, align 4
+  %55 = add nsw i32 %47, 1
+  %56 = sext i32 %55 to i64
+  %57 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %56
+  %58 = load i32, i32* %57, align 4
+  %59 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %56
+  %60 = load i32, i32* %59, align 4
+  %61 = add nsw i32 %58, %60
+  %62 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %56
+  store i32 %61, i32* %62, align 4
+  %63 = add nsw i32 %55, 1
+  %64 = sext i32 %63 to i64
+  %65 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %64
+  %66 = load i32, i32* %65, align 4
+  %67 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %64
+  %68 = load i32, i32* %67, align 4
+  %69 = add nsw i32 %66, %68
+  %70 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %64
+  store i32 %69, i32* %70, align 4
+  %71 = add nsw i32 %63, 1
+  %72 = sext i32 %71 to i64
+  %73 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %72
+  %74 = load i32, i32* %73, align 4
+  %75 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %72
+  %76 = load i32, i32* %75, align 4
+  %77 = add nsw i32 %74, %76
+  %78 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %72
+  store i32 %77, i32* %78, align 4
+  %79 = add nsw i32 %71, 1
+  %80 = sext i32 %79 to i64
+  %81 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %80
+  %82 = load i32, i32* %81, align 4
+  %83 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %80
+  %84 = load i32, i32* %83, align 4
+  %85 = add nsw i32 %82, %84
+  %86 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %80
+  store i32 %85, i32* %86, align 4
+  %87 = add nsw i32 %79, 1
+  %88 = sext i32 %87 to i64
+  %89 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %88
+  %90 = load i32, i32* %89, align 4
+  %91 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %88
+  %92 = load i32, i32* %91, align 4
+  %93 = add nsw i32 %90, %92
+  %94 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %88
+  store i32 %93, i32* %94, align 4
+  %95 = add nsw i32 %87, 1
+  %96 = sext i32 %95 to i64
+  %97 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %96
+  %98 = load i32, i32* %97, align 4
+  %99 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %96
+  %100 = load i32, i32* %99, align 4
+  %101 = add nsw i32 %98, %100
+  %102 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %96
+  store i32 %101, i32* %102, align 4
+  %103 = add nsw i32 %95, 1
+  %104 = sext i32 %103 to i64
+  %105 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %104
+  %106 = load i32, i32* %105, align 4
+  %107 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %104
+  %108 = load i32, i32* %107, align 4
+  %109 = add nsw i32 %106, %108
+  %110 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %104
+  store i32 %109, i32* %110, align 4
+  %111 = add nsw i32 %103, 1
+  %112 = sext i32 %111 to i64
+  %113 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %112
+  %114 = load i32, i32* %113, align 4
+  %115 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %112
+  %116 = load i32, i32* %115, align 4
+  %117 = add nsw i32 %114, %116
+  %118 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %112
+  store i32 %117, i32* %118, align 4
+  %119 = add nsw i32 %111, 1
+  %120 = sext i32 %119 to i64
+  %121 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %120
+  %122 = load i32, i32* %121, align 4
+  %123 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %120
+  %124 = load i32, i32* %123, align 4
+  %125 = add nsw i32 %122, %124
+  %126 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %120
+  store i32 %125, i32* %126, align 4
+  %127 = add nsw i32 %119, 1
+  %128 = sext i32 %127 to i64
+  %129 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %128
+  %130 = load i32, i32* %129, align 4
+  %131 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %128
+  %132 = load i32, i32* %131, align 4
+  %133 = add nsw i32 %130, %132
+  %134 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %128
+  store i32 %133, i32* %134, align 4
+  %135 = add nsw i32 %127, 1
+  %136 = sext i32 %135 to i64
+  %137 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %136
+  %138 = load i32, i32* %137, align 4
+  %139 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %136
+  %140 = load i32, i32* %139, align 4
+  %141 = add nsw i32 %138, %140
+  %142 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %136
+  store i32 %141, i32* %142, align 4
+  %143 = add nsw i32 %135, 1
+  %144 = sext i32 %143 to i64
+  %145 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %144
+  %146 = load i32, i32* %145, align 4
+  %147 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %144
+  %148 = load i32, i32* %147, align 4
+  %149 = add nsw i32 %146, %148
+  %150 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %144
+  store i32 %149, i32* %150, align 4
+  %151 = add nsw i32 %143, 1
+  %152 = sext i32 %151 to i64
+  %153 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %152
+  %154 = load i32, i32* %153, align 4
+  %155 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %152
+  %156 = load i32, i32* %155, align 4
+  %157 = add nsw i32 %154, %156
+  %158 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %152
+  store i32 %157, i32* %158, align 4
+  %159 = add nsw i32 %151, 1
+  %160 = sext i32 %159 to i64
+  %161 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %160
+  %162 = load i32, i32* %161, align 4
+  %163 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %160
+  %164 = load i32, i32* %163, align 4
+  %165 = add nsw i32 %162, %164
+  %166 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %160
+  store i32 %165, i32* %166, align 4
+  %167 = add nsw i32 %159, 1
+  %168 = sext i32 %167 to i64
+  %169 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %168
+  %170 = load i32, i32* %169, align 4
+  %171 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %168
+  %172 = load i32, i32* %171, align 4
+  %173 = add nsw i32 %170, %172
+  %174 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %168
+  store i32 %173, i32* %174, align 4
+  %175 = add nsw i32 %167, 1
+  %176 = sext i32 %175 to i64
+  %177 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %176
+  %178 = load i32, i32* %177, align 4
+  %179 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %176
+  %180 = load i32, i32* %179, align 4
+  %181 = add nsw i32 %178, %180
+  %182 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %176
+  store i32 %181, i32* %182, align 4
+  %183 = add nsw i32 %175, 1
+  %184 = sext i32 %183 to i64
+  %185 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %184
+  %186 = load i32, i32* %185, align 4
+  %187 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %184
+  %188 = load i32, i32* %187, align 4
+  %189 = add nsw i32 %186, %188
+  %190 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %184
+  store i32 %189, i32* %190, align 4
+  %191 = add nsw i32 %183, 1
+  %192 = sext i32 %191 to i64
+  %193 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %192
+  %194 = load i32, i32* %193, align 4
+  %195 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %192
+  %196 = load i32, i32* %195, align 4
+  %197 = add nsw i32 %194, %196
+  %198 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %192
+  store i32 %197, i32* %198, align 4
+  %199 = add nsw i32 %191, 1
+  %200 = sext i32 %199 to i64
+  %201 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %200
+  %202 = load i32, i32* %201, align 4
+  %203 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %200
+  %204 = load i32, i32* %203, align 4
+  %205 = add nsw i32 %202, %204
+  %206 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %200
+  store i32 %205, i32* %206, align 4
+  %207 = add nsw i32 %199, 1
+  %208 = add nsw i32 %207, 24
+  br label %6
+}
+

github-actions · 2024-12-06T14:04:34Z

✅ With the latest revision this PR passed the C/C++ code formatter.

sjoerdmeijer · 2024-12-10T16:55:06Z

The idea makes sense, I think. I.e., bailing out before the more expensive dependence checks would be the sensible thing to do. But we do need numbers to discuss this trade-off:

the compile time savings,
the number of interchanges that we miss out on, if any.

Maybe you can share those numbers for the LLVM test-suite here. Ideally, interchange triggers the same amount of times, and we save compile-time by doing the check upfront.

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

madhur13490 · 2024-12-11T16:18:26Z

The idea makes sense, I think. I.e., bailing out before the more expensive dependence checks would be the sensible thing to do. But we do need numbers to discuss this trade-off:

the compile time savings,

the number of interchanges that we miss out on, if any.

Maybe you can share those numbers for the LLVM test-suite here. Ideally, interchange triggers the same amount of times, and we save compile-time by doing the check upfront.

Thanks for reminding me.

When I profiled this change on LLVM Test-suite, I see 39 interchanges compared to 43 with the trunk. If I bump up the value of MaxMemInstrCount to 128, I see 42 interchanges. However, with 128 we see bump in the compile-time too. Please find numbers here

the branch is madhur13490/perf/revert_and_bound_instr_64

This change tries to achieve trade-off between the number of loop interchanges and compile-time. Indeed, compile-time is directly proportional to the value of MaxMemInstCount.

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

I have a separate change for this in which I will also expose min max depths of the loop nest using command-line options.

madhur13490 · 2024-12-12T07:51:50Z

Also used remark message in the test and removed dependency from asserts build.

CongzheUalberta · 2024-12-12T17:30:25Z

When I profiled this change on LLVM Test-suite, I see 39 interchanges compared to 43 with the trunk. If I bump up the value of MaxMemInstrCount to 128, I see 42 interchanges. However, with 128 we see bump in the compile-time too. Please find numbers here the branch is madhur13490/perf/revert_and_bound_instr_64

This change tries to achieve trade-off between the number of loop interchanges and compile-time. Indeed, compile-time is directly proportional to the value of MaxMemInstCount.

With your results I see compile time actually increases with this patch but not decreases? For example compare the results of commit ae719f0756 versus commit f8c5d74865/bfbcc6d728.

madhur13490 · 2024-12-13T04:38:08Z

When I profiled this change on LLVM Test-suite, I see 39 interchanges compared to 43 with the trunk. If I bump up the value of MaxMemInstrCount to 128, I see 42 interchanges. However, with 128 we see bump in the compile-time too. Please find numbers here the branch is madhur13490/perf/revert_and_bound_instr_64
This change tries to achieve trade-off between the number of loop interchanges and compile-time. Indeed, compile-time is directly proportional to the value of MaxMemInstCount.

With your results I see compile time actually increases with this patch but not decreases? For example compare the results of commit ae719f0756 versus commit f8c5d74865/bfbcc6d728.

Ah! I think the branch is messed up due to reverts and other things in it.

To simplify things, I have rerun the experiments yesterday. I have created two branches.
a) perf/li-enable - This branch just enables LoopInterchange; no other code. On webpage, please see the results. Geomean is +0.54%. Most interestingly, lencod is +2.56%

b) perf/li-enable_bound_load_store - In this branch, I do two things 1. Enable LoopInterchange 2. Add this patch on top of it. If you see results, geomean is just +0.28% and most interestingly in lencod we see just +1.09%.
Thus, with this patch, I see improvements in geomean (0.54 - 0.28 = 0.26%) as well as lencod (2.56 - 1.09 = 1.47%)

FWIW, since this patch eliminates the expensive computation of dependence info in many cases and bails out early, we see an overall improvement in compile time, as expected.

Does this clear the confusion?

madhur13490 · 2025-01-03T06:15:54Z

@CongzheUalberta Happy new year! Did you get a chance to look at this further?

CongzheUalberta · 2025-01-05T21:58:56Z

@CongzheUalberta Happy new year! Did you get a chance to look at this further?

Thanks for the clarification. Now I can see this patch could reduce the compile time of loop interchange.

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

I do second this. It would be more handy to turn MaxMemInstrCount into a command line option.

madhur13490 · 2025-01-08T09:03:17Z

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

I do second this. It would be more handy to turn MaxMemInstrCount into a command line option.

Sure. Addressed in the latest commit with one minor sanity check on the value.

madhur13490 · 2025-01-16T06:06:34Z

Gentle ping!. FYI, I added a check for the new command-line option in the test.

sjoerdmeijer

As I wrote earlier, I think the idea makes sense, so this LGTM.

I have left 2 nits about debug/remark messages, but they can addressed without another review.

Please wait a day before merging this to give others a last chance to comment.

sjoerdmeijer · 2025-01-16T10:58:35Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

-
+  if (MemInstr.size() > MaxMemInstrCount) {
+    LLVM_DEBUG(dbgs() << "The transform doesn't support more than "
+                      << MaxMemInstrCount << " load stores in a loop\n");


Nit: "load stores" -> loads/stores

sjoerdmeijer · 2025-01-16T11:34:43Z

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

+    ORE->emit([&]() {
+      return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedLoop",
+                                      L->getStartLoc(), L->getHeader())
+             << "Number of loads/stores in the loop are more than threshold.";


Nit: " .... more than threshold": I would like to add something about which threshold exactly we're talking about here. Or the message could be something like

Number of supported loads/stores exceeded, the supported maximum can be increased with option -loop-interchange-max-meminstr-count

In the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 99th entry it terminates and thus overall wastes compile-time. It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors: 1. Number of load-store pairs in a loop. 2. Number of common loop levels for each of the pair. This patch just constrains the whole computation on the number of loads and store instructions in the loop. With 64, I see 39 interchanges compared to 42 with the trunk. (With 128, it is 42 vs 43 but I see increase in the compile-time.) In an other approach, I experimented with computing 1 and constraining the number of pairs but that did not lead to any additional benefit in terms of compile-time. However, when other issues are fixed, I can revisit this approach.

madhur13490 · 2025-01-17T04:42:24Z

Thanks @sjoerdmeijer. I will merge to main early next week giving some more time for the comments.

CongzheUalberta

LGTM giving all comments are addressed.

llvm-ci · 2025-01-21T05:33:33Z

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12063

Here is the relevant piece of the build log for the reference

Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/pgo1.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate      -Xclang "-fprofile-instrument=clang"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate -Xclang -fprofile-instrument=clang
# RUN: at line 3
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c      --check-prefix="CLANG-PGO"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c --check-prefix=CLANG-PGO
# RUN: at line 5
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-generate      -Xclang "-fprofile-instrument=llvm"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-generate -Xclang -fprofile-instrument=llvm
# RUN: at line 7
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c      --check-prefix="LLVM-PGO"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c --check-prefix=LLVM-PGO
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c:55:19: error: LLVM-PGO-NEXT: expected string not found in input
# | // LLVM-PGO-NEXT: [ 20 10 2 1 ]
# |                   ^
# | <stdin>:3:28: note: scanning from here
# | ======== Counters =========
# |                            ^
# | <stdin>:4:1: note: possible intended match here
# | [ 20 10 3 1 ]
# | ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |            1: ======= GPU Profile ======= 
# |            2: Target: amdgcn-amd-amdhsa 
# |            3: ======== Counters ========= 
# | next:55'0                                X error: no match found
# |            4: [ 20 10 3 1 ] 
# | next:55'0     ~~~~~~~~~~~~~~
# | next:55'1     ?              possible intended match
# |            5: [ 10 ] 
# | next:55'0     ~~~~~~~
# |            6: [ 20 ] 
# | next:55'0     ~~~~~~~
# |            7: ========== Data =========== 
# | next:55'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...

@kasuga-fj

This is a work in progress patch to enable loop-interchange by default and is a continuation of the RFC: https://discourse.llvm.org/t/enabling-loop-interchange/82589 Basically, we promised to fix any compile-time and correctness issues in the different components involved here (loop-interchange and dependence analaysis.) before discussing enabling interchange by default. We think are close to complete this; I would like to explain where we are and wanted to check if there are any thoughts or concerns. A quick overview of the correctness and compile-time improvements that we have made include: Correctness: - [LoopInterchange] Remove 'S' Scalar Dependencies (llvm#119345) - [LoopInterchange] Fix overflow in cost calculation (llvm#111807) - [LoopInterchange] Handle LE and GE correctly (PR llvm#124901) @kasuga-fj - [DA] disambiguate evolution of base addresses (llvm#116628) Compile-times: - [LoopInterchange] Constrain number of load/stores in a loop (llvm#118973) - [LoopInterchange] Bail out early if minimum loop nest is not met (llvm#115128) - [LoopInterchange] Hoist isComputableLoopNest() in the control flow (llvm#124247) And in terms of remaining work, we think we are very close to fixing these depenence analysis issues: - [DA] do not handle array accesses of different offsets (llvm#123436) - [DA] Dependence analysis does not handle array accesses of different sizes (llvm#116630) - [DA] use NSW arithmetic llvm#116632 The compile-time increase with a geomean increase of 0.19% looks good (after committing llvm#124247), I think: stage1-O3: Benchmark kimwitu++ +0.10% sqlite3 +0.14% consumer-typeset +0.07% Bullet +0.06% tramp3d-v4 +0.21% mafft +0.39% ClamAVi +0.06% lencod +0.61% SPASS +0.17% 7zip +0.08% geomean +0.19% See also: http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au We might want to look into lencod to see if we can improve more, but not sure it is strictly necessary.

madhur13490 requested review from sebpop, CongzheUalberta and sjoerdmeijer December 6, 2024 14:01

llvmbot added the llvm:transforms label Dec 6, 2024

madhur13490 force-pushed the perf/bound-load-stores branch 2 times, most recently from ca6235d to 2022daf Compare December 6, 2024 14:41

madhur13490 requested a review from fhahn December 6, 2024 14:47

madhur13490 changed the title ~~[LoopInterchange] Constrain load/stores in a loop~~ [LoopInterchange] Constrain number of load/stores in a loop Dec 10, 2024

madhur13490 force-pushed the perf/bound-load-stores branch from 2022daf to 1a6c755 Compare December 12, 2024 07:50

madhur13490 force-pushed the perf/bound-load-stores branch from 1a6c755 to 484524c Compare January 8, 2025 09:02

madhur13490 force-pushed the perf/bound-load-stores branch from 484524c to ac0fad8 Compare January 16, 2025 07:26

sjoerdmeijer approved these changes Jan 16, 2025

View reviewed changes

madhur13490 force-pushed the perf/bound-load-stores branch from ac0fad8 to 2e3567d Compare January 16, 2025 15:33

CongzheUalberta approved these changes Jan 20, 2025

View reviewed changes

madhur13490 merged commit 5d281a4 into llvm:main Jan 21, 2025
8 checks passed

sjoerdmeijer mentioned this pull request Jan 29, 2025

[LoopInterchange] Enable it by default (WIP) #124911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopInterchange] Constrain number of load/stores in a loop #118973

[LoopInterchange] Constrain number of load/stores in a loop #118973

madhur13490 commented Dec 6, 2024 •

edited

Loading

llvmbot commented Dec 6, 2024

github-actions bot commented Dec 6, 2024 •

edited

Loading

sjoerdmeijer commented Dec 10, 2024 •

edited

Loading

madhur13490 commented Dec 11, 2024

madhur13490 commented Dec 12, 2024

CongzheUalberta commented Dec 12, 2024

madhur13490 commented Dec 13, 2024 •

edited

Loading

madhur13490 commented Jan 3, 2025

CongzheUalberta commented Jan 5, 2025

madhur13490 commented Jan 8, 2025 •

edited

Loading

madhur13490 commented Jan 16, 2025 •

edited

Loading

sjoerdmeijer left a comment •

edited

Loading

sjoerdmeijer Jan 16, 2025

madhur13490 Jan 17, 2025

sjoerdmeijer Jan 16, 2025

madhur13490 Jan 17, 2025

madhur13490 commented Jan 17, 2025

CongzheUalberta left a comment

llvm-ci commented Jan 21, 2025

[LoopInterchange] Constrain number of load/stores in a loop #118973

[LoopInterchange] Constrain number of load/stores in a loop #118973

Conversation

madhur13490 commented Dec 6, 2024 • edited Loading

llvmbot commented Dec 6, 2024

github-actions bot commented Dec 6, 2024 • edited Loading

sjoerdmeijer commented Dec 10, 2024 • edited Loading

madhur13490 commented Dec 11, 2024

madhur13490 commented Dec 12, 2024

CongzheUalberta commented Dec 12, 2024

madhur13490 commented Dec 13, 2024 • edited Loading

madhur13490 commented Jan 3, 2025

CongzheUalberta commented Jan 5, 2025

madhur13490 commented Jan 8, 2025 • edited Loading

madhur13490 commented Jan 16, 2025 • edited Loading

sjoerdmeijer left a comment • edited Loading

Choose a reason for hiding this comment

sjoerdmeijer Jan 16, 2025

Choose a reason for hiding this comment

madhur13490 Jan 17, 2025

Choose a reason for hiding this comment

sjoerdmeijer Jan 16, 2025

Choose a reason for hiding this comment

madhur13490 Jan 17, 2025

Choose a reason for hiding this comment

madhur13490 commented Jan 17, 2025

CongzheUalberta left a comment

Choose a reason for hiding this comment

llvm-ci commented Jan 21, 2025

madhur13490 commented Dec 6, 2024 •

edited

Loading

github-actions bot commented Dec 6, 2024 •

edited

Loading

sjoerdmeijer commented Dec 10, 2024 •

edited

Loading

madhur13490 commented Dec 13, 2024 •

edited

Loading

madhur13490 commented Jan 8, 2025 •

edited

Loading

madhur13490 commented Jan 16, 2025 •

edited

Loading

sjoerdmeijer left a comment •

edited

Loading