Skip to content

[LoopInterchange] Constrain number of load/stores in a loop #118973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

madhur13490
Copy link
Contributor

@madhur13490 madhur13490 commented Dec 6, 2024

In the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 99th entry, it terminates and thus overall wastes compile-time.

It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors:

  1. Number of load-store pairs in a loop.
  2. Number of common loop levels for each of the pair.

This patch constrains the whole computation on the number of loads and stores instructions in the loop.

In another approach, I experimented with computing 1 and constraining the number of pairs, but that did not lead to any additional benefit in terms of compile time. However, when other issues are fixed, I can revisit this approach.

@llvmbot
Copy link
Member

llvmbot commented Dec 6, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Madhur Amilkanthwar (madhur13490)

Changes

In the current state of the code, the transform computes entries for the dependency matrix until MaxMemInstrCount which is 100. After 100th entry it terminates and thus overall wastes compile-time.

It would be nice if we can compute total number of entries upfront and early exit if the number of entries > 100. However, computing the number of entries is not always possible as it depends on two factors:

  1. Number of load-store pairs in a loop.
  2. Number of common loop levels for each of the pair.

This patch just constrains the whole computation on the number of loadds and store instructions in the loop.

In an other approach, I experimented with computing 1 and constraining the number of pairs but that did not lead to any additional benefit in terms of compile-time. However, when other issues are fixed, I can revisit this approach.


Full diff: https://github.com/llvm/llvm-project/pull/118973.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+15-11)
  • (added) llvm/test/Transforms/LoopInterchange/many-load-stores.ll (+262)
diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index a0c0080c0bda1c..42db6ecc9f9c4a 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -66,8 +66,8 @@ using CharMatrix = std::vector<std::vector<char>>;
 
 } // end anonymous namespace
 
-// Maximum number of dependencies that can be handled in the dependency matrix.
-static const unsigned MaxMemInstrCount = 100;
+// Maximum number of load-stores that can be handled in the dependency matrix.
+static const unsigned MaxMemInstrCount = 64;
 
 // Maximum loop depth supported.
 static const unsigned MaxLoopNestDepth = 10;
@@ -84,7 +84,7 @@ static void printDepMatrix(CharMatrix &DepMatrix) {
 
 static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
                                      Loop *L, DependenceInfo *DI,
-                                     ScalarEvolution *SE) {
+                                     ScalarEvolution *SE, OptimizationRemarkEmitter *ORE) {
   using ValueVector = SmallVector<Value *, 16>;
 
   ValueVector MemInstr;
@@ -110,6 +110,17 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
   LLVM_DEBUG(dbgs() << "Found " << MemInstr.size()
                     << " Loads and Stores to analyze\n");
 
+  if (MemInstr.size() > MaxMemInstrCount) {
+    LLVM_DEBUG(dbgs() << "The transform doesn't support more than "
+                      << MaxMemInstrCount << " load stores in a loop\n");
+    ORE->emit([&]() {
+      return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedLoop",
+                                L->getStartLoc(),
+                                L->getHeader())
+             << "Reached maximum loads/stores than can be supported in the loop.";
+    });
+    return false;
+  }
   ValueVector::iterator I, IE, J, JE;
   StringSet<> Seen;
 
@@ -161,12 +172,6 @@ static bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level,
         // Make sure we only add unique entries to the dependency matrix.
         if (Seen.insert(StringRef(Dep.data(), Dep.size())).second)
           DepMatrix.push_back(Dep);
-
-        if (DepMatrix.size() > MaxMemInstrCount) {
-          LLVM_DEBUG(dbgs() << "Cannot handle more than " << MaxMemInstrCount
-                            << " dependencies inside loop\n");
-          return false;
-        }
       }
     }
   }
@@ -450,7 +455,7 @@ struct LoopInterchange {
     CharMatrix DependencyMatrix;
     Loop *OuterMostLoop = *(LoopList.begin());
     if (!populateDependencyMatrix(DependencyMatrix, LoopNestDepth,
-                                  OuterMostLoop, DI, SE)) {
+                                  OuterMostLoop, DI, SE, ORE)) {
       LLVM_DEBUG(dbgs() << "Populating dependency matrix failed\n");
       return false;
     }
@@ -1728,7 +1733,6 @@ PreservedAnalyses LoopInterchangePass::run(LoopNest &LN,
   // Ensure minimum depth of the loop nest to do the interchange.
   if (!hasMinimumLoopDepth(LoopList))
     return PreservedAnalyses::all();
-
   DependenceInfo DI(&F, &AR.AA, &AR.SE, &AR.LI);
   std::unique_ptr<CacheCost> CC =
       CacheCost::getCacheCost(LN.getOutermostLoop(), AR, DI);
diff --git a/llvm/test/Transforms/LoopInterchange/many-load-stores.ll b/llvm/test/Transforms/LoopInterchange/many-load-stores.ll
new file mode 100644
index 00000000000000..e718d0b72cfed7
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/many-load-stores.ll
@@ -0,0 +1,262 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes=loop-interchange -S -disable-output | FileCheck %s
+; ModuleID = 'vadd.cpp'
+source_filename = "vadd.cpp"
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+@A = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+@B = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+@C = dso_local local_unnamed_addr global [2048 x [2048 x i32]] zeroinitializer, align 4
+
+; CHECK: The transform doesn't support more than 64 load stores in a loop
+; CHECK: Populating dependency matrix failed
+; Function Attrs: mustprogress norecurse nounwind uwtable
+define dso_local noundef i32 @main() local_unnamed_addr #0 {
+  br label %1
+
+1:                                                ; preds = %9, %0
+  %2 = phi i32 [ 0, %0 ], [ %10, %9 ]
+  %3 = icmp slt i32 %2, 2048
+  br i1 %3, label %5, label %4
+
+4:                                                ; preds = %1
+  ret i32 0
+
+5:                                                ; preds = %1
+  br label %6
+
+6:                                                ; preds = %11, %5
+  %7 = phi i32 [ 0, %5 ], [ %208, %11 ]
+  %8 = icmp slt i32 %7, 85
+  br i1 %8, label %11, label %9
+
+9:                                                ; preds = %6
+  %10 = add nsw i32 %2, 1
+  br label %1
+
+11:                                               ; preds = %6
+  %12 = sext i32 %2 to i64
+  %13 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @B, i64 0, i64 %12
+  %14 = sext i32 %7 to i64
+  %15 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %14
+  %16 = load i32, i32* %15, align 4
+  %17 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @C, i64 0, i64 %12
+  %18 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %14
+  %19 = load i32, i32* %18, align 4
+  %20 = add nsw i32 %16, %19
+  %21 = getelementptr inbounds [2048 x [2048 x i32]], [2048 x [2048 x i32]]* @A, i64 0, i64 %12
+  %22 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %14
+  store i32 %20, i32* %22, align 4
+  %23 = add nsw i32 %7, 1
+  %24 = sext i32 %23 to i64
+  %25 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %24
+  %26 = load i32, i32* %25, align 4
+  %27 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %24
+  %28 = load i32, i32* %27, align 4
+  %29 = add nsw i32 %26, %28
+  %30 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %24
+  store i32 %29, i32* %30, align 4
+  %31 = add nsw i32 %23, 1
+  %32 = sext i32 %31 to i64
+  %33 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %32
+  %34 = load i32, i32* %33, align 4
+  %35 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %32
+  %36 = load i32, i32* %35, align 4
+  %37 = add nsw i32 %34, %36
+  %38 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %32
+  store i32 %37, i32* %38, align 4
+  %39 = add nsw i32 %31, 1
+  %40 = sext i32 %39 to i64
+  %41 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %40
+  %42 = load i32, i32* %41, align 4
+  %43 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %40
+  %44 = load i32, i32* %43, align 4
+  %45 = add nsw i32 %42, %44
+  %46 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %40
+  store i32 %45, i32* %46, align 4
+  %47 = add nsw i32 %39, 1
+  %48 = sext i32 %47 to i64
+  %49 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %48
+  %50 = load i32, i32* %49, align 4
+  %51 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %48
+  %52 = load i32, i32* %51, align 4
+  %53 = add nsw i32 %50, %52
+  %54 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %48
+  store i32 %53, i32* %54, align 4
+  %55 = add nsw i32 %47, 1
+  %56 = sext i32 %55 to i64
+  %57 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %56
+  %58 = load i32, i32* %57, align 4
+  %59 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %56
+  %60 = load i32, i32* %59, align 4
+  %61 = add nsw i32 %58, %60
+  %62 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %56
+  store i32 %61, i32* %62, align 4
+  %63 = add nsw i32 %55, 1
+  %64 = sext i32 %63 to i64
+  %65 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %64
+  %66 = load i32, i32* %65, align 4
+  %67 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %64
+  %68 = load i32, i32* %67, align 4
+  %69 = add nsw i32 %66, %68
+  %70 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %64
+  store i32 %69, i32* %70, align 4
+  %71 = add nsw i32 %63, 1
+  %72 = sext i32 %71 to i64
+  %73 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %72
+  %74 = load i32, i32* %73, align 4
+  %75 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %72
+  %76 = load i32, i32* %75, align 4
+  %77 = add nsw i32 %74, %76
+  %78 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %72
+  store i32 %77, i32* %78, align 4
+  %79 = add nsw i32 %71, 1
+  %80 = sext i32 %79 to i64
+  %81 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %80
+  %82 = load i32, i32* %81, align 4
+  %83 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %80
+  %84 = load i32, i32* %83, align 4
+  %85 = add nsw i32 %82, %84
+  %86 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %80
+  store i32 %85, i32* %86, align 4
+  %87 = add nsw i32 %79, 1
+  %88 = sext i32 %87 to i64
+  %89 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %88
+  %90 = load i32, i32* %89, align 4
+  %91 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %88
+  %92 = load i32, i32* %91, align 4
+  %93 = add nsw i32 %90, %92
+  %94 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %88
+  store i32 %93, i32* %94, align 4
+  %95 = add nsw i32 %87, 1
+  %96 = sext i32 %95 to i64
+  %97 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %96
+  %98 = load i32, i32* %97, align 4
+  %99 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %96
+  %100 = load i32, i32* %99, align 4
+  %101 = add nsw i32 %98, %100
+  %102 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %96
+  store i32 %101, i32* %102, align 4
+  %103 = add nsw i32 %95, 1
+  %104 = sext i32 %103 to i64
+  %105 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %104
+  %106 = load i32, i32* %105, align 4
+  %107 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %104
+  %108 = load i32, i32* %107, align 4
+  %109 = add nsw i32 %106, %108
+  %110 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %104
+  store i32 %109, i32* %110, align 4
+  %111 = add nsw i32 %103, 1
+  %112 = sext i32 %111 to i64
+  %113 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %112
+  %114 = load i32, i32* %113, align 4
+  %115 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %112
+  %116 = load i32, i32* %115, align 4
+  %117 = add nsw i32 %114, %116
+  %118 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %112
+  store i32 %117, i32* %118, align 4
+  %119 = add nsw i32 %111, 1
+  %120 = sext i32 %119 to i64
+  %121 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %120
+  %122 = load i32, i32* %121, align 4
+  %123 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %120
+  %124 = load i32, i32* %123, align 4
+  %125 = add nsw i32 %122, %124
+  %126 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %120
+  store i32 %125, i32* %126, align 4
+  %127 = add nsw i32 %119, 1
+  %128 = sext i32 %127 to i64
+  %129 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %128
+  %130 = load i32, i32* %129, align 4
+  %131 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %128
+  %132 = load i32, i32* %131, align 4
+  %133 = add nsw i32 %130, %132
+  %134 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %128
+  store i32 %133, i32* %134, align 4
+  %135 = add nsw i32 %127, 1
+  %136 = sext i32 %135 to i64
+  %137 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %136
+  %138 = load i32, i32* %137, align 4
+  %139 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %136
+  %140 = load i32, i32* %139, align 4
+  %141 = add nsw i32 %138, %140
+  %142 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %136
+  store i32 %141, i32* %142, align 4
+  %143 = add nsw i32 %135, 1
+  %144 = sext i32 %143 to i64
+  %145 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %144
+  %146 = load i32, i32* %145, align 4
+  %147 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %144
+  %148 = load i32, i32* %147, align 4
+  %149 = add nsw i32 %146, %148
+  %150 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %144
+  store i32 %149, i32* %150, align 4
+  %151 = add nsw i32 %143, 1
+  %152 = sext i32 %151 to i64
+  %153 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %152
+  %154 = load i32, i32* %153, align 4
+  %155 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %152
+  %156 = load i32, i32* %155, align 4
+  %157 = add nsw i32 %154, %156
+  %158 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %152
+  store i32 %157, i32* %158, align 4
+  %159 = add nsw i32 %151, 1
+  %160 = sext i32 %159 to i64
+  %161 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %160
+  %162 = load i32, i32* %161, align 4
+  %163 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %160
+  %164 = load i32, i32* %163, align 4
+  %165 = add nsw i32 %162, %164
+  %166 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %160
+  store i32 %165, i32* %166, align 4
+  %167 = add nsw i32 %159, 1
+  %168 = sext i32 %167 to i64
+  %169 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %168
+  %170 = load i32, i32* %169, align 4
+  %171 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %168
+  %172 = load i32, i32* %171, align 4
+  %173 = add nsw i32 %170, %172
+  %174 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %168
+  store i32 %173, i32* %174, align 4
+  %175 = add nsw i32 %167, 1
+  %176 = sext i32 %175 to i64
+  %177 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %176
+  %178 = load i32, i32* %177, align 4
+  %179 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %176
+  %180 = load i32, i32* %179, align 4
+  %181 = add nsw i32 %178, %180
+  %182 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %176
+  store i32 %181, i32* %182, align 4
+  %183 = add nsw i32 %175, 1
+  %184 = sext i32 %183 to i64
+  %185 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %184
+  %186 = load i32, i32* %185, align 4
+  %187 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %184
+  %188 = load i32, i32* %187, align 4
+  %189 = add nsw i32 %186, %188
+  %190 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %184
+  store i32 %189, i32* %190, align 4
+  %191 = add nsw i32 %183, 1
+  %192 = sext i32 %191 to i64
+  %193 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %192
+  %194 = load i32, i32* %193, align 4
+  %195 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %192
+  %196 = load i32, i32* %195, align 4
+  %197 = add nsw i32 %194, %196
+  %198 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %192
+  store i32 %197, i32* %198, align 4
+  %199 = add nsw i32 %191, 1
+  %200 = sext i32 %199 to i64
+  %201 = getelementptr inbounds [2048 x i32], [2048 x i32]* %13, i64 0, i64 %200
+  %202 = load i32, i32* %201, align 4
+  %203 = getelementptr inbounds [2048 x i32], [2048 x i32]* %17, i64 0, i64 %200
+  %204 = load i32, i32* %203, align 4
+  %205 = add nsw i32 %202, %204
+  %206 = getelementptr inbounds [2048 x i32], [2048 x i32]* %21, i64 0, i64 %200
+  store i32 %205, i32* %206, align 4
+  %207 = add nsw i32 %199, 1
+  %208 = add nsw i32 %207, 24
+  br label %6
+}
+

Copy link

github-actions bot commented Dec 6, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@madhur13490 madhur13490 force-pushed the perf/bound-load-stores branch 2 times, most recently from ca6235d to 2022daf Compare December 6, 2024 14:41
@madhur13490 madhur13490 requested a review from fhahn December 6, 2024 14:47
@madhur13490 madhur13490 changed the title [LoopInterchange] Constrain load/stores in a loop [LoopInterchange] Constrain number of load/stores in a loop Dec 10, 2024
@sjoerdmeijer
Copy link
Collaborator

sjoerdmeijer commented Dec 10, 2024

The idea makes sense, I think. I.e., bailing out before the more expensive dependence checks would be the sensible thing to do. But we do need numbers to discuss this trade-off:

  • the compile time savings,
  • the number of interchanges that we miss out on, if any.

Maybe you can share those numbers for the LLVM test-suite here. Ideally, interchange triggers the same amount of times, and we save compile-time by doing the check upfront.

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

@madhur13490
Copy link
Contributor Author

The idea makes sense, I think. I.e., bailing out before the more expensive dependence checks would be the sensible thing to do. But we do need numbers to discuss this trade-off:

  • the compile time savings,
  • the number of interchanges that we miss out on, if any.

Maybe you can share those numbers for the LLVM test-suite here. Ideally, interchange triggers the same amount of times, and we save compile-time by doing the check upfront.

Thanks for reminding me.

When I profiled this change on LLVM Test-suite, I see 39 interchanges compared to 43 with the trunk. If I bump up the value of MaxMemInstrCount to 128, I see 42 interchanges. However, with 128 we see bump in the compile-time too. Please find numbers here

the branch is madhur13490/perf/revert_and_bound_instr_64

This change tries to achieve trade-off between the number of loop interchanges and compile-time. Indeed, compile-time is directly proportional to the value of MaxMemInstCount.

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

I have a separate change for this in which I will also expose min max depths of the loop nest using command-line options.

@madhur13490 madhur13490 force-pushed the perf/bound-load-stores branch from 2022daf to 1a6c755 Compare December 12, 2024 07:50
@madhur13490
Copy link
Contributor Author

Also used remark message in the test and removed dependency from asserts build.

@CongzheUalberta
Copy link
Contributor

When I profiled this change on LLVM Test-suite, I see 39 interchanges compared to 43 with the trunk. If I bump up the value of MaxMemInstrCount to 128, I see 42 interchanges. However, with 128 we see bump in the compile-time too. Please find numbers here the branch is madhur13490/perf/revert_and_bound_instr_64

This change tries to achieve trade-off between the number of loop interchanges and compile-time. Indeed, compile-time is directly proportional to the value of MaxMemInstCount.

With your results I see compile time actually increases with this patch but not decreases? For example compare the results of commit ae719f0756 versus commit f8c5d74865/bfbcc6d728.

@madhur13490
Copy link
Contributor Author

madhur13490 commented Dec 13, 2024

When I profiled this change on LLVM Test-suite, I see 39 interchanges compared to 43 with the trunk. If I bump up the value of MaxMemInstrCount to 128, I see 42 interchanges. However, with 128 we see bump in the compile-time too. Please find numbers here the branch is madhur13490/perf/revert_and_bound_instr_64
This change tries to achieve trade-off between the number of loop interchanges and compile-time. Indeed, compile-time is directly proportional to the value of MaxMemInstCount.

With your results I see compile time actually increases with this patch but not decreases? For example compare the results of commit ae719f0756 versus commit f8c5d74865/bfbcc6d728.

Ah! I think the branch is messed up due to reverts and other things in it.

To simplify things, I have rerun the experiments yesterday. I have created two branches.
a) perf/li-enable - This branch just enables LoopInterchange; no other code. On webpage, please see the results. Geomean is +0.54%. Most interestingly, lencod is +2.56%

b) perf/li-enable_bound_load_store - In this branch, I do two things 1. Enable LoopInterchange 2. Add this patch on top of it. If you see results, geomean is just +0.28% and most interestingly in lencod we see just +1.09%.
Thus, with this patch, I see improvements in geomean (0.54 - 0.28 = 0.26%) as well as lencod (2.56 - 1.09 = 1.47%)

FWIW, since this patch eliminates the expensive computation of dependence info in many cases and bails out early, we see an overall improvement in compile time, as expected.

Screenshot 2024-12-13 at 10 21 44 AM

Does this clear the confusion?

@madhur13490
Copy link
Contributor Author

@CongzheUalberta Happy new year! Did you get a chance to look at this further?

@CongzheUalberta
Copy link
Contributor

@CongzheUalberta Happy new year! Did you get a chance to look at this further?

Thanks for the clarification. Now I can see this patch could reduce the compile time of loop interchange.

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

I do second this. It would be more handy to turn MaxMemInstrCount into a command line option.

@madhur13490 madhur13490 force-pushed the perf/bound-load-stores branch from 1a6c755 to 484524c Compare January 8, 2025 09:02
@madhur13490
Copy link
Contributor Author

madhur13490 commented Jan 8, 2025

Also, it is probably better to now turn that constant MaxMemInstrCount into an option with a default value, so that it could be easily adjusted on the command line for testing purposes.

I do second this. It would be more handy to turn MaxMemInstrCount into a command line option.

Sure. Addressed in the latest commit with one minor sanity check on the value.

@madhur13490
Copy link
Contributor Author

madhur13490 commented Jan 16, 2025

Gentle ping!. FYI, I added a check for the new command-line option in the test.

@madhur13490 madhur13490 force-pushed the perf/bound-load-stores branch from 484524c to ac0fad8 Compare January 16, 2025 07:26
Copy link
Collaborator

@sjoerdmeijer sjoerdmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote earlier, I think the idea makes sense, so this LGTM.

I have left 2 nits about debug/remark messages, but they can addressed without another review.

Please wait a day before merging this to give others a last chance to comment.


if (MemInstr.size() > MaxMemInstrCount) {
LLVM_DEBUG(dbgs() << "The transform doesn't support more than "
<< MaxMemInstrCount << " load stores in a loop\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "load stores" -> loads/stores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ORE->emit([&]() {
return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedLoop",
L->getStartLoc(), L->getHeader())
<< "Number of loads/stores in the loop are more than threshold.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: " .... more than threshold": I would like to add something about which threshold exactly we're talking about here. Or the message could be something like

Number of supported loads/stores exceeded, the supported maximum can be increased
with option -loop-interchange-max-meminstr-count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

In the current state of the code, the transform computes
entries for the dependency matrix until MaxMemInstrCount which
is 100. After 99th entry it terminates and thus overall wastes
compile-time.

It would be nice if we can compute total number of entries
upfront and early exit if the number of entries > 100.
However, computing the number of entries is not always possible
as it depends on two factors:
1. Number of load-store pairs in a loop.
2. Number of common loop levels for each of the pair.

This patch just constrains the whole computation on the number of
loads and store instructions in the loop. With 64, I see 39 interchanges
compared to 42 with the trunk. (With 128, it is 42 vs 43 but I see
increase in the compile-time.)

In an other approach, I experimented with computing 1 and constraining
the number of pairs but that did not lead to any additional benefit
in terms of compile-time. However, when other issues are fixed,
I can revisit this approach.
@madhur13490 madhur13490 force-pushed the perf/bound-load-stores branch from ac0fad8 to 2e3567d Compare January 16, 2025 15:33
@madhur13490
Copy link
Contributor Author

Thanks @sjoerdmeijer. I will merge to main early next week giving some more time for the comments.

Copy link
Contributor

@CongzheUalberta CongzheUalberta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM giving all comments are addressed.

@madhur13490 madhur13490 merged commit 5d281a4 into llvm:main Jan 21, 2025
8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 21, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12063

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/pgo1.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate      -Xclang "-fprofile-instrument=clang"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-instr-generate -Xclang -fprofile-instrument=clang
# RUN: at line 3
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c      --check-prefix="CLANG-PGO"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c --check-prefix=CLANG-PGO
# RUN: at line 5
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-generate      -Xclang "-fprofile-instrument=llvm"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fprofile-generate -Xclang -fprofile-instrument=llvm
# RUN: at line 7
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c      --check-prefix="LLVM-PGO"
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/pgo1.c.tmp
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c --check-prefix=LLVM-PGO
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c:55:19: error: LLVM-PGO-NEXT: expected string not found in input
# | // LLVM-PGO-NEXT: [ 20 10 2 1 ]
# |                   ^
# | <stdin>:3:28: note: scanning from here
# | ======== Counters =========
# |                            ^
# | <stdin>:4:1: note: possible intended match here
# | [ 20 10 3 1 ]
# | ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/pgo1.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |            1: ======= GPU Profile ======= 
# |            2: Target: amdgcn-amd-amdhsa 
# |            3: ======== Counters ========= 
# | next:55'0                                X error: no match found
# |            4: [ 20 10 3 1 ] 
# | next:55'0     ~~~~~~~~~~~~~~
# | next:55'1     ?              possible intended match
# |            5: [ 10 ] 
# | next:55'0     ~~~~~~~
# |            6: [ 20 ] 
# | next:55'0     ~~~~~~~
# |            7: ========== Data =========== 
# | next:55'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...

sjoerdmeijer added a commit to sjoerdmeijer/llvm-project that referenced this pull request Jan 29, 2025
This is a work in progress patch to enable loop-interchange by default
and is a continuation of the RFC:

https://discourse.llvm.org/t/enabling-loop-interchange/82589

Basically, we promised to fix any compile-time and correctness issues in
the different components involved here (loop-interchange and dependence
analaysis.) before discussing enabling interchange by default. We think
are close to complete this; I would like to explain where we are and
wanted to check if there are any thoughts or concerns. A quick overview
of the correctness and compile-time improvements that we have made include:

Correctness:
- [LoopInterchange] Remove 'S' Scalar Dependencies (llvm#119345)
- [LoopInterchange] Fix overflow in cost calculation (llvm#111807)
- [LoopInterchange] Handle LE and GE correctly (PR llvm#124901) @kasuga-fj
- [DA] disambiguate evolution of base addresses (llvm#116628)

Compile-times:
- [LoopInterchange] Constrain number of load/stores in a loop (llvm#118973)
- [LoopInterchange] Bail out early if minimum loop nest is not met (llvm#115128)
- [LoopInterchange] Hoist isComputableLoopNest() in the control flow (llvm#124247)

And in terms of remaining work, we think we are very close to fixing
these depenence analysis issues:
- [DA] do not handle array accesses of different offsets (llvm#123436)
- [DA] Dependence analysis does not handle array accesses of different sizes (llvm#116630)
- [DA] use NSW arithmetic llvm#116632

The compile-time increase with a geomean increase of 0.19% looks good
(after committing llvm#124247), I think:

  stage1-O3:
  Benchmark
  kimwitu++        +0.10%
  sqlite3          +0.14%
  consumer-typeset +0.07%
  Bullet           +0.06%
  tramp3d-v4       +0.21%
  mafft            +0.39%
  ClamAVi          +0.06%
  lencod           +0.61%
  SPASS            +0.17%
  7zip             +0.08%
  geomean          +0.19%

See also:
http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au

We might want to look into lencod to see if we can improve more, but not
sure it is strictly necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants