-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[LoopVectorize] Enable hoisting of runtime checks by default #71538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With commit https://reviews.llvm.org/D152366 I introduced functionality that permitted the hoisting of runtime memory checks from a vectorised inner loop to the preheader of the next outer-most loop. This is useful for benchmarks like SPEC2017's x264 where the inner loop is vectorised and only has a small trip count. In such cases the runtime memory checks become expensive and since the checks never fail in the case of x264 it makes sense to do this. However, this behaviour was controlled by the flag -hoist-runtime-checks which was off by default. This patch enables this flag by default for all targets, since I believe this is a generally beneficial thing to do. I have tested this with SPEC2017 and I see 2.3% and 2.6% improvements with x264 on neoverse-v1 and neoverse-n1, respectively. Similarly, I saw slight improvements in the overall geomean on both machines. The only other notable changes were a 1% drop in the roms benchmark, which was compensated for by a 1% improvement in fotonik3d.
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-llvm-analysis Author: David Sherwood (david-arm) ChangesWith commit https://reviews.llvm.org/D152366 I introduced functionality that permitted the hoisting of runtime memory checks from a vectorised inner loop to the preheader of the next outer-most loop. This is useful for benchmarks like SPEC2017's x264 where the inner loop is vectorised and only has a small trip count. In such cases the runtime memory checks become expensive and since the checks never fail in the case of x264 it makes sense to do this. However, this behaviour was controlled by the flag -hoist-runtime-checks which was off by default. This patch enables this flag by default for all targets, since I believe this is a generally beneficial thing to do. I have tested this with SPEC2017 and I see 2.3% and 2.6% improvements with x264 on neoverse-v1 and neoverse-n1, respectively. Similarly, I saw slight improvements in the overall geomean on both machines. The only other notable changes were a 1% drop in the roms benchmark, which was compensated for by a 1% improvement in fotonik3d. Patch is 20.72 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/71538.diff 5 Files Affected:
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 3d1edd5f038a25e..05ca09968207fec 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -142,7 +142,7 @@ static cl::opt<bool, true> HoistRuntimeChecks(
"hoist-runtime-checks", cl::Hidden,
cl::desc(
"Hoist inner loop runtime memory checks to outer loop if possible"),
- cl::location(VectorizerParams::HoistRuntimeChecks), cl::init(false));
+ cl::location(VectorizerParams::HoistRuntimeChecks), cl::init(true));
bool VectorizerParams::HoistRuntimeChecks;
bool VectorizerParams::isInterleaveForced() {
diff --git a/llvm/test/Transforms/LoopVectorize/invariant-store-vectorization.ll b/llvm/test/Transforms/LoopVectorize/invariant-store-vectorization.ll
index 9e36649bcf73d6a..52101fda6309f64 100644
--- a/llvm/test/Transforms/LoopVectorize/invariant-store-vectorization.ll
+++ b/llvm/test/Transforms/LoopVectorize/invariant-store-vectorization.ll
@@ -13,9 +13,6 @@ target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f3
; address.
-; memory check is found.conflict = b[max(n-1,1)] > a && (ptr a)+1 > (ptr b)
-
-
define i32 @inv_val_store_to_inv_address_with_reduction(ptr %a, i64 %n, ptr %b) {
; CHECK-LABEL: @inv_val_store_to_inv_address_with_reduction(
; CHECK-NEXT: entry:
@@ -346,74 +343,75 @@ define i32 @multiple_uniform_stores(ptr nocapture %var1, ptr nocapture readonly
; CHECK-NEXT: [[CMP20:%.*]] = icmp eq i32 [[ITR:%.*]], 0
; CHECK-NEXT: br i1 [[CMP20]], label [[FOR_END10:%.*]], label [[FOR_COND1_PREHEADER_PREHEADER:%.*]]
; CHECK: for.cond1.preheader.preheader:
-; CHECK-NEXT: [[SCEVGEP3:%.*]] = getelementptr i8, ptr [[VAR2:%.*]], i64 4
-; CHECK-NEXT: [[INVARIANT_GEP5:%.*]] = getelementptr i8, ptr [[VAR1:%.*]], i64 4
+; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[ITR]], -1
+; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
+; CHECK-NEXT: [[TMP2:%.*]] = shl nuw nsw i64 [[TMP1]], 2
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr i8, ptr [[VAR1:%.*]], i64 [[TMP2]]
+; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[TMP3]], i64 4
+; CHECK-NEXT: [[SCEVGEP2:%.*]] = getelementptr i8, ptr [[VAR2:%.*]], i64 4
; CHECK-NEXT: br label [[FOR_COND1_PREHEADER:%.*]]
; CHECK: for.cond1.preheader:
; CHECK-NEXT: [[INDVARS_IV23:%.*]] = phi i64 [ [[INDVARS_IV_NEXT24:%.*]], [[FOR_INC8:%.*]] ], [ 0, [[FOR_COND1_PREHEADER_PREHEADER]] ]
; CHECK-NEXT: [[J_022:%.*]] = phi i32 [ [[J_1_LCSSA:%.*]], [[FOR_INC8]] ], [ 0, [[FOR_COND1_PREHEADER_PREHEADER]] ]
-; CHECK-NEXT: [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV23]], 2
-; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[VAR1]], i64 [[TMP0]]
-; CHECK-NEXT: [[GEP6:%.*]] = getelementptr i8, ptr [[INVARIANT_GEP5]], i64 [[TMP0]]
; CHECK-NEXT: [[CMP218:%.*]] = icmp ult i32 [[J_022]], [[ITR]]
; CHECK-NEXT: br i1 [[CMP218]], label [[FOR_BODY3_LR_PH:%.*]], label [[FOR_INC8]]
; CHECK: for.body3.lr.ph:
; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds i32, ptr [[VAR1]], i64 [[INDVARS_IV23]]
-; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[J_022]] to i64
+; CHECK-NEXT: [[TMP4:%.*]] = zext i32 [[J_022]] to i64
; CHECK-NEXT: [[ARRAYIDX5_PROMOTED:%.*]] = load i32, ptr [[ARRAYIDX5]], align 4
-; CHECK-NEXT: [[TMP2:%.*]] = xor i32 [[J_022]], -1
-; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[TMP2]], [[ITR]]
-; CHECK-NEXT: [[TMP4:%.*]] = zext i32 [[TMP3]] to i64
-; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[TMP4]], 1
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP3]], 3
+; CHECK-NEXT: [[TMP5:%.*]] = xor i32 [[J_022]], -1
+; CHECK-NEXT: [[TMP6:%.*]] = add i32 [[TMP5]], [[ITR]]
+; CHECK-NEXT: [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
+; CHECK-NEXT: [[TMP8:%.*]] = add nuw nsw i64 [[TMP7]], 1
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP6]], 3
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]
; CHECK: vector.memcheck:
-; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[TMP1]], 2
-; CHECK-NEXT: [[SCEVGEP2:%.*]] = getelementptr i8, ptr [[VAR2]], i64 [[TMP6]]
-; CHECK-NEXT: [[TMP7:%.*]] = xor i32 [[J_022]], -1
-; CHECK-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], [[ITR]]
-; CHECK-NEXT: [[TMP9:%.*]] = zext i32 [[TMP8]] to i64
-; CHECK-NEXT: [[TMP10:%.*]] = add nuw nsw i64 [[TMP1]], [[TMP9]]
-; CHECK-NEXT: [[TMP11:%.*]] = shl nuw nsw i64 [[TMP10]], 2
-; CHECK-NEXT: [[SCEVGEP4:%.*]] = getelementptr i8, ptr [[SCEVGEP3]], i64 [[TMP11]]
-; CHECK-NEXT: [[BOUND0:%.*]] = icmp ult ptr [[SCEVGEP]], [[SCEVGEP4]]
-; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[SCEVGEP2]], [[GEP6]]
+; CHECK-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[TMP4]], 2
+; CHECK-NEXT: [[SCEVGEP1:%.*]] = getelementptr i8, ptr [[VAR2]], i64 [[TMP9]]
+; CHECK-NEXT: [[TMP10:%.*]] = xor i32 [[J_022]], -1
+; CHECK-NEXT: [[TMP11:%.*]] = add i32 [[TMP10]], [[ITR]]
+; CHECK-NEXT: [[TMP12:%.*]] = zext i32 [[TMP11]] to i64
+; CHECK-NEXT: [[TMP13:%.*]] = add nuw nsw i64 [[TMP4]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = shl nuw nsw i64 [[TMP13]], 2
+; CHECK-NEXT: [[SCEVGEP3:%.*]] = getelementptr i8, ptr [[SCEVGEP2]], i64 [[TMP14]]
+; CHECK-NEXT: [[BOUND0:%.*]] = icmp ugt ptr [[SCEVGEP3]], [[VAR1]]
+; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[SCEVGEP1]], [[SCEVGEP]]
; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
-; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[TMP5]], 8589934588
-; CHECK-NEXT: [[IND_END:%.*]] = add nuw nsw i64 [[N_VEC]], [[TMP1]]
-; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, i32 [[ARRAYIDX5_PROMOTED]], i64 0
-; CHECK-NEXT: [[INVARIANT_GEP:%.*]] = getelementptr i32, ptr [[VAR2]], i64 [[TMP1]]
+; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[TMP8]], 8589934588
+; CHECK-NEXT: [[IND_END:%.*]] = add nuw nsw i64 [[N_VEC]], [[TMP4]]
+; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, i32 [[ARRAYIDX5_PROMOTED]], i64 0
+; CHECK-NEXT: [[INVARIANT_GEP:%.*]] = getelementptr i32, ptr [[VAR2]], i64 [[TMP4]]
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <4 x i32> [ [[TMP12]], [[VECTOR_PH]] ], [ [[TMP14:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <4 x i32> [ [[TMP15]], [[VECTOR_PH]] ], [ [[TMP17:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[GEP:%.*]] = getelementptr i32, ptr [[INVARIANT_GEP]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[GEP]], align 4, !alias.scope !23
-; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]
-; CHECK-NEXT: [[TMP14]] = add <4 x i32> [[TMP13]], <i32 1, i32 1, i32 1, i32 1>
+; CHECK-NEXT: [[TMP16:%.*]] = add <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]
+; CHECK-NEXT: [[TMP17]] = add <4 x i32> [[TMP16]], <i32 1, i32 1, i32 1, i32 1>
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
-; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; CHECK-NEXT: br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
+; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi <4 x i32> [ [[TMP14]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[TMP16:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[DOTLCSSA]])
-; CHECK-NEXT: store i32 [[TMP16]], ptr [[ARRAYIDX5]], align 4
-; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP5]], [[N_VEC]]
+; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi <4 x i32> [ [[TMP17]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP19:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[DOTLCSSA]])
+; CHECK-NEXT: store i32 [[TMP19]], ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP8]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_INC8_LOOPEXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
-; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[TMP1]], [[FOR_BODY3_LR_PH]] ], [ [[TMP1]], [[VECTOR_MEMCHECK]] ]
-; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP16]], [[MIDDLE_BLOCK]] ], [ [[ARRAYIDX5_PROMOTED]], [[FOR_BODY3_LR_PH]] ], [ [[ARRAYIDX5_PROMOTED]], [[VECTOR_MEMCHECK]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[TMP4]], [[FOR_BODY3_LR_PH]] ], [ [[TMP4]], [[VECTOR_MEMCHECK]] ]
+; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP19]], [[MIDDLE_BLOCK]] ], [ [[ARRAYIDX5_PROMOTED]], [[FOR_BODY3_LR_PH]] ], [ [[ARRAYIDX5_PROMOTED]], [[VECTOR_MEMCHECK]] ]
; CHECK-NEXT: br label [[FOR_BODY3:%.*]]
; CHECK: for.body3:
-; CHECK-NEXT: [[TMP17:%.*]] = phi i32 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[TMP19:%.*]], [[FOR_BODY3]] ]
+; CHECK-NEXT: [[TMP20:%.*]] = phi i32 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[TMP22:%.*]], [[FOR_BODY3]] ]
; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY3]] ]
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[VAR2]], i64 [[INDVARS_IV]]
-; CHECK-NEXT: [[TMP18:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP17]], [[TMP18]]
-; CHECK-NEXT: [[TMP19]] = add nsw i32 [[ADD]], 1
-; CHECK-NEXT: store i32 [[TMP19]], ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT: [[TMP21:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP20]], [[TMP21]]
+; CHECK-NEXT: [[TMP22]] = add nsw i32 [[ADD]], 1
+; CHECK-NEXT: store i32 [[TMP22]], ptr [[ARRAYIDX5]], align 4
; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[ITR]]
diff --git a/llvm/test/Transforms/LoopVectorize/multiple-strides-vectorization.ll b/llvm/test/Transforms/LoopVectorize/multiple-strides-vectorization.ll
index 47104645c18a63e..fc6dcc3d278c916 100644
--- a/llvm/test/Transforms/LoopVectorize/multiple-strides-vectorization.ll
+++ b/llvm/test/Transforms/LoopVectorize/multiple-strides-vectorization.ll
@@ -1,5 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt -passes=loop-vectorize -force-vector-width=4 -S < %s | FileCheck %s
+; RUN: opt -passes=loop-vectorize -force-vector-width=4 -hoist-runtime-checks=false -S < %s | FileCheck %s --check-prefix=CHECK
+; RUN: opt -passes=loop-vectorize -force-vector-width=4 -S < %s | FileCheck %s --check-prefix=CHECK-HOIST
; This is the test case from PR26314.
; When we were retrying dependence checking with memchecks only,
@@ -31,29 +32,29 @@ define void @Test(ptr nocapture %obj, i64 %z) #0 {
; CHECK-LABEL: @Test(
; CHECK-NEXT: [[TMP1:%.*]] = shl i64 [[Z:%.*]], 2
; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[TMP1]], 256
-; CHECK-NEXT: [[UGLYGEP2:%.*]] = getelementptr i8, ptr [[OBJ:%.*]], i64 [[TMP1]]
+; CHECK-NEXT: [[SCEVGEP2:%.*]] = getelementptr i8, ptr [[OBJ:%.*]], i64 [[TMP1]]
; CHECK-NEXT: br label [[DOTOUTER_PREHEADER:%.*]]
; CHECK: .outer.preheader:
; CHECK-NEXT: [[I:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[I_NEXT:%.*]], [[DOTOUTER:%.*]] ]
; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[I]], 7
; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[TMP3]], 256
-; CHECK-NEXT: [[UGLYGEP:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP4]]
+; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP4]]
; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[TMP2]], [[TMP3]]
-; CHECK-NEXT: [[UGLYGEP1:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP5]]
+; CHECK-NEXT: [[SCEVGEP1:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP5]]
; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[I]], 2
; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[TMP6]], 128
-; CHECK-NEXT: [[UGLYGEP3:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP7]]
+; CHECK-NEXT: [[SCEVGEP3:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[TMP6]], 132
-; CHECK-NEXT: [[UGLYGEP4:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP8]]
+; CHECK-NEXT: [[SCEVGEP4:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP8]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds [[STRUCT_S:%.*]], ptr [[OBJ]], i64 0, i32 1, i64 [[I]]
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[Z]], 4
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]
; CHECK: vector.memcheck:
-; CHECK-NEXT: [[BOUND0:%.*]] = icmp ult ptr [[UGLYGEP]], [[UGLYGEP2]]
-; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[OBJ]], [[UGLYGEP1]]
+; CHECK-NEXT: [[BOUND0:%.*]] = icmp ult ptr [[SCEVGEP]], [[SCEVGEP2]]
+; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[OBJ]], [[SCEVGEP1]]
; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
-; CHECK-NEXT: [[BOUND05:%.*]] = icmp ult ptr [[UGLYGEP]], [[UGLYGEP4]]
-; CHECK-NEXT: [[BOUND16:%.*]] = icmp ult ptr [[UGLYGEP3]], [[UGLYGEP1]]
+; CHECK-NEXT: [[BOUND05:%.*]] = icmp ult ptr [[SCEVGEP]], [[SCEVGEP4]]
+; CHECK-NEXT: [[BOUND16:%.*]] = icmp ult ptr [[SCEVGEP3]], [[SCEVGEP1]]
; CHECK-NEXT: [[FOUND_CONFLICT7:%.*]] = and i1 [[BOUND05]], [[BOUND16]]
; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT7]]
; CHECK-NEXT: br i1 [[CONFLICT_RDX]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
@@ -103,7 +104,74 @@ define void @Test(ptr nocapture %obj, i64 %z) #0 {
; CHECK-NEXT: store i32 [[TMP25]], ptr [[TMP23]], align 4
; CHECK-NEXT: [[J_NEXT]] = add nuw nsw i64 [[J]], 1
; CHECK-NEXT: [[EXITCOND_INNER:%.*]] = icmp eq i64 [[J_NEXT]], [[Z]]
-; CHECK-NEXT: br i1 [[EXITCOND_INNER]], label [[DOTOUTER]], label [[DOTINNER]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK-NEXT: br i1 [[EXITCOND_INNER]], label [[DOTOUTER]], label [[DOTINNER]], !llvm.loop [[LOOP11:![0-9]+]]
+;
+; CHECK-HOIST-LABEL: @Test(
+; CHECK-HOIST-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[OBJ:%.*]], i64 256
+; CHECK-HOIST-NEXT: [[TMP1:%.*]] = shl i64 [[Z:%.*]], 2
+; CHECK-HOIST-NEXT: [[TMP2:%.*]] = add i64 [[TMP1]], 4224
+; CHECK-HOIST-NEXT: [[SCEVGEP1:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP2]]
+; CHECK-HOIST-NEXT: [[SCEVGEP2:%.*]] = getelementptr i8, ptr [[OBJ]], i64 [[TMP1]]
+; CHECK-HOIST-NEXT: [[SCEVGEP3:%.*]] = getelementptr i8, ptr [[OBJ]], i64 128
+; CHECK-HOIST-NEXT: br label [[DOTOUTER_PREHEADER:%.*]]
+; CHECK-HOIST: .outer.preheader:
+; CHECK-HOIST-NEXT: [[I:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[I_NEXT:%.*]], [[DOTOUTER:%.*]] ]
+; CHECK-HOIST-NEXT: [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_S:%.*]], ptr [[OBJ]], i64 0, i32 1, i64 [[I]]
+; CHECK-HOIST-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[Z]], 4
+; CHECK-HOIST-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]
+; CHECK-HOIST: vector.memcheck:
+; CHECK-HOIST-NEXT: [[BOUND0:%.*]] = icmp ult ptr [[SCEVGEP]], [[SCEVGEP2]]
+; CHECK-HOIST-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[OBJ]], [[SCEVGEP1]]
+; CHECK-HOIST-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
+; CHECK-HOIST-NEXT: [[BOUND14:%.*]] = icmp ult ptr [[SCEVGEP3]], [[SCEVGEP1]]
+; CHECK-HOIST-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
+; CHECK-HOIST: vector.ph:
+; CHECK-HOIST-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[Z]], 4
+; CHECK-HOIST-NEXT: [[N_VEC:%.*]] = sub i64 [[Z]], [[N_MOD_VF]]
+; CHECK-HOIST-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK-HOIST: vector.body:
+; CHECK-HOIST-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-HOIST-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
+; CHECK-HOIST-NEXT: [[TMP5:%.*]] = getelementptr inbounds [[STRUCT_S]], ptr [[OBJ]], i64 0, i32 0, i64 [[TMP4]]
+; CHECK-HOIST-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[TMP5]], i32 0
+; CHECK-HOIST-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP6]], align 4, !alias.scope !0
+; CHECK-HOIST-NEXT: [[TMP7:%.*]] = load i32, ptr [[TMP3]], align 4, !alias.scope !3
+; CHECK-HOIST-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP7]], i64 0
+; CHECK-HOIST-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-HOIST-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[BROADCAST_SPLAT]], [[WIDE_LOAD]]
+; CHECK-HOIST-NEXT: [[TMP9:%.*]] = getelementptr inbounds [[STRUCT_S]], ptr [[OBJ]], i64 0, i32 2, i64 [[I]], i64 [[TMP4]]
+; CHECK-HOIST-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i32 0
+; CHECK-HOIST-NEXT: [[WIDE_LOAD5:%.*]] = load <4 x i32>, ptr [[TMP10]], align 4, !alias.scope !5, !noalias !7
+; CHECK-HOIST-NEXT: [[TMP11:%.*]] = add nsw <4 x i32> [[TMP8]], [[WIDE_LOAD5]]
+; CHECK-HOIST-NEXT: store <4 x i32> [[TMP11]], ptr [[TMP10]], align 4, !alias.scope !5, !noalias !7
+; CHECK-HOIST-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-HOIST-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-HOIST-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-HOIST: middle.block:
+; CHECK-HOIST-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[Z]], [[N_VEC]]
+; CHECK-HOIST-NEXT: br i1 [[CMP_N]], label [[DOTOUTER]], label [[SCALAR_PH]]
+; CHECK-HOIST: scalar.ph:
+; CHECK-HOIST-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[DOTOUTER_PREHEADER]] ], [ 0, [[VECTOR_MEMCHECK]] ]
+; CHECK-HOIST-NEXT: br label [[DOTINNER:%.*]]
+; CHECK-HOIST: .exit:
+; CHECK-HOIST-NEXT: ret void
+; CHECK-HOIST: .outer:
+; CHECK-HOIST-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
+; CHECK-HOIST-NEXT: [[EXITCOND_OUTER:%.*]] = icmp eq i64 [[I_NEXT]], 32
+; CHECK-HOIST-NEXT: br i1 [[EXITCOND_OUTER]], label [[DOTEXIT:%.*]], label [[DOTOUTER_PREHEADER]]
+; CHECK-HOIST: .inner:
+; CHECK-HOIST-NEXT: [[J:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[J_NEXT:%.*]], [[DOTINNER]] ]
+; CHECK-HOIST-NEXT: [[TMP13:%.*]] = getelementptr inbounds [[STRUCT_S]], ptr [[OBJ]], i64 0, i32 0, i64 [[J]]
+; CHECK-HOIST-NEXT: [[TMP14:%.*]] = load i32, ptr [[TMP13]], align 4
+; CHECK-HOIST-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP3]], align 4
+; CHECK-HOIST-NEXT: [[TMP16:%.*]] = add nsw i32 [[TMP15]], [[TMP14]]
+; CHECK-HOIST-NEXT: [[TMP17:%.*]] = getelementptr inbounds [[STRUCT_S]], ptr [[OBJ]], i64 0, i32 2, i64 [[I]], i64 [[J]]
+; CHECK-HOIST-NEXT: [[TMP18:%.*]] = load i32, ptr [[TMP17]], align 4
+; CHECK-HOIST-NEXT: [[TMP19:%.*]] = add nsw i32 [[TMP16]], [[TMP18]]
+; CHECK-HOIST-NEXT: store i32 [[TMP19]], ptr [[TMP17]], align 4
+; CHECK-HOIST-NEXT: [[J_NEXT]] = add nuw nsw i64 [[J]], 1
+; CHECK-HOIST-NEXT: [[EXITCOND_INNER:%.*]] = icmp eq i64 [[J_NEXT]], [[Z]]
+; CHECK-HOIST-NEXT: br i1 [[EXITCOND_INNER]], label [[DOTOUTER]], label [[DOTINNER]], !llvm.loop [[LOOP11:![0-9]+]]
;
br label %.outer.preheader
diff --git a/llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll b/llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll
index bdafe9f4daaaede..08b630e70d23363 100644
--- a/llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll
+++ b/llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll
@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt %s -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -S | FileCheck %s
+; RUN: opt %s -passes=loop-vectorize -hoist-runtime-checks=false -force-vector-width=4 -force-vector-interleave=1 -S | FileCheck %s
target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
diff --git a/llvm/test/Tra...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the embedded benchmarks I have tried don't show any problems from this after the recent fix, they were all improving where they changed.
@@ -13,9 +13,6 @@ target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f3 | |||
; address. | |||
|
|||
|
|||
; memory check is found.conflict = b[max(n-1,1)] > a && (ptr a)+1 > (ptr b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed? It looks like there is a memcheck below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question! I wasn't sure what to do here, because the exact contents of found.conflict depend upon whether you're trying to hoist the runtime memory checks or not. Perhaps it was just lazy of me to remove it - I'll update the comment to reflect what's it actually now checking.
In the original version of my patch I removed a comment about memory checks, thinking that the check had now changed with my patch. However, the comment refers to a function that does not have a nested loop, so the checks don't actually change.
Thanks for sharing the patch! I'll run some perf tests on my side, should hopefully be done by Monday. |
Hi @fhahn, good morning! Did you get chance to look at this? |
Hi, sorry for the delay, there was some infrastructure issues with our perf testing infrastructure. But the benchmarks are running now, will share the results soon hopefully. Please ping me again tomorrow if I forget the respond by then :) |
There's one internal benchmark where there may be a ~2% regression with this patch (and extra vectorizer passes enabled). I'll confirm tomorrow if this is actually caused by the patch or noise only |
I had a look and there are some cases where the more lightweight diff-checks are more profitable than the hoisted checks after 32d1197. One particular case is when the start of the AddRecs in the inner loop is the same AddRec in the outer loop, like in
It would probably be good to refine the checks when to pick hoisting vs diff checks |
Hmm, I guess the trick is to do this in a way that doesn't make x264 slower because I know one of the loops in x264 that benefits from this do have lightweight diff-checks. Even diff checks are still expensive for short inner loops. Do you know of any specific benchmarks where this is a problem? Is
Anyway, I'm happy to take a further look at this and see if I can get the best of both worlds. If necessary the only alternative is to enable this under a TTI hook, but I was hoping to avoid that because it's a very invasive change. |
For cases like this one (where the start of the inner AddRecs is the same outer loop AddRec), the difference should also become loop invariant, as the diff checks subtract the starts, so if the outer AddRecs are the same this simplifies to the start of the outer AddRec, which is invariant. |
Ah, thanks @fhahn that makes sense then! |
Gentle ping! #73515 has now landed so I think this patch should be ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that fixed, and from the perf Ive seen, this LGTM. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the earlier adjustments, thanks!
With commit https://reviews.llvm.org/D152366 I introduced functionality that permitted the hoisting of runtime memory checks from a vectorised inner loop to the preheader of the next outer-most loop. This is useful for benchmarks like SPEC2017's x264 where the inner loop is vectorised and only has a small trip count. In such cases the runtime memory checks become expensive and since the checks never fail in the case of x264 it makes sense to do this. However, this behaviour was controlled by the flag -hoist-runtime-checks which was off by default.
This patch enables this flag by default for all targets, since I believe this is a generally beneficial thing to do. I have tested this with SPEC2017 and I see 2.3% and 2.6% improvements with x264 on neoverse-v1 and neoverse-n1, respectively. Similarly, I saw slight improvements in the overall geomean on both machines. The only other notable changes were a 1% drop in the roms benchmark, which was compensated for by a 1% improvement in fotonik3d.