-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[LoopVectorize] Add support for reverse loops in isDereferenceableAndAlignedInLoop #96752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-systemz @llvm/pr-subscribers-llvm-analysis Author: David Sherwood (david-arm) ChangesCurrently when we encounter a negative step in the induction The motivation for this patch comes from PR #88385 where a Patch is 22.41 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96752.diff 2 Files Affected:
diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index 2b8197066e8e9..18431da682891 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -293,7 +293,9 @@ bool llvm::isDereferenceableAndAlignedInLoop(LoadInst *LI, Loop *L,
// TODO: Handle overlapping accesses.
// We should be computing AccessSize as (TC - 1) * Step + EltSize.
- if (EltSize.sgt(Step->getAPInt()))
+ bool StepIsNegative = Step->getAPInt().isNegative();
+ APInt AbsStep = Step->getAPInt().abs();
+ if (EltSize.ugt(AbsStep))
return false;
// Compute the total access size for access patterns with unit stride and
@@ -301,13 +303,14 @@ bool llvm::isDereferenceableAndAlignedInLoop(LoadInst *LI, Loop *L,
// same.
// For patterns with gaps (i.e. non unit stride), we are
// accessing EltSize bytes at every Step.
- APInt AccessSize = TC * Step->getAPInt();
+ APInt AccessSize = TC * AbsStep;
assert(SE.isLoopInvariant(AddRec->getStart(), L) &&
"implied by addrec definition");
Value *Base = nullptr;
if (auto *StartS = dyn_cast<SCEVUnknown>(AddRec->getStart())) {
- Base = StartS->getValue();
+ if (!StepIsNegative)
+ Base = StartS->getValue();
} else if (auto *StartS = dyn_cast<SCEVAddExpr>(AddRec->getStart())) {
// Handle (NewBase + offset) as start value.
const auto *Offset = dyn_cast<SCEVConstant>(StartS->getOperand(0));
@@ -318,11 +321,24 @@ bool llvm::isDereferenceableAndAlignedInLoop(LoadInst *LI, Loop *L,
// TODO: generalize if a case found which warrants
if (Offset->getAPInt().urem(Alignment.value()) != 0)
return false;
+ if (StepIsNegative) {
+ // In the last iteration of the loop the address we access we will be
+ // lower than the first by (TC - 1) * Step. So we need to make sure
+ // that there is enough room in Offset to accomodate this.
+ APInt SubOffset = (TC - 1) * AbsStep;
+ if (Offset->getAPInt().ult(SubOffset))
+ return false;
+ // We can safely use the new base because the decrementing pointer is
+ // always guaranteed to be >= new base. The total access size needs to
+ // take into account the start offset and the loaded element size.
+ AccessSize = Offset->getAPInt() + EltSize;
+ } else {
+ bool Overflow = false;
+ AccessSize = AccessSize.uadd_ov(Offset->getAPInt(), Overflow);
+ if (Overflow)
+ return false;
+ }
Base = NewBase->getValue();
- bool Overflow = false;
- AccessSize = AccessSize.uadd_ov(Offset->getAPInt(), Overflow);
- if (Overflow)
- return false;
}
}
diff --git a/llvm/test/Transforms/LoopVectorize/load-deref-pred-align.ll b/llvm/test/Transforms/LoopVectorize/load-deref-pred-align.ll
index 8d4be05a4390e..88802a48afa7a 100644
--- a/llvm/test/Transforms/LoopVectorize/load-deref-pred-align.ll
+++ b/llvm/test/Transforms/LoopVectorize/load-deref-pred-align.ll
@@ -17,7 +17,7 @@ define i16 @test_access_size_not_multiple_of_align(i64 %len, ptr %test_base) {
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_LOAD_CONTINUE2:%.*]] ]
-; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <2 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.*]], [[PRED_LOAD_CONTINUE2]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <2 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP15:%.*]], [[PRED_LOAD_CONTINUE2]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[TEST_BASE:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0
@@ -43,16 +43,16 @@ define i16 @test_access_size_not_multiple_of_align(i64 %len, ptr %test_base) {
; CHECK: pred.load.continue2:
; CHECK-NEXT: [[TMP14:%.*]] = phi <2 x i16> [ [[TMP8]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF1]] ]
; CHECK-NEXT: [[PREDPHI:%.*]] = select <2 x i1> [[TMP3]], <2 x i16> [[TMP14]], <2 x i16> zeroinitializer
-; CHECK-NEXT: [[TMP16]] = add <2 x i16> [[VEC_PHI]], [[PREDPHI]]
+; CHECK-NEXT: [[TMP15]] = add <2 x i16> [[VEC_PHI]], [[PREDPHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
-; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
-; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
+; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[TMP18:%.*]] = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> [[TMP16]])
+; CHECK-NEXT: [[TMP17:%.*]] = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> [[TMP15]])
; CHECK-NEXT: br i1 true, label [[LOOP_EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i16 [ 0, [[ENTRY]] ], [ [[TMP18]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i16 [ 0, [[ENTRY]] ], [ [[TMP17]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LATCH:%.*]] ]
@@ -72,7 +72,7 @@ define i16 @test_access_size_not_multiple_of_align(i64 %len, ptr %test_base) {
; CHECK-NEXT: [[EXIT:%.*]] = icmp eq i64 [[IV]], 4095
; CHECK-NEXT: br i1 [[EXIT]], label [[LOOP_EXIT]], label [[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: loop_exit:
-; CHECK-NEXT: [[ACCUM_NEXT_LCSSA:%.*]] = phi i16 [ [[ACCUM_NEXT]], [[LATCH]] ], [ [[TMP18]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[ACCUM_NEXT_LCSSA:%.*]] = phi i16 [ [[ACCUM_NEXT]], [[LATCH]] ], [ [[TMP17]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i16 [[ACCUM_NEXT_LCSSA]]
;
entry:
@@ -114,7 +114,7 @@ define i32 @test_access_size_multiple_of_align_but_offset_by_1(i64 %len, ptr %te
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_LOAD_CONTINUE2:%.*]] ]
-; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.*]], [[PRED_LOAD_CONTINUE2]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP15:%.*]], [[PRED_LOAD_CONTINUE2]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[TEST_BASE:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0
@@ -140,16 +140,16 @@ define i32 @test_access_size_multiple_of_align_but_offset_by_1(i64 %len, ptr %te
; CHECK: pred.load.continue2:
; CHECK-NEXT: [[TMP14:%.*]] = phi <2 x i32> [ [[TMP8]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF1]] ]
; CHECK-NEXT: [[PREDPHI:%.*]] = select <2 x i1> [[TMP3]], <2 x i32> [[TMP14]], <2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP16]] = add <2 x i32> [[VEC_PHI]], [[PREDPHI]]
+; CHECK-NEXT: [[TMP15]] = add <2 x i32> [[VEC_PHI]], [[PREDPHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
-; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
-; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
+; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[TMP18:%.*]] = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> [[TMP16]])
+; CHECK-NEXT: [[TMP17:%.*]] = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> [[TMP15]])
; CHECK-NEXT: br i1 true, label [[LOOP_EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[TMP18]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[TMP17]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LATCH:%.*]] ]
@@ -169,7 +169,7 @@ define i32 @test_access_size_multiple_of_align_but_offset_by_1(i64 %len, ptr %te
; CHECK-NEXT: [[EXIT:%.*]] = icmp eq i64 [[IV]], 4095
; CHECK-NEXT: br i1 [[EXIT]], label [[LOOP_EXIT]], label [[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: loop_exit:
-; CHECK-NEXT: [[ACCUM_NEXT_LCSSA:%.*]] = phi i32 [ [[ACCUM_NEXT]], [[LATCH]] ], [ [[TMP18]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[ACCUM_NEXT_LCSSA:%.*]] = phi i32 [ [[ACCUM_NEXT]], [[LATCH]] ], [ [[TMP17]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i32 [[ACCUM_NEXT_LCSSA]]
;
entry:
@@ -198,3 +198,230 @@ latch:
loop_exit:
ret i32 %accum.next
}
+
+; Test reverse loops where we can prove loads in predicated blocks are safe
+; to load unconditionally.
+define void @test_rev_loops_deref_loads(ptr nocapture noundef writeonly %dest) {
+; CHECK-LABEL: @test_rev_loops_deref_loads(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[LOCAL_DEST:%.*]] = alloca [1024 x i32], align 4
+; CHECK-NEXT: [[LOCAL_SRC:%.*]] = alloca [1024 x i32], align 4
+; CHECK-NEXT: [[LOCAL_CMP:%.*]] = alloca [1024 x i32], align 4
+; CHECK-NEXT: call void @init(ptr [[LOCAL_SRC]])
+; CHECK-NEXT: call void @init(ptr [[LOCAL_CMP]])
+; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE4:%.*]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = sub i64 1023, [[INDEX]]
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_CMP]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[TMP2]], i32 -1
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP3]], align 4
+; CHECK-NEXT: [[REVERSE:%.*]] = shufflevector <2 x i32> [[WIDE_LOAD]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
+; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <2 x i32> [[REVERSE]], <i32 3, i32 3>
+; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i1> [[TMP4]], <i1 true, i1 true>
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr [1024 x i32], ptr [[LOCAL_SRC]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr i32, ptr [[TMP6]], i32 0
+; CHECK-NEXT: [[TMP8:%.*]] = getelementptr i32, ptr [[TMP7]], i32 -1
+; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <2 x i32>, ptr [[TMP8]], align 4
+; CHECK-NEXT: [[REVERSE2:%.*]] = shufflevector <2 x i32> [[WIDE_LOAD1]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
+; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
+; CHECK-NEXT: br i1 [[TMP9]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
+; CHECK: pred.store.if:
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_DEST]], i64 0, i64 [[TMP0]]
+; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i32> [[REVERSE2]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = shl nsw i32 [[TMP11]], 2
+; CHECK-NEXT: store i32 [[TMP12]], ptr [[TMP10]], align 4
+; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
+; CHECK: pred.store.continue:
+; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
+; CHECK-NEXT: br i1 [[TMP13]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4]]
+; CHECK: pred.store.if3:
+; CHECK-NEXT: [[TMP14:%.*]] = add i64 [[OFFSET_IDX]], -1
+; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_DEST]], i64 0, i64 [[TMP14]]
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[REVERSE2]], i32 1
+; CHECK-NEXT: [[TMP17:%.*]] = shl nsw i32 [[TMP16]], 2
+; CHECK-NEXT: store i32 [[TMP17]], ptr [[TMP15]], align 4
+; CHECK-NEXT: br label [[PRED_STORE_CONTINUE4]]
+; CHECK: pred.store.continue4:
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
+; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 1023, [[ENTRY:%.*]] ]
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.body:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_INC:%.*]] ]
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_CMP]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP19:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[CMP3_NOT:%.*]] = icmp eq i32 [[TMP19]], 3
+; CHECK-NEXT: br i1 [[CMP3_NOT]], label [[FOR_INC]], label [[IF_THEN:%.*]]
+; CHECK: if.then:
+; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_SRC]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: [[TMP20:%.*]] = load i32, ptr [[ARRAYIDX5]], align 4
+; CHECK-NEXT: [[MUL:%.*]] = shl nsw i32 [[TMP20]], 2
+; CHECK-NEXT: [[ARRAYIDX7:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_DEST]], i64 0, i64 [[INDVARS_IV]]
+; CHECK-NEXT: store i32 [[MUL]], ptr [[ARRAYIDX7]], align 4
+; CHECK-NEXT: br label [[FOR_INC]]
+; CHECK: for.inc:
+; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
+; CHECK-NEXT: [[CMP2_NOT:%.*]] = icmp eq i64 [[INDVARS_IV]], 0
+; CHECK-NEXT: br i1 [[CMP2_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK: for.cond.cleanup:
+; CHECK-NEXT: call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(1024) [[DEST:%.*]], ptr noundef nonnull align 4 dereferenceable(1024) [[LOCAL_DEST]], i64 1024, i1 false)
+; CHECK-NEXT: ret void
+;
+entry:
+ %local_dest = alloca [1024 x i32], align 4
+ %local_src = alloca [1024 x i32], align 4
+ %local_cmp = alloca [1024 x i32], align 4
+ call void @init(ptr %local_src)
+ call void @init(ptr %local_cmp)
+ br label %for.body
+
+for.body:
+ %indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %for.inc ]
+ %arrayidx = getelementptr inbounds [1024 x i32], ptr %local_cmp, i64 0, i64 %indvars.iv
+ %0 = load i32, ptr %arrayidx, align 4
+ %cmp3.not = icmp eq i32 %0, 3
+ br i1 %cmp3.not, label %for.inc, label %if.then
+
+if.then:
+ %arrayidx5 = getelementptr inbounds [1024 x i32], ptr %local_src, i64 0, i64 %indvars.iv
+ %1 = load i32, ptr %arrayidx5, align 4
+ %mul = shl nsw i32 %1, 2
+ %arrayidx7 = getelementptr inbounds [1024 x i32], ptr %local_dest, i64 0, i64 %indvars.iv
+ store i32 %mul, ptr %arrayidx7, align 4
+ br label %for.inc
+
+for.inc:
+ %indvars.iv.next = add nsw i64 %indvars.iv, -1
+ %cmp2.not = icmp eq i64 %indvars.iv, 0
+ br i1 %cmp2.not, label %for.cond.cleanup, label %for.body
+
+for.cond.cleanup:
+ call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(1024) %dest, ptr noundef nonnull align 4 dereferenceable(1024) %local_dest, i64 1024, i1 false)
+ ret void
+}
+
+
+; Test reverse loops where we *cannot* prove loads in predicated blocks are safe
+; to load unconditionally.
+define void @test_rev_loops_non_deref_loads(ptr nocapture noundef writeonly %dest) {
+; CHECK-LABEL: @test_rev_loops_non_deref_loads(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[LOCAL_DEST:%.*]] = alloca [1024 x i32], align 4
+; CHECK-NEXT: [[LOCAL_SRC:%.*]] = alloca [1024 x i32], align 4
+; CHECK-NEXT: [[LOCAL_CMP:%.*]] = alloca [1024 x i32], align 4
+; CHECK-NEXT: call void @init(ptr [[LOCAL_SRC]])
+; CHECK-NEXT: call void @init(ptr [[LOCAL_CMP]])
+; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE2:%.*]] ]
+; CHECK-NEXT: [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 1023, i64 1022>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE2]] ]
+; CHECK-NEXT: [[TMP0:%.*]] = add <2 x i64> [[VEC_IND]], <i64 -1, i64 -1>
+; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x i64> [[TMP0]], i32 0
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_CMP]], i64 0, i64 [[TMP1]]
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[TMP2]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[TMP3]], i32 -1
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP4]], align 4
+; CHECK-NEXT: [[REVERSE:%.*]] = shufflevector <2 x i32> [[WIDE_LOAD]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
+; CHECK-NEXT: [[TMP5:%.*]] = icmp eq <2 x i32> [[REVERSE]], <i32 3, i32 3>
+; CHECK-NEXT: [[TMP6:%.*]] = xor <2 x i1> [[TMP5]], <i1 true, i1 true>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i1> [[TMP6]], i32 0
+; CHECK-NEXT: br i1 [[TMP7]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
+; CHECK: pred.store.if:
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i64> [[TMP0]], i32 0
+; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_SRC]], i64 0, i64 [[TMP8]]
+; CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[TMP9]], align 4
+; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP0]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_DEST]], i64 0, i64 [[TMP11]]
+; CHECK-NEXT: [[TMP13:%.*]] = shl nsw i32 [[TMP10]], 2
+; CHECK-NEXT: store i32 [[TMP13]], ptr [[TMP12]], align 4
+; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
+; CHECK: pred.store.continue:
+; CHECK-NEXT: [[TMP14:%.*]] = phi i32 [ poison, [[VECTOR_BODY]] ], [ [[TMP10]], [[PRED_STORE_IF]] ]
+; CHECK-NEXT: [[TMP15:%.*]] = extractelement <2 x i1> [[TMP6]], i32 1
+; CHECK-NEXT: br i1 [[TMP15]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2]]
+; CHECK: pred.store.if1:
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i64> [[TMP0]], i32 1
+; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_SRC]], i64 0, i64 [[TMP16]]
+; CHECK-NEXT: [[TMP18:%.*]] = load i32, ptr [[TMP17]], align 4
+; CHECK-NEXT: [[TMP19:%.*]] = extractelement <2 x i64> [[TMP0]], i32 1
+; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_DEST]], i64 0, i64 [[TMP19]]
+; CHECK-NEXT: [[TMP21:%.*]] = shl nsw i32 [[TMP18]], 2
+; CHECK-NEXT: store i32 [[TMP21]], ptr [[TMP20]], align 4
+; CHECK-NEXT: br label [[PRED_STORE_CONTINUE2]]
+; CHECK: pred.store.continue2:
+; CHECK-NEXT: [[TMP22:%.*]] = phi i32 [ poison, [[PRED_STORE_CONTINUE]] ], [ [[TMP18]], [[PRED_STORE_IF1]] ]
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 -2, i64 -2>
+; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
+; CHECK-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 1023, [[ENTRY:%.*]] ]
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.body:
+; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_INC:%.*]] ]
+; CHECK-NEXT: [[OFF:%.*]] = add i64 [[INDVARS_IV]], -1
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [1024 x i32], ptr [[LOCAL_CMP]], i64 0, i64 [[OFF]]
+; CHECK-NEXT: [[TMP24:%.*...
[truncated]
|
llvm/lib/Analysis/Loads.cpp
Outdated
// We can safely use the new base because the decrementing pointer is | ||
// always guaranteed to be >= new base. The total access size needs to | ||
// take into account the start offset and the loaded element size. | ||
AccessSize = Offset->getAPInt() + EltSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need an overflow check and bail out in case we have a narrow IV where incrementing by EltSize wraps? Alternatively we could probably also perform the computation in always in something like i64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've added the overflow check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Do you think it would be possible to add a test case that covers the overflow case (e.g. with an i8 induction that wraps when adding the element size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing that with an i8, but the offset return by SCEV is always 64-bit. So in practice I'd need a test with an alloca just less than UINT64_MAX bytes in size and an offset just less than UINT64_MAX, in order to exercise the overflow case. I wasn't sure if this was fragile, i.e. are allocas of such a large size even legal given that I don't think GEPs even support an offset greater than SINT64_MAX (since offsets are signed)? Having said that, I'm open to ideas about how to write a sensible test for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like the below work? I might be missing something, but it looks like the overflow check should trigger there?
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128-ni:1"
declare void @init(ptr nocapture nofree)
define void @test(ptr nocapture noundef writeonly %dest) {
entry:
%local_dest = alloca [128 x i8], align 4
%local_src = alloca [128 x i8], align 4
%local_cmp = alloca [128 x i8], align 4
call void @init(ptr %local_src)
call void @init(ptr %local_cmp)
br label %for.body
for.body:
%iv = phi i8 [ 255, %entry ], [ %iv.next, %for.inc ]
%arrayidx = getelementptr inbounds [1024 x i8], ptr %local_cmp, i8 0, i8 %iv
%0 = load i8, ptr %arrayidx, align 1
%cmp3.not = icmp eq i8 %0, 3
br i1 %cmp3.not, label %for.inc, label %if.then
if.then:
%arrayidx5 = getelementptr inbounds [1024 x i8], ptr %local_src, i8 0, i8 %iv
%1 = load i8, ptr %arrayidx5, align 1
%mul = shl nsw i8 %1, 2
%arrayidx7 = getelementptr inbounds [1024 x i8], ptr %local_dest, i8 0, i8 %iv
store i8 %mul, ptr %arrayidx7, align 1
br label %for.inc
for.inc:
%iv.next = add nsw i8 %iv, -1
%cmp2.not = icmp eq i8 %iv, 200
br i1 %cmp2.not, label %for.cond.cleanup, label %for.body
for.cond.cleanup:
call void @llvm.memcpy.p0.p0.i8(ptr noundef nonnull align 4 dereferenceable(1024) %dest, ptr noundef nonnull align 4 dereferenceable(1024) %local_dest, i8 1024, i1 false)
ret void
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps bail out entirely if the offset is negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might makes sense for now to bail out for negative values - it doesn't cause any existing tests to fail. It would also make it almost impossible to write an overflow test because you'd need something like an offset of 0x7FFFFFFFFFFFFFFF and an element size of 0x7FFFFFFFFFFFFFFF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to restrict this to non-negative values, thanks! It would probably make sense to split off the patch with the extra bail-out separately. Also adding @preames who added some of that code originally I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I originally thought about splitting off the change in a separate patch, but I couldn't see any way of testing it by itself because the overflow checks catch this by accident. If you want I can split it off into a separate PR - there just won't be any tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #99490 for the negative offset check.
|
An alternate approach here would be to compute the AccessSize as a SCEV (which should just be umax(End,Start)-umin(Start,End) + ElementSize, and then check to see if it's a SCEVConstant. End should be evaluateAtIteration on the AddRec. The base would then be umin(End, Start) as a SCEV, and then see if it's an understandable one. In general, doing the math manually seems like a recipe for error here. |
Thanks for this suggestion @preames! It was a bit fiddly to get it working without using SCEVExpander, but I do think it's now less prone to error. As part of this patch I've also added some tests for loops with strided accesses for positive and negative steps too as we didn't seem to be testing that. |
|
@preames @fhahn After rebasing my early exit auto-vectorisation patch on top of this one, I realised some of the tests no longer vectorise because isDereferenceAndAlignedInLoop now returns false for some cases. In particular, if a function argument has the attribute So the question is - are we happy with accepting these minor regressions when using SCEV for now? Or is it worth creating a new patch to teach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@preames @fhahn After rebasing my early exit auto-vectorisation patch on top of this one, I realised some of the tests no longer vectorise because isDereferenceAndAlignedInLoop now returns false for some cases. In particular, if a function argument has the attribute
dereferenceable(100)
then the SCEV min/max code fails to return a SCEVConstant. This is because it is unable to determine that a calculation such as (ptr + offset) will not wrap. I don't think there is a fundamental issue blocking this - it's just that the functionScalarEvolution::getRangeRef
knows about allocas, but doesn't know about the dereferenceable attribute.So the question is - are we happy with accepting these minor regressions when using SCEV for now? Or is it worth creating a new patch to teach
ScalarEvolution::getRangeRef
new tricks first?
That's interesting! Is it possible that the current tests don't catch this issue because load-deref-pred-align.ll
doesn't have a variant with dereferenceable()
?
llvm/lib/Analysis/Loads.cpp
Outdated
const auto *Offset = dyn_cast<SCEVConstant>(StartS->getOperand(0)); | ||
const auto *NewBase = dyn_cast<SCEVUnknown>(StartS->getOperand(1)); | ||
if (StartS->getNumOperands() == 2 && Offset && NewBase) { | ||
const SCEV *End = AddRec->evaluateAtIteration( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we re-use the code for computing start and end from https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/LoopAccessAnalysis.cpp#L206
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's possible, but it's not a drop-in replacement and is quite inefficient in the current context. For example, getStartAndEndForAccess
recalculates the trip count and element size which we've already calculated, although I could rip out a lot of the code in isDereferenceableAndAlignedInLoop
so that I avoid doing this. The function also requires passing in a PointerBounds
DenseMap, which we don't care about. I could probably refactor getStartAndEndForAccess
by adding a new variant that doesn't require a map. Ultimately I will still need to extract an actual base Value*
pointer from a SCEV, but perhaps this would reduce the overall complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've done some refactoring and taught isDereferenceableAndAlignedInLoop to call getStartAndEndForAccess. This at least reduces some of the complexity and is actually an improvement on the existing code.
llvm/lib/Analysis/Loads.cpp
Outdated
return false; | ||
|
||
// Now calculate the total access size, which is (max - min) + element_size. | ||
const SCEV *Diff = SE.getMinusSCEV(Max, Min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the separate handling for SCEVUnknown and SCEVAddExpr still needed when expanding the bounds here? I'd expect that computing start/end of the add rec would work for both cases without the need for a distinction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so. That's because the Min value is a SCEV*
whereas we need a Value*
and you can only do that by analysing the SCEV type and handling it on a case-by-case basis. We cannot use the SCEVExpander here because that would create new IR. This is the downside to using SCEV here.
llvm/lib/Analysis/Loads.cpp
Outdated
if (!ASC) | ||
return false; | ||
|
||
if (const SCEVUnknown *NewBase = dyn_cast<SCEVUnknown>(Min)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieving the new base and access size here seems to unfortunately add a bit of complexity.
What could make this much simpler would be if we would get the base pointer (underlying object) and the # of known dereferenceable bits. Then construct SCEVs for the start and end of the object and check if the Start/End of the accessed pointer is >= and <= the bounds of the object.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/Loads.cpp#L97 would highlight a way to get the number of dereferenceable bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do certainly, but then we're effectively doing some of the work that's already done in isDereferenceableAndAlignedPointer
. It's almost like we want to call isDereferenceableAndAlignedPointer
on both the min and max addressed pointers in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I still think we end up with a similar problem in that we need to discover the base Value*
in order to call getPointerDereferenceableBytes
. I understand what you're suggesting, but not sure it will reduce the complexity in practice. I'll have a think about it some more.
Yeah that's right, whereas my early exit autovec tests have a test for dereferenceable pointers. It turns out there is a relatively simple fix by teaching |
…ble objects Whilst dealing with review comments on llvm#96752 I discovered that SCEV does not know about the dereferenceable attribute on function arguments so I have updated getRangeRef to make use of it by calling getPointerDereferenceableBytes.
Rebased. |
@@ -3,7 +3,7 @@ | |||
|
|||
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64" | |||
|
|||
@src = external global [8 x i32], align 4 | |||
declare void @init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was for an earlier version of the patch, but it's no longer required. I've reverted it now!
|
||
const SCEV *ScStart; | ||
const SCEV *ScEnd; | ||
|
||
if (SE->isLoopInvariant(PtrExpr, Lp)) { | ||
ScStart = ScEnd = PtrExpr; | ||
} else if (auto *AR = dyn_cast<SCEVAddRecExpr>(PtrExpr)) { | ||
const SCEV *Ex = PSE.getSymbolicMaxBackedgeTakenCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as PSE caches the max symbolic BTC (I think) it's probably better to avoid the extra parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra parameter is required in order to reuse this function in Loads.cpp, because isDereferenceableAndAlignedInLoop only has access to ScalarEvolution. In the places where getStartAndEndForAccess
is called below we still call PSE.getSymbolicMaxBackedgeTakenCount()
anyway so it will be cached.
The alternative would be to have two versions of getStartAndEndForAccess
- one with a ScalarEvolution parameter and the other with a PredicatedScalarEvolution - but that's not great for code reuse.
@@ -839,6 +839,12 @@ bool sortPtrAccesses(ArrayRef<Value *> VL, Type *ElemTy, const DataLayout &DL, | |||
bool isConsecutiveAccess(Value *A, Value *B, const DataLayout &DL, | |||
ScalarEvolution &SE, bool CheckType = true); | |||
|
|||
std::pair<const SCEV *, const SCEV *> getStartAndEndForAccess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the comment from llvm/lib/Analysis/LoopAccessAnalysis.cpp
, as it is more precise?
I think from the current description, it sounds like the returned value is the last address accessed in the loop, but unless I am missing something it is actually (address of last access)+access size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
static std::pair<const SCEV *, const SCEV *> getStartAndEndForAccess( | ||
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, | ||
PredicatedScalarEvolution &PSE, | ||
std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to add an overload without the extra EltSize argument for users here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
br label %for.body | ||
|
||
for.body: | ||
%indvars.iv = phi i64 [ 511, %entry ], [ %indvars.iv.next, %for.inc ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%indvars.iv = phi i64 [ 511, %entry ], [ %indvars.iv.next, %for.inc ] | |
%iv = phi i64 [ 511, %entry ], [ %indvars.iv.next, %for.inc ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
br i1 %cmp2.not, label %for.cond.cleanup, label %for.body | ||
|
||
for.cond.cleanup: | ||
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(1024) %dest, ptr noundef nonnull align 4 dereferenceable(1024) %local_dest, i64 1024, i1 false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(1024) %dest, ptr noundef nonnull align 4 dereferenceable(1024) %local_dest, i64 1024, i1 false) | |
call void @llvm.memcpy.p0.p0.i64(ptr %dest, ptr %local_dest, i64 1024, i1 false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
br label %for.body | ||
|
||
for.body: | ||
%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %for.inc ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %for.inc ] | |
%iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %for.inc ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
br label %for.inc | ||
|
||
for.inc: | ||
%indvars.iv.next = add nsw i64 %indvars.iv, -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%indvars.iv.next = add nsw i64 %indvars.iv, -1 | |
%iv.next = add nsw i64 %indvars.iv, -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
llvm/lib/Analysis/Loads.cpp
Outdated
if (isa<SCEVConstant>(PtrDiff)) | ||
MaxPtrDiff = cast<SCEVConstant>(PtrDiff)->getAPInt(); | ||
else | ||
MaxPtrDiff = SE.getUnsignedRangeMax(PtrDiff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sufficient to ?
if (isa<SCEVConstant>(PtrDiff)) | |
MaxPtrDiff = cast<SCEVConstant>(PtrDiff)->getAPInt(); | |
else | |
MaxPtrDiff = SE.getUnsignedRangeMax(PtrDiff); | |
MaxPtrDiff = SE.getUnsignedRangeMax(PtrDiff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
llvm/lib/Analysis/Loads.cpp
Outdated
return false; | ||
|
||
const SCEV *EltSizeSCEV; | ||
std::pair<const SCEV *, const SCEV *> Range = getStartAndEndForAccess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::pair<const SCEV *, const SCEV *> Range = getStartAndEndForAccess( | |
const auto &[AccessStart, AccessEnd] = getStartAndEndForAccess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…ble objects (llvm#104778) Whilst dealing with review comments on llvm#96752 I discovered that SCEV does not know about the dereferenceable attribute on function arguments so I have updated getRangeRef to make use of it by calling getPointerDereferenceableBytes.
Rebase and some minor refactoring to hopefully make it look a bit neater. |
Rebase. |
llvm/lib/Analysis/Loads.cpp
Outdated
const SCEV *MaxBECount = | ||
SE.getPredicatedSymbolicMaxBackedgeTakenCount(L, *Predicates); | ||
if (isa<SCEVCouldNotCompute>(MaxBECount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending to support predicated BTCs instead of constant trip counts should probably be done separately if possible?
It should probably also be possible add a test for predicated BTCs unless I am missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry, you're right. It shouldn't be part of this change. I think I just did it whilst addressing one of your review comments about reusing getStartAndEndForAccess. It should be fine to pass in SE.getSmallConstantMaxTripCount
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I can't pass in the result of SE.getSmallConstantMaxTripCount
anymore because it returns an unsigned value. Instead I have changed it to call SE.getPredicatedConstantMaxBackedgeTakenCount
, which is what getSmallConstantMaxTripCount
does anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
This adds missing test coverage for isDereferenceableAndAlignedInLoop, related to #96752.
|
||
// If given a uniform (i.e. non-varying) address, see if we can prove the | ||
// access is safe within the loop w/o needing predication. | ||
if (L->isLoopInvariant(Ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this code path was previously not tested and dropping it may cause a regression. I added 747f7f3 which should cover this codepath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Thanks for adding the test! I've re-added the code now.
llvm/lib/Analysis/Loads.cpp
Outdated
const SCEV *MaxBECount = | ||
SE.getPredicatedSymbolicMaxBackedgeTakenCount(L, *Predicates); | ||
if (isa<SCEVCouldNotCompute>(MaxBECount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
…d assumption. This adds missing test coverage for isDereferenceableAndAlignedInLoop, related to llvm/llvm-project#96752.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Just one comment remaining re moving the existing comment for getStartAndEndForAccess
@@ -839,6 +839,12 @@ bool sortPtrAccesses(ArrayRef<Value *> VL, Type *ElemTy, const DataLayout &DL, | |||
bool isConsecutiveAccess(Value *A, Value *B, const DataLayout &DL, | |||
ScalarEvolution &SE, bool CheckType = true); | |||
|
|||
std::pair<const SCEV *, const SCEV *> getStartAndEndForAccess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the comment from llvm/lib/Analysis/LoopAccessAnalysis.cpp
, as it is more precise?
I think from the current description, it sounds like the returned value is the last address accessed in the loop, but unless I am missing something it is actually (address of last access)+access size.
…AlignedInLoop Currently when we encounter a negative step in the induction variable isDereferenceableAndAlignedInLoop bails out because the element size is signed greater than the step. This patch adds support for negative steps in cases where we detect the start address for the load is of the form base + offset. In this case the address decrements in each iteration so we need to calculate the access size differently. I have done this by caling getStartAndEndForAccess from LoopAccessAnalysis.cpp. The changed test in LoopVectorize/X86/load-deref-pred.ll now passes because previously we were calculating the total access size incorrectly, whereas now it is 412 bytes and fits perfectly into the alloca.
Rebase + addressed review comment. I ran "make check-all" downstream after rebasing and the tests look good. |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/9122 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/169/builds/7413 Here is the relevant piece of the build log for the reference
|
Looks like this is causing an assert failure for build https://lab.llvm.org/buildbot/#/builders/169/builds/7413. I'll revert asap. |
…eableAndAlignedInLoop (llvm#96752)" This reverts commit bfedf64.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/85/builds/4376 Here is the relevant piece of the build log for the reference
|
…eableAndAlignedInLoop llvm#96752" The last attempt failed a sanitiser build because we were creating a reference to a null Predicates pointer in isDereferenceableAndAlignedInLoop. This was exposed by the unit test IsDerefReadOnlyLoop in unittests/Analysis/LoadsTest.cpp. I fixed this by falling back on getConstantMaxBackedgeTakenCount if Predicates is null.
…eableAndAlignedInLoop llvm#96752" The last attempt failed a sanitiser build because we were creating a reference to a null Predicates pointer in isDereferenceableAndAlignedInLoop. This was exposed by the unit test IsDerefReadOnlyLoop in unittests/Analysis/LoadsTest.cpp. I fixed this by falling back on getConstantMaxBackedgeTakenCount if Predicates is null.
…eableAndAlignedInLoop #96752" (#123616) The last attempt failed a sanitiser build because we were creating a reference to a null Predicates pointer in isDereferenceableAndAlignedInLoop. This was exposed by the unit test IsDerefReadOnlyLoop in unittests/Analysis/LoadsTest.cpp. I fixed this by falling back on getConstantMaxBackedgeTakenCount if Predicates is null - see line 316 in llvm/lib/Analysis/Loads.cpp. There are no other changes.
Currently when we encounter a negative step in the induction
variable isDereferenceableAndAlignedInLoop bails out because
the element size is signed greater than the step. This patch
adds support for negative steps in cases where we detect the
start address for the load is of the form base + offset. In
this case the address decrements in each iteration so we need
to calculate the access size differently. I have done this by
caling getStartAndEndForAccess from LoopAccessAnalysis.cpp.
The motivation for this patch comes from PR #88385 where a
reviewer requested reusing isDereferenceableAndAlignedInLoop,
but that PR itself does support reverse loops.
The changed test in LoopVectorize/X86/load-deref-pred.ll now
passes because previously we were calculating the total access
size incorrectly, whereas now it is 412 bytes and fits
perfectly into the alloca.