-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[Flang]Fix for changed code at the end of AllocaIP. #92430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flang]Fix for changed code at the end of AllocaIP. #92430
Conversation
@llvm/pr-subscribers-clang @llvm/pr-subscribers-flang-openmp Author: Mats Petersson (Leporacanthicus) ChangesSome of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching or something like that. The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP. Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created). A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added. Full diff: https://github.com/llvm/llvm-project/pull/92430.diff 3 Files Affected:
diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90
new file mode 100644
index 0000000000000..fddce25ae22cc
--- /dev/null
+++ b/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90
@@ -0,0 +1,23 @@
+!! The main point of this test is to check that the code compiles at all, so the
+!! checking is not very detailed. Not hitting an assert, crashing or otherwise failing
+!! to compile is the key point. Also, emitting llvm is required for this to happen.
+! RUN: %flang_fc1 -emit-llvm -fopenmp -o - %s 2>&1 | FileCheck %s
+subroutine proc
+ implicit none
+ real(8),allocatable :: F(:)
+ real(8),allocatable :: A(:)
+
+!$omp parallel private(A) reduction(+:F)
+ allocate(A(10))
+!$omp end parallel
+end subroutine proc
+
+!CHECK-LABEL: define void @proc_()
+!CHECK: call void
+!CHECK-SAME: @__kmpc_fork_call(ptr {{.*}}, i32 1, ptr @[[OMP_PAR:.*]], {{.*}})
+
+!CHECK: define internal void @[[OMP_PAR]]
+!CHECK: omp.par.region8:
+!CHECK-NEXT: call ptr @malloc
+!CHECK-SAME: i64 10
+
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 9f0c07ef0da91..c0f4b5b066097 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -1391,7 +1391,8 @@ IRBuilder<>::InsertPoint OpenMPIRBuilder::createParallel(
// Change the location to the outer alloca insertion point to create and
// initialize the allocas we pass into the parallel region.
- Builder.restoreIP(OuterAllocaIP);
+ InsertPointTy NewOuter(OuterAllocaBlock, OuterAllocaBlock->begin());
+ Builder.restoreIP(NewOuter);
AllocaInst *TIDAddrAlloca = Builder.CreateAlloca(Int32, nullptr, "tid.addr");
AllocaInst *ZeroAddrAlloca =
Builder.CreateAlloca(Int32, nullptr, "zero.addr");
@@ -2135,7 +2136,8 @@ OpenMPIRBuilder::InsertPointTy OpenMPIRBuilder::createReductions(
// values.
unsigned NumReductions = ReductionInfos.size();
Type *RedArrayTy = ArrayType::get(Builder.getPtrTy(), NumReductions);
- Builder.restoreIP(AllocaIP);
+ Builder.SetInsertPoint(AllocaIP.getBlock()->getTerminator());
+ // Builder.restoreIP(AllocaIP);
Value *RedArray = Builder.CreateAlloca(RedArrayTy, nullptr, "red.array");
Builder.SetInsertPoint(InsertBlock, InsertBlock->end());
@@ -2536,7 +2538,10 @@ OpenMPIRBuilder::applyStaticWorkshareLoop(DebugLoc DL, CanonicalLoopInfo *CLI,
getOrCreateRuntimeFunction(M, omp::OMPRTL___kmpc_for_static_fini);
// Allocate space for computed loop bounds as expected by the "init" function.
- Builder.restoreIP(AllocaIP);
+
+ // Builder.restoreIP(AllocaIP);
+ Builder.SetInsertPoint(AllocaIP.getBlock()->getTerminator());
+
Type *I32Type = Type::getInt32Ty(M.getContext());
Value *PLastIter = Builder.CreateAlloca(I32Type, nullptr, "p.lastiter");
Value *PLowerBound = Builder.CreateAlloca(IVTy, nullptr, "p.lowerbound");
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index bfd7d65912bdb..b5f46b641f596 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -1194,6 +1194,17 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
MutableArrayRef<BlockArgument> reductionArgs =
opInst.getRegion().getArguments().take_back(
opInst.getNumReductionVars());
+
+ SmallVector<llvm::Value *> byRefVars;
+ if (isByRef) {
+ for (unsigned i = 0; i < opInst.getNumReductionVars(); ++i) {
+ // Allocate reduction variable (which is a pointer to the real reduciton
+ // variable allocated in the inlined region)
+ byRefVars.push_back(builder.CreateAlloca(
+ moduleTranslation.convertType(reductionDecls[i].getType())));
+ }
+ }
+
for (unsigned i = 0; i < opInst.getNumReductionVars(); ++i) {
SmallVector<llvm::Value *> phis;
@@ -1206,18 +1217,13 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
assert(phis.size() == 1 &&
"expected one value to be yielded from the "
"reduction neutral element declaration region");
- builder.restoreIP(allocaIP);
if (isByRef) {
- // Allocate reduction variable (which is a pointer to the real reduciton
- // variable allocated in the inlined region)
- llvm::Value *var = builder.CreateAlloca(
- moduleTranslation.convertType(reductionDecls[i].getType()));
// Store the result of the inlined region to the allocated reduction var
// ptr
- builder.CreateStore(phis[0], var);
+ builder.CreateStore(phis[0], byRefVars[i]);
- privateReductionVariables.push_back(var);
+ privateReductionVariables.push_back(byRefVars[i]);
moduleTranslation.mapValue(reductionArgs[i], phis[0]);
reductionVariableMap.try_emplace(opInst.getReductionVars()[i], phis[0]);
} else {
|
I'm aware MLIR tests are failing, will try to fix tomorrow. |
@@ -1194,6 +1194,17 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder, | |||
MutableArrayRef<BlockArgument> reductionArgs = | |||
opInst.getRegion().getArguments().take_back( | |||
opInst.getNumReductionVars()); | |||
|
|||
SmallVector<llvm::Value *> byRefVars; | |||
if (isByRef) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #92244
Ideally, the only thing added to the alloca block are allocas. Apparently in this case something is splitting the alloca block before the AllocaIP, do you know what exactly? Resetting the insertion point to the beginning of the alloca block causes the inserted instructions to be reverser order. I don't think that's a big deal (unless the alloca size needs a value that has previously computed), but I would consider making a dedicated alloca block that is never split to be the cleaner solution. I created
I'd prefer to not make the split conditional as it is here, it adds another possibly unnecessary BB but is more robust. What do you think? |
One possible reason is that for allocatable variables the privatization demands that we allocate only if the original variable is allocated. This introduces an |
I strongly recommend not doing it this way. An unconditional alloca in the entry block has no overhead (it just moves the stack pointer for the required space). Doing it conditionally does introduces overhead. Even worse, some LLVM passes treat allocas in the entry block specially (e.g. the inliner moves them to the caller's entry block to avoid having to dynamically readjust the stack pointer; mem2reg; SROA) and adding a branch will effectively pushes all allocas after the introduced branch into a non-entry block and thus worse optimization for them too. See the discussions at https://discourse.llvm.org/t/alloca-outside-of-entry-block/11088/5, https://discourse.llvm.org/t/mem2reg-for-non-entry-blocks/26696/5 It would be better if the memory is allocated unconditionally, and just not used it if the variable not allocated. |
@Meinersbur The case is not that we are using I'm still trying to make this work correctly for all cases. |
In any case, the block that the conditional is inserted into should not be the alloca block. If there are alloca instructions after the split point, those allocas will be pushed out of the entry block. I would consider anything that inserts control flow into the AllocaIP to be fragile. If we only ever want to insert instructions to the end of the (current) alloca block, we should just pass the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes.
@Meinersbur thank you for your comments. I agree that it would be better for all allocas to come first in the entry block and not to be inserted after conditionals. But I think this is out of scope of this PR, which is trying to fix an assertion failure while building a real application.
Would it be acceptable for you if we addressed the alloca placement in a later patch?
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
@Meinersbur : I completely agree that this is far from a perfect solution. The trouble is that the "good solution" involves refactoring quite a bit of the infrastructure involved to fix that. Which I started looking at, but we're would like to get something that works in now, and make a proper refactoring later on. |
@@ -0,0 +1,23 @@ | |||
!! The main point of this test is to check that the code compiles at all, so the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this comment:
I don't think we want the new AllocaIP to be in that second block, in case someone adds more alloca ops - they should go into the first block.
My PR, that Kiran mentioned, has several of this kind of fixes [I will check if it fixes this case too,]. Whenever possible, I try to make the new location at the terminator of the original AllocaIP block.
I looked the generated IR of this test, you can check the CFG of the outlined parallel region here.
As you can see in the CFG, alloca
s are still scattered through out the BBs of the outlined region's function. I think if we want to keep all alloca
s only at the entry block of the outlined function, only carefully updating the insertion point will not be enough.
I think the problem is that we convert and inlined entire "sub-regions" individually; for example, in this test we inline the whole logic/IR for A
first and then F
(or vice versa). So no matter where the IP is, since for allocatables there is if-else
condtion to inline, the currently set IP will have to be split and alloca
s that were created before will either be pushed up or down to a different BB than the one we will insert the new alloca
s to.
Instead, I think we should do that a post-processing step after the region is completely emitted. So we hoist the all alloca
s after we finish emitting the outlined region to the entry BB of the outlined function. However, I don't know if there are any unseen issues here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created an internal ticket to improve the code generation. It's just a much bigger task than this one started out as. I then found more problems, so this took much longer than I expected.
b0875c4
to
535d84e
Compare
Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching or something like that. The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP. Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created). A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.
535d84e
to
2577a63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if @Meinersbur is happy. It's a shame about the fallout to the clang tests but I can see that would be very difficult to avoid.
Thank you for persisting with this difficult patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocas should be placed at the start of the entry block.
Would it be acceptable to make this change such that the alloca's are at the very start of the block? I have tried that, and it seems to work. It will still mean changes to the tests, and I can't guarantee ALL alloca's for all Fortran (via MLIR) generated code will be at the start of the entry block. We have a further ticket to clean this up, but it's a bigger task, which we will do, but it will take a few weeks at least. |
Yes, that should be fine. You could also use something like getFirstNonPHIOrDbgOrAlloca to compute the insert point after the current allocas. (Calling this many times could have quadratic runtime though -- I don't have enough context here to say whether that's a problem or not.) |
Add a second reduction of a by_val type, so that we can test for the most recent bug found. Also, since the test is now different, rename it. This test is going all the way to LLVM-IR to ensure the whole codegeneration works here.
Also fix some tests, as this moves some allocas to an earlier location.
2577a63
to
243e5fa
Compare
My concern was addressed -- and I don't have the OpenMP context to comment on the broader change.
@ergawy Could you take a look at this, given that you did something similar [even if it was much smaller] recently? |
progress? |
I'm afraid I'm still waiting for review from some of the key people here. Apparently there was some OpenMP conference last week, which kept some people very busy, so I'm hoping by the time they've read emails and got over jet-lag, etc, they'll get round to this. |
Sorry, this totally slipped my mind. I will take a look today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It was incorrect to set the insertion point to the init block after inlining the initialization region because the code generated in the init block depends upon the value yielded from the init region. When there were multiple reduction initialization regions each with multiple blocks, this could lead to the initilization region being inlined after the init block which depends upon it. Moving the insertion point to before inlining the initialization block turned up further issues around the handling of the terminator for the initialization block, which are also fixed here. This fixes a bug in llvm#92430 (but the affected code couldn't compile before llvm#92430 anyway).
It was incorrect to set the insertion point to the init block after inlining the initialization region because the code generated in the init block depends upon the value yielded from the init region. When there were multiple reduction initialization regions each with multiple blocks, this could lead to the initilization region being inlined after the init block which depends upon it. Moving the insertion point to before inlining the initialization block turned up further issues around the handling of the terminator for the initialization block, which are also fixed here. This fixes a bug in #92430 (but the affected code couldn't compile before #92430 anyway).
Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching. The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP. Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created). A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.
…6052) It was incorrect to set the insertion point to the init block after inlining the initialization region because the code generated in the init block depends upon the value yielded from the init region. When there were multiple reduction initialization regions each with multiple blocks, this could lead to the initilization region being inlined after the init block which depends upon it. Moving the insertion point to before inlining the initialization block turned up further issues around the handling of the terminator for the initialization block, which are also fixed here. This fixes a bug in llvm#92430 (but the affected code couldn't compile before llvm#92430 anyway).
Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching.
The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP.
Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created).
A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.