Skip to content

[Flang]Fix for changed code at the end of AllocaIP. #92430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 18, 2024

Conversation

Leporacanthicus
Copy link
Contributor

@Leporacanthicus Leporacanthicus commented May 16, 2024

Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching.

The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP.

Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created).

A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.

@llvmbot llvmbot added mlir:llvm mlir flang Flang issues not falling into any other category mlir:openmp flang:fir-hlfir flang:openmp clang:openmp OpenMP related changes to Clang labels May 16, 2024
@llvmbot
Copy link
Member

llvmbot commented May 16, 2024

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-mlir-openmp
@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-flang-fir-hlfir

@llvm/pr-subscribers-flang-openmp

Author: Mats Petersson (Leporacanthicus)

Changes

Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching or something like that.

The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP.

Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created).

A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.


Full diff: https://github.com/llvm/llvm-project/pull/92430.diff

3 Files Affected:

  • (added) flang/test/Lower/OpenMP/parallel-reduction-allocate.f90 (+23)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+8-3)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+13-7)
diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90
new file mode 100644
index 0000000000000..fddce25ae22cc
--- /dev/null
+++ b/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90
@@ -0,0 +1,23 @@
+!! The main point of this test is to check that the code compiles at all, so the
+!! checking is not very detailed. Not hitting an assert, crashing or otherwise failing
+!! to compile is the key point. Also, emitting llvm is required for this to happen.
+! RUN: %flang_fc1 -emit-llvm -fopenmp -o - %s 2>&1 | FileCheck %s
+subroutine proc
+  implicit none
+  real(8),allocatable :: F(:)
+  real(8),allocatable :: A(:)
+
+!$omp parallel private(A) reduction(+:F)
+  allocate(A(10))
+!$omp end parallel
+end subroutine proc
+
+!CHECK-LABEL: define void @proc_()
+!CHECK: call void
+!CHECK-SAME: @__kmpc_fork_call(ptr {{.*}}, i32 1, ptr @[[OMP_PAR:.*]], {{.*}})
+
+!CHECK: define internal void @[[OMP_PAR]]
+!CHECK: omp.par.region8:
+!CHECK-NEXT: call ptr @malloc
+!CHECK-SAME: i64 10
+
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 9f0c07ef0da91..c0f4b5b066097 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -1391,7 +1391,8 @@ IRBuilder<>::InsertPoint OpenMPIRBuilder::createParallel(
 
   // Change the location to the outer alloca insertion point to create and
   // initialize the allocas we pass into the parallel region.
-  Builder.restoreIP(OuterAllocaIP);
+  InsertPointTy NewOuter(OuterAllocaBlock, OuterAllocaBlock->begin());
+  Builder.restoreIP(NewOuter);
   AllocaInst *TIDAddrAlloca = Builder.CreateAlloca(Int32, nullptr, "tid.addr");
   AllocaInst *ZeroAddrAlloca =
       Builder.CreateAlloca(Int32, nullptr, "zero.addr");
@@ -2135,7 +2136,8 @@ OpenMPIRBuilder::InsertPointTy OpenMPIRBuilder::createReductions(
   // values.
   unsigned NumReductions = ReductionInfos.size();
   Type *RedArrayTy = ArrayType::get(Builder.getPtrTy(), NumReductions);
-  Builder.restoreIP(AllocaIP);
+  Builder.SetInsertPoint(AllocaIP.getBlock()->getTerminator());
+  //  Builder.restoreIP(AllocaIP);
   Value *RedArray = Builder.CreateAlloca(RedArrayTy, nullptr, "red.array");
 
   Builder.SetInsertPoint(InsertBlock, InsertBlock->end());
@@ -2536,7 +2538,10 @@ OpenMPIRBuilder::applyStaticWorkshareLoop(DebugLoc DL, CanonicalLoopInfo *CLI,
       getOrCreateRuntimeFunction(M, omp::OMPRTL___kmpc_for_static_fini);
 
   // Allocate space for computed loop bounds as expected by the "init" function.
-  Builder.restoreIP(AllocaIP);
+
+  //  Builder.restoreIP(AllocaIP);
+  Builder.SetInsertPoint(AllocaIP.getBlock()->getTerminator());
+
   Type *I32Type = Type::getInt32Ty(M.getContext());
   Value *PLastIter = Builder.CreateAlloca(I32Type, nullptr, "p.lastiter");
   Value *PLowerBound = Builder.CreateAlloca(IVTy, nullptr, "p.lowerbound");
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index bfd7d65912bdb..b5f46b641f596 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -1194,6 +1194,17 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
     MutableArrayRef<BlockArgument> reductionArgs =
         opInst.getRegion().getArguments().take_back(
             opInst.getNumReductionVars());
+
+    SmallVector<llvm::Value *> byRefVars;
+    if (isByRef) {
+      for (unsigned i = 0; i < opInst.getNumReductionVars(); ++i) {
+        // Allocate reduction variable (which is a pointer to the real reduciton
+        // variable allocated in the inlined region)
+        byRefVars.push_back(builder.CreateAlloca(
+            moduleTranslation.convertType(reductionDecls[i].getType())));
+      }
+    }
+
     for (unsigned i = 0; i < opInst.getNumReductionVars(); ++i) {
       SmallVector<llvm::Value *> phis;
 
@@ -1206,18 +1217,13 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
       assert(phis.size() == 1 &&
              "expected one value to be yielded from the "
              "reduction neutral element declaration region");
-      builder.restoreIP(allocaIP);
 
       if (isByRef) {
-        // Allocate reduction variable (which is a pointer to the real reduciton
-        // variable allocated in the inlined region)
-        llvm::Value *var = builder.CreateAlloca(
-            moduleTranslation.convertType(reductionDecls[i].getType()));
         // Store the result of the inlined region to the allocated reduction var
         // ptr
-        builder.CreateStore(phis[0], var);
+        builder.CreateStore(phis[0], byRefVars[i]);
 
-        privateReductionVariables.push_back(var);
+        privateReductionVariables.push_back(byRefVars[i]);
         moduleTranslation.mapValue(reductionArgs[i], phis[0]);
         reductionVariableMap.try_emplace(opInst.getReductionVars()[i], phis[0]);
       } else {

@Leporacanthicus Leporacanthicus changed the title Fix for changed code at the end of AllocaIP. [Flang]Fix for changed code at the end of AllocaIP. May 16, 2024
@Leporacanthicus
Copy link
Contributor Author

I'm aware MLIR tests are failing, will try to fix tomorrow.

@@ -1194,6 +1194,17 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
MutableArrayRef<BlockArgument> reductionArgs =
opInst.getRegion().getArguments().take_back(
opInst.getNumReductionVars());

SmallVector<llvm::Value *> byRefVars;
if (isByRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #92244

@Meinersbur
Copy link
Member

Ideally, the only thing added to the alloca block are allocas. Apparently in this case something is splitting the alloca block before the AllocaIP, do you know what exactly?

Resetting the insertion point to the beginning of the alloca block causes the inserted instructions to be reverser order. I don't think that's a big deal (unless the alloca size needs a value that has previously computed), but I would consider making a dedicated alloca block that is never split to be the cleaner solution. I created splitBB for this purpose. There is this code in createTeams:

  if (&OuterAllocaBB == Builder.GetInsertBlock()) {
    BasicBlock *BodyBB = splitBB(Builder, /*CreateBranch=*/true, "teams.entry");
    Builder.SetInsertPoint(BodyBB, BodyBB->begin());
  }

I'd prefer to not make the split conditional as it is here, it adds another possibly unnecessary BB but is more robust.

What do you think?

@kiranchandramohan
Copy link
Contributor

Ideally, the only thing added to the alloca block are allocas. Apparently in this case something is splitting the alloca block before the AllocaIP, do you know what exactly?

One possible reason is that for allocatable variables the privatization demands that we allocate only if the original variable is allocated. This introduces an if and hence multiple blocks.

@Meinersbur
Copy link
Member

One possible reason is that for allocatable variables the privatization demands that we allocate only if the original variable is allocated. This introduces an if and hence multiple blocks.

I strongly recommend not doing it this way. An unconditional alloca in the entry block has no overhead (it just moves the stack pointer for the required space). Doing it conditionally does introduces overhead. Even worse, some LLVM passes treat allocas in the entry block specially (e.g. the inliner moves them to the caller's entry block to avoid having to dynamically readjust the stack pointer; mem2reg; SROA) and adding a branch will effectively pushes all allocas after the introduced branch into a non-entry block and thus worse optimization for them too.

See the discussions at https://discourse.llvm.org/t/alloca-outside-of-entry-block/11088/5, https://discourse.llvm.org/t/mem2reg-for-non-entry-blocks/26696/5

It would be better if the memory is allocated unconditionally, and just not used it if the variable not allocated.

@Leporacanthicus
Copy link
Contributor Author

@Meinersbur The case is not that we are using alloca in conditionals - but that we do something like if thing.is_allocated() then allocate-and copy into private variable - we do the alloca, unconditionally. But those if allocated means there's an alternative path, so there's new blocks being created. And this means the end marker used in the original block gets changed so any previously stored insertion point becomes invalid.

I'm still trying to make this work correctly for all cases.

@Meinersbur
Copy link
Member

@Meinersbur The case is not that we are using alloca in conditionals - but that we do something like if thing.is_allocated() then allocate-and copy into private variable - we do the alloca, unconditionally. But those if allocated means there's an alternative path, so there's new blocks being created. And this means the end marker used in the original block gets changed so any previously stored insertion point becomes invalid.

In any case, the block that the conditional is inserted into should not be the alloca block. If there are alloca instructions after the split point, those allocas will be pushed out of the entry block. I would consider anything that inserts control flow into the AllocaIP to be fragile. If we only ever want to insert instructions to the end of the (current) alloca block, we should just pass the BasicBlock* everywhere, not an InsertionPoint.

Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes.

@Meinersbur thank you for your comments. I agree that it would be better for all allocas to come first in the entry block and not to be inserted after conditionals. But I think this is out of scope of this PR, which is trying to fix an assertion failure while building a real application.

Would it be acceptable for you if we addressed the alloca placement in a later patch?

@tblah tblah mentioned this pull request May 30, 2024
@Leporacanthicus
Copy link
Contributor Author

@Meinersbur : I completely agree that this is far from a perfect solution. The trouble is that the "good solution" involves refactoring quite a bit of the infrastructure involved to fix that. Which I started looking at, but we're would like to get something that works in now, and make a proper refactoring later on.

@@ -0,0 +1,23 @@
!! The main point of this test is to check that the code compiles at all, so the
Copy link
Member

@ergawy ergawy Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this comment:

I don't think we want the new AllocaIP to be in that second block, in case someone adds more alloca ops - they should go into the first block.
My PR, that Kiran mentioned, has several of this kind of fixes [I will check if it fixes this case too,]. Whenever possible, I try to make the new location at the terminator of the original AllocaIP block.

I looked the generated IR of this test, you can check the CFG of the outlined parallel region here.

As you can see in the CFG, allocas are still scattered through out the BBs of the outlined region's function. I think if we want to keep all allocas only at the entry block of the outlined function, only carefully updating the insertion point will not be enough.

I think the problem is that we convert and inlined entire "sub-regions" individually; for example, in this test we inline the whole logic/IR for A first and then F (or vice versa). So no matter where the IP is, since for allocatables there is if-else condtion to inline, the currently set IP will have to be split and allocas that were created before will either be pushed up or down to a different BB than the one we will insert the new allocas to.

Instead, I think we should do that a post-processing step after the region is completely emitted. So we hoist the all allocas after we finish emitting the outlined region to the entry BB of the outlined function. However, I don't know if there are any unseen issues here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created an internal ticket to improve the code generation. It's just a much bigger task than this one started out as. I then found more problems, so this took much longer than I expected.

@llvmbot llvmbot added the clang Clang issues not falling into any other category label Jun 4, 2024
Some of the OpenMP code can change the instruction pointed at by
the insertion point. This leads to an assert in the compiler about
BB->getParent() and IP->getParent() not matching or something like
that.

The fix is to rebuild the insertionpoint from the block, rather
than use builder.restoreIP.

Also, move some of the alloca generation, rather than skipping back
and forth between insert points (and ensure all the allocas are done
before their users are created).

A simple test, mainly to ensure the minimal reproducer doesn't fail
to compile in the future is also added.
Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if @Meinersbur is happy. It's a shame about the fallout to the clang tests but I can see that would be very difficult to avoid.

Thank you for persisting with this difficult patch.

@Leporacanthicus Leporacanthicus requested a review from nikic June 6, 2024 10:15
nikic
nikic previously requested changes Jun 6, 2024
Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocas should be placed at the start of the entry block.

@Leporacanthicus
Copy link
Contributor Author

Allocas should be placed at the start of the entry block.

Would it be acceptable to make this change such that the alloca's are at the very start of the block? I have tried that, and it seems to work. It will still mean changes to the tests, and I can't guarantee ALL alloca's for all Fortran (via MLIR) generated code will be at the start of the entry block.

We have a further ticket to clean this up, but it's a bigger task, which we will do, but it will take a few weeks at least.

@nikic
Copy link
Contributor

nikic commented Jun 6, 2024

Allocas should be placed at the start of the entry block.

Would it be acceptable to make this change such that the alloca's are at the very start of the block? I have tried that, and it seems to work. It will still mean changes to the tests, and I can't guarantee ALL alloca's for all Fortran (via MLIR) generated code will be at the start of the entry block.

Yes, that should be fine.

You could also use something like getFirstNonPHIOrDbgOrAlloca to compute the insert point after the current allocas. (Calling this many times could have quadratic runtime though -- I don't have enough context here to say whether that's a problem or not.)

Add a second reduction of a by_val type, so that we can test
for the most recent bug found. Also, since the test is now
different, rename it.

This test is going all the way to LLVM-IR to ensure the whole
codegeneration works here.
Also fix some tests, as this moves some allocas to an earlier
location.
@nikic nikic dismissed their stale review June 7, 2024 08:20

My concern was addressed -- and I don't have the OpenMP context to comment on the broader change.

@Leporacanthicus
Copy link
Contributor Author

@ergawy Could you take a look at this, given that you did something similar [even if it was much smaller] recently?

@ye-luo
Copy link
Contributor

ye-luo commented Jun 17, 2024

progress?

@Leporacanthicus
Copy link
Contributor Author

progress?

I'm afraid I'm still waiting for review from some of the key people here. Apparently there was some OpenMP conference last week, which kept some people very busy, so I'm hoping by the time they've read emails and got over jet-lag, etc, they'll get round to this.

@ergawy
Copy link
Member

ergawy commented Jun 18, 2024

@ergawy Could you take a look at this, given that you did something similar [even if it was much smaller] recently?

Sorry, this totally slipped my mind. I will take a look today.

Copy link
Member

@ergawy ergawy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Leporacanthicus Leporacanthicus merged commit e5f1639 into llvm:main Jun 18, 2024
7 checks passed
tblah added a commit to tblah/llvm-project that referenced this pull request Jun 19, 2024
It was incorrect to set the insertion point to the init block after
inlining the initialization region because the code generated in the
init block depends upon the value yielded from the init region. When
there were multiple reduction initialization regions each with multiple
blocks, this could lead to the initilization region being inlined after
the init block which depends upon it.

Moving the insertion point to before inlining the initialization block
turned up further issues around the handling of the terminator for the
initialization block, which are also fixed here.

This fixes a bug in llvm#92430 (but the affected code couldn't compile
before llvm#92430 anyway).
tblah added a commit that referenced this pull request Jun 21, 2024
It was incorrect to set the insertion point to the init block after
inlining the initialization region because the code generated in the
init block depends upon the value yielded from the init region. When
there were multiple reduction initialization regions each with multiple
blocks, this could lead to the initilization region being inlined after
the init block which depends upon it.

Moving the insertion point to before inlining the initialization block
turned up further issues around the handling of the terminator for the
initialization block, which are also fixed here.

This fixes a bug in #92430 (but the affected code couldn't compile
before #92430 anyway).
AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
Some of the OpenMP code can change the instruction pointed at by the
insertion point. This leads to an assert in the compiler about
BB->getParent() and IP->getParent() not matching.

The fix is to rebuild the insertionpoint from the block, rather than use
builder.restoreIP.

Also, move some of the alloca generation, rather than skipping back and
forth between insert points (and ensure all the allocas are done before
their users are created).

A simple test, mainly to ensure the minimal reproducer doesn't fail to
compile in the future is also added.
AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
…6052)

It was incorrect to set the insertion point to the init block after
inlining the initialization region because the code generated in the
init block depends upon the value yielded from the init region. When
there were multiple reduction initialization regions each with multiple
blocks, this could lead to the initilization region being inlined after
the init block which depends upon it.

Moving the insertion point to before inlining the initialization block
turned up further issues around the handling of the terminator for the
initialization block, which are also fixed here.

This fixes a bug in llvm#92430 (but the affected code couldn't compile
before llvm#92430 anyway).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category flang:fir-hlfir flang:openmp flang Flang issues not falling into any other category mlir:llvm mlir:openmp mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants