[Flang]Fix for changed code at the end of AllocaIP. #92430

Leporacanthicus · 2024-05-16T17:02:51Z

Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching.

The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP.

Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created).

A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.

llvmbot · 2024-05-16T17:03:21Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-mlir-openmp
@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-flang-fir-hlfir

@llvm/pr-subscribers-flang-openmp

Author: Mats Petersson (Leporacanthicus)

Changes

Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching or something like that.

The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP.

Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created).

A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.

Full diff: https://github.com/llvm/llvm-project/pull/92430.diff

3 Files Affected:

(added) flang/test/Lower/OpenMP/parallel-reduction-allocate.f90 (+23)
(modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+8-3)
(modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+13-7)

diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90
new file mode 100644
index 0000000000000..fddce25ae22cc
--- /dev/null
+++ b/flang/test/Lower/OpenMP/parallel-reduction-allocate.f90
@@ -0,0 +1,23 @@
+!! The main point of this test is to check that the code compiles at all, so the
+!! checking is not very detailed. Not hitting an assert, crashing or otherwise failing
+!! to compile is the key point. Also, emitting llvm is required for this to happen.
+! RUN: %flang_fc1 -emit-llvm -fopenmp -o - %s 2>&1 | FileCheck %s
+subroutine proc
+  implicit none
+  real(8),allocatable :: F(:)
+  real(8),allocatable :: A(:)
+
+!$omp parallel private(A) reduction(+:F)
+  allocate(A(10))
+!$omp end parallel
+end subroutine proc
+
+!CHECK-LABEL: define void @proc_()
+!CHECK: call void
+!CHECK-SAME: @__kmpc_fork_call(ptr {{.*}}, i32 1, ptr @[[OMP_PAR:.*]], {{.*}})
+
+!CHECK: define internal void @[[OMP_PAR]]
+!CHECK: omp.par.region8:
+!CHECK-NEXT: call ptr @malloc
+!CHECK-SAME: i64 10
+
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 9f0c07ef0da91..c0f4b5b066097 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -1391,7 +1391,8 @@ IRBuilder<>::InsertPoint OpenMPIRBuilder::createParallel(
 
   // Change the location to the outer alloca insertion point to create and
   // initialize the allocas we pass into the parallel region.
-  Builder.restoreIP(OuterAllocaIP);
+  InsertPointTy NewOuter(OuterAllocaBlock, OuterAllocaBlock->begin());
+  Builder.restoreIP(NewOuter);
   AllocaInst *TIDAddrAlloca = Builder.CreateAlloca(Int32, nullptr, "tid.addr");
   AllocaInst *ZeroAddrAlloca =
       Builder.CreateAlloca(Int32, nullptr, "zero.addr");
@@ -2135,7 +2136,8 @@ OpenMPIRBuilder::InsertPointTy OpenMPIRBuilder::createReductions(
   // values.
   unsigned NumReductions = ReductionInfos.size();
   Type *RedArrayTy = ArrayType::get(Builder.getPtrTy(), NumReductions);
-  Builder.restoreIP(AllocaIP);
+  Builder.SetInsertPoint(AllocaIP.getBlock()->getTerminator());
+  //  Builder.restoreIP(AllocaIP);
   Value *RedArray = Builder.CreateAlloca(RedArrayTy, nullptr, "red.array");
 
   Builder.SetInsertPoint(InsertBlock, InsertBlock->end());
@@ -2536,7 +2538,10 @@ OpenMPIRBuilder::applyStaticWorkshareLoop(DebugLoc DL, CanonicalLoopInfo *CLI,
       getOrCreateRuntimeFunction(M, omp::OMPRTL___kmpc_for_static_fini);
 
   // Allocate space for computed loop bounds as expected by the "init" function.
-  Builder.restoreIP(AllocaIP);
+
+  //  Builder.restoreIP(AllocaIP);
+  Builder.SetInsertPoint(AllocaIP.getBlock()->getTerminator());
+
   Type *I32Type = Type::getInt32Ty(M.getContext());
   Value *PLastIter = Builder.CreateAlloca(I32Type, nullptr, "p.lastiter");
   Value *PLowerBound = Builder.CreateAlloca(IVTy, nullptr, "p.lowerbound");
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index bfd7d65912bdb..b5f46b641f596 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -1194,6 +1194,17 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
     MutableArrayRef<BlockArgument> reductionArgs =
         opInst.getRegion().getArguments().take_back(
             opInst.getNumReductionVars());
+
+    SmallVector<llvm::Value *> byRefVars;
+    if (isByRef) {
+      for (unsigned i = 0; i < opInst.getNumReductionVars(); ++i) {
+        // Allocate reduction variable (which is a pointer to the real reduciton
+        // variable allocated in the inlined region)
+        byRefVars.push_back(builder.CreateAlloca(
+            moduleTranslation.convertType(reductionDecls[i].getType())));
+      }
+    }
+
     for (unsigned i = 0; i < opInst.getNumReductionVars(); ++i) {
       SmallVector<llvm::Value *> phis;
 
@@ -1206,18 +1217,13 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
       assert(phis.size() == 1 &&
              "expected one value to be yielded from the "
              "reduction neutral element declaration region");
-      builder.restoreIP(allocaIP);
 
       if (isByRef) {
-        // Allocate reduction variable (which is a pointer to the real reduciton
-        // variable allocated in the inlined region)
-        llvm::Value *var = builder.CreateAlloca(
-            moduleTranslation.convertType(reductionDecls[i].getType()));
         // Store the result of the inlined region to the allocated reduction var
         // ptr
-        builder.CreateStore(phis[0], var);
+        builder.CreateStore(phis[0], byRefVars[i]);
 
-        privateReductionVariables.push_back(var);
+        privateReductionVariables.push_back(byRefVars[i]);
         moduleTranslation.mapValue(reductionArgs[i], phis[0]);
         reductionVariableMap.try_emplace(opInst.getReductionVars()[i], phis[0]);
       } else {

Leporacanthicus · 2024-05-16T18:51:40Z

I'm aware MLIR tests are failing, will try to fix tomorrow.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

tblah · 2024-05-17T09:51:04Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

@@ -1194,6 +1194,17 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
    MutableArrayRef<BlockArgument> reductionArgs =
        opInst.getRegion().getArguments().take_back(
            opInst.getNumReductionVars());
+
+    SmallVector<llvm::Value *> byRefVars;
+    if (isByRef) {


llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

Meinersbur · 2024-05-17T12:28:05Z

Ideally, the only thing added to the alloca block are allocas. Apparently in this case something is splitting the alloca block before the AllocaIP, do you know what exactly?

Resetting the insertion point to the beginning of the alloca block causes the inserted instructions to be reverser order. I don't think that's a big deal (unless the alloca size needs a value that has previously computed), but I would consider making a dedicated alloca block that is never split to be the cleaner solution. I created splitBB for this purpose. There is this code in createTeams:

  if (&OuterAllocaBB == Builder.GetInsertBlock()) {
    BasicBlock *BodyBB = splitBB(Builder, /*CreateBranch=*/true, "teams.entry");
    Builder.SetInsertPoint(BodyBB, BodyBB->begin());
  }

I'd prefer to not make the split conditional as it is here, it adds another possibly unnecessary BB but is more robust.

What do you think?

kiranchandramohan · 2024-05-18T14:51:52Z

Ideally, the only thing added to the alloca block are allocas. Apparently in this case something is splitting the alloca block before the AllocaIP, do you know what exactly?

One possible reason is that for allocatable variables the privatization demands that we allocate only if the original variable is allocated. This introduces an if and hence multiple blocks.

Meinersbur · 2024-05-21T10:06:44Z

One possible reason is that for allocatable variables the privatization demands that we allocate only if the original variable is allocated. This introduces an if and hence multiple blocks.

I strongly recommend not doing it this way. An unconditional alloca in the entry block has no overhead (it just moves the stack pointer for the required space). Doing it conditionally does introduces overhead. Even worse, some LLVM passes treat allocas in the entry block specially (e.g. the inliner moves them to the caller's entry block to avoid having to dynamically readjust the stack pointer; mem2reg; SROA) and adding a branch will effectively pushes all allocas after the introduced branch into a non-entry block and thus worse optimization for them too.

See the discussions at https://discourse.llvm.org/t/alloca-outside-of-entry-block/11088/5, https://discourse.llvm.org/t/mem2reg-for-non-entry-blocks/26696/5

It would be better if the memory is allocated unconditionally, and just not used it if the variable not allocated.

Leporacanthicus · 2024-05-21T17:22:44Z

@Meinersbur The case is not that we are using alloca in conditionals - but that we do something like if thing.is_allocated() then allocate-and copy into private variable - we do the alloca, unconditionally. But those if allocated means there's an alternative path, so there's new blocks being created. And this means the end marker used in the original block gets changed so any previously stored insertion point becomes invalid.

I'm still trying to make this work correctly for all cases.

Meinersbur · 2024-05-22T12:53:40Z

@Meinersbur The case is not that we are using alloca in conditionals - but that we do something like if thing.is_allocated() then allocate-and copy into private variable - we do the alloca, unconditionally. But those if allocated means there's an alternative path, so there's new blocks being created. And this means the end marker used in the original block gets changed so any previously stored insertion point becomes invalid.

In any case, the block that the conditional is inserted into should not be the alloca block. If there are alloca instructions after the split point, those allocas will be pushed out of the entry block. I would consider anything that inserts control flow into the AllocaIP to be fragile. If we only ever want to insert instructions to the end of the (current) alloca block, we should just pass the BasicBlock* everywhere, not an InsertionPoint.

tblah

Thanks for the changes.

@Meinersbur thank you for your comments. I agree that it would be better for all allocas to come first in the entry block and not to be inserted after conditionals. But I think this is out of scope of this PR, which is trying to fix an assertion failure while building a real application.

Would it be acceptable for you if we addressed the alloca placement in a later patch?

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

Leporacanthicus · 2024-05-30T13:57:06Z

@Meinersbur : I completely agree that this is far from a perfect solution. The trouble is that the "good solution" involves refactoring quite a bit of the infrastructure involved to fix that. Which I started looking at, but we're would like to get something that works in now, and make a proper refactoring later on.

ergawy · 2024-06-03T13:16:26Z

flang/test/Lower/OpenMP/parallel-reduction-allocate.f90

@@ -0,0 +1,23 @@
+!! The main point of this test is to check that the code compiles at all, so the


From this comment:

I don't think we want the new AllocaIP to be in that second block, in case someone adds more alloca ops - they should go into the first block.
My PR, that Kiran mentioned, has several of this kind of fixes [I will check if it fixes this case too,]. Whenever possible, I try to make the new location at the terminator of the original AllocaIP block.

I looked the generated IR of this test, you can check the CFG of the outlined parallel region here.

As you can see in the CFG, allocas are still scattered through out the BBs of the outlined region's function. I think if we want to keep all allocas only at the entry block of the outlined function, only carefully updating the insertion point will not be enough.

I think the problem is that we convert and inlined entire "sub-regions" individually; for example, in this test we inline the whole logic/IR for A first and then F (or vice versa). So no matter where the IP is, since for allocatables there is if-else condtion to inline, the currently set IP will have to be split and allocas that were created before will either be pushed up or down to a different BB than the one we will insert the new allocas to.

Instead, I think we should do that a post-processing step after the region is completely emitted. So we hoist the all allocas after we finish emitting the outlined region to the entry BB of the outlined function. However, I don't know if there are any unseen issues here.

I have created an internal ticket to improve the code generation. It's just a much bigger task than this one started out as. I then found more problems, so this took much longer than I expected.

Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching or something like that. The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP. Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created). A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.

tblah

LGTM if @Meinersbur is happy. It's a shame about the fallout to the clang tests but I can see that would be very difficult to avoid.

Thank you for persisting with this difficult patch.

nikic

Allocas should be placed at the start of the entry block.

Leporacanthicus · 2024-06-06T14:12:32Z

Allocas should be placed at the start of the entry block.

Would it be acceptable to make this change such that the alloca's are at the very start of the block? I have tried that, and it seems to work. It will still mean changes to the tests, and I can't guarantee ALL alloca's for all Fortran (via MLIR) generated code will be at the start of the entry block.

We have a further ticket to clean this up, but it's a bigger task, which we will do, but it will take a few weeks at least.

nikic · 2024-06-06T14:17:53Z

Allocas should be placed at the start of the entry block.

Would it be acceptable to make this change such that the alloca's are at the very start of the block? I have tried that, and it seems to work. It will still mean changes to the tests, and I can't guarantee ALL alloca's for all Fortran (via MLIR) generated code will be at the start of the entry block.

Yes, that should be fine.

You could also use something like getFirstNonPHIOrDbgOrAlloca to compute the insert point after the current allocas. (Calling this many times could have quadratic runtime though -- I don't have enough context here to say whether that's a problem or not.)

Add a second reduction of a by_val type, so that we can test for the most recent bug found. Also, since the test is now different, rename it. This test is going all the way to LLVM-IR to ensure the whole codegeneration works here.

Also fix some tests, as this moves some allocas to an earlier location.

My concern was addressed -- and I don't have the OpenMP context to comment on the broader change.

Leporacanthicus · 2024-06-07T15:27:01Z

@ergawy Could you take a look at this, given that you did something similar [even if it was much smaller] recently?

ye-luo · 2024-06-17T14:50:47Z

progress?

Leporacanthicus · 2024-06-17T19:59:39Z

progress?

I'm afraid I'm still waiting for review from some of the key people here. Apparently there was some OpenMP conference last week, which kept some people very busy, so I'm hoping by the time they've read emails and got over jet-lag, etc, they'll get round to this.

ergawy · 2024-06-18T03:12:38Z

@ergawy Could you take a look at this, given that you did something similar [even if it was much smaller] recently?

Sorry, this totally slipped my mind. I will take a look today.

ergawy

LGTM

It was incorrect to set the insertion point to the init block after inlining the initialization region because the code generated in the init block depends upon the value yielded from the init region. When there were multiple reduction initialization regions each with multiple blocks, this could lead to the initilization region being inlined after the init block which depends upon it. Moving the insertion point to before inlining the initialization block turned up further issues around the handling of the terminator for the initialization block, which are also fixed here. This fixes a bug in llvm#92430 (but the affected code couldn't compile before llvm#92430 anyway).

It was incorrect to set the insertion point to the init block after inlining the initialization region because the code generated in the init block depends upon the value yielded from the init region. When there were multiple reduction initialization regions each with multiple blocks, this could lead to the initilization region being inlined after the init block which depends upon it. Moving the insertion point to before inlining the initialization block turned up further issues around the handling of the terminator for the initialization block, which are also fixed here. This fixes a bug in #92430 (but the affected code couldn't compile before #92430 anyway).

Some of the OpenMP code can change the instruction pointed at by the insertion point. This leads to an assert in the compiler about BB->getParent() and IP->getParent() not matching. The fix is to rebuild the insertionpoint from the block, rather than use builder.restoreIP. Also, move some of the alloca generation, rather than skipping back and forth between insert points (and ensure all the allocas are done before their users are created). A simple test, mainly to ensure the minimal reproducer doesn't fail to compile in the future is also added.

…6052) It was incorrect to set the insertion point to the init block after inlining the initialization region because the code generated in the init block depends upon the value yielded from the init region. When there were multiple reduction initialization regions each with multiple blocks, this could lead to the initilization region being inlined after the init block which depends upon it. Moving the insertion point to before inlining the initialization block turned up further issues around the handling of the terminator for the initialization block, which are also fixed here. This fixes a bug in llvm#92430 (but the affected code couldn't compile before llvm#92430 anyway).

Leporacanthicus requested review from tblah and kiranchandramohan May 16, 2024 17:02

llvmbot added mlir:llvm mlir flang Flang issues not falling into any other category mlir:openmp flang:fir-hlfir flang:openmp clang:openmp OpenMP related changes to Clang labels May 16, 2024

Leporacanthicus requested a review from pawosm-arm May 16, 2024 17:03

Leporacanthicus requested review from DavidTruby and kparzysz May 16, 2024 17:03

Leporacanthicus changed the title ~~Fix for changed code at the end of AllocaIP.~~ [Flang]Fix for changed code at the end of AllocaIP. May 16, 2024

tblah reviewed May 17, 2024

View reviewed changes

kiranchandramohan requested a review from Meinersbur May 17, 2024 09:59

tblah reviewed May 29, 2024

View reviewed changes

tblah mentioned this pull request May 30, 2024

[Flang] OpenMP ICE #93727

Closed

kiranchandramohan mentioned this pull request May 31, 2024

[OpenMP][LLVM] Update alloca IP after PrivCB in OMPIRBUIlder #93920

Merged

kiranchandramohan requested a review from ergawy May 31, 2024 07:38

ergawy reviewed Jun 3, 2024

View reviewed changes

Leporacanthicus force-pushed the openmp-insert-points branch from b0875c4 to 535d84e Compare June 4, 2024 16:43

llvmbot added the clang Clang issues not falling into any other category label Jun 4, 2024

Leporacanthicus added 3 commits June 5, 2024 10:50

Make it work for both MLIR and Flang tests

5cd1a43

Review fixes

8bd7596

Leporacanthicus force-pushed the openmp-insert-points branch from 535d84e to 2577a63 Compare June 5, 2024 19:15

tblah approved these changes Jun 6, 2024

View reviewed changes

Leporacanthicus requested a review from nikic June 6, 2024 10:15

nikic previously requested changes Jun 6, 2024

View reviewed changes

Leporacanthicus added 3 commits June 6, 2024 17:00

Fix bugs

a07d09a

Rename and improve test

8663c66

Add a second reduction of a by_val type, so that we can test for the most recent bug found. Also, since the test is now different, rename it. This test is going all the way to LLVM-IR to ensure the whole codegeneration works here.

Select different place for allocas

243e5fa

Also fix some tests, as this moves some allocas to an earlier location.

Leporacanthicus force-pushed the openmp-insert-points branch from 2577a63 to 243e5fa Compare June 6, 2024 20:41

tblah approved these changes Jun 7, 2024

View reviewed changes

kiranchandramohan mentioned this pull request Jun 12, 2024

[flang][OpenMP] Enable delayed privatization for omp parallel by default #90945

Merged

ergawy approved these changes Jun 18, 2024

View reviewed changes

Leporacanthicus merged commit e5f1639 into llvm:main Jun 18, 2024
7 checks passed

tblah mentioned this pull request Jun 19, 2024

[mlir][LLVMIR][OpenMP] fix dominance for reduction init block #96052

Merged

		@@ -0,0 +1,23 @@
		!! The main point of this test is to check that the code compiles at all, so the

[Flang]Fix for changed code at the end of AllocaIP. #92430

[Flang]Fix for changed code at the end of AllocaIP. #92430

Uh oh!

Conversation

Leporacanthicus commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Leporacanthicus commented May 16, 2024

Uh oh!

Uh oh!

Uh oh!

tblah May 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Meinersbur commented May 17, 2024

Uh oh!

kiranchandramohan commented May 18, 2024

Uh oh!

Meinersbur commented May 21, 2024

Uh oh!

Leporacanthicus commented May 21, 2024

Uh oh!

Meinersbur commented May 22, 2024

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Leporacanthicus commented May 30, 2024

Uh oh!

ergawy Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Leporacanthicus Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Leporacanthicus commented Jun 6, 2024

Uh oh!

nikic commented Jun 6, 2024

Uh oh!

Leporacanthicus commented Jun 7, 2024

Uh oh!

ye-luo commented Jun 17, 2024

Uh oh!

Leporacanthicus commented Jun 17, 2024

Uh oh!

ergawy commented Jun 18, 2024

Uh oh!

ergawy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Leporacanthicus commented May 16, 2024 •

edited

Loading

llvmbot commented May 16, 2024 •

edited

Loading

ergawy Jun 3, 2024 •

edited

Loading