Skip to content

[LV] Use frozen start value for FindLastIV if needed. #132691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 4, 2025

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Mar 24, 2025

FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

Alive2 showing the issue: https://alive2.llvm.org/ce/z/mB4Jtb

Depends on #132689
(included in PR) and #132690
(included in PR)

Fixes #126836

fhahn added 4 commits March 24, 2025 07:52
This moves the logic for computing the FindLastIV reduction result to
its own opcode. A follow-up patch will update the new opcode to also
take the start value, to fix
llvm#126836.
Keep the start value as operand of ComputeFindLastIVResult. A follow-up
patch will use this to make sure the start value is frozen if needed.
FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

Alive2 showing the issue: https://alive2.llvm.org/ce/z/mB4Jtb
@llvmbot
Copy link
Member

llvmbot commented Mar 24, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

Alive2 showing the issue: https://alive2.llvm.org/ce/z/mB4Jtb

Depends on #132689
(included in PR) and #132690
(included in PR)

Fixes #126836


Patch is 48.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132691.diff

10 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Utils/LoopUtils.h (+1-1)
  • (modified) llvm/lib/Transforms/Utils/LoopUtils.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+74-30)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h (+17)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+39-9)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp (+3-1)
  • (modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-no-wrap.ll (+6-4)
  • (modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp.ll (+42-30)
diff --git a/llvm/include/llvm/Transforms/Utils/LoopUtils.h b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
index 193f505fb03fe..416a0a70325d1 100644
--- a/llvm/include/llvm/Transforms/Utils/LoopUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
@@ -423,7 +423,7 @@ Value *createAnyOfReduction(IRBuilderBase &B, Value *Src,
 /// Create a reduction of the given vector \p Src for a reduction of the
 /// kind RecurKind::IFindLastIV or RecurKind::FFindLastIV. The reduction
 /// operation is described by \p Desc.
-Value *createFindLastIVReduction(IRBuilderBase &B, Value *Src,
+Value *createFindLastIVReduction(IRBuilderBase &B, Value *Src, Value *Start,
                                  const RecurrenceDescriptor &Desc);
 
 /// Create an ordered reduction intrinsic using the given recurrence
diff --git a/llvm/lib/Transforms/Utils/LoopUtils.cpp b/llvm/lib/Transforms/Utils/LoopUtils.cpp
index 2e7685254f512..f57d95e7722dc 100644
--- a/llvm/lib/Transforms/Utils/LoopUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUtils.cpp
@@ -1233,11 +1233,11 @@ Value *llvm::createAnyOfReduction(IRBuilderBase &Builder, Value *Src,
 }
 
 Value *llvm::createFindLastIVReduction(IRBuilderBase &Builder, Value *Src,
+                                       Value *Start,
                                        const RecurrenceDescriptor &Desc) {
   assert(RecurrenceDescriptor::isFindLastIVRecurrenceKind(
              Desc.getRecurrenceKind()) &&
          "Unexpected reduction kind");
-  Value *StartVal = Desc.getRecurrenceStartValue();
   Value *Sentinel = Desc.getSentinelValue();
   Value *MaxRdx = Src->getType()->isVectorTy()
                       ? Builder.CreateIntMaxReduce(Src, true)
@@ -1246,7 +1246,7 @@ Value *llvm::createFindLastIVReduction(IRBuilderBase &Builder, Value *Src,
   // reduction is sentinel value.
   Value *Cmp =
       Builder.CreateCmp(CmpInst::ICMP_NE, MaxRdx, Sentinel, "rdx.select.cmp");
-  return Builder.CreateSelect(Cmp, MaxRdx, StartVal, "rdx.select");
+  return Builder.CreateSelect(Cmp, MaxRdx, Start, "rdx.select");
 }
 
 Value *llvm::getReductionIdentity(Intrinsic::ID RdxID, Type *Ty,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 92160a421e59c..d5949e3d52918 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7612,7 +7612,8 @@ static void fixReductionScalarResumeWhenVectorizingEpilog(
     BasicBlock *BypassBlock) {
   auto *EpiRedResult = dyn_cast<VPInstruction>(R);
   if (!EpiRedResult ||
-      EpiRedResult->getOpcode() != VPInstruction::ComputeReductionResult)
+      (EpiRedResult->getOpcode() != VPInstruction::ComputeReductionResult &&
+       EpiRedResult->getOpcode() != VPInstruction::ComputeFindLastIVResult))
     return;
 
   auto *EpiRedHeaderPhi =
@@ -7633,14 +7634,17 @@ static void fixReductionScalarResumeWhenVectorizingEpilog(
   } else if (RecurrenceDescriptor::isFindLastIVRecurrenceKind(
                  RdxDesc.getRecurrenceKind())) {
     using namespace llvm::PatternMatch;
-    Value *Cmp, *OrigResumeV;
+    Value *Cmp, *OrigResumeV, *CmpOp;
     bool IsExpectedPattern =
         match(MainResumeValue, m_Select(m_OneUse(m_Value(Cmp)),
                                         m_Specific(RdxDesc.getSentinelValue()),
                                         m_Value(OrigResumeV))) &&
-        match(Cmp,
-              m_SpecificICmp(ICmpInst::ICMP_EQ, m_Specific(OrigResumeV),
-                             m_Specific(RdxDesc.getRecurrenceStartValue())));
+        (match(Cmp, m_SpecificICmp(ICmpInst::ICMP_EQ, m_Specific(OrigResumeV),
+                                   m_Value(CmpOp))) &&
+         (match(CmpOp,
+                m_Freeze(m_Specific(RdxDesc.getRecurrenceStartValue()))) ||
+          (CmpOp == RdxDesc.getRecurrenceStartValue() &&
+           isGuaranteedNotToBeUndefOrPoison(CmpOp))));
     assert(IsExpectedPattern && "Unexpected reduction resume pattern");
     (void)IsExpectedPattern;
     MainResumeValue = OrigResumeV;
@@ -9817,8 +9821,10 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
           Builder.createSelect(Cond, OrigExitingVPV, PhiR, {}, "", FMFs);
       OrigExitingVPV->replaceUsesWithIf(NewExitingVPV, [](VPUser &U, unsigned) {
         return isa<VPInstruction>(&U) &&
-               cast<VPInstruction>(&U)->getOpcode() ==
-                   VPInstruction::ComputeReductionResult;
+               (cast<VPInstruction>(&U)->getOpcode() ==
+                    VPInstruction::ComputeReductionResult ||
+                cast<VPInstruction>(&U)->getOpcode() ==
+                    VPInstruction::ComputeFindLastIVResult);
       });
       if (CM.usePredicatedReductionSelect(
               PhiR->getRecurrenceDescriptor().getOpcode(), PhiTy))
@@ -9861,10 +9867,36 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     // bc.merge.rdx phi nodes, hence it needs to be created unconditionally here
     // even for in-loop reductions, until the reduction resume value handling is
     // also modeled in VPlan.
+    VPInstruction *FinalReductionResult;
     VPBuilder::InsertPointGuard Guard(Builder);
-    Builder.setInsertPoint(MiddleVPBB, IP);
-    auto *FinalReductionResult = Builder.createNaryOp(
-        VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
+    if (RecurrenceDescriptor::isFindLastIVRecurrenceKind(
+            RdxDesc.getRecurrenceKind())) {
+      VPValue *Start = PhiR->getStartValue();
+      if (!isGuaranteedNotToBeUndefOrPoison(
+              PhiR->getStartValue()->getLiveInIRValue())) {
+        Builder.setInsertPoint(cast<VPBasicBlock>(Plan->getEntry()));
+        Start = Builder.createNaryOp(Instruction::Freeze, {Start}, {}, "fr");
+      }
+      Builder.setInsertPoint(MiddleVPBB, IP);
+      FinalReductionResult =
+          Builder.createNaryOp(VPInstruction::ComputeFindLastIVResult,
+                               {PhiR, Start, NewExitingVPV}, ExitDL);
+      // Update all users outside the vector region.
+      for (VPUser *U : to_vector(OrigExitingVPV->users())) {
+        auto *R = cast<VPRecipeBase>(U);
+        if (R->getParent() && R->getParent()->getParent())
+          continue;
+
+        for (unsigned Idx = 0; Idx != R->getNumOperands(); ++Idx) {
+          if (R->getOperand(Idx) == PhiR->getStartValue())
+            R->setOperand(Idx, Start);
+        }
+      }
+    } else {
+      Builder.setInsertPoint(MiddleVPBB, IP);
+      FinalReductionResult = Builder.createNaryOp(
+          VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
+    }
     // Update all users outside the vector region.
     OrigExitingVPV->replaceUsesWithIf(
         FinalReductionResult, [FinalReductionResult](VPUser &User, unsigned) {
@@ -10353,24 +10385,7 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
   VPBasicBlock *Header = VectorLoop->getEntryBasicBlock();
   Header->setName("vec.epilog.vector.body");
 
-  // Re-use the trip count and steps expanded for the main loop, as
-  // skeleton creation needs it as a value that dominates both the scalar
-  // and vector epilogue loops
-  // TODO: This is a workaround needed for epilogue vectorization and it
-  // should be removed once induction resume value creation is done
-  // directly in VPlan.
-  for (auto &R : make_early_inc_range(*Plan.getEntry())) {
-    auto *ExpandR = dyn_cast<VPExpandSCEVRecipe>(&R);
-    if (!ExpandR)
-      continue;
-    auto *ExpandedVal =
-        Plan.getOrAddLiveIn(ExpandedSCEVs.find(ExpandR->getSCEV())->second);
-    ExpandR->replaceAllUsesWith(ExpandedVal);
-    if (Plan.getTripCount() == ExpandR)
-      Plan.resetTripCount(ExpandedVal);
-    ExpandR->eraseFromParent();
-  }
-
+  DenseMap<Value *, Value *> ToFrozen;
   // Ensure that the start values for all header phi recipes are updated before
   // vectorizing the epilogue loop.
   for (VPRecipeBase &R : Header->phis()) {
@@ -10436,6 +10451,10 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
         ResumeV =
             Builder.CreateICmpNE(ResumeV, RdxDesc.getRecurrenceStartValue());
       } else if (RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) {
+        ToFrozen[RdxDesc.getRecurrenceStartValue()] =
+            cast<PHINode>(ResumeV)->getIncomingValueForBlock(
+                EPI.MainLoopIterationCountCheck);
+
         // VPReductionPHIRecipe for FindLastIV reductions requires an adjustment
         // to the resume value. The resume value is adjusted to the sentinel
         // value when the final value from the main vector loop equals the start
@@ -10444,8 +10463,8 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
         // variable.
         BasicBlock *ResumeBB = cast<Instruction>(ResumeV)->getParent();
         IRBuilder<> Builder(ResumeBB, ResumeBB->getFirstNonPHIIt());
-        Value *Cmp =
-            Builder.CreateICmpEQ(ResumeV, RdxDesc.getRecurrenceStartValue());
+        Value *Cmp = Builder.CreateICmpEQ(
+            ResumeV, ToFrozen[RdxDesc.getRecurrenceStartValue()]);
         ResumeV =
             Builder.CreateSelect(Cmp, RdxDesc.getSentinelValue(), ResumeV);
       }
@@ -10461,6 +10480,31 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
     VPValue *StartVal = Plan.getOrAddLiveIn(ResumeV);
     cast<VPHeaderPHIRecipe>(&R)->setStartValue(StartVal);
   }
+
+  // Re-use the trip count and steps expanded for the main loop, as
+  // skeleton creation needs it as a value that dominates both the scalar
+  // and vector epilogue loops
+  // TODO: This is a workaround needed for epilogue vectorization and it
+  // should be removed once induction resume value creation is done
+  // directly in VPlan.
+  for (auto &R : make_early_inc_range(*Plan.getEntry())) {
+    auto *VPI = dyn_cast<VPInstruction>(&R);
+    if (VPI) {
+      VPI->replaceAllUsesWith(Plan.getOrAddLiveIn(
+          ToFrozen[VPI->getOperand(0)->getLiveInIRValue()]));
+      continue;
+    }
+
+    auto *ExpandR = dyn_cast<VPExpandSCEVRecipe>(&R);
+    if (!ExpandR)
+      continue;
+    auto *ExpandedVal =
+        Plan.getOrAddLiveIn(ExpandedSCEVs.find(ExpandR->getSCEV())->second);
+    ExpandR->replaceAllUsesWith(ExpandedVal);
+    if (Plan.getTripCount() == ExpandR)
+      Plan.resetTripCount(ExpandedVal);
+    ExpandR->eraseFromParent();
+  }
 }
 
 // Generate bypass values from the additional bypass block. Note that when the
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 3059b87ae63c8..64e7f2bddb668 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -866,6 +866,7 @@ class VPInstruction : public VPRecipeWithIRFlags,
     BranchOnCount,
     BranchOnCond,
     Broadcast,
+    ComputeFindLastIVResult,
     ComputeReductionResult,
     // Takes the VPValue to extract from as first operand and the lane or part
     // to extract as second operand, counting from the end starting with 1 for
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 38bec733dbf73..24a166bd336d1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -51,6 +51,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
 
   switch (Opcode) {
   case Instruction::ExtractElement:
+  case Instruction::Freeze:
     return inferScalarType(R->getOperand(0));
   case Instruction::Select: {
     Type *ResTy = inferScalarType(R->getOperand(1));
@@ -66,6 +67,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
                inferScalarType(R->getOperand(1)) &&
            "different types inferred for different operands");
     return IntegerType::get(Ctx, 1);
+  case VPInstruction::ComputeFindLastIVResult:
   case VPInstruction::ComputeReductionResult: {
     auto *PhiR = cast<VPReductionPHIRecipe>(R->getOperand(0));
     auto *OrigPhi = cast<PHINode>(PhiR->getUnderlyingValue());
diff --git a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
index 8c11d93734667..3a727866a2875 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
@@ -216,6 +216,16 @@ using BinaryVPInstruction_match =
     BinaryRecipe_match<Op0_t, Op1_t, Opcode, /*Commutative*/ false,
                        VPInstruction>;
 
+template <typename Op0_t, typename Op1_t, typename Op2_t, unsigned Opcode,
+          bool Commutative, typename... RecipeTys>
+using TernaryRecipe_match = Recipe_match<std::tuple<Op0_t, Op1_t, Op2_t>,
+                                         Opcode, Commutative, RecipeTys...>;
+
+template <typename Op0_t, typename Op1_t, typename Op2_t, unsigned Opcode>
+using TernaryVPInstruction_match =
+    TernaryRecipe_match<Op0_t, Op1_t, Op2_t, Opcode, /*Commutative*/ false,
+                        VPInstruction>;
+
 template <typename Op0_t, typename Op1_t, unsigned Opcode,
           bool Commutative = false>
 using AllBinaryRecipe_match =
@@ -234,6 +244,13 @@ m_VPInstruction(const Op0_t &Op0, const Op1_t &Op1) {
   return BinaryVPInstruction_match<Op0_t, Op1_t, Opcode>(Op0, Op1);
 }
 
+template <unsigned Opcode, typename Op0_t, typename Op1_t, typename Op2_t>
+inline TernaryVPInstruction_match<Op0_t, Op1_t, Op2_t, Opcode>
+m_VPInstruction(const Op0_t &Op0, const Op1_t &Op1, const Op2_t &Op2) {
+  return TernaryVPInstruction_match<Op0_t, Op1_t, Op2_t, Opcode>(
+      {Op0, Op1, Op2});
+}
+
 template <typename Op0_t>
 inline UnaryVPInstruction_match<Op0_t, VPInstruction::Not>
 m_Not(const Op0_t &Op0) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index c7190b3187d94..f78ced71e1260 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -423,6 +423,7 @@ bool VPInstruction::canGenerateScalarForFirstLane() const {
   if (isSingleScalar() || isVectorToScalar())
     return true;
   switch (Opcode) {
+  case Instruction::Freeze:
   case Instruction::ICmp:
   case Instruction::PHI:
   case Instruction::Select:
@@ -474,6 +475,12 @@ Value *VPInstruction::generate(VPTransformState &State) {
     Value *Idx = State.get(getOperand(1), /*IsScalar=*/true);
     return Builder.CreateExtractElement(Vec, Idx, Name);
   }
+  case Instruction::Freeze: {
+    if (State.hasVectorValue(this))
+      return State.get(this);
+    Value *Op = State.get(getOperand(0), true);
+    return Builder.CreateFreeze(Op, Name);
+  }
   case Instruction::ICmp: {
     bool OnlyFirstLaneUsed = vputils::onlyFirstLaneUsed(this);
     Value *A = State.get(getOperand(0), OnlyFirstLaneUsed);
@@ -614,6 +621,28 @@ Value *VPInstruction::generate(VPTransformState &State) {
     return Builder.CreateVectorSplat(
         State.VF, State.get(getOperand(0), /*IsScalar*/ true), "broadcast");
   }
+  case VPInstruction::ComputeFindLastIVResult: {
+    // The recipe's operands are the reduction phi, followed by one operand for
+    // each part of the reduction.
+    unsigned UF = getNumOperands() - 2;
+    Value *ReducedPartRdx = State.get(getOperand(2));
+    for (unsigned Part = 1; Part < UF; ++Part) {
+      ReducedPartRdx = createMinMaxOp(Builder, RecurKind::SMax, ReducedPartRdx,
+                                      State.get(getOperand(2 + Part)));
+    }
+
+    // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
+    // and will be removed by breaking up the recipe further.
+    auto *PhiR = cast<VPReductionPHIRecipe>(getOperand(0));
+    // Get its reduction variable descriptor.
+    const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
+    RecurKind RK = RdxDesc.getRecurrenceKind();
+
+    assert(RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK));
+    assert(!PhiR->isInLoop());
+    return createFindLastIVReduction(Builder, ReducedPartRdx,
+                                     State.get(getOperand(1), true), RdxDesc);
+  }
   case VPInstruction::ComputeReductionResult: {
     // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
     // and will be removed by breaking up the recipe further.
@@ -623,6 +652,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
     const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
 
     RecurKind RK = RdxDesc.getRecurrenceKind();
+    assert(!RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK) &&
+           "should be handled by ComputeFindLastIVResult");
 
     Type *PhiTy = OrigPhi->getType();
     // The recipe's operands are the reduction phi, followed by one operand for
@@ -658,9 +689,6 @@ Value *VPInstruction::generate(VPTransformState &State) {
         if (Op != Instruction::ICmp && Op != Instruction::FCmp)
           ReducedPartRdx = Builder.CreateBinOp(
               (Instruction::BinaryOps)Op, RdxPart, ReducedPartRdx, "bin.rdx");
-        else if (RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK))
-          ReducedPartRdx =
-              createMinMaxOp(Builder, RecurKind::SMax, ReducedPartRdx, RdxPart);
         else
           ReducedPartRdx = createMinMaxOp(Builder, RK, ReducedPartRdx, RdxPart);
       }
@@ -669,8 +697,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
     // Create the reduction after the loop. Note that inloop reductions create
     // the target reduction in the loop using a Reduction recipe.
     if ((State.VF.isVector() ||
-         RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) ||
-         RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
+         RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) &&
         !PhiR->isInLoop()) {
       // TODO: Support in-order reductions based on the recurrence descriptor.
       // All ops in the reduction inherit fast-math-flags from the recurrence
@@ -681,9 +708,6 @@ Value *VPInstruction::generate(VPTransformState &State) {
       if (RecurrenceDescriptor::isAnyOfRecurrenceKind(RK))
         ReducedPartRdx =
             createAnyOfReduction(Builder, ReducedPartRdx, RdxDesc, OrigPhi);
-      else if (RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK))
-        ReducedPartRdx =
-            createFindLastIVReduction(Builder, ReducedPartRdx, RdxDesc);
       else
         ReducedPartRdx = createSimpleReduction(Builder, ReducedPartRdx, RK);
 
@@ -829,6 +853,7 @@ bool VPInstruction::isVectorToScalar() const {
   return getOpcode() == VPInstruction::ExtractFromEnd ||
          getOpcode() == Instruction::ExtractElement ||
          getOpcode() == VPInstruction::FirstActiveLane ||
+         getOpcode() == VPInstruction::ComputeFindLastIVResult ||
          getOpcode() == VPInstruction::ComputeReductionResult ||
          getOpcode() == VPInstruction::AnyOf;
 }
@@ -921,6 +946,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
   case Instruction::ICmp:
   case Instruction::Select:
   case Instruction::Or:
+  case Instruction::Freeze:
     // TODO: Cover additional opcodes.
     return vputils::onlyFirstLaneUsed(this);
   case VPInstruction::ActiveLaneMask:
@@ -933,6 +959,8 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
     return true;
   case VPInstruction::PtrAdd:
     return Op == getOperand(0) || vputils::onlyFirstLaneUsed(this);
+  case VPInstruction::ComputeFindLastIVResult:
+    return Op == getOperand(1);
   };
   llvm_unreachable("switch should return");
 }
@@ -1011,6 +1039,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::ExtractFromEnd:
     O << "extract-from-end";
     break;
+  case VPInstruction::ComputeFindLastIVResult:
+    O << "compute-find-last-iv-result";
+    break;
   case VPInstruction::ComputeReductionResult:
     O << "compute-reduction-result";
     break;
@@ -1571,7 +1602,6 @@ void VPWidenRecipe::execute(VPTransformState &State) {
   }
   case Instruction::Freeze: {
     Value *Op = State.get(getOperand(0));
-
     Value *Freeze = Builder.CreateFreeze(Op);
     State.set(this, Freeze);
     break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
index a36c2aeb3da5c..a513a255344cc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
@@ -348,7 +348,9 @@ void UnrollState::unrollBlock(VPBlockBase *VPB) {
     // the pa...
[truncated]

@lukel97
Copy link
Contributor

lukel97 commented Mar 24, 2025

Makes sense to me. Although I think AnyOf reductions might have the same problem too?

With opt -p loop-vectorize -force-vector-width=4:

define i32 @select_i32_from_icmp(ptr %v, i32 %a, i32 %b, i64 %n) {
entry:
  br label %loop

loop:                                      ; preds = %entry, %loop
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
  %rdx = phi i32 [ %a, %entry ], [ %sel, %loop ]
  %gep.v.iv = getelementptr inbounds i32, ptr %v, i64 %iv
  %load.v.iv = load i32, ptr %gep.v.iv, align 4
  %cmp.load.iv.3 = icmp eq i32 %load.v.iv, 3
  %sel = select i1 %cmp.load.iv.3, i32 %rdx, i32 %b
  %iv.next = add nuw nsw i64 %iv, 1
  %exit.cond = icmp eq i64 %iv.next, %n
  br i1 %exit.cond, label %exit, label %loop

exit:                                     ; preds = %loop
  ret i32 %sel
}

Becomes

define i32 @select_i32_from_icmp(ptr %v, i32 %a, i32 %b, i64 %n) {
entry:
  %min.iters.check = icmp ult i64 %n, 4
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %entry
  %n.mod.vf = urem i64 %n, 4
  %n.vec = sub i64 %n, %n.mod.vf
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i1> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
  %0 = add i64 %index, 0
  %1 = getelementptr inbounds i32, ptr %v, i64 %0
  %2 = getelementptr inbounds i32, ptr %1, i32 0
  %wide.load = load <4 x i32>, ptr %2, align 4
  %3 = icmp eq <4 x i32> %wide.load, splat (i32 3)
  %4 = xor <4 x i1> %3, splat (i1 true)
  %5 = or <4 x i1> %vec.phi, %4
  %index.next = add nuw i64 %index, 4
  %6 = icmp eq i64 %index.next, %n.vec
  br i1 %6, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  %7 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %5)
  %8 = freeze i1 %7
  %rdx.select = select i1 %8, i32 %b, i32 %a
  %cmp.n = icmp eq i64 %n, %n.vec
  br i1 %cmp.n, label %exit, label %scalar.ph

scalar.ph:                                        ; preds = %entry, %middle.block
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %entry ]
  %bc.merge.rdx = phi i32 [ %rdx.select, %middle.block ], [ %a, %entry ]
  br label %loop

loop:                                             ; preds = %scalar.ph, %loop
  %iv = phi i64 [ %bc.resume.val, %scalar.ph ], [ %iv.next, %loop ]
  %rdx = phi i32 [ %bc.merge.rdx, %scalar.ph ], [ %sel, %loop ]
  %gep.v.iv = getelementptr inbounds i32, ptr %v, i64 %iv
  %load.v.iv = load i32, ptr %gep.v.iv, align 4
  %cmp.load.iv.3 = icmp eq i32 %load.v.iv, 3
  %sel = select i1 %cmp.load.iv.3, i32 %rdx, i32 %b
  %iv.next = add nuw nsw i64 %iv, 1
  %exit.cond = icmp eq i64 %iv.next, %n
  br i1 %exit.cond, label %exit, label %loop, !llvm.loop !3

exit:                                             ; preds = %middle.block, %loop
  %sel.lcssa = phi i32 [ %sel, %loop ], [ %rdx.select, %middle.block ]
  ret i32 %sel.lcssa
}

We now have two uses of %a in %rdx.select and %bc.merge.rdx.

EDIT: Alive2 seems to confirm the transformation is incorrect: https://alive2.llvm.org/ce/z/XL3Yrk

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @fhahn!

// Update all users outside the vector region.
for (VPUser *U : to_vector(OrigExitingVPV->users())) {
auto *R = cast<VPRecipeBase>(U);
if (R->getParent() && R->getParent()->getParent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be based on the assumption that the parent's parent is a vector loop region. Is it worth asserting this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to check that the parent's region is the vector loop region, thanks

if (R->getParent() && R->getParent()->getParent())
continue;

for (unsigned Idx = 0; Idx != R->getNumOperands(); ++Idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a commonly-used idiom, something we probably have an equivalent IR helper for. Is it worth adding a helper in VPUser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added VPUser::replaceUsesOfWith in line with User::replaceUsesOfWith

@@ -10448,6 +10451,10 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
ResumeV =
Builder.CreateICmpNE(ResumeV, RdxDesc.getRecurrenceStartValue());
} else if (RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) {
ToFrozen[RdxDesc.getRecurrenceStartValue()] =
cast<PHINode>(ResumeV)->getIncomingValueForBlock(
EPI.MainLoopIterationCountCheck);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me at first because there is no obvious indication that EPI.MainLoopIterationCountCheck is a Block pointer. Doesn't need doing in this PR, but it would be good to rename MainLoopIterationCountCheck to something like MainLoopIterationCountCheckBlock.

@@ -10365,24 +10385,7 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
VPBasicBlock *Header = VectorLoop->getEntryBasicBlock();
Header->setName("vec.epilog.vector.body");

// Re-use the trip count and steps expanded for the main loop, as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in theory this code could be moved to the end of the function in a separate NFC patch, with this patch then adding the new bits:

    auto *VPI = dyn_cast<VPInstruction>(&R);
    if (VPI) {
      VPI->replaceAllUsesWith(Plan.getOrAddLiveIn(
          ToFrozen[VPI->getOperand(0)->getLiveInIRValue()]));
      continue;
    }

However, I don't want to create unnecessary burden as I know this is an important fix. I'll leave it up to you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it included in the patch for now, could also split it off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. 👍

@@ -474,6 +475,12 @@ Value *VPInstruction::generate(VPTransformState &State) {
Value *Idx = State.get(getOperand(1), /*IsScalar=*/true);
return Builder.CreateExtractElement(Vec, Idx, Name);
}
case Instruction::Freeze: {
if (State.hasVectorValue(this))
return State.get(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we still generate a freeze instruction if we have a vector value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check was left over from earlier version, removed, thanks

@@ -939,6 +946,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
case Instruction::ICmp:
case Instruction::Select:
case Instruction::Or:
case Instruction::Freeze:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also add this opcode to VPInstruction::opcodeMayReadOrWriteFromMemory and return false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

Copy link
Contributor

@Mel-Chen Mel-Chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help!

@@ -6,6 +6,7 @@ define i64 @select_icmp_nuw_nsw(ptr %a, ptr %b, i64 %ii, i64 %n) {
; CHECK-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i64 [[II:%.*]], i64 [[N:%.*]]) {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
; CHECK-NEXT: [[FR:%.*]] = freeze i64 [[II]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we insert the freeze in vector preheader, or the scalar loop also need the frozen start value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked a bit more and I think the only issue is with epilogue vectorization; Here's a proof where we go via the main vector loop to the scalar remainder loop https://alive2.llvm.org/ce/z/yFSBrq and here where we go from the vector loop to the exit block https://alive2.llvm.org/ce/z/D6Ppzr.

When vectorizing the epilogue, freeze is needed for the start value used in the selects and the resume phi from the main vector loop:

The latter 2 show requiring freezing the resume phi. That means we cannot freeze in the preheader. We could move the freeze to the main iteration count check, but that would be a bit fragile to find and other transforms can sink the freeze if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So it's because
[[TMP11:%.*]] = icmp eq i8 [[BC_MERGE_RDX]], [[START]]
passes the undef start value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we reach vec.epilog.ph without going through the main vector loop, %bc.merge.rdx will be the start value. If it is not frozen we have 2 uses of undef (both operands of the compare), and each use may have a different concrete value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the incoming value of FindLastIV PHI in epilogue vectorization can be directly broadcasted with the sentinel value to avoid generating this icmp, I'm not sure if AnyOf has a similar approach to bypass the icmp that propagates undef/poison. Therefore, I think continuing with the approach of freezing the start value is reasonable, as it allows both FindLastIV and AnyOf to share the same approach. :)

Copy link

github-actions bot commented Mar 31, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link

github-actions bot commented Mar 31, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/lib/Transforms/Vectorize/LoopVectorize.cpp llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll llvm/test/Transforms/LoopVectorize/epilog-iv-select-cmp.ll

The following files introduce new uses of undef:

  • llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to current main, so now only the PR changes should be included.

Updated the PR to only freeze when vectorizing the epilogue, as this is the only case where it should be needed

@@ -939,6 +946,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
case Instruction::ICmp:
case Instruction::Select:
case Instruction::Or:
case Instruction::Freeze:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

@@ -474,6 +475,12 @@ Value *VPInstruction::generate(VPTransformState &State) {
Value *Idx = State.get(getOperand(1), /*IsScalar=*/true);
return Builder.CreateExtractElement(Vec, Idx, Name);
}
case Instruction::Freeze: {
if (State.hasVectorValue(this))
return State.get(this);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check was left over from earlier version, removed, thanks

if (R->getParent() && R->getParent()->getParent())
continue;

for (unsigned Idx = 0; Idx != R->getNumOperands(); ++Idx) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added VPUser::replaceUsesOfWith in line with User::replaceUsesOfWith

// Update all users outside the vector region.
for (VPUser *U : to_vector(OrigExitingVPV->users())) {
auto *R = cast<VPRecipeBase>(U);
if (R->getParent() && R->getParent()->getParent())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to check that the parent's region is the vector loop region, thanks

@@ -10365,24 +10385,7 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
VPBasicBlock *Header = VectorLoop->getEntryBasicBlock();
Header->setName("vec.epilog.vector.body");

// Re-use the trip count and steps expanded for the main loop, as
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it included in the patch for now, could also split it off.

@@ -6,6 +6,7 @@ define i64 @select_icmp_nuw_nsw(ptr %a, ptr %b, i64 %ii, i64 %n) {
; CHECK-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i64 [[II:%.*]], i64 [[N:%.*]]) {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
; CHECK-NEXT: [[FR:%.*]] = freeze i64 [[II]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked a bit more and I think the only issue is with epilogue vectorization; Here's a proof where we go via the main vector loop to the scalar remainder loop https://alive2.llvm.org/ce/z/yFSBrq and here where we go from the vector loop to the exit block https://alive2.llvm.org/ce/z/D6Ppzr.

When vectorizing the epilogue, freeze is needed for the start value used in the selects and the resume phi from the main vector loop:

The latter 2 show requiring freezing the resume phi. That means we cannot freeze in the preheader. We could move the freeze to the main iteration count check, but that would be a bit fragile to find and other transforms can sink the freeze if needed.

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update - looks almost ready to go!

@@ -9892,14 +9895,22 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
// also modeled in VPlan.
VPInstruction *FinalReductionResult;
VPBuilder::InsertPointGuard Guard(Builder);
Builder.setInsertPoint(MiddleVPBB, IP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been sunk into the if/else cases when we are inserting into the same location in each case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored, this was left over after moving the code

auto *R = cast<VPRecipeBase>(U);
if (R->getParent()->getParent() == VectorLoopRegion)
continue;
R->replaceUsesOfWith(PhiR->getStartValue(), Start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to assume that the only users of PhiR->getStartValue() we need to update are operands of the users of OrigExitingVPV? I guess for FindLastIV that is true because it's usually going to be a select? Could you add a comment here explaining the logic?

Furthermore, what if PhiR->getStartValue() == Start already? I guess it doesn't do any harm to call replaceUsesOfWith - I just wondered if we should do something like:

  if (PhiR->getStartValue() != Start)
    R->replaceUsesOfWith(PhiR->getStartValue(), Start);

I believe the values could be equal if we managed to prove that Start is not undef or poison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something here, isn't Start defined as VPValue *Start = PhiR->getStartValue(); just above? So isn't this always a no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left over after moving the logic to the epilogue code, removed now, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same doubt. PhiR->getStartValue() and Start looks the same.

@@ -1615,6 +1622,7 @@ void VPWidenRecipe::execute(VPTransformState &State) {
}
case Instruction::Freeze: {
Value *Op = State.get(getOperand(0));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Stray new line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

@@ -1372,6 +1372,7 @@ static bool isDefinedInsideLoopRegions(const VPValue *VPV) {
bool VPValue::isDefinedOutsideLoopRegions() const {
return !isDefinedInsideLoopRegions(this);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Stray new line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripped, thanks

@pawosm-arm
Copy link
Contributor

This is a vital patch. I'm sorry for this blatant question, but, is there any chance of landing it this week?

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -10377,6 +10380,36 @@ static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
VPInstruction::ResumePhi,
{VectorTC, MainPlan.getCanonicalIV()->getStartValue()}, {},
"vec.epilog.resume.val");

// When vectorizing the epilogue, FindLastIV reductions can introduce multiple
// uses of undef/poison. If the reduction start value is not guaranteed to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this makes it even wordier, but shouldn't this be If the reduction start value is not guaranteed to **not** be undef or poison? In other words, if the start value could be either undef or poison we need to freeze it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the comment to say If the reduction start value may be undef or poison....

VPValue *OrigStart = VPI->getOperand(1);
if (isGuaranteedNotToBeUndefOrPoison(OrigStart->getLiveInIRValue()))
continue;
VPBuilder Builder(Plan.getEntry());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe hoist this out of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

// directly in VPlan.
for (auto &R : make_early_inc_range(*Plan.getEntry())) {
auto *VPI = dyn_cast<VPInstruction>(&R);
if (VPI) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantee is there that VPI corresponds to the frozen start value? Do we need to check for VPInstructions with opcode Instruction::Freeze? I assume this is supposed to match up with the VPInstructions added by AddFreezeForFindLastIVReductions above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, there can only be freeze VPInstruction in the header, updated to check though

@@ -10365,24 +10385,7 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
VPBasicBlock *Header = VectorLoop->getEntryBasicBlock();
Header->setName("vec.epilog.vector.body");

// Re-use the trip count and steps expanded for the main loop, as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. 👍

Copy link
Contributor

@Mel-Chen Mel-Chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after all current comments are fixed.
BTW, I can help update the llvm-testsuite to cover this case if there isn't already a unit test for it.

@@ -1403,6 +1403,13 @@ void VPValue::replaceUsesWithIf(
}
}

void VPUser::replaceUsesOfWith(VPValue *From, VPValue *To) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove this since we don't use the function in this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed again

@@ -246,6 +246,9 @@ class VPUser {
New->addUser(*this);
}

/// Replaces all uses of \p From in the VPUser with \p To.
void replaceUsesOfWith(VPValue *From, VPValue *To);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed again

@@ -6,6 +6,7 @@ define i64 @select_icmp_nuw_nsw(ptr %a, ptr %b, i64 %ii, i64 %n) {
; CHECK-SAME: ptr [[A:%.*]], ptr [[B:%.*]], i64 [[II:%.*]], i64 [[N:%.*]]) {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
; CHECK-NEXT: [[FR:%.*]] = freeze i64 [[II]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the incoming value of FindLastIV PHI in epilogue vectorization can be directly broadcasted with the sentinel value to avoid generating this icmp, I'm not sure if AnyOf has a similar approach to bypass the icmp that propagates undef/poison. Therefore, I think continuing with the approach of freezing the start value is reasonable, as it allows both FindLastIV and AnyOf to share the same approach. :)

@vzakhari
Copy link
Contributor

vzakhari commented Apr 2, 2025

Sorry for the ignorant question in advance - I am not familiar with the mechanics of vectorization.

How does this work with FP values? If the undef value happens to be NaN, then do we care that the comparison will return false (regardless of whether the initial value was updated or not by the vector loop)?

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this @fhahn.

for (auto &R : make_early_inc_range(*Plan.getEntry())) {
auto *VPI = dyn_cast<VPInstruction>(&R);
if (VPI && VPI->getOpcode() == Instruction::Freeze) {
VPI->replaceAllUsesWith(Plan.getOrAddLiveIn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you're doing here, but it looks a bit odd at first glance. You're essentially replacing one freeze in the epilogue entry block with another from the MainLoopIterationCountCheck block, right? Perhaps worth a comment explaining what's happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I also moved the comment about re-using the trip count inside the loop, thanks

fhahn added a commit that referenced this pull request Apr 3, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 3, 2025
@fhahn fhahn merged commit 2bdc1a1 into llvm:main Apr 4, 2025
10 of 11 checks passed
@fhahn fhahn deleted the findlastiv-poison-safe branch April 4, 2025 10:48
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 4, 2025
…2691)

FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use, when the epilogue is
vectorized.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

See the following scenarios in Alive2:
* Both main and epilogue vector loops execute, go to exit block: https://alive2.llvm.org/ce/z/_TSvRr
* Both main and epilogue vector loops execute, go to scalar loop: https://alive2.llvm.org/ce/z/CsPj5v
* Only epilogue vector loop executes, go to exit block: https://alive2.llvm.org/ce/z/5XqkNV
* Only epilogue vector loop executes, go to scalar loop: https://alive2.llvm.org/ce/z/JUpqRN

The latter 2 show requiring freezing the resume phi. That means we cannot freeze
in the preheader. We could move the freeze to the main iteration count check, but
that would be a bit fragile to find and other transforms can sink the freeze if needed.

Depends on llvm/llvm-project#132689
and llvm/llvm-project#132690.

Fixes llvm/llvm-project#126836

PR: llvm/llvm-project#132691
@Mel-Chen
Copy link
Contributor

Mel-Chen commented Apr 7, 2025

Sorry for the ignorant question in advance - I am not familiar with the mechanics of vectorization.

How does this work with FP values? If the undef value happens to be NaN, then do we care that the comparison will return false (regardless of whether the initial value was updated or not by the vector loop)?

@vzakhari Both AnyOf and FindLastIV do not support floating-point types so far.

@yus3710-fj
Copy link
Contributor

Thank you for the fix!

I'd like to backport this fix to the release branch, but I'm not sure how to do it because it seems to depend on some other fixes. I was hoping that someone familiar with this fix could backport it.

@pawosm-arm
Copy link
Contributor

Hi @fhahn, I need to kindly ask you a question. A fix for the issue #126836 has been aimed for LLVM20. Do you have any plans to implement it for the LLVM20 release branch too? It seems the PRs that fix this issue depend too much on the current state of the main branch to make them cherry-pick as-is.

pawosm-arm pushed a commit to arm/arm-toolchain that referenced this pull request Apr 9, 2025
pawosm-arm pushed a commit to arm/arm-toolchain that referenced this pull request Apr 9, 2025
FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use, when the epilogue is
vectorized.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

See the following scenarios in Alive2:
* Both main and epilogue vector loops execute, go to exit block: https://alive2.llvm.org/ce/z/_TSvRr
* Both main and epilogue vector loops execute, go to scalar loop: https://alive2.llvm.org/ce/z/CsPj5v
* Only epilogue vector loop executes, go to exit block: https://alive2.llvm.org/ce/z/5XqkNV
* Only epilogue vector loop executes, go to scalar loop: https://alive2.llvm.org/ce/z/JUpqRN

The latter 2 show requiring freezing the resume phi. That means we cannot freeze
in the preheader. We could move the freeze to the main iteration count check, but
that would be a bit fragile to find and other transforms can sink the freeze if needed.

Depends on llvm/llvm-project#132689
and llvm/llvm-project#132690.

Fixes llvm/llvm-project#126836

PR: llvm/llvm-project#132691
pawosm-arm pushed a commit to pawosm-arm/llvm-project that referenced this pull request Apr 10, 2025
pawosm-arm pushed a commit to pawosm-arm/llvm-project that referenced this pull request Apr 10, 2025
FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use, when the epilogue is
vectorized.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

See the following scenarios in Alive2:
* Both main and epilogue vector loops execute, go to exit block: https://alive2.llvm.org/ce/z/_TSvRr
* Both main and epilogue vector loops execute, go to scalar loop: https://alive2.llvm.org/ce/z/CsPj5v
* Only epilogue vector loop executes, go to exit block: https://alive2.llvm.org/ce/z/5XqkNV
* Only epilogue vector loop executes, go to scalar loop: https://alive2.llvm.org/ce/z/JUpqRN

The latter 2 show requiring freezing the resume phi. That means we cannot freeze
in the preheader. We could move the freeze to the main iteration count check, but
that would be a bit fragile to find and other transforms can sink the freeze if needed.

Depends on llvm#132689
and llvm#132690.

Fixes llvm#126836

PR: llvm#132691
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LV] wrf in SPEC2017 emits "Verification Error" on Grace when using llvmorg-20.1.0-rc1
8 participants