[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. #98495

tianqingw · 2024-07-11T15:19:59Z

The !unpredictable metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
FoldTwoEntryPHINode() to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface getBranchMispredictPenalty() is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.

llvmbot · 2024-07-11T15:27:16Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Tianqing Wang (tianqingw)

Changes

The !unpredictable metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
FoldTwoEntryPHINode() to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface getBranchMispredictPenalty() is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.

Full diff: https://github.com/llvm/llvm-project/pull/98495.diff

7 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+5)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.h (+2)
(modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+6-3)
(added) llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll (+96)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index dcdd9f82cde8e..9fbb6e1cff445 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -419,6 +419,11 @@ class TargetTransformInfo {
   /// this factor, it is very likely to be predicted correctly.
   BranchProbability getPredictableBranchThreshold() const;
 
+  // Returns an integer indicating how aggressive the target wants for
+  // eliminating unpredictable branches. A zero return value means extra
+  // optimization applied to them should be minimal.
+  unsigned getBranchMispredictPenalty() const;
+
   /// Return true if branch divergence exists.
   ///
   /// Branch divergence has a significantly negative impact on GPU performance
@@ -1820,6 +1825,7 @@ class TargetTransformInfo::Concept {
                                              ArrayRef<const Value *> Operands,
                                              TargetCostKind CostKind) = 0;
   virtual BranchProbability getPredictableBranchThreshold() = 0;
+  virtual unsigned getBranchMispredictPenalty() = 0;
   virtual bool hasBranchDivergence(const Function *F = nullptr) = 0;
   virtual bool isSourceOfDivergence(const Value *V) = 0;
   virtual bool isAlwaysUniform(const Value *V) = 0;
@@ -2228,6 +2234,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   BranchProbability getPredictableBranchThreshold() override {
     return Impl.getPredictableBranchThreshold();
   }
+  unsigned getBranchMispredictPenalty() override {
+    return Impl.getBranchMispredictPenalty();
+  }
   bool hasBranchDivergence(const Function *F = nullptr) override {
     return Impl.hasBranchDivergence(F);
   }
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 01624de190d51..d4b65ee6f5bd0 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -99,6 +99,8 @@ class TargetTransformInfoImplBase {
     return BranchProbability(99, 100);
   }
 
+  unsigned getBranchMispredictPenalty() const { return 0; }
+
   bool hasBranchDivergence(const Function *F = nullptr) const { return false; }
 
   bool isSourceOfDivergence(const Value *V) const { return false; }
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c175d1737e54b..0ee9c8ee0cdf8 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -279,6 +279,10 @@ BranchProbability TargetTransformInfo::getPredictableBranchThreshold() const {
              : TTIImpl->getPredictableBranchThreshold();
 }
 
+unsigned TargetTransformInfo::getBranchMispredictPenalty() const {
+  return TTIImpl->getBranchMispredictPenalty();
+}
+
 bool TargetTransformInfo::hasBranchDivergence(const Function *F) const {
   return TTIImpl->hasBranchDivergence(F);
 }
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 32a3683355b72..984586f4ae5f6 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6756,3 +6756,8 @@ InstructionCost X86TTIImpl::getScalingFactorCost(Type *Ty, GlobalValue *BaseGV,
     return AM.Scale != 0;
   return -1;
 }
+
+unsigned X86TTIImpl::getBranchMispredictPenalty() const {
+  // TODO: Hook MispredictPenalty of SchedMachineModel into this.
+  return 14;
+}
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index 5eccb1aea308d..d2b5c093e7003 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -294,6 +294,8 @@ class X86TTIImpl : public BasicTTIImplBase<X86TTIImpl> {
   bool supportsEfficientVectorElementLoadStore() const;
   bool enableInterleavedAccessVectorization();
 
+  unsigned getBranchMispredictPenalty() const;
+
 private:
   bool supportsGather() const;
   InstructionCost getGSVectorCost(unsigned Opcode, TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
index 3fa3c0f1f52b0..128238bba10b0 100644
--- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
@@ -3508,7 +3508,8 @@ static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,
   // jump to one specific 'then' block (if we have two of them).
   // It isn't beneficial to speculatively execute the code
   // from the block that we know is predictably not entered.
-  if (!DomBI->getMetadata(LLVMContext::MD_unpredictable)) {
+  bool IsUnpredictable = DomBI->getMetadata(LLVMContext::MD_unpredictable);
+  if (!IsUnpredictable) {
     uint64_t TWeight, FWeight;
     if (extractBranchWeights(*DomBI, TWeight, FWeight) &&
         (TWeight + FWeight) != 0) {
@@ -3549,8 +3550,10 @@ static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,
   // that need to be moved to the dominating block.
   SmallPtrSet<Instruction *, 4> AggressiveInsts;
   InstructionCost Cost = 0;
-  InstructionCost Budget =
-      TwoEntryPHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic;
+  unsigned Threshold = TwoEntryPHINodeFoldingThreshold;
+  if (IsUnpredictable)
+    Threshold += TTI.getBranchMispredictPenalty();
+  InstructionCost Budget = Threshold * TargetTransformInfo::TCC_Basic;
 
   bool Changed = false;
   for (BasicBlock::iterator II = BB->begin(); isa<PHINode>(II);) {
diff --git a/llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll b/llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll
new file mode 100644
index 0000000000000..88aa8a619207d
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll
@@ -0,0 +1,96 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; Two-entry phi nodes with unpredictable conditions may get increased budget for folding.
+; RUN: opt < %s -S -passes=simplifycfg | FileCheck --check-prefix=CHECK-NOFOLD %s
+; RUN: opt -mtriple=x86_64-unknown-linux-gnu -mattr=+avx2 < %s -S -passes=simplifycfg | FileCheck --check-prefix=CHECK-FOLD %s
+
+define { <2 x float>, <2 x float> } @foo(float %speed, <2 x float> %velocity.coerce0, <2 x float> %velocity.coerce1) {
+; CHECK-NOFOLD-LABEL: define { <2 x float>, <2 x float> } @foo(
+; CHECK-NOFOLD-SAME: float [[SPEED:%.*]], <2 x float> [[VELOCITY_COERCE0:%.*]], <2 x float> [[VELOCITY_COERCE1:%.*]]) {
+; CHECK-NOFOLD-NEXT:  [[ENTRY:.*]]:
+; CHECK-NOFOLD-NEXT:    [[CMP:%.*]] = fcmp fast ogt float [[SPEED]], 0x3F747AE140000000
+; CHECK-NOFOLD-NEXT:    br i1 [[CMP]], label %[[IF_THEN:.*]], label %[[IF_END:.*]], !unpredictable [[META0:![0-9]+]]
+; CHECK-NOFOLD:       [[IF_THEN]]:
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_0_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 0
+; CHECK-NOFOLD-NEXT:    [[MUL_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_0_VEC_EXTRACT]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 1
+; CHECK-NOFOLD-NEXT:    [[MUL8_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_4_VEC_EXTRACT]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[ADD_I_I_I_I:%.*]] = fadd fast float [[MUL8_I_I_I_I]], [[MUL_I_I_I_I]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE1]], i64 0
+; CHECK-NOFOLD-NEXT:    [[MUL13_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_14_8_VEC_EXTRACT]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[ADD14_I_I_I_I:%.*]] = fadd fast float [[ADD_I_I_I_I]], [[MUL13_I_I_I_I]]
+; CHECK-NOFOLD-NEXT:    [[TMP0:%.*]] = tail call fast noundef float @llvm.sqrt.f32(float [[ADD14_I_I_I_I]])
+; CHECK-NOFOLD-NEXT:    [[MUL_I_I_I:%.*]] = fdiv fast float 0x3FEFD70A40000000, [[TMP0]]
+; CHECK-NOFOLD-NEXT:    [[SUB_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[TMP1:%.*]] = insertelement <2 x float> poison, float [[SUB_I]], i64 0
+; CHECK-NOFOLD-NEXT:    [[SUB8_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_INSERT25:%.*]] = insertelement <2 x float> [[TMP1]], float [[SUB8_I]], i64 1
+; CHECK-NOFOLD-NEXT:    [[SUB13_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_INSERT35:%.*]] = insertelement <2 x float> [[VELOCITY_COERCE1]], float [[SUB13_I]], i64 0
+; CHECK-NOFOLD-NEXT:    br label %[[IF_END]]
+; CHECK-NOFOLD:       [[IF_END]]:
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_0:%.*]] = phi nsz <2 x float> [ [[VELOCITY_SROA_0_4_VEC_INSERT25]], %[[IF_THEN]] ], [ zeroinitializer, %[[ENTRY]] ]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_14_0:%.*]] = phi nsz <2 x float> [ [[VELOCITY_SROA_14_8_VEC_INSERT35]], %[[IF_THEN]] ], [ zeroinitializer, %[[ENTRY]] ]
+; CHECK-NOFOLD-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } poison, <2 x float> [[VELOCITY_SROA_0_0]], 0
+; CHECK-NOFOLD-NEXT:    [[DOTFCA_1_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } [[DOTFCA_0_INSERT]], <2 x float> [[VELOCITY_SROA_14_0]], 1
+; CHECK-NOFOLD-NEXT:    ret { <2 x float>, <2 x float> } [[DOTFCA_1_INSERT]]
+;
+; CHECK-FOLD-LABEL: define { <2 x float>, <2 x float> } @foo(
+; CHECK-FOLD-SAME: float [[SPEED:%.*]], <2 x float> [[VELOCITY_COERCE0:%.*]], <2 x float> [[VELOCITY_COERCE1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-FOLD-NEXT:  [[ENTRY:.*:]]
+; CHECK-FOLD-NEXT:    [[CMP:%.*]] = fcmp fast ogt float [[SPEED]], 0x3F747AE140000000
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_0_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 0
+; CHECK-FOLD-NEXT:    [[MUL_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_0_VEC_EXTRACT]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 1
+; CHECK-FOLD-NEXT:    [[MUL8_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_4_VEC_EXTRACT]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[ADD_I_I_I_I:%.*]] = fadd fast float [[MUL8_I_I_I_I]], [[MUL_I_I_I_I]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE1]], i64 0
+; CHECK-FOLD-NEXT:    [[MUL13_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_14_8_VEC_EXTRACT]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[ADD14_I_I_I_I:%.*]] = fadd fast float [[ADD_I_I_I_I]], [[MUL13_I_I_I_I]]
+; CHECK-FOLD-NEXT:    [[TMP0:%.*]] = tail call fast float @llvm.sqrt.f32(float [[ADD14_I_I_I_I]])
+; CHECK-FOLD-NEXT:    [[MUL_I_I_I:%.*]] = fdiv fast float 0x3FEFD70A40000000, [[TMP0]]
+; CHECK-FOLD-NEXT:    [[SUB_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[TMP1:%.*]] = insertelement <2 x float> poison, float [[SUB_I]], i64 0
+; CHECK-FOLD-NEXT:    [[SUB8_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_INSERT25:%.*]] = insertelement <2 x float> [[TMP1]], float [[SUB8_I]], i64 1
+; CHECK-FOLD-NEXT:    [[SUB13_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_INSERT35:%.*]] = insertelement <2 x float> [[VELOCITY_COERCE1]], float [[SUB13_I]], i64 0
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_0:%.*]] = select nsz i1 [[CMP]], <2 x float> [[VELOCITY_SROA_0_4_VEC_INSERT25]], <2 x float> zeroinitializer, !unpredictable [[META0:![0-9]+]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_14_0:%.*]] = select nsz i1 [[CMP]], <2 x float> [[VELOCITY_SROA_14_8_VEC_INSERT35]], <2 x float> zeroinitializer, !unpredictable [[META0]]
+; CHECK-FOLD-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } poison, <2 x float> [[VELOCITY_SROA_0_0]], 0
+; CHECK-FOLD-NEXT:    [[DOTFCA_1_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } [[DOTFCA_0_INSERT]], <2 x float> [[VELOCITY_SROA_14_0]], 1
+; CHECK-FOLD-NEXT:    ret { <2 x float>, <2 x float> } [[DOTFCA_1_INSERT]]
+;
+entry:
+  %cmp = fcmp fast ogt float %speed, 0x3F747AE140000000
+  br i1 %cmp, label %if.then, label %if.end, !unpredictable !0
+
+if.then:
+  %velocity.sroa.0.0.vec.extract = extractelement <2 x float> %velocity.coerce0, i64 0
+  %mul.i.i.i.i = fmul fast float %velocity.sroa.0.0.vec.extract, %velocity.sroa.0.0.vec.extract
+  %velocity.sroa.0.4.vec.extract = extractelement <2 x float> %velocity.coerce0, i64 1
+  %mul8.i.i.i.i = fmul fast float %velocity.sroa.0.4.vec.extract, %velocity.sroa.0.4.vec.extract
+  %add.i.i.i.i = fadd fast float %mul8.i.i.i.i, %mul.i.i.i.i
+  %velocity.sroa.14.8.vec.extract = extractelement <2 x float> %velocity.coerce1, i64 0
+  %mul13.i.i.i.i = fmul fast float %velocity.sroa.14.8.vec.extract, %velocity.sroa.14.8.vec.extract
+  %add14.i.i.i.i = fadd fast float %add.i.i.i.i, %mul13.i.i.i.i
+  %0 = tail call fast noundef float @llvm.sqrt.f32(float %add14.i.i.i.i)
+  %mul.i.i.i = fdiv fast float 0x3FEFD70A40000000, %0
+  %sub.i = fmul fast float %mul.i.i.i, %velocity.sroa.0.0.vec.extract
+  %1 = insertelement <2 x float> poison, float %sub.i, i64 0
+  %sub8.i = fmul fast float %mul.i.i.i, %velocity.sroa.0.4.vec.extract
+  %velocity.sroa.0.4.vec.insert25 = insertelement <2 x float> %1, float %sub8.i, i64 1
+  %sub13.i = fmul fast float %mul.i.i.i, %velocity.sroa.14.8.vec.extract
+  %velocity.sroa.14.8.vec.insert35 = insertelement <2 x float> %velocity.coerce1, float %sub13.i, i64 0
+  br label %if.end
+
+if.end:
+  %velocity.sroa.0.0 = phi nsz <2 x float> [ %velocity.sroa.0.4.vec.insert25, %if.then ], [ zeroinitializer, %entry ]
+  %velocity.sroa.14.0 = phi nsz <2 x float> [ %velocity.sroa.14.8.vec.insert35, %if.then ], [ zeroinitializer, %entry ]
+  %.fca.0.insert = insertvalue { <2 x float>, <2 x float> } poison, <2 x float> %velocity.sroa.0.0, 0
+  %.fca.1.insert = insertvalue { <2 x float>, <2 x float> } %.fca.0.insert, <2 x float> %velocity.sroa.14.0, 1
+  ret { <2 x float>, <2 x float> } %.fca.1.insert
+}
+
+declare float @llvm.sqrt.f32(float)
+
+!0 = !{}

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

RKSimon · 2024-07-18T15:33:29Z

llvm/include/llvm/Analysis/TargetTransformInfo.h

+  // Returns an integer indicating how aggressive the target wants for
+  // eliminating unpredictable branches. A zero return value means extra
+  // optimization applied to them should be minimal.
+  unsigned getBranchMispredictPenalty() const;


Can we be more specific about how this penalty (and the threshold) is used - should it be comparable to TCK_Latency costs (and should it return InstructrionCost?).

I'm actually not sure about it and I'd like to have more discussion. IMO this is still very new and more use cases are needed to determine what's the best way for optimizations to leverage it. But for the time being, using it as an latency seems to be good.

Not sure about InstructionCost. With InstructionCost we need to:

Define what Invalid represents. Perhaps just means unknown and no extra optimization.

Define what 0 represents, and how it differs from Invalid. Maybe a (potentially hypothetical) hardware with very cheap branching, in which case we need to avoid folding branches at best effort? I can't think about a practical case yet.

github-actions · 2024-07-18T18:10:55Z

✅ With the latest revision this PR passed the C/C++ code formatter.

williamweixiao

LGTM. Let's wait one or two days to see if there are comments from other reviewers.

dianqk

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll

dianqk · 2024-07-22T01:18:49Z

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

I believe this is the right approach. The SimplifyCFG should only perform some trivial transforms from if to select in this context.

@nikic @dtcxzyw @arsenm might provide additional ideas.

is unpredictable. The `!unpredictable` metadata has been present for a long time, but it's usage in optimizations is still limited. This patch teaches `FoldTwoEntryPHINode()` to be more aggressive with an unpredictable branch to reduce mispredictions. A TTI interface `getBranchMispredictPenalty()` is added to distinguish between different hardwares to ensure we don't go too far for simpler cores. For simplicity, only a naive x86 implementation is included for the time being.

tianqingw · 2024-07-22T07:20:04Z

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

This is intended to be a minimum viable product for 19.x. That's the reason why it's limited to x86 target for the time being. Moving to EarlyIfConversion can be a next step.

The SimplifyCFG should only perform some trivial transforms from if to select in this context.

It is still "transforms from if to select". FoldTwoEntryPHINode() does things in a very conservative way, that it'll only fold if every instruction in IfTrue and IfFalse has no side effect, and can be traced from the operands of PHI. No change is done to these instructions.

I added a SpeculateUnpredictables option to prevent redundant runs.

nikic · 2024-07-23T10:48:29Z

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

+
+InstructionCost X86TTIImpl::getBranchMispredictPenalty() const {
+  // TODO: Hook MispredictPenalty of SchedMachineModel into this.
+  return 14;


How did you arrive at the value 14? Why is the TODO a TODO and not implemented in this PR?

nikic · 2024-07-23T11:52:41Z

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

This is intended to be a minimum viable product for 19.x. That's the reason why it's limited to x86 target for the time being. Moving to EarlyIfConversion can be a next step.

If I understand correctly, EarlyIfConversion already does this optimization, just on the machine level. Currently, it only allows critical path extension by half the mispredict penalty. It would seem reasonable to allow a larger extension for !unpredictable branches. So I don't really understand this idea of "implement it in SimplifyCFG now for 19.x and then move it to EarlyIfConversion later". If you plan to move this to EarlyIfConversion I would rather revert this (including from the release branch) and implement it there directly.

My own 2c on SimplifyCFG vs EarlyIfConversion is that doing it in SimplifyCFG can allow it to benefit from other select optimizations in the middle end, at the cost of losing branch based optimizations. Though the way this was implemented, with this transform only running at the very end of the middle-end pipeline, you wouldn't get the benefit of other middle-end optimizations anyway. If interaction with other middle-end optimizations is the reason for having this in SimplifyCFG, it may make sense to have a phase ordering test to demonstrate them.

nikic · 2024-07-23T11:54:41Z

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

I don't think removing it entirely makes sense or is possible, because InstCombine only recognizes many patterns in select form. So it's important that we convert "small" cases to selects in the middle-end. (I don't think the same is really true for the kinds of huge pattern handled here though.)

… is unpredictable. (#98495) The `!unpredictable` metadata has been present for a long time, but it's usage in optimizations is still limited. This patch teaches `FoldTwoEntryPHINode()` to be more aggressive with an unpredictable branch to reduce mispredictions. A TTI interface `getBranchMispredictPenalty()` is added to distinguish between different hardwares to ensure we don't go too far for simpler cores. For simplicity, only a naive x86 implementation is included for the time being.

dianqk · 2024-08-05T10:14:51Z

Any other comments? :)

tianqingw · 2024-08-07T00:15:12Z

Just a heads-up, I'm working on migrating it to CodeGen, though it's a little more work than expected because some of the infrastructures are not there yet (#102101).

tianqingw marked this pull request as ready for review July 11, 2024 15:26

llvmbot added backend:X86 llvm:analysis llvm:transforms labels Jul 11, 2024

williamweixiao requested review from williamweixiao, paulwalker-arm, preames and arsenm July 15, 2024 02:03

williamweixiao reviewed Jul 15, 2024

View reviewed changes

llvm/lib/Transforms/Utils/SimplifyCFG.cpp Outdated Show resolved Hide resolved

williamweixiao requested a review from HaohaiWen July 17, 2024 01:38

tcreech-intel mentioned this pull request Jul 18, 2024

Introduce UnpredictableProfileLoader for PMU branch-miss profiles #99027

Draft

tianqingw requested review from nikic, RKSimon and dianqk July 18, 2024 14:30

RKSimon reviewed Jul 18, 2024

View reviewed changes

williamweixiao approved these changes Jul 20, 2024

View reviewed changes

dianqk reviewed Jul 22, 2024

View reviewed changes

llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll Outdated Show resolved Hide resolved

tianqingw added 6 commits July 22, 2024 13:42

Add debug log.

645ba1f

Returns InstructionCost for getBranchMispredictPenalty().

da55feb

clang-format.

b008d5f

Canonicalize the test.

fa967f8

Add "speculate-unpredictables" SimplifyCFGOptions.

7a8e32b

tianqingw force-pushed the simpcfg-unpredictable branch from c64d637 to 7a8e32b Compare July 22, 2024 06:54

williamweixiao merged commit 3d494bf into llvm:main Jul 22, 2024
7 checks passed

nikic reviewed Jul 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. #98495

[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. #98495

Uh oh!

tianqingw commented Jul 11, 2024

Uh oh!

llvmbot commented Jul 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

RKSimon Jul 18, 2024

Uh oh!

tianqingw Jul 18, 2024

Uh oh!

github-actions bot commented Jul 18, 2024 •

edited

Loading

Uh oh!

williamweixiao left a comment

Uh oh!

dianqk left a comment

Uh oh!

Uh oh!

dianqk commented Jul 22, 2024

Uh oh!

tianqingw commented Jul 22, 2024

Uh oh!

Uh oh!

nikic Jul 23, 2024

Uh oh!

nikic commented Jul 23, 2024

Uh oh!

nikic commented Jul 23, 2024

Uh oh!

dianqk commented Aug 5, 2024

Uh oh!

tianqingw commented Aug 7, 2024

Uh oh!

Uh oh!

[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. #98495

[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. #98495

Uh oh!

Conversation

tianqingw commented Jul 11, 2024

Uh oh!

llvmbot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RKSimon Jul 18, 2024

Choose a reason for hiding this comment

Uh oh!

tianqingw Jul 18, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

williamweixiao left a comment

Choose a reason for hiding this comment

Uh oh!

dianqk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dianqk commented Jul 22, 2024

Uh oh!

tianqingw commented Jul 22, 2024

Uh oh!

Uh oh!

nikic Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

nikic commented Jul 23, 2024

Uh oh!

nikic commented Jul 23, 2024

Uh oh!

dianqk commented Aug 5, 2024

Uh oh!

tianqingw commented Aug 7, 2024

Uh oh!

Uh oh!

llvmbot commented Jul 11, 2024 •

edited

Loading

github-actions bot commented Jul 18, 2024 •

edited

Loading