Skip to content

[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. #98495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 22, 2024

Conversation

tianqingw
Copy link
Contributor

The !unpredictable metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
FoldTwoEntryPHINode() to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface getBranchMispredictPenalty() is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.

@llvmbot
Copy link
Member

llvmbot commented Jul 11, 2024

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Tianqing Wang (tianqingw)

Changes

The !unpredictable metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
FoldTwoEntryPHINode() to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface getBranchMispredictPenalty() is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.


Full diff: https://github.com/llvm/llvm-project/pull/98495.diff

7 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.h (+2)
  • (modified) llvm/lib/Transforms/Utils/SimplifyCFG.cpp (+6-3)
  • (added) llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll (+96)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index dcdd9f82cde8e..9fbb6e1cff445 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -419,6 +419,11 @@ class TargetTransformInfo {
   /// this factor, it is very likely to be predicted correctly.
   BranchProbability getPredictableBranchThreshold() const;
 
+  // Returns an integer indicating how aggressive the target wants for
+  // eliminating unpredictable branches. A zero return value means extra
+  // optimization applied to them should be minimal.
+  unsigned getBranchMispredictPenalty() const;
+
   /// Return true if branch divergence exists.
   ///
   /// Branch divergence has a significantly negative impact on GPU performance
@@ -1820,6 +1825,7 @@ class TargetTransformInfo::Concept {
                                              ArrayRef<const Value *> Operands,
                                              TargetCostKind CostKind) = 0;
   virtual BranchProbability getPredictableBranchThreshold() = 0;
+  virtual unsigned getBranchMispredictPenalty() = 0;
   virtual bool hasBranchDivergence(const Function *F = nullptr) = 0;
   virtual bool isSourceOfDivergence(const Value *V) = 0;
   virtual bool isAlwaysUniform(const Value *V) = 0;
@@ -2228,6 +2234,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   BranchProbability getPredictableBranchThreshold() override {
     return Impl.getPredictableBranchThreshold();
   }
+  unsigned getBranchMispredictPenalty() override {
+    return Impl.getBranchMispredictPenalty();
+  }
   bool hasBranchDivergence(const Function *F = nullptr) override {
     return Impl.hasBranchDivergence(F);
   }
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 01624de190d51..d4b65ee6f5bd0 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -99,6 +99,8 @@ class TargetTransformInfoImplBase {
     return BranchProbability(99, 100);
   }
 
+  unsigned getBranchMispredictPenalty() const { return 0; }
+
   bool hasBranchDivergence(const Function *F = nullptr) const { return false; }
 
   bool isSourceOfDivergence(const Value *V) const { return false; }
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c175d1737e54b..0ee9c8ee0cdf8 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -279,6 +279,10 @@ BranchProbability TargetTransformInfo::getPredictableBranchThreshold() const {
              : TTIImpl->getPredictableBranchThreshold();
 }
 
+unsigned TargetTransformInfo::getBranchMispredictPenalty() const {
+  return TTIImpl->getBranchMispredictPenalty();
+}
+
 bool TargetTransformInfo::hasBranchDivergence(const Function *F) const {
   return TTIImpl->hasBranchDivergence(F);
 }
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 32a3683355b72..984586f4ae5f6 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6756,3 +6756,8 @@ InstructionCost X86TTIImpl::getScalingFactorCost(Type *Ty, GlobalValue *BaseGV,
     return AM.Scale != 0;
   return -1;
 }
+
+unsigned X86TTIImpl::getBranchMispredictPenalty() const {
+  // TODO: Hook MispredictPenalty of SchedMachineModel into this.
+  return 14;
+}
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index 5eccb1aea308d..d2b5c093e7003 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -294,6 +294,8 @@ class X86TTIImpl : public BasicTTIImplBase<X86TTIImpl> {
   bool supportsEfficientVectorElementLoadStore() const;
   bool enableInterleavedAccessVectorization();
 
+  unsigned getBranchMispredictPenalty() const;
+
 private:
   bool supportsGather() const;
   InstructionCost getGSVectorCost(unsigned Opcode, TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
index 3fa3c0f1f52b0..128238bba10b0 100644
--- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
@@ -3508,7 +3508,8 @@ static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,
   // jump to one specific 'then' block (if we have two of them).
   // It isn't beneficial to speculatively execute the code
   // from the block that we know is predictably not entered.
-  if (!DomBI->getMetadata(LLVMContext::MD_unpredictable)) {
+  bool IsUnpredictable = DomBI->getMetadata(LLVMContext::MD_unpredictable);
+  if (!IsUnpredictable) {
     uint64_t TWeight, FWeight;
     if (extractBranchWeights(*DomBI, TWeight, FWeight) &&
         (TWeight + FWeight) != 0) {
@@ -3549,8 +3550,10 @@ static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,
   // that need to be moved to the dominating block.
   SmallPtrSet<Instruction *, 4> AggressiveInsts;
   InstructionCost Cost = 0;
-  InstructionCost Budget =
-      TwoEntryPHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic;
+  unsigned Threshold = TwoEntryPHINodeFoldingThreshold;
+  if (IsUnpredictable)
+    Threshold += TTI.getBranchMispredictPenalty();
+  InstructionCost Budget = Threshold * TargetTransformInfo::TCC_Basic;
 
   bool Changed = false;
   for (BasicBlock::iterator II = BB->begin(); isa<PHINode>(II);) {
diff --git a/llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll b/llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll
new file mode 100644
index 0000000000000..88aa8a619207d
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/two-entry-phi-fold-unpredictable.ll
@@ -0,0 +1,96 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; Two-entry phi nodes with unpredictable conditions may get increased budget for folding.
+; RUN: opt < %s -S -passes=simplifycfg | FileCheck --check-prefix=CHECK-NOFOLD %s
+; RUN: opt -mtriple=x86_64-unknown-linux-gnu -mattr=+avx2 < %s -S -passes=simplifycfg | FileCheck --check-prefix=CHECK-FOLD %s
+
+define { <2 x float>, <2 x float> } @foo(float %speed, <2 x float> %velocity.coerce0, <2 x float> %velocity.coerce1) {
+; CHECK-NOFOLD-LABEL: define { <2 x float>, <2 x float> } @foo(
+; CHECK-NOFOLD-SAME: float [[SPEED:%.*]], <2 x float> [[VELOCITY_COERCE0:%.*]], <2 x float> [[VELOCITY_COERCE1:%.*]]) {
+; CHECK-NOFOLD-NEXT:  [[ENTRY:.*]]:
+; CHECK-NOFOLD-NEXT:    [[CMP:%.*]] = fcmp fast ogt float [[SPEED]], 0x3F747AE140000000
+; CHECK-NOFOLD-NEXT:    br i1 [[CMP]], label %[[IF_THEN:.*]], label %[[IF_END:.*]], !unpredictable [[META0:![0-9]+]]
+; CHECK-NOFOLD:       [[IF_THEN]]:
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_0_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 0
+; CHECK-NOFOLD-NEXT:    [[MUL_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_0_VEC_EXTRACT]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 1
+; CHECK-NOFOLD-NEXT:    [[MUL8_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_4_VEC_EXTRACT]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[ADD_I_I_I_I:%.*]] = fadd fast float [[MUL8_I_I_I_I]], [[MUL_I_I_I_I]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE1]], i64 0
+; CHECK-NOFOLD-NEXT:    [[MUL13_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_14_8_VEC_EXTRACT]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[ADD14_I_I_I_I:%.*]] = fadd fast float [[ADD_I_I_I_I]], [[MUL13_I_I_I_I]]
+; CHECK-NOFOLD-NEXT:    [[TMP0:%.*]] = tail call fast noundef float @llvm.sqrt.f32(float [[ADD14_I_I_I_I]])
+; CHECK-NOFOLD-NEXT:    [[MUL_I_I_I:%.*]] = fdiv fast float 0x3FEFD70A40000000, [[TMP0]]
+; CHECK-NOFOLD-NEXT:    [[SUB_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[TMP1:%.*]] = insertelement <2 x float> poison, float [[SUB_I]], i64 0
+; CHECK-NOFOLD-NEXT:    [[SUB8_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_INSERT25:%.*]] = insertelement <2 x float> [[TMP1]], float [[SUB8_I]], i64 1
+; CHECK-NOFOLD-NEXT:    [[SUB13_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_INSERT35:%.*]] = insertelement <2 x float> [[VELOCITY_COERCE1]], float [[SUB13_I]], i64 0
+; CHECK-NOFOLD-NEXT:    br label %[[IF_END]]
+; CHECK-NOFOLD:       [[IF_END]]:
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_0_0:%.*]] = phi nsz <2 x float> [ [[VELOCITY_SROA_0_4_VEC_INSERT25]], %[[IF_THEN]] ], [ zeroinitializer, %[[ENTRY]] ]
+; CHECK-NOFOLD-NEXT:    [[VELOCITY_SROA_14_0:%.*]] = phi nsz <2 x float> [ [[VELOCITY_SROA_14_8_VEC_INSERT35]], %[[IF_THEN]] ], [ zeroinitializer, %[[ENTRY]] ]
+; CHECK-NOFOLD-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } poison, <2 x float> [[VELOCITY_SROA_0_0]], 0
+; CHECK-NOFOLD-NEXT:    [[DOTFCA_1_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } [[DOTFCA_0_INSERT]], <2 x float> [[VELOCITY_SROA_14_0]], 1
+; CHECK-NOFOLD-NEXT:    ret { <2 x float>, <2 x float> } [[DOTFCA_1_INSERT]]
+;
+; CHECK-FOLD-LABEL: define { <2 x float>, <2 x float> } @foo(
+; CHECK-FOLD-SAME: float [[SPEED:%.*]], <2 x float> [[VELOCITY_COERCE0:%.*]], <2 x float> [[VELOCITY_COERCE1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-FOLD-NEXT:  [[ENTRY:.*:]]
+; CHECK-FOLD-NEXT:    [[CMP:%.*]] = fcmp fast ogt float [[SPEED]], 0x3F747AE140000000
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_0_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 0
+; CHECK-FOLD-NEXT:    [[MUL_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_0_VEC_EXTRACT]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE0]], i64 1
+; CHECK-FOLD-NEXT:    [[MUL8_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_0_4_VEC_EXTRACT]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[ADD_I_I_I_I:%.*]] = fadd fast float [[MUL8_I_I_I_I]], [[MUL_I_I_I_I]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_EXTRACT:%.*]] = extractelement <2 x float> [[VELOCITY_COERCE1]], i64 0
+; CHECK-FOLD-NEXT:    [[MUL13_I_I_I_I:%.*]] = fmul fast float [[VELOCITY_SROA_14_8_VEC_EXTRACT]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[ADD14_I_I_I_I:%.*]] = fadd fast float [[ADD_I_I_I_I]], [[MUL13_I_I_I_I]]
+; CHECK-FOLD-NEXT:    [[TMP0:%.*]] = tail call fast float @llvm.sqrt.f32(float [[ADD14_I_I_I_I]])
+; CHECK-FOLD-NEXT:    [[MUL_I_I_I:%.*]] = fdiv fast float 0x3FEFD70A40000000, [[TMP0]]
+; CHECK-FOLD-NEXT:    [[SUB_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_0_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[TMP1:%.*]] = insertelement <2 x float> poison, float [[SUB_I]], i64 0
+; CHECK-FOLD-NEXT:    [[SUB8_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_0_4_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_4_VEC_INSERT25:%.*]] = insertelement <2 x float> [[TMP1]], float [[SUB8_I]], i64 1
+; CHECK-FOLD-NEXT:    [[SUB13_I:%.*]] = fmul fast float [[MUL_I_I_I]], [[VELOCITY_SROA_14_8_VEC_EXTRACT]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_14_8_VEC_INSERT35:%.*]] = insertelement <2 x float> [[VELOCITY_COERCE1]], float [[SUB13_I]], i64 0
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_0_0:%.*]] = select nsz i1 [[CMP]], <2 x float> [[VELOCITY_SROA_0_4_VEC_INSERT25]], <2 x float> zeroinitializer, !unpredictable [[META0:![0-9]+]]
+; CHECK-FOLD-NEXT:    [[VELOCITY_SROA_14_0:%.*]] = select nsz i1 [[CMP]], <2 x float> [[VELOCITY_SROA_14_8_VEC_INSERT35]], <2 x float> zeroinitializer, !unpredictable [[META0]]
+; CHECK-FOLD-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } poison, <2 x float> [[VELOCITY_SROA_0_0]], 0
+; CHECK-FOLD-NEXT:    [[DOTFCA_1_INSERT:%.*]] = insertvalue { <2 x float>, <2 x float> } [[DOTFCA_0_INSERT]], <2 x float> [[VELOCITY_SROA_14_0]], 1
+; CHECK-FOLD-NEXT:    ret { <2 x float>, <2 x float> } [[DOTFCA_1_INSERT]]
+;
+entry:
+  %cmp = fcmp fast ogt float %speed, 0x3F747AE140000000
+  br i1 %cmp, label %if.then, label %if.end, !unpredictable !0
+
+if.then:
+  %velocity.sroa.0.0.vec.extract = extractelement <2 x float> %velocity.coerce0, i64 0
+  %mul.i.i.i.i = fmul fast float %velocity.sroa.0.0.vec.extract, %velocity.sroa.0.0.vec.extract
+  %velocity.sroa.0.4.vec.extract = extractelement <2 x float> %velocity.coerce0, i64 1
+  %mul8.i.i.i.i = fmul fast float %velocity.sroa.0.4.vec.extract, %velocity.sroa.0.4.vec.extract
+  %add.i.i.i.i = fadd fast float %mul8.i.i.i.i, %mul.i.i.i.i
+  %velocity.sroa.14.8.vec.extract = extractelement <2 x float> %velocity.coerce1, i64 0
+  %mul13.i.i.i.i = fmul fast float %velocity.sroa.14.8.vec.extract, %velocity.sroa.14.8.vec.extract
+  %add14.i.i.i.i = fadd fast float %add.i.i.i.i, %mul13.i.i.i.i
+  %0 = tail call fast noundef float @llvm.sqrt.f32(float %add14.i.i.i.i)
+  %mul.i.i.i = fdiv fast float 0x3FEFD70A40000000, %0
+  %sub.i = fmul fast float %mul.i.i.i, %velocity.sroa.0.0.vec.extract
+  %1 = insertelement <2 x float> poison, float %sub.i, i64 0
+  %sub8.i = fmul fast float %mul.i.i.i, %velocity.sroa.0.4.vec.extract
+  %velocity.sroa.0.4.vec.insert25 = insertelement <2 x float> %1, float %sub8.i, i64 1
+  %sub13.i = fmul fast float %mul.i.i.i, %velocity.sroa.14.8.vec.extract
+  %velocity.sroa.14.8.vec.insert35 = insertelement <2 x float> %velocity.coerce1, float %sub13.i, i64 0
+  br label %if.end
+
+if.end:
+  %velocity.sroa.0.0 = phi nsz <2 x float> [ %velocity.sroa.0.4.vec.insert25, %if.then ], [ zeroinitializer, %entry ]
+  %velocity.sroa.14.0 = phi nsz <2 x float> [ %velocity.sroa.14.8.vec.insert35, %if.then ], [ zeroinitializer, %entry ]
+  %.fca.0.insert = insertvalue { <2 x float>, <2 x float> } poison, <2 x float> %velocity.sroa.0.0, 0
+  %.fca.1.insert = insertvalue { <2 x float>, <2 x float> } %.fca.0.insert, <2 x float> %velocity.sroa.14.0, 1
+  ret { <2 x float>, <2 x float> } %.fca.1.insert
+}
+
+declare float @llvm.sqrt.f32(float)
+
+!0 = !{}

// Returns an integer indicating how aggressive the target wants for
// eliminating unpredictable branches. A zero return value means extra
// optimization applied to them should be minimal.
unsigned getBranchMispredictPenalty() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific about how this penalty (and the threshold) is used - should it be comparable to TCK_Latency costs (and should it return InstructrionCost?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure about it and I'd like to have more discussion. IMO this is still very new and more use cases are needed to determine what's the best way for optimizations to leverage it. But for the time being, using it as an latency seems to be good.

Not sure about InstructionCost. With InstructionCost we need to:

  1. Define what Invalid represents. Perhaps just means unknown and no extra optimization.
  2. Define what 0 represents, and how it differs from Invalid. Maybe a (potentially hypothetical) hardware with very cheap branching, in which case we need to avoid folding branches at best effort? I can't think about a practical case yet.

Copy link

github-actions bot commented Jul 18, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@williamweixiao williamweixiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's wait one or two days to see if there are comments from other reviewers.

Copy link
Member

@dianqk dianqk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

@dianqk
Copy link
Member

dianqk commented Jul 22, 2024

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

I believe this is the right approach. The SimplifyCFG should only perform some trivial transforms from if to select in this context.

@nikic @dtcxzyw @arsenm might provide additional ideas.

is unpredictable.

The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
@tianqingw tianqingw force-pushed the simpcfg-unpredictable branch from c64d637 to 7a8e32b Compare July 22, 2024 06:54
@tianqingw
Copy link
Contributor Author

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

This is intended to be a minimum viable product for 19.x. That's the reason why it's limited to x86 target for the time being. Moving to EarlyIfConversion can be a next step.

The SimplifyCFG should only perform some trivial transforms from if to select in this context.

It is still "transforms from if to select". FoldTwoEntryPHINode() does things in a very conservative way, that it'll only fold if every instruction in IfTrue and IfFalse has no side effect, and can be traced from the operands of PHI. No change is done to these instructions.

I added a SpeculateUnpredictables option to prevent redundant runs.

@williamweixiao williamweixiao merged commit 3d494bf into llvm:main Jul 22, 2024
7 checks passed

InstructionCost X86TTIImpl::getBranchMispredictPenalty() const {
// TODO: Hook MispredictPenalty of SchedMachineModel into this.
return 14;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at the value 14? Why is the TODO a TODO and not implemented in this PR?

@nikic
Copy link
Contributor

nikic commented Jul 23, 2024

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

This is intended to be a minimum viable product for 19.x. That's the reason why it's limited to x86 target for the time being. Moving to EarlyIfConversion can be a next step.

If I understand correctly, EarlyIfConversion already does this optimization, just on the machine level. Currently, it only allows critical path extension by half the mispredict penalty. It would seem reasonable to allow a larger extension for !unpredictable branches. So I don't really understand this idea of "implement it in SimplifyCFG now for 19.x and then move it to EarlyIfConversion later". If you plan to move this to EarlyIfConversion I would rather revert this (including from the release branch) and implement it there directly.

My own 2c on SimplifyCFG vs EarlyIfConversion is that doing it in SimplifyCFG can allow it to benefit from other select optimizations in the middle end, at the cost of losing branch based optimizations. Though the way this was implemented, with this transform only running at the very end of the middle-end pipeline, you wouldn't get the benefit of other middle-end optimizations anyway. If interaction with other middle-end optimizations is the reason for having this in SimplifyCFG, it may make sense to have a phase ordering test to demonstrate them.

@nikic
Copy link
Contributor

nikic commented Jul 23, 2024

I'm wondering if it's possible to remove the FoldTwoEntryPHINode altogether and hand it in EarlyIfConversion.

I don't think removing it entirely makes sense or is possible, because InstCombine only recognizes many patterns in select form. So it's important that we convert "small" cases to selects in the middle-end. (I don't think the same is really true for the kinds of huge pattern handled here though.)

yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
… is unpredictable. (#98495)

The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
@dianqk
Copy link
Member

dianqk commented Aug 5, 2024

Any other comments? :)

@tianqingw
Copy link
Contributor Author

Just a heads-up, I'm working on migrating it to CodeGen, though it's a little more work than expected because some of the infrastructures are not there yet (#102101).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants