Skip to content

[SystemZ] Enable MachineCombiner for FP reassociation #83546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 30, 2024

Conversation

JonPsson1
Copy link
Contributor

This patch is still slightly unfinished/experimental. It has the option of using either PPC or alternative new patterns. Some feedback would be nice at this point (see below).

Main points:

  • Enable MachineCombiner
  • Disable reg/mem folding during isel as MachineCombiner can only work with reg/reg instructions.
  • Implement optimizeLoadInstr() so that PeepholeOptimizer can fold loads into reg/mem instructions after
    MachineCombiner.
  • Run a new simple pass in SystemZ backend that converts any remaining reg/reg pseudos. It also does some
    limited reg/mem folding across MBB boundaries which gives a nice improvement on imagick (13%!).

I have experimented with figuring out which pattern combinations would give the best improvement, measured by the decrease of maximum depth of any instructions in the MBB as reported by MachineTraceMetrics. Just the common-code binary patterns gives a total static improvement of -10294. Adding different fma patterns gave some further improvements:

Only common code binary patterns:          -10294
Additional decrease, relative to -10294:               With reassoc-additions: static (relative to -10294) / performance
PowerPC patterns (fma2add, fma3-ch):        -3211
fma1add, fma2, addfma                    :  -6269      -6812 / possibly better
fma1add, fma2                            :  -5760      
All experimental patterns                :  -6237      -6865 / about same

Looking at these numbers, it seemed like the patterns of {fma+fma, fma+add, add+fma} was the most promising combination, and that the PPC patterns {fma+fma+fma, fma+fma+add} were only half of that.

However, even though these different settings of FMA patterns give further static improvements, there doesn't seem to be any further performance improvement on SPEC for any of them. With the experimental reassoc-additions pass (right colum) a bit more depth-improvement resulted, but not even with this is there any noticeable performance win.

A bit surprisingly, I saw instead a 13% improvement imagick without the MachineCombiner, just because of the global reg/mem folding I tried. Then with the MachineCombiner using only common-code patterns for binops, lbm is improved ~20%, and cactus ~2%.

My conclusions are that the MachineCombiner is worth enabling and even though there is no noticeable performance improvement on SPEC it seems good to have some basic FMA patterns, either like -z-fma or -ppc-fma . The simpler solution is probably to reuse the 2 PPC patterns, while the z-fma ones give better maxdepth improvement which may or may not be valuable sometimes. Compile time seems to be the same.

The experiments and results per above have included a minor fix in MachineCombiner (#82025), which hopefully can be committed but is probably not vital.

The full experimental patch:
MachineComb.patch.tar.gz

@llvmbot
Copy link
Member

llvmbot commented Mar 1, 2024

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-backend-systemz

Author: Jonas Paulsson (JonPsson1)

Changes

This patch is still slightly unfinished/experimental. It has the option of using either PPC or alternative new patterns. Some feedback would be nice at this point (see below).

Main points:

  • Enable MachineCombiner
  • Disable reg/mem folding during isel as MachineCombiner can only work with reg/reg instructions.
  • Implement optimizeLoadInstr() so that PeepholeOptimizer can fold loads into reg/mem instructions after
    MachineCombiner.
  • Run a new simple pass in SystemZ backend that converts any remaining reg/reg pseudos. It also does some
    limited reg/mem folding across MBB boundaries which gives a nice improvement on imagick (13%!).

I have experimented with figuring out which pattern combinations would give the best improvement, measured by the decrease of maximum depth of any instructions in the MBB as reported by MachineTraceMetrics. Just the common-code binary patterns gives a total static improvement of -10294. Adding different fma patterns gave some further improvements:

Only common code binary patterns:          -10294
Additional decrease, relative to -10294:               With reassoc-additions: static (relative to -10294) / performance
PowerPC patterns (fma2add, fma3-ch):        -3211
fma1add, fma2, addfma                    :  -6269      -6812 / possibly better
fma1add, fma2                            :  -5760      
All experimental patterns                :  -6237      -6865 / about same

Looking at these numbers, it seemed like the patterns of {fma+fma, fma+add, add+fma} was the most promising combination, and that the PPC patterns {fma+fma+fma, fma+fma+add} were only half of that.

However, even though these different settings of FMA patterns give further static improvements, there doesn't seem to be any further performance improvement on SPEC for any of them. With the experimental reassoc-additions pass (right colum) a bit more depth-improvement resulted, but not even with this is there any noticeable performance win.

A bit surprisingly, I saw instead a 13% improvement imagick without the MachineCombiner, just because of the global reg/mem folding I tried. Then with the MachineCombiner using only common-code patterns for binops, lbm is improved ~20%, and cactus ~2%.

My conclusions are that the MachineCombiner is worth enabling and even though there is no noticeable performance improvement on SPEC it seems good to have some basic FMA patterns, either like -z-fma or -ppc-fma . The simpler solution is probably to reuse the 2 PPC patterns, while the z-fma ones give better maxdepth improvement which may or may not be valuable sometimes. Compile time seems to be the same.

The experiments and results per above have included a minor fix in MachineCombiner (#82025), which hopefully can be committed but is probably not vital.

The full experimental patch:
MachineComb.patch.tar.gz


Patch is 116.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/83546.diff

27 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineCombinerPattern.h (+9)
  • (modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+1-1)
  • (modified) llvm/lib/CodeGen/MachineCombiner.cpp (+8-3)
  • (modified) llvm/lib/Target/SystemZ/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/SystemZ/SystemZ.h (+2)
  • (added) llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp (+127)
  • (modified) llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp (+7)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrFP.td (+16-8)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrFormats.td (+30)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp (+554-14)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.h (+39)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrVector.td (+33-33)
  • (modified) llvm/lib/Target/SystemZ/SystemZOperators.td (+29)
  • (modified) llvm/lib/Target/SystemZ/SystemZScheduleZ13.td (+3-3)
  • (modified) llvm/lib/Target/SystemZ/SystemZScheduleZ14.td (+5-5)
  • (modified) llvm/lib/Target/SystemZ/SystemZScheduleZ15.td (+5-5)
  • (modified) llvm/lib/Target/SystemZ/SystemZScheduleZ16.td (+5-5)
  • (modified) llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp (+10)
  • (modified) llvm/lib/Target/X86/X86InstrInfo.cpp (+1-1)
  • (modified) llvm/lib/Target/X86/X86InstrInfo.h (+1-1)
  • (modified) llvm/test/CodeGen/SystemZ/fp-add-02.ll (+14)
  • (modified) llvm/test/CodeGen/SystemZ/fp-mul-02.ll (+11-1)
  • (added) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp-01.ll (+690)
  • (added) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp-03.ll (+90)
  • (added) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp-04.ll (+123)
  • (added) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp-08.ll (+115)
  • (added) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp-09.ll (+177)
diff --git a/llvm/include/llvm/CodeGen/MachineCombinerPattern.h b/llvm/include/llvm/CodeGen/MachineCombinerPattern.h
index 89eed7463bd783..c0b3927450cb69 100644
--- a/llvm/include/llvm/CodeGen/MachineCombinerPattern.h
+++ b/llvm/include/llvm/CodeGen/MachineCombinerPattern.h
@@ -176,6 +176,15 @@ enum class MachineCombinerPattern {
   FMSUB,
   FNMSUB,
 
+  // SystemZ patterns.  (EXPERIMENTAL)
+  FMA2_P1P0,
+  FMA2_P0P1,
+  FMA2,
+  FMA1_Add_L,
+  FMA1_Add_R,
+  FMA3,     // These are inspired by PPC
+  FMA2_Add, //
+
   // X86 VNNI
   DPWSSD,
 
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index e7787aafb98e2d..dc48056c3d9bcd 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -1698,7 +1698,7 @@ class TargetInstrInfo : public MCInstrInfo {
   /// instruction that defines FoldAsLoadDefReg, and the function returns
   /// the machine instruction generated due to folding.
   virtual MachineInstr *optimizeLoadInstr(MachineInstr &MI,
-                                          const MachineRegisterInfo *MRI,
+                                          MachineRegisterInfo *MRI,
                                           Register &FoldAsLoadDefReg,
                                           MachineInstr *&DefMI) const {
     return nullptr;
diff --git a/llvm/lib/CodeGen/MachineCombiner.cpp b/llvm/lib/CodeGen/MachineCombiner.cpp
index c65937935ed820..577331441eef28 100644
--- a/llvm/lib/CodeGen/MachineCombiner.cpp
+++ b/llvm/lib/CodeGen/MachineCombiner.cpp
@@ -155,9 +155,7 @@ MachineCombiner::getOperandDef(const MachineOperand &MO) {
   // We need a virtual register definition.
   if (MO.isReg() && MO.getReg().isVirtual())
     DefInstr = MRI->getUniqueVRegDef(MO.getReg());
-  // PHI's have no depth etc.
-  if (DefInstr && DefInstr->isPHI())
-    DefInstr = nullptr;
+  // (PATCH PROPOSED for PHIs: https://github.com/llvm/llvm-project/pull/82025)
   return DefInstr;
 }
 
@@ -317,6 +315,13 @@ static CombinerObjective getCombinerObjective(MachineCombinerPattern P) {
   case MachineCombinerPattern::FMADD_XA:
   case MachineCombinerPattern::FMSUB:
   case MachineCombinerPattern::FNMSUB:
+  case MachineCombinerPattern::FMA2_P1P0:
+  case MachineCombinerPattern::FMA2_P0P1:
+  case MachineCombinerPattern::FMA2:
+  case MachineCombinerPattern::FMA1_Add_L:
+  case MachineCombinerPattern::FMA1_Add_R:
+  case MachineCombinerPattern::FMA3:
+  case MachineCombinerPattern::FMA2_Add:
     return CombinerObjective::MustReduceDepth;
   case MachineCombinerPattern::REASSOC_XY_BCA:
   case MachineCombinerPattern::REASSOC_XY_BAC:
diff --git a/llvm/lib/Target/SystemZ/CMakeLists.txt b/llvm/lib/Target/SystemZ/CMakeLists.txt
index 0614e07bde8ac1..235df220729447 100644
--- a/llvm/lib/Target/SystemZ/CMakeLists.txt
+++ b/llvm/lib/Target/SystemZ/CMakeLists.txt
@@ -20,6 +20,7 @@ add_llvm_target(SystemZCodeGen
   SystemZConstantPoolValue.cpp
   SystemZCopyPhysRegs.cpp
   SystemZElimCompare.cpp
+  SystemZFinalizeReassociation.cpp
   SystemZFrameLowering.cpp
   SystemZHazardRecognizer.cpp
   SystemZISelDAGToDAG.cpp
diff --git a/llvm/lib/Target/SystemZ/SystemZ.h b/llvm/lib/Target/SystemZ/SystemZ.h
index d7aa9e4e18cbbb..49a200babfff57 100644
--- a/llvm/lib/Target/SystemZ/SystemZ.h
+++ b/llvm/lib/Target/SystemZ/SystemZ.h
@@ -195,12 +195,14 @@ FunctionPass *createSystemZShortenInstPass(SystemZTargetMachine &TM);
 FunctionPass *createSystemZLongBranchPass(SystemZTargetMachine &TM);
 FunctionPass *createSystemZLDCleanupPass(SystemZTargetMachine &TM);
 FunctionPass *createSystemZCopyPhysRegsPass(SystemZTargetMachine &TM);
+FunctionPass *createSystemZFinalizeReassociationPass(SystemZTargetMachine &TM);
 FunctionPass *createSystemZPostRewritePass(SystemZTargetMachine &TM);
 FunctionPass *createSystemZTDCPass();
 
 void initializeSystemZCopyPhysRegsPass(PassRegistry &);
 void initializeSystemZDAGToDAGISelPass(PassRegistry &);
 void initializeSystemZElimComparePass(PassRegistry &);
+void initializeSystemZFinalizeReassociationPass(PassRegistry &);
 void initializeSystemZLDCleanupPass(PassRegistry &);
 void initializeSystemZLongBranchPass(PassRegistry &);
 void initializeSystemZPostRewritePass(PassRegistry &);
diff --git a/llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp b/llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp
new file mode 100644
index 00000000000000..d441b8bbc87381
--- /dev/null
+++ b/llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp
@@ -0,0 +1,127 @@
+//===---- SystemZFinalizeReassociation.cpp - Finalize FP reassociation ----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass is the last step of the process of enabling reassociation with
+// the MachineCombiner. These are the steps involved:
+//
+// 1. Instruction selection: Disable reg/mem folding for any operations that
+//    are reassociable since MachineCombiner will not succeed
+//    otherwise. Instead select a reg/reg pseudo that pretends to clobber CC.
+//
+// 2. MachineCombiner: Performs reassociation with the reg/reg instructions.
+//
+// 3. PeepholeOptimizer: fold loads into reg/mem instructions after
+//    reassociation. The reg/mem opcode sets CC which is why the special
+//    reg/reg pseudo is needed.
+//
+// 4. Convert any remaining pseudos into the target opcodes that do not
+//    clobber CC (this pass).
+//
+//===----------------------------------------------------------------------===//
+
+#include "SystemZMachineFunctionInfo.h"
+#include "SystemZTargetMachine.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineInstrBuilder.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/TargetInstrInfo.h"
+#include "llvm/CodeGen/TargetRegisterInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+namespace {
+
+class SystemZFinalizeReassociation : public MachineFunctionPass {
+public:
+  static char ID;
+  SystemZFinalizeReassociation()
+    : MachineFunctionPass(ID), TII(nullptr), MRI(nullptr) {
+    initializeSystemZFinalizeReassociationPass(*PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+  void getAnalysisUsage(AnalysisUsage &AU) const override;
+
+private:
+
+  bool visitMBB(MachineBasicBlock &MBB);
+
+  const SystemZInstrInfo *TII;
+  MachineRegisterInfo *MRI;
+};
+
+char SystemZFinalizeReassociation::ID = 0;
+
+} // end anonymous namespace
+
+INITIALIZE_PASS(SystemZFinalizeReassociation, "systemz-finalize-reassoc",
+                "SystemZ Finalize Reassociation", false, false)
+
+FunctionPass *llvm::
+createSystemZFinalizeReassociationPass(SystemZTargetMachine &TM) {
+  return new SystemZFinalizeReassociation();
+}
+
+void SystemZFinalizeReassociation::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.setPreservesCFG();
+  MachineFunctionPass::getAnalysisUsage(AU);
+}
+
+bool SystemZFinalizeReassociation::visitMBB(MachineBasicBlock &MBB) {
+  bool Changed = false;
+  for (MachineInstr &MI : llvm::make_early_inc_range(MBB)) {
+    unsigned PseudoOpcode = MI.getOpcode();
+    unsigned TargetOpcode =
+      PseudoOpcode == SystemZ::WFADB_CCPseudo    ? SystemZ::WFADB
+      : PseudoOpcode == SystemZ::WFASB_CCPseudo  ? SystemZ::WFASB
+      : PseudoOpcode == SystemZ::WFSDB_CCPseudo  ? SystemZ::WFSDB
+      : PseudoOpcode == SystemZ::WFSSB_CCPseudo  ? SystemZ::WFSSB
+      : PseudoOpcode == SystemZ::WFMDB_CCPseudo  ? SystemZ::WFMDB
+      : PseudoOpcode == SystemZ::WFMSB_CCPseudo  ? SystemZ::WFMSB
+      : PseudoOpcode == SystemZ::WFMADB_CCPseudo ? SystemZ::WFMADB
+      : PseudoOpcode == SystemZ::WFMASB_CCPseudo ? SystemZ::WFMASB
+      : 0;
+    if (TargetOpcode) {
+      // PeepholeOptimizer will not fold any loads across basic blocks, which
+      // however seems beneficial, so do it here:
+      bool Folded = false;
+      for (unsigned Op = 1; Op <= 2; ++Op) {
+        Register Reg = MI.getOperand(Op).getReg();
+        if (MachineInstr *DefMI = MRI->getVRegDef(Reg))
+          if (TII->optimizeLoadInstr(MI, MRI, Reg, DefMI)) {
+            MI.eraseFromParent();
+            DefMI->eraseFromParent();
+            MRI->markUsesInDebugValueAsUndef(Reg);
+            Folded = true;
+            break;
+          }
+      }
+
+      if (!Folded) {
+        MI.setDesc(TII->get(TargetOpcode));
+        int CCIdx = MI.findRegisterDefOperandIdx(SystemZ::CC);
+        MI.removeOperand(CCIdx);
+      }
+      Changed = true;
+    }
+  }
+  return Changed;
+}
+
+bool SystemZFinalizeReassociation::runOnMachineFunction(MachineFunction &F) {
+  TII = F.getSubtarget<SystemZSubtarget>().getInstrInfo();
+  MRI = &F.getRegInfo();
+
+  bool Modified = false;
+  for (auto &MBB : F)
+    Modified |= visitMBB(MBB);
+
+  return Modified;
+}
diff --git a/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp b/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
index 815eca1240d827..652980d3c2c0df 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
@@ -347,6 +347,13 @@ class SystemZDAGToDAGISel : public SelectionDAGISel {
   // Try to expand a boolean SELECT_CCMASK using an IPM sequence.
   SDValue expandSelectBoolean(SDNode *Node);
 
+  // Return true if the flags of N and the subtarget allows for reassociation.
+  bool isReassociable(SDNode *N) const {
+    return N->getFlags().hasAllowReassociation() &&
+           N->getFlags().hasNoSignedZeros() &&
+           Subtarget->hasVector();
+  }
+
 public:
   static char ID;
 
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrFP.td b/llvm/lib/Target/SystemZ/SystemZInstrFP.td
index 6e67425c1e788b..5b383b3fa6ae88 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrFP.td
+++ b/llvm/lib/Target/SystemZ/SystemZInstrFP.td
@@ -430,8 +430,10 @@ let Uses = [FPC], mayRaiseFPException = 1,
     def ADBR : BinaryRRE<"adbr", 0xB31A, any_fadd, FP64,  FP64>;
     def AXBR : BinaryRRE<"axbr", 0xB34A, any_fadd, FP128, FP128>;
   }
-  defm AEB : BinaryRXEAndPseudo<"aeb", 0xED0A, any_fadd, FP32, load, 4>;
-  defm ADB : BinaryRXEAndPseudo<"adb", 0xED1A, any_fadd, FP64, load, 8>;
+  defm AEB : BinaryRXEAndPseudo<"aeb", 0xED0A, z_any_fadd_noreassoc, FP32,
+                                load, 4>;
+  defm ADB : BinaryRXEAndPseudo<"adb", 0xED1A, z_any_fadd_noreassoc, FP64,
+                                load, 8>;
 }
 
 // Subtraction.
@@ -441,8 +443,10 @@ let Uses = [FPC], mayRaiseFPException = 1,
   def SDBR : BinaryRRE<"sdbr", 0xB31B, any_fsub, FP64,  FP64>;
   def SXBR : BinaryRRE<"sxbr", 0xB34B, any_fsub, FP128, FP128>;
 
-  defm SEB : BinaryRXEAndPseudo<"seb",  0xED0B, any_fsub, FP32, load, 4>;
-  defm SDB : BinaryRXEAndPseudo<"sdb",  0xED1B, any_fsub, FP64, load, 8>;
+  defm SEB : BinaryRXEAndPseudo<"seb",  0xED0B, z_any_fsub_noreassoc, FP32,
+                                load, 4>;
+  defm SDB : BinaryRXEAndPseudo<"sdb",  0xED1B, z_any_fsub_noreassoc, FP64,
+                                load, 8>;
 }
 
 // Multiplication.
@@ -452,8 +456,10 @@ let Uses = [FPC], mayRaiseFPException = 1 in {
     def MDBR  : BinaryRRE<"mdbr",  0xB31C, any_fmul, FP64,  FP64>;
     def MXBR  : BinaryRRE<"mxbr",  0xB34C, any_fmul, FP128, FP128>;
   }
-  defm MEEB : BinaryRXEAndPseudo<"meeb", 0xED17, any_fmul, FP32, load, 4>;
-  defm MDB  : BinaryRXEAndPseudo<"mdb",  0xED1C, any_fmul, FP64, load, 8>;
+  defm MEEB : BinaryRXEAndPseudo<"meeb", 0xED17, z_any_fmul_noreassoc, FP32,
+                                 load, 4>;
+  defm MDB  : BinaryRXEAndPseudo<"mdb",  0xED1C, z_any_fmul_noreassoc, FP64,
+                                 load, 8>;
 }
 
 // f64 multiplication of two FP32 registers.
@@ -495,8 +501,10 @@ let Uses = [FPC], mayRaiseFPException = 1 in {
   def MAEBR : TernaryRRD<"maebr", 0xB30E, z_any_fma, FP32, FP32>;
   def MADBR : TernaryRRD<"madbr", 0xB31E, z_any_fma, FP64, FP64>;
 
-  defm MAEB : TernaryRXFAndPseudo<"maeb", 0xED0E, z_any_fma, FP32, FP32, load, 4>;
-  defm MADB : TernaryRXFAndPseudo<"madb", 0xED1E, z_any_fma, FP64, FP64, load, 8>;
+  defm MAEB : TernaryRXFAndPseudo<"maeb", 0xED0E, z_any_fma_noreassoc, FP32,
+                                  FP32, load, 4>;
+  defm MADB : TernaryRXFAndPseudo<"madb", 0xED1E, z_any_fma_noreassoc, FP64,
+                                  FP64, load, 8>;
 }
 
 // Fused multiply-subtract.
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrFormats.td b/llvm/lib/Target/SystemZ/SystemZInstrFormats.td
index bb9fa0fc33ffa0..9fdda0a5d60031 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrFormats.td
+++ b/llvm/lib/Target/SystemZ/SystemZInstrFormats.td
@@ -5536,3 +5536,33 @@ multiclass StringRRE<string mnemonic, bits<16> opcode,
                       [(set GR64:$end, (operator GR64:$start1, GR64:$start2,
                                                  GR32:$char))]>;
 }
+
+multiclass BinaryVRRcAndCCPseudo<string mnemonic, bits<16> opcode,
+                                 SDPatternOperator operator,
+                                 SDPatternOperator reassoc_operator,
+                                 TypedReg tr1, TypedReg tr2, bits<4> type = 0,
+                                 bits<4> m5 = 0, bits<4> m6 = 0,
+                                 string fp_mnemonic = ""> {
+  def "" : BinaryVRRc<mnemonic, opcode, operator, tr1, tr2, type, m5, m6,
+                      fp_mnemonic>;
+  let Defs = [CC], AddedComplexity = 1 in // Win over "".
+    def _CCPseudo : Pseudo<(outs tr1.op:$V1), (ins tr2.op:$V2, tr2.op:$V3),
+                           [(set (tr1.vt tr1.op:$V1),
+                                 (reassoc_operator (tr2.vt tr2.op:$V2),
+                                                   (tr2.vt tr2.op:$V3)))]>;
+}
+
+multiclass TernaryVRReAndCCPseudo<string mnemonic, bits<16> opcode,
+                                  SDPatternOperator operator,
+                                  SDPatternOperator reassoc_operator,
+                                  TypedReg tr1, TypedReg tr2, bits<4> m5 = 0,
+                                  bits<4> type = 0, string fp_mnemonic = ""> {
+  def "" : TernaryVRRe<mnemonic, opcode, operator, tr1, tr2, m5, type,
+                       fp_mnemonic>;
+  let Defs = [CC], AddedComplexity = 1 in // Win over "".
+    def _CCPseudo : Pseudo<(outs tr1.op:$V1),
+                           (ins tr2.op:$V2, tr2.op:$V3, tr1.op:$V4),
+            [(set (tr1.vt tr1.op:$V1), (reassoc_operator (tr2.vt tr2.op:$V2),
+                                                         (tr2.vt tr2.op:$V3),
+                                                         (tr1.vt tr1.op:$V4)))]>;
+}
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 046a12208467b4..1d8db507e30f14 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -21,6 +21,7 @@
 #include "llvm/CodeGen/LivePhysRegs.h"
 #include "llvm/CodeGen/LiveVariables.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineCombinerPattern.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -610,6 +611,112 @@ void SystemZInstrInfo::insertSelect(MachineBasicBlock &MBB,
     .addImm(CCValid).addImm(CCMask);
 }
 
+static void transferDeadCC(MachineInstr *OldMI, MachineInstr *NewMI) {
+  if (OldMI->registerDefIsDead(SystemZ::CC)) {
+    MachineOperand *CCDef = NewMI->findRegisterDefOperand(SystemZ::CC);
+    if (CCDef != nullptr)
+      CCDef->setIsDead(true);
+  }
+}
+
+void SystemZInstrInfo::transferMIFlag(MachineInstr *OldMI, MachineInstr *NewMI,
+                                      MachineInstr::MIFlag Flag) const {
+  if (OldMI->getFlag(Flag))
+    NewMI->setFlag(Flag);
+}
+
+static cl::opt<unsigned> MaxUsersGlobalFold("maxusersglobfold", cl::init(7));
+
+MachineInstr *SystemZInstrInfo::optimizeLoadInstr(MachineInstr &MI,
+                                                  MachineRegisterInfo *MRI,
+                                                  Register &FoldAsLoadDefReg,
+                                                  MachineInstr *&DefMI) const {
+  const TargetRegisterInfo *TRI = MRI->getTargetRegisterInfo();
+
+  // Check whether we can move the DefMI load, and that it only has one use.
+  DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
+  assert(DefMI);
+  bool SawStore = false;
+  if (!DefMI->isSafeToMove(nullptr, SawStore) ||
+      !MRI->hasOneNonDBGUse(FoldAsLoadDefReg))
+    return nullptr;
+
+  // For reassociable FP operations, any loads have been purposefully left
+  // unfolded so that MachineCombiner can do its work on reg/reg
+  // opcodes. After that, as many loads as possible are now folded.
+  unsigned LoadOpcode = 0;
+  unsigned RegMemOpcode = 0;
+  const TargetRegisterClass *FPRC = nullptr;
+  RegMemOpcode = MI.getOpcode() == SystemZ::WFADB_CCPseudo    ? SystemZ::ADB
+                 : MI.getOpcode() == SystemZ::WFSDB_CCPseudo  ? SystemZ::SDB
+                 : MI.getOpcode() == SystemZ::WFMDB_CCPseudo  ? SystemZ::MDB
+                 : MI.getOpcode() == SystemZ::WFMADB_CCPseudo ? SystemZ::MADB
+                                                              : 0;
+  if (RegMemOpcode) {
+    LoadOpcode = SystemZ::VL64;
+    FPRC = &SystemZ::FP64BitRegClass;
+  } else {
+    RegMemOpcode = MI.getOpcode() == SystemZ::WFASB_CCPseudo    ? SystemZ::AEB
+                   : MI.getOpcode() == SystemZ::WFSSB_CCPseudo  ? SystemZ::SEB
+                   : MI.getOpcode() == SystemZ::WFMSB_CCPseudo  ? SystemZ::MEEB
+                   : MI.getOpcode() == SystemZ::WFMASB_CCPseudo ? SystemZ::MAEB
+                                                                : 0;
+    if (RegMemOpcode) {
+      LoadOpcode = SystemZ::VL32;
+      FPRC = &SystemZ::FP32BitRegClass;
+    }
+  }
+  if (!RegMemOpcode || DefMI->getOpcode() != LoadOpcode)
+    return nullptr;
+
+  DebugLoc DL = MI.getDebugLoc();
+  Register DstReg = MI.getOperand(0).getReg();
+  MachineOperand LHS = MI.getOperand(1);
+  MachineOperand RHS = MI.getOperand(2);
+  bool IsTernary =
+    (RegMemOpcode == SystemZ::MADB || RegMemOpcode == SystemZ::MAEB);
+  MachineOperand *AccMO = IsTernary ? &MI.getOperand(3) : nullptr;
+  if ((RegMemOpcode == SystemZ::SDB || RegMemOpcode == SystemZ::SEB) &&
+      FoldAsLoadDefReg != RHS.getReg())
+    return nullptr;
+  if (IsTernary && FoldAsLoadDefReg == AccMO->getReg())
+    return nullptr;
+  MachineOperand &RegMO = RHS.getReg() == FoldAsLoadDefReg ? LHS : RHS;
+  // It seems that folding globally into 2-address with too many users of reg
+  // is detrimental (resulting in reg copys before each use).
+  if (!MRI->hasOneNonDBGUse(RegMO.getReg()) && !RegMO.isKill() &&
+      DefMI->getParent() != MI.getParent()) {
+    unsigned NumUsers = 0;
+    for (MachineInstr &UI : MRI->use_nodbg_instructions(RegMO.getReg())) {
+      NumUsers++; (void)UI;
+    }
+    if (NumUsers > MaxUsersGlobalFold)
+      return nullptr;
+  }
+
+  MachineOperand &Base = DefMI->getOperand(1);
+  MachineOperand &Disp = DefMI->getOperand(2);
+  MachineOperand &Indx = DefMI->getOperand(3);
+  if (Base.isReg())  // Could be a FrameIndex.
+    Base.setIsKill(false);
+  Indx.setIsKill(false);
+
+  MachineInstrBuilder MIB =
+    BuildMI(*MI.getParent(), MI, DL, get(RegMemOpcode), DstReg);
+  if (IsTernary)
+    MIB.add(*AccMO);
+  MIB.add(RegMO);
+  MIB.add(Base).add(Disp).add(Indx);
+  MIB.addMemOperand(*DefMI->memoperands_begin());
+  transferMIFlag(&MI, MIB, MachineInstr::NoFPExcept);
+  MIB->addRegisterDead(SystemZ::CC, TRI);
+  MRI->setRegClass(DstReg, FPRC);
+  if (IsTernary)
+    MRI->setRegClass(AccMO->getReg(), FPRC);
+  MRI->setRegClass(RegMO.getReg(), FPRC);
+  return MIB;
+}
+
 bool SystemZInstrInfo::foldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
                                      Register Reg,
                                      MachineRegisterInfo *MRI) const {
@@ -937,20 +1044,6 @@ static LogicOp interpretAndImmediate(unsigned Opcode) {
   }
 }
 
-static void transferDeadCC(MachineInstr *OldMI, MachineInstr *NewMI) {
-  if (OldMI->registerDefIsDead(SystemZ::CC)) {
-    MachineOperand *CCDef = NewMI->findRegisterDefOperand(SystemZ::CC);
-    ...
[truncated]

Copy link

github-actions bot commented Mar 1, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff f73e87f53f5d8a86c29251dedc9dbd264179203a add84038dd93f7944aa8d4e9f30547d77912c3f6 -- llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp llvm/lib/Target/SystemZ/SystemZInstrInfo.h llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 2b61cff727..d1d2a68a32 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -610,10 +610,9 @@ void SystemZInstrInfo::insertSelect(MachineBasicBlock &MBB,
     .addImm(CCValid).addImm(CCMask);
 }
 
-MachineInstr *SystemZInstrInfo::optimizeLoadInstr(MachineInstr &MI,
-                                                  const MachineRegisterInfo *MRI,
-                                                  Register &FoldAsLoadDefReg,
-                                                  MachineInstr *&DefMI) const {
+MachineInstr *SystemZInstrInfo::optimizeLoadInstr(
+    MachineInstr &MI, const MachineRegisterInfo *MRI,
+    Register &FoldAsLoadDefReg, MachineInstr *&DefMI) const {
   // Check whether we can move the DefMI load, and that it only has one use.
   DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
   assert(DefMI);
@@ -1461,8 +1460,7 @@ MachineInstr *SystemZInstrInfo::foldMemoryOperandImpl(
   if (get(RegMemOpcode).hasImplicitDefOfPhysReg(SystemZ::CC)) {
     assert(LoadMI.getParent() == MI.getParent() && "Assuming a local fold.");
     assert(LoadMI != InsertPt && "Assuming InsertPt not to be first in MBB.");
-    for (MachineBasicBlock::iterator MII = std::prev(InsertPt);;
-         --MII) {
+    for (MachineBasicBlock::iterator MII = std::prev(InsertPt);; --MII) {
       if (MII->definesRegister(SystemZ::CC, /*TRI=*/nullptr)) {
         if (!MII->registerDefIsDead(SystemZ::CC, /*TRI=*/nullptr))
           return nullptr;
@@ -1490,12 +1488,12 @@ MachineInstr *SystemZInstrInfo::foldMemoryOperandImpl(
   MachineOperand &Base = LoadMI.getOperand(1);
   MachineOperand &Disp = LoadMI.getOperand(2);
   MachineOperand &Indx = LoadMI.getOperand(3);
-  MachineInstrBuilder MIB =
-      BuildMI(*MI.getParent(), InsertPt, MI.getDebugLoc(), get(RegMemOpcode), DstReg)
-          .add(RegMO)
-          .add(Base)
-          .add(Disp)
-          .add(Indx);
+  MachineInstrBuilder MIB = BuildMI(*MI.getParent(), InsertPt, MI.getDebugLoc(),
+                                    get(RegMemOpcode), DstReg)
+                                .add(RegMO)
+                                .add(Base)
+                                .add(Disp)
+                                .add(Indx);
   MIB->addRegisterDead(SystemZ::CC, &RI);
   MRI->setRegClass(DstReg, FPRC);
   MRI->setRegClass(RegMO.getReg(), FPRC);
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp b/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
index dced64d6b2..14675b049e 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
@@ -30,10 +30,10 @@
 
 using namespace llvm;
 
-static cl::opt<bool> EnableMachineCombinerPass(
-    "systemz-machine-combiner",
-    cl::desc("Enable the machine combiner pass"),
-    cl::init(true), cl::Hidden);
+static cl::opt<bool>
+    EnableMachineCombinerPass("systemz-machine-combiner",
+                              cl::desc("Enable the machine combiner pass"),
+                              cl::init(true), cl::Hidden);
 
 // NOLINTNEXTLINE(readability-identifier-naming)
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeSystemZTarget() {

@uweigand
Copy link
Member

uweigand commented Mar 1, 2024

A bit surprisingly, I saw instead a 13% improvement imagick without the MachineCombiner, just because of the global reg/mem folding I tried.

Haven't looked into the patch in detail yet, but this is interesting. Should we then always do the reg/mem folding late (for all FP operations, including non-reassociable ones), if it is generally beneficial? This might actually even simplify the code (and could be done as a separate PR before the MachineCombiner reassociation patch?).

@JonPsson1 JonPsson1 force-pushed the MachineComb_patch branch from fab2c85 to ac44853 Compare April 4, 2024 12:56
@JonPsson1
Copy link
Contributor Author

JonPsson1 commented Apr 4, 2024

Patch rebased.

  • The little patch for the MachineCombiner has been committed.

  • Scratch the improvement with extra reg/mem folding - sorry, this was broken as there is not any AA
    check done in isSafeToMove(), or any checking across basic blocks.

  • Benchmarks: Still 20% improvement on LBM with add/sub/mul reassocation (common code). Adding FMA patterns still gives no visible improvement on SPEC.

  • Patch is not quite ready yet, with some steps needed before a final update and (performance) review:

FMA: Chose strategy (PPC or Z -- see previous comment from March 1st), and maybe add some tests for 'float' as well. Alternatively, skip FMA patterns for now until they prove themselves on benchmarks? After this, patch could be simplified/refactored further a bit.

Should we also fold 20-bit offset-addresses in optimizeLoadInstr()? That is still an extra instruction compared to a separate load - LAY here while ISel uses other forms of immediate-add, e.g. AGHI. I suppose this should be done, but I have
no benchmarking saying this should be worth any extra trouble. It makes some sense in saving an FP reg.

Sidenotes:

  • Saw just a dozen or so more reg/mem opcodes without unsafe math / machine-combiner. It seems that machine-sink
    puts a few loads by their single users (from predecessor block). Shouldn't matter much, but good to know - peephole optimizer could then actually be a way to fold more than what local isel can do. Probably quite marginal, though.

  • This is so far only for FP - looking into integer as well is probably a very good next step. (CC @dominik-steenken )

@JonPsson1
Copy link
Contributor Author

Patch finalised for review. Waiting with FMA patterns as they sofar give no performance improvement, and optimizeLoadInstr() simplified to only fold 12-bit displacements.

  • optimizeLoadInstr() MRI argument needs to be non-const as the SystemZ implementation calls setRegClass().

  • The reg/reg pseudos that clobber CC need to be converted to the real target instruction in the case where
    it was not folded to a reg/mem during peephole-opt. This should be done before pre-ra machine scheduler, but
    the question is where. The patch currently adds a simple pass that does only this, but this is slightly
    bulky although simple. As an alternative, it seems possible to do this in createMachineScheduler() instead, although that is
    not the basic intent of that method (and in theory it is better to do it as early as possible). Another option
    might be to even ignore the effects of the CC operand during regalloc (scheduling) and handle it somewhere else later.

  • I tried changing the SystemZ::optimizeLoadInstr() check for a hasOneNonDBGUse into an assertion as I thought the Peephole optimizer should already be doing this. However, it seems it is checking for a single instruction
    user, not just a single use operand. Changing this in Peephole opt gave test failures on X86, which may
    or may not be real. Perhaps worth a follow-up.

@uweigand
Copy link
Member

A couple more general issues that I think we should consider:

  • Instead of adding a CC clobber first and then removing it, did we try making optimizeLoadInstr only fold the load if CC is not live at this point? This is what foldMemoryOperandImpl does ... Would this have worse performance?
  • Talking about foldMemoryOperandImpl, it's a bit weird that there are three different target hooks to fold a load: optimizeLoadInstr, foldMemoryOperandImpl with FrameIndex, and foldMemoryOperandImpl with LoadMI. The X86 back-end implement the "real" folding in foldMemoryOperandImpl with LoadMI, and has the other two call into that. Our back-end only implements foldMemoryOperandImpl with FrameIndex so far, and your patch now duplicates a subset of this as optimizeLoadInstr. What would it take for us to do the same as X86, and would this maybe have other benefits?
  • X86 also implements an unfoldMemoryOperand callback. I was wondering if this would allow reassociation after creating reg/mem instruction, but it looks like MachineCombiner never even uses that callback. Still, would we see any benefit from implementing it? (If it's unrelated to the combiner, that is of course a separate topic.)

Not saying we need to implement any of this right now, but I think we should at least have an opinion on what the best direction is going forward.

@JonPsson1
Copy link
Contributor Author

Instead of adding a CC clobber first and then removing it, did we try making optimizeLoadInstr only fold the load if CC is not live at this point? This is what foldMemoryOperandImpl does ... Would this have worse performance?

I agree that now that the general reg/reg scheme doesn't seem worthwhile, it may not be worth using the CC-pseudos and the extra pass to convert them back. I tried to do the CC check in optimizeLoadInstr() instead of using CC-pseudos. Performance-wise I see no difference, and doing it this way only misses a few foldings, which seem acceptable:

"CC pseudos" <-> "CC check"

seb            :                 4457                 4307     -150
adb            :                 9180                 9033     -147
sdb            :                 4336                 4275      -61
aeb            :                 6626                 6574      -52
mdb            :                 8717                 8715       -2

The backwards search for CC-def should be ok compile-time-wise as many instructions clobber CC, but if there emerges an issue, Peephole-opt could probably be extended easily to maintin a LivePhysRegs tracker. The good thing is that we can trust the MBB live-in list for CC.

Talking about foldMemoryOperandImpl, it's a bit weird that there are three different target hooks to fold a load: optimizeLoadInstr, foldMemoryOperandImpl with FrameIndex, and foldMemoryOperandImpl with LoadMI.

I tried your suggestion to have optimizeLoadInstr() call foldMemoryOperand(), which in turn calls foldMemoryOperandImpl(). This was NFC on SPEC, and the only test changes I saw was that in SystemZ anyregcc.ll and stackmap.ll now fails. I am not sure about these, but it seems that foldMemoryOperand() optimizes these in foldPatchPoint and the result seems to be that some of the loads are folded into those instructions. I have not updated those tests yet as I am not sure if this is correct / desired on SystemZ (?).

Since foldMemoryOperand() sets the memoperands on the returned MI, it wouldn't work to call foldMemoryOperandImpl() directly from optimizeLoadInstr(). So alternatives then are to do these foldings in optimizeLoadInstr() as in HEAD^, or do it this way, possibly stopping on STACKMAP etc in optimizeLoadInstr() if that is not desired.

It seems slightly more extensible to have foldMemoryOperandImpl() do it, but maybe it doesn't really matter.

X86 also implements an unfoldMemoryOperand callback. I was wondering if this would allow reassociation after creating reg/mem instruction, but it looks like MachineCombiner never even uses that callback. Still, would we see any benefit from implementing it? (If it's unrelated to the combiner, that is of course a separate topic.)

It seems that LICM can do this to hoist the memory access out of the loop, but that will also make the register live across the loop, so not sure how useful it would be. I Haven't tried it.

@uweigand
Copy link
Member

I agree that now that the general reg/reg scheme doesn't seem worthwhile, it may not be worth using the CC-pseudos and the extra pass to convert them back. I tried to do the CC check in optimizeLoadInstr() instead of using CC-pseudos. Performance-wise I see no difference, and doing it this way only misses a few foldings, which seem acceptable:

OK, that looks good.

The backwards search for CC-def should be ok compile-time-wise as many instructions clobber CC, but if there emerges an issue, Peephole-opt could probably be extended easily to maintin a LivePhysRegs tracker. The good thing is that we can trust the MBB live-in list for CC.

Also, if we use the approach below to have optimizeLoadInstr call foldMemoryOperand, then we already get the CC handling in foldMemoryOperandImpl, so wouldn't have to duplicate the implementation.

I tried your suggestion to have optimizeLoadInstr() call foldMemoryOperand(), which in turn calls foldMemoryOperandImpl(). This was NFC on SPEC, and the only test changes I saw was that in SystemZ anyregcc.ll and stackmap.ll now fails. I am not sure about these, but it seems that foldMemoryOperand() optimizes these in foldPatchPoint and the result seems to be that some of the loads are folded into those instructions. I have not updated those tests yet as I am not sure if this is correct / desired on SystemZ (?).

I don't know what the specific differences are, but in general if this is OK on X86 it should be OK on SystemZ too.

Since foldMemoryOperand() sets the memoperands on the returned MI, it wouldn't work to call foldMemoryOperandImpl() directly from optimizeLoadInstr().

I agree. We should call foldMemoryOperand, basically the same as X86 does.

X86 also implements an unfoldMemoryOperand callback. I was wondering if this would allow reassociation after creating reg/mem instruction, but it looks like MachineCombiner never even uses that callback. Still, would we see any benefit from implementing it? (If it's unrelated to the combiner, that is of course a separate topic.)

It seems that LICM can do this to hoist the memory access out of the loop, but that will also make the register live across the loop, so not sure how useful it would be. I Haven't tried it.

OK, that's something we can maybe look at some later time, it isn't really related to this patch.

@JonPsson1
Copy link
Contributor Author

Updated tests: anyregcc.ll and stackmap.ll.

Also, if we use the approach below to have optimizeLoadInstr call foldMemoryOperand, then we already get the CC handling in foldMemoryOperandImpl, so wouldn't have to duplicate the implementation.

I think we need two different implementations: One quick if LiveIntervals is available ("FI version"), and one that does the scan ("MI version"). Even though there is the LIS argument to the "MI version", it is optional and null as this is still on SSA form.

Copy link
Member

@uweigand uweigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nearly ready to merge now, but see inline comments in particular about the assert. Thanks!

Register FoldAsLoadDefReg = LoadMI.getOperand(0).getReg();
// We don't really need Ops, but do a sanity check:
assert(Ops.size() == 1 && FoldAsLoadDefReg == MI.getOperand(Ops[0]).getReg() &&
"Expected MI to have the only use of the load.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is correct to use an assertion here. Common code might call us with more than one element in Ops; e.g. if there's load into a register that is used as both input register operands of the add. (This won't happen in the call from optimizeLoadInstr - but common code may call this foldMemoryOperandImpl for other reasons.) We should just return nullptr in this case.

// opcodes. After that, as many loads as possible are now folded.
// TODO: This may be beneficial with other opcodes as well as machine-sink
// can move loads close to their user in a different MBB, which the isel
// matcher did not see.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this could be more generic, using the instruction maps just like the other foldMemoryOperandImpl. But that can be done later.


int UseOpIdx = MI.findRegisterUseOperandIdx(FoldAsLoadDefReg);
assert(UseOpIdx != -1 && "Expected FoldAsLoadDefReg to be used by MI.");
return foldMemoryOperand(MI, {((unsigned) UseOpIdx)}, *DefMI);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The X86 version clears FoldAsLoadDefReg if sucessful. Not sure why since common code doesn't appear to rely on that. But I think we should be consistent here.

@JonPsson1 JonPsson1 merged commit 6c32a1f into llvm:main Apr 30, 2024
2 of 4 checks passed
@JonPsson1 JonPsson1 deleted the MachineComb_patch branch April 30, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants