-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[SystemZ] Enable MachineCombiner for FP reassociation #83546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-backend-systemz Author: Jonas Paulsson (JonPsson1) ChangesThis patch is still slightly unfinished/experimental. It has the option of using either PPC or alternative new patterns. Some feedback would be nice at this point (see below). Main points:
I have experimented with figuring out which pattern combinations would give the best improvement, measured by the decrease of maximum depth of any instructions in the MBB as reported by MachineTraceMetrics. Just the common-code binary patterns gives a total static improvement of -10294. Adding different fma patterns gave some further improvements:
Looking at these numbers, it seemed like the patterns of {fma+fma, fma+add, add+fma} was the most promising combination, and that the PPC patterns {fma+fma+fma, fma+fma+add} were only half of that. However, even though these different settings of FMA patterns give further static improvements, there doesn't seem to be any further performance improvement on SPEC for any of them. With the experimental reassoc-additions pass (right colum) a bit more depth-improvement resulted, but not even with this is there any noticeable performance win. A bit surprisingly, I saw instead a 13% improvement imagick without the MachineCombiner, just because of the global reg/mem folding I tried. Then with the MachineCombiner using only common-code patterns for binops, lbm is improved ~20%, and cactus ~2%. My conclusions are that the MachineCombiner is worth enabling and even though there is no noticeable performance improvement on SPEC it seems good to have some basic FMA patterns, either like -z-fma or -ppc-fma . The simpler solution is probably to reuse the 2 PPC patterns, while the z-fma ones give better maxdepth improvement which may or may not be valuable sometimes. Compile time seems to be the same. The experiments and results per above have included a minor fix in MachineCombiner (#82025), which hopefully can be committed but is probably not vital. The full experimental patch: Patch is 116.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/83546.diff 27 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachineCombinerPattern.h b/llvm/include/llvm/CodeGen/MachineCombinerPattern.h
index 89eed7463bd783..c0b3927450cb69 100644
--- a/llvm/include/llvm/CodeGen/MachineCombinerPattern.h
+++ b/llvm/include/llvm/CodeGen/MachineCombinerPattern.h
@@ -176,6 +176,15 @@ enum class MachineCombinerPattern {
FMSUB,
FNMSUB,
+ // SystemZ patterns. (EXPERIMENTAL)
+ FMA2_P1P0,
+ FMA2_P0P1,
+ FMA2,
+ FMA1_Add_L,
+ FMA1_Add_R,
+ FMA3, // These are inspired by PPC
+ FMA2_Add, //
+
// X86 VNNI
DPWSSD,
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index e7787aafb98e2d..dc48056c3d9bcd 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -1698,7 +1698,7 @@ class TargetInstrInfo : public MCInstrInfo {
/// instruction that defines FoldAsLoadDefReg, and the function returns
/// the machine instruction generated due to folding.
virtual MachineInstr *optimizeLoadInstr(MachineInstr &MI,
- const MachineRegisterInfo *MRI,
+ MachineRegisterInfo *MRI,
Register &FoldAsLoadDefReg,
MachineInstr *&DefMI) const {
return nullptr;
diff --git a/llvm/lib/CodeGen/MachineCombiner.cpp b/llvm/lib/CodeGen/MachineCombiner.cpp
index c65937935ed820..577331441eef28 100644
--- a/llvm/lib/CodeGen/MachineCombiner.cpp
+++ b/llvm/lib/CodeGen/MachineCombiner.cpp
@@ -155,9 +155,7 @@ MachineCombiner::getOperandDef(const MachineOperand &MO) {
// We need a virtual register definition.
if (MO.isReg() && MO.getReg().isVirtual())
DefInstr = MRI->getUniqueVRegDef(MO.getReg());
- // PHI's have no depth etc.
- if (DefInstr && DefInstr->isPHI())
- DefInstr = nullptr;
+ // (PATCH PROPOSED for PHIs: https://github.com/llvm/llvm-project/pull/82025)
return DefInstr;
}
@@ -317,6 +315,13 @@ static CombinerObjective getCombinerObjective(MachineCombinerPattern P) {
case MachineCombinerPattern::FMADD_XA:
case MachineCombinerPattern::FMSUB:
case MachineCombinerPattern::FNMSUB:
+ case MachineCombinerPattern::FMA2_P1P0:
+ case MachineCombinerPattern::FMA2_P0P1:
+ case MachineCombinerPattern::FMA2:
+ case MachineCombinerPattern::FMA1_Add_L:
+ case MachineCombinerPattern::FMA1_Add_R:
+ case MachineCombinerPattern::FMA3:
+ case MachineCombinerPattern::FMA2_Add:
return CombinerObjective::MustReduceDepth;
case MachineCombinerPattern::REASSOC_XY_BCA:
case MachineCombinerPattern::REASSOC_XY_BAC:
diff --git a/llvm/lib/Target/SystemZ/CMakeLists.txt b/llvm/lib/Target/SystemZ/CMakeLists.txt
index 0614e07bde8ac1..235df220729447 100644
--- a/llvm/lib/Target/SystemZ/CMakeLists.txt
+++ b/llvm/lib/Target/SystemZ/CMakeLists.txt
@@ -20,6 +20,7 @@ add_llvm_target(SystemZCodeGen
SystemZConstantPoolValue.cpp
SystemZCopyPhysRegs.cpp
SystemZElimCompare.cpp
+ SystemZFinalizeReassociation.cpp
SystemZFrameLowering.cpp
SystemZHazardRecognizer.cpp
SystemZISelDAGToDAG.cpp
diff --git a/llvm/lib/Target/SystemZ/SystemZ.h b/llvm/lib/Target/SystemZ/SystemZ.h
index d7aa9e4e18cbbb..49a200babfff57 100644
--- a/llvm/lib/Target/SystemZ/SystemZ.h
+++ b/llvm/lib/Target/SystemZ/SystemZ.h
@@ -195,12 +195,14 @@ FunctionPass *createSystemZShortenInstPass(SystemZTargetMachine &TM);
FunctionPass *createSystemZLongBranchPass(SystemZTargetMachine &TM);
FunctionPass *createSystemZLDCleanupPass(SystemZTargetMachine &TM);
FunctionPass *createSystemZCopyPhysRegsPass(SystemZTargetMachine &TM);
+FunctionPass *createSystemZFinalizeReassociationPass(SystemZTargetMachine &TM);
FunctionPass *createSystemZPostRewritePass(SystemZTargetMachine &TM);
FunctionPass *createSystemZTDCPass();
void initializeSystemZCopyPhysRegsPass(PassRegistry &);
void initializeSystemZDAGToDAGISelPass(PassRegistry &);
void initializeSystemZElimComparePass(PassRegistry &);
+void initializeSystemZFinalizeReassociationPass(PassRegistry &);
void initializeSystemZLDCleanupPass(PassRegistry &);
void initializeSystemZLongBranchPass(PassRegistry &);
void initializeSystemZPostRewritePass(PassRegistry &);
diff --git a/llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp b/llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp
new file mode 100644
index 00000000000000..d441b8bbc87381
--- /dev/null
+++ b/llvm/lib/Target/SystemZ/SystemZFinalizeReassociation.cpp
@@ -0,0 +1,127 @@
+//===---- SystemZFinalizeReassociation.cpp - Finalize FP reassociation ----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass is the last step of the process of enabling reassociation with
+// the MachineCombiner. These are the steps involved:
+//
+// 1. Instruction selection: Disable reg/mem folding for any operations that
+// are reassociable since MachineCombiner will not succeed
+// otherwise. Instead select a reg/reg pseudo that pretends to clobber CC.
+//
+// 2. MachineCombiner: Performs reassociation with the reg/reg instructions.
+//
+// 3. PeepholeOptimizer: fold loads into reg/mem instructions after
+// reassociation. The reg/mem opcode sets CC which is why the special
+// reg/reg pseudo is needed.
+//
+// 4. Convert any remaining pseudos into the target opcodes that do not
+// clobber CC (this pass).
+//
+//===----------------------------------------------------------------------===//
+
+#include "SystemZMachineFunctionInfo.h"
+#include "SystemZTargetMachine.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineInstrBuilder.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/TargetInstrInfo.h"
+#include "llvm/CodeGen/TargetRegisterInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+namespace {
+
+class SystemZFinalizeReassociation : public MachineFunctionPass {
+public:
+ static char ID;
+ SystemZFinalizeReassociation()
+ : MachineFunctionPass(ID), TII(nullptr), MRI(nullptr) {
+ initializeSystemZFinalizeReassociationPass(*PassRegistry::getPassRegistry());
+ }
+
+ bool runOnMachineFunction(MachineFunction &MF) override;
+ void getAnalysisUsage(AnalysisUsage &AU) const override;
+
+private:
+
+ bool visitMBB(MachineBasicBlock &MBB);
+
+ const SystemZInstrInfo *TII;
+ MachineRegisterInfo *MRI;
+};
+
+char SystemZFinalizeReassociation::ID = 0;
+
+} // end anonymous namespace
+
+INITIALIZE_PASS(SystemZFinalizeReassociation, "systemz-finalize-reassoc",
+ "SystemZ Finalize Reassociation", false, false)
+
+FunctionPass *llvm::
+createSystemZFinalizeReassociationPass(SystemZTargetMachine &TM) {
+ return new SystemZFinalizeReassociation();
+}
+
+void SystemZFinalizeReassociation::getAnalysisUsage(AnalysisUsage &AU) const {
+ AU.setPreservesCFG();
+ MachineFunctionPass::getAnalysisUsage(AU);
+}
+
+bool SystemZFinalizeReassociation::visitMBB(MachineBasicBlock &MBB) {
+ bool Changed = false;
+ for (MachineInstr &MI : llvm::make_early_inc_range(MBB)) {
+ unsigned PseudoOpcode = MI.getOpcode();
+ unsigned TargetOpcode =
+ PseudoOpcode == SystemZ::WFADB_CCPseudo ? SystemZ::WFADB
+ : PseudoOpcode == SystemZ::WFASB_CCPseudo ? SystemZ::WFASB
+ : PseudoOpcode == SystemZ::WFSDB_CCPseudo ? SystemZ::WFSDB
+ : PseudoOpcode == SystemZ::WFSSB_CCPseudo ? SystemZ::WFSSB
+ : PseudoOpcode == SystemZ::WFMDB_CCPseudo ? SystemZ::WFMDB
+ : PseudoOpcode == SystemZ::WFMSB_CCPseudo ? SystemZ::WFMSB
+ : PseudoOpcode == SystemZ::WFMADB_CCPseudo ? SystemZ::WFMADB
+ : PseudoOpcode == SystemZ::WFMASB_CCPseudo ? SystemZ::WFMASB
+ : 0;
+ if (TargetOpcode) {
+ // PeepholeOptimizer will not fold any loads across basic blocks, which
+ // however seems beneficial, so do it here:
+ bool Folded = false;
+ for (unsigned Op = 1; Op <= 2; ++Op) {
+ Register Reg = MI.getOperand(Op).getReg();
+ if (MachineInstr *DefMI = MRI->getVRegDef(Reg))
+ if (TII->optimizeLoadInstr(MI, MRI, Reg, DefMI)) {
+ MI.eraseFromParent();
+ DefMI->eraseFromParent();
+ MRI->markUsesInDebugValueAsUndef(Reg);
+ Folded = true;
+ break;
+ }
+ }
+
+ if (!Folded) {
+ MI.setDesc(TII->get(TargetOpcode));
+ int CCIdx = MI.findRegisterDefOperandIdx(SystemZ::CC);
+ MI.removeOperand(CCIdx);
+ }
+ Changed = true;
+ }
+ }
+ return Changed;
+}
+
+bool SystemZFinalizeReassociation::runOnMachineFunction(MachineFunction &F) {
+ TII = F.getSubtarget<SystemZSubtarget>().getInstrInfo();
+ MRI = &F.getRegInfo();
+
+ bool Modified = false;
+ for (auto &MBB : F)
+ Modified |= visitMBB(MBB);
+
+ return Modified;
+}
diff --git a/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp b/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
index 815eca1240d827..652980d3c2c0df 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
@@ -347,6 +347,13 @@ class SystemZDAGToDAGISel : public SelectionDAGISel {
// Try to expand a boolean SELECT_CCMASK using an IPM sequence.
SDValue expandSelectBoolean(SDNode *Node);
+ // Return true if the flags of N and the subtarget allows for reassociation.
+ bool isReassociable(SDNode *N) const {
+ return N->getFlags().hasAllowReassociation() &&
+ N->getFlags().hasNoSignedZeros() &&
+ Subtarget->hasVector();
+ }
+
public:
static char ID;
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrFP.td b/llvm/lib/Target/SystemZ/SystemZInstrFP.td
index 6e67425c1e788b..5b383b3fa6ae88 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrFP.td
+++ b/llvm/lib/Target/SystemZ/SystemZInstrFP.td
@@ -430,8 +430,10 @@ let Uses = [FPC], mayRaiseFPException = 1,
def ADBR : BinaryRRE<"adbr", 0xB31A, any_fadd, FP64, FP64>;
def AXBR : BinaryRRE<"axbr", 0xB34A, any_fadd, FP128, FP128>;
}
- defm AEB : BinaryRXEAndPseudo<"aeb", 0xED0A, any_fadd, FP32, load, 4>;
- defm ADB : BinaryRXEAndPseudo<"adb", 0xED1A, any_fadd, FP64, load, 8>;
+ defm AEB : BinaryRXEAndPseudo<"aeb", 0xED0A, z_any_fadd_noreassoc, FP32,
+ load, 4>;
+ defm ADB : BinaryRXEAndPseudo<"adb", 0xED1A, z_any_fadd_noreassoc, FP64,
+ load, 8>;
}
// Subtraction.
@@ -441,8 +443,10 @@ let Uses = [FPC], mayRaiseFPException = 1,
def SDBR : BinaryRRE<"sdbr", 0xB31B, any_fsub, FP64, FP64>;
def SXBR : BinaryRRE<"sxbr", 0xB34B, any_fsub, FP128, FP128>;
- defm SEB : BinaryRXEAndPseudo<"seb", 0xED0B, any_fsub, FP32, load, 4>;
- defm SDB : BinaryRXEAndPseudo<"sdb", 0xED1B, any_fsub, FP64, load, 8>;
+ defm SEB : BinaryRXEAndPseudo<"seb", 0xED0B, z_any_fsub_noreassoc, FP32,
+ load, 4>;
+ defm SDB : BinaryRXEAndPseudo<"sdb", 0xED1B, z_any_fsub_noreassoc, FP64,
+ load, 8>;
}
// Multiplication.
@@ -452,8 +456,10 @@ let Uses = [FPC], mayRaiseFPException = 1 in {
def MDBR : BinaryRRE<"mdbr", 0xB31C, any_fmul, FP64, FP64>;
def MXBR : BinaryRRE<"mxbr", 0xB34C, any_fmul, FP128, FP128>;
}
- defm MEEB : BinaryRXEAndPseudo<"meeb", 0xED17, any_fmul, FP32, load, 4>;
- defm MDB : BinaryRXEAndPseudo<"mdb", 0xED1C, any_fmul, FP64, load, 8>;
+ defm MEEB : BinaryRXEAndPseudo<"meeb", 0xED17, z_any_fmul_noreassoc, FP32,
+ load, 4>;
+ defm MDB : BinaryRXEAndPseudo<"mdb", 0xED1C, z_any_fmul_noreassoc, FP64,
+ load, 8>;
}
// f64 multiplication of two FP32 registers.
@@ -495,8 +501,10 @@ let Uses = [FPC], mayRaiseFPException = 1 in {
def MAEBR : TernaryRRD<"maebr", 0xB30E, z_any_fma, FP32, FP32>;
def MADBR : TernaryRRD<"madbr", 0xB31E, z_any_fma, FP64, FP64>;
- defm MAEB : TernaryRXFAndPseudo<"maeb", 0xED0E, z_any_fma, FP32, FP32, load, 4>;
- defm MADB : TernaryRXFAndPseudo<"madb", 0xED1E, z_any_fma, FP64, FP64, load, 8>;
+ defm MAEB : TernaryRXFAndPseudo<"maeb", 0xED0E, z_any_fma_noreassoc, FP32,
+ FP32, load, 4>;
+ defm MADB : TernaryRXFAndPseudo<"madb", 0xED1E, z_any_fma_noreassoc, FP64,
+ FP64, load, 8>;
}
// Fused multiply-subtract.
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrFormats.td b/llvm/lib/Target/SystemZ/SystemZInstrFormats.td
index bb9fa0fc33ffa0..9fdda0a5d60031 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrFormats.td
+++ b/llvm/lib/Target/SystemZ/SystemZInstrFormats.td
@@ -5536,3 +5536,33 @@ multiclass StringRRE<string mnemonic, bits<16> opcode,
[(set GR64:$end, (operator GR64:$start1, GR64:$start2,
GR32:$char))]>;
}
+
+multiclass BinaryVRRcAndCCPseudo<string mnemonic, bits<16> opcode,
+ SDPatternOperator operator,
+ SDPatternOperator reassoc_operator,
+ TypedReg tr1, TypedReg tr2, bits<4> type = 0,
+ bits<4> m5 = 0, bits<4> m6 = 0,
+ string fp_mnemonic = ""> {
+ def "" : BinaryVRRc<mnemonic, opcode, operator, tr1, tr2, type, m5, m6,
+ fp_mnemonic>;
+ let Defs = [CC], AddedComplexity = 1 in // Win over "".
+ def _CCPseudo : Pseudo<(outs tr1.op:$V1), (ins tr2.op:$V2, tr2.op:$V3),
+ [(set (tr1.vt tr1.op:$V1),
+ (reassoc_operator (tr2.vt tr2.op:$V2),
+ (tr2.vt tr2.op:$V3)))]>;
+}
+
+multiclass TernaryVRReAndCCPseudo<string mnemonic, bits<16> opcode,
+ SDPatternOperator operator,
+ SDPatternOperator reassoc_operator,
+ TypedReg tr1, TypedReg tr2, bits<4> m5 = 0,
+ bits<4> type = 0, string fp_mnemonic = ""> {
+ def "" : TernaryVRRe<mnemonic, opcode, operator, tr1, tr2, m5, type,
+ fp_mnemonic>;
+ let Defs = [CC], AddedComplexity = 1 in // Win over "".
+ def _CCPseudo : Pseudo<(outs tr1.op:$V1),
+ (ins tr2.op:$V2, tr2.op:$V3, tr1.op:$V4),
+ [(set (tr1.vt tr1.op:$V1), (reassoc_operator (tr2.vt tr2.op:$V2),
+ (tr2.vt tr2.op:$V3),
+ (tr1.vt tr1.op:$V4)))]>;
+}
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 046a12208467b4..1d8db507e30f14 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -21,6 +21,7 @@
#include "llvm/CodeGen/LivePhysRegs.h"
#include "llvm/CodeGen/LiveVariables.h"
#include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineCombinerPattern.h"
#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstr.h"
@@ -610,6 +611,112 @@ void SystemZInstrInfo::insertSelect(MachineBasicBlock &MBB,
.addImm(CCValid).addImm(CCMask);
}
+static void transferDeadCC(MachineInstr *OldMI, MachineInstr *NewMI) {
+ if (OldMI->registerDefIsDead(SystemZ::CC)) {
+ MachineOperand *CCDef = NewMI->findRegisterDefOperand(SystemZ::CC);
+ if (CCDef != nullptr)
+ CCDef->setIsDead(true);
+ }
+}
+
+void SystemZInstrInfo::transferMIFlag(MachineInstr *OldMI, MachineInstr *NewMI,
+ MachineInstr::MIFlag Flag) const {
+ if (OldMI->getFlag(Flag))
+ NewMI->setFlag(Flag);
+}
+
+static cl::opt<unsigned> MaxUsersGlobalFold("maxusersglobfold", cl::init(7));
+
+MachineInstr *SystemZInstrInfo::optimizeLoadInstr(MachineInstr &MI,
+ MachineRegisterInfo *MRI,
+ Register &FoldAsLoadDefReg,
+ MachineInstr *&DefMI) const {
+ const TargetRegisterInfo *TRI = MRI->getTargetRegisterInfo();
+
+ // Check whether we can move the DefMI load, and that it only has one use.
+ DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
+ assert(DefMI);
+ bool SawStore = false;
+ if (!DefMI->isSafeToMove(nullptr, SawStore) ||
+ !MRI->hasOneNonDBGUse(FoldAsLoadDefReg))
+ return nullptr;
+
+ // For reassociable FP operations, any loads have been purposefully left
+ // unfolded so that MachineCombiner can do its work on reg/reg
+ // opcodes. After that, as many loads as possible are now folded.
+ unsigned LoadOpcode = 0;
+ unsigned RegMemOpcode = 0;
+ const TargetRegisterClass *FPRC = nullptr;
+ RegMemOpcode = MI.getOpcode() == SystemZ::WFADB_CCPseudo ? SystemZ::ADB
+ : MI.getOpcode() == SystemZ::WFSDB_CCPseudo ? SystemZ::SDB
+ : MI.getOpcode() == SystemZ::WFMDB_CCPseudo ? SystemZ::MDB
+ : MI.getOpcode() == SystemZ::WFMADB_CCPseudo ? SystemZ::MADB
+ : 0;
+ if (RegMemOpcode) {
+ LoadOpcode = SystemZ::VL64;
+ FPRC = &SystemZ::FP64BitRegClass;
+ } else {
+ RegMemOpcode = MI.getOpcode() == SystemZ::WFASB_CCPseudo ? SystemZ::AEB
+ : MI.getOpcode() == SystemZ::WFSSB_CCPseudo ? SystemZ::SEB
+ : MI.getOpcode() == SystemZ::WFMSB_CCPseudo ? SystemZ::MEEB
+ : MI.getOpcode() == SystemZ::WFMASB_CCPseudo ? SystemZ::MAEB
+ : 0;
+ if (RegMemOpcode) {
+ LoadOpcode = SystemZ::VL32;
+ FPRC = &SystemZ::FP32BitRegClass;
+ }
+ }
+ if (!RegMemOpcode || DefMI->getOpcode() != LoadOpcode)
+ return nullptr;
+
+ DebugLoc DL = MI.getDebugLoc();
+ Register DstReg = MI.getOperand(0).getReg();
+ MachineOperand LHS = MI.getOperand(1);
+ MachineOperand RHS = MI.getOperand(2);
+ bool IsTernary =
+ (RegMemOpcode == SystemZ::MADB || RegMemOpcode == SystemZ::MAEB);
+ MachineOperand *AccMO = IsTernary ? &MI.getOperand(3) : nullptr;
+ if ((RegMemOpcode == SystemZ::SDB || RegMemOpcode == SystemZ::SEB) &&
+ FoldAsLoadDefReg != RHS.getReg())
+ return nullptr;
+ if (IsTernary && FoldAsLoadDefReg == AccMO->getReg())
+ return nullptr;
+ MachineOperand &RegMO = RHS.getReg() == FoldAsLoadDefReg ? LHS : RHS;
+ // It seems that folding globally into 2-address with too many users of reg
+ // is detrimental (resulting in reg copys before each use).
+ if (!MRI->hasOneNonDBGUse(RegMO.getReg()) && !RegMO.isKill() &&
+ DefMI->getParent() != MI.getParent()) {
+ unsigned NumUsers = 0;
+ for (MachineInstr &UI : MRI->use_nodbg_instructions(RegMO.getReg())) {
+ NumUsers++; (void)UI;
+ }
+ if (NumUsers > MaxUsersGlobalFold)
+ return nullptr;
+ }
+
+ MachineOperand &Base = DefMI->getOperand(1);
+ MachineOperand &Disp = DefMI->getOperand(2);
+ MachineOperand &Indx = DefMI->getOperand(3);
+ if (Base.isReg()) // Could be a FrameIndex.
+ Base.setIsKill(false);
+ Indx.setIsKill(false);
+
+ MachineInstrBuilder MIB =
+ BuildMI(*MI.getParent(), MI, DL, get(RegMemOpcode), DstReg);
+ if (IsTernary)
+ MIB.add(*AccMO);
+ MIB.add(RegMO);
+ MIB.add(Base).add(Disp).add(Indx);
+ MIB.addMemOperand(*DefMI->memoperands_begin());
+ transferMIFlag(&MI, MIB, MachineInstr::NoFPExcept);
+ MIB->addRegisterDead(SystemZ::CC, TRI);
+ MRI->setRegClass(DstReg, FPRC);
+ if (IsTernary)
+ MRI->setRegClass(AccMO->getReg(), FPRC);
+ MRI->setRegClass(RegMO.getReg(), FPRC);
+ return MIB;
+}
+
bool SystemZInstrInfo::foldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
Register Reg,
MachineRegisterInfo *MRI) const {
@@ -937,20 +1044,6 @@ static LogicOp interpretAndImmediate(unsigned Opcode) {
}
}
-static void transferDeadCC(MachineInstr *OldMI, MachineInstr *NewMI) {
- if (OldMI->registerDefIsDead(SystemZ::CC)) {
- MachineOperand *CCDef = NewMI->findRegisterDefOperand(SystemZ::CC);
- ...
[truncated]
|
You can test this locally with the following command:git-clang-format --diff f73e87f53f5d8a86c29251dedc9dbd264179203a add84038dd93f7944aa8d4e9f30547d77912c3f6 -- llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp llvm/lib/Target/SystemZ/SystemZInstrInfo.h llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 2b61cff727..d1d2a68a32 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -610,10 +610,9 @@ void SystemZInstrInfo::insertSelect(MachineBasicBlock &MBB,
.addImm(CCValid).addImm(CCMask);
}
-MachineInstr *SystemZInstrInfo::optimizeLoadInstr(MachineInstr &MI,
- const MachineRegisterInfo *MRI,
- Register &FoldAsLoadDefReg,
- MachineInstr *&DefMI) const {
+MachineInstr *SystemZInstrInfo::optimizeLoadInstr(
+ MachineInstr &MI, const MachineRegisterInfo *MRI,
+ Register &FoldAsLoadDefReg, MachineInstr *&DefMI) const {
// Check whether we can move the DefMI load, and that it only has one use.
DefMI = MRI->getVRegDef(FoldAsLoadDefReg);
assert(DefMI);
@@ -1461,8 +1460,7 @@ MachineInstr *SystemZInstrInfo::foldMemoryOperandImpl(
if (get(RegMemOpcode).hasImplicitDefOfPhysReg(SystemZ::CC)) {
assert(LoadMI.getParent() == MI.getParent() && "Assuming a local fold.");
assert(LoadMI != InsertPt && "Assuming InsertPt not to be first in MBB.");
- for (MachineBasicBlock::iterator MII = std::prev(InsertPt);;
- --MII) {
+ for (MachineBasicBlock::iterator MII = std::prev(InsertPt);; --MII) {
if (MII->definesRegister(SystemZ::CC, /*TRI=*/nullptr)) {
if (!MII->registerDefIsDead(SystemZ::CC, /*TRI=*/nullptr))
return nullptr;
@@ -1490,12 +1488,12 @@ MachineInstr *SystemZInstrInfo::foldMemoryOperandImpl(
MachineOperand &Base = LoadMI.getOperand(1);
MachineOperand &Disp = LoadMI.getOperand(2);
MachineOperand &Indx = LoadMI.getOperand(3);
- MachineInstrBuilder MIB =
- BuildMI(*MI.getParent(), InsertPt, MI.getDebugLoc(), get(RegMemOpcode), DstReg)
- .add(RegMO)
- .add(Base)
- .add(Disp)
- .add(Indx);
+ MachineInstrBuilder MIB = BuildMI(*MI.getParent(), InsertPt, MI.getDebugLoc(),
+ get(RegMemOpcode), DstReg)
+ .add(RegMO)
+ .add(Base)
+ .add(Disp)
+ .add(Indx);
MIB->addRegisterDead(SystemZ::CC, &RI);
MRI->setRegClass(DstReg, FPRC);
MRI->setRegClass(RegMO.getReg(), FPRC);
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp b/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
index dced64d6b2..14675b049e 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
@@ -30,10 +30,10 @@
using namespace llvm;
-static cl::opt<bool> EnableMachineCombinerPass(
- "systemz-machine-combiner",
- cl::desc("Enable the machine combiner pass"),
- cl::init(true), cl::Hidden);
+static cl::opt<bool>
+ EnableMachineCombinerPass("systemz-machine-combiner",
+ cl::desc("Enable the machine combiner pass"),
+ cl::init(true), cl::Hidden);
// NOLINTNEXTLINE(readability-identifier-naming)
extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeSystemZTarget() {
|
Haven't looked into the patch in detail yet, but this is interesting. Should we then always do the reg/mem folding late (for all FP operations, including non-reassociable ones), if it is generally beneficial? This might actually even simplify the code (and could be done as a separate PR before the MachineCombiner reassociation patch?). |
fab2c85
to
ac44853
Compare
Patch rebased.
FMA: Chose strategy (PPC or Z -- see previous comment from March 1st), and maybe add some tests for 'float' as well. Alternatively, skip FMA patterns for now until they prove themselves on benchmarks? After this, patch could be simplified/refactored further a bit. Should we also fold 20-bit offset-addresses in optimizeLoadInstr()? That is still an extra instruction compared to a separate load - LAY here while ISel uses other forms of immediate-add, e.g. AGHI. I suppose this should be done, but I have Sidenotes:
|
ac44853
to
6f18d91
Compare
Patch finalised for review. Waiting with FMA patterns as they sofar give no performance improvement, and optimizeLoadInstr() simplified to only fold 12-bit displacements.
|
A couple more general issues that I think we should consider:
Not saying we need to implement any of this right now, but I think we should at least have an opinion on what the best direction is going forward. |
6f18d91
to
e9e7ff3
Compare
I agree that now that the general reg/reg scheme doesn't seem worthwhile, it may not be worth using the CC-pseudos and the extra pass to convert them back. I tried to do the CC check in optimizeLoadInstr() instead of using CC-pseudos. Performance-wise I see no difference, and doing it this way only misses a few foldings, which seem acceptable:
The backwards search for CC-def should be ok compile-time-wise as many instructions clobber CC, but if there emerges an issue, Peephole-opt could probably be extended easily to maintin a LivePhysRegs tracker. The good thing is that we can trust the MBB live-in list for CC.
I tried your suggestion to have optimizeLoadInstr() call foldMemoryOperand(), which in turn calls foldMemoryOperandImpl(). This was NFC on SPEC, and the only test changes I saw was that in SystemZ anyregcc.ll and stackmap.ll now fails. I am not sure about these, but it seems that foldMemoryOperand() optimizes these in foldPatchPoint and the result seems to be that some of the loads are folded into those instructions. I have not updated those tests yet as I am not sure if this is correct / desired on SystemZ (?). Since foldMemoryOperand() sets the memoperands on the returned MI, it wouldn't work to call foldMemoryOperandImpl() directly from optimizeLoadInstr(). So alternatives then are to do these foldings in optimizeLoadInstr() as in HEAD^, or do it this way, possibly stopping on STACKMAP etc in optimizeLoadInstr() if that is not desired. It seems slightly more extensible to have foldMemoryOperandImpl() do it, but maybe it doesn't really matter.
It seems that LICM can do this to hoist the memory access out of the loop, but that will also make the register live across the loop, so not sure how useful it would be. I Haven't tried it. |
OK, that looks good.
Also, if we use the approach below to have optimizeLoadInstr call foldMemoryOperand, then we already get the CC handling in foldMemoryOperandImpl, so wouldn't have to duplicate the implementation.
I don't know what the specific differences are, but in general if this is OK on X86 it should be OK on SystemZ too.
I agree. We should call foldMemoryOperand, basically the same as X86 does.
OK, that's something we can maybe look at some later time, it isn't really related to this patch. |
Updated tests: anyregcc.ll and stackmap.ll.
I think we need two different implementations: One quick if LiveIntervals is available ("FI version"), and one that does the scan ("MI version"). Even though there is the LIS argument to the "MI version", it is optional and null as this is still on SSA form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nearly ready to merge now, but see inline comments in particular about the assert. Thanks!
Register FoldAsLoadDefReg = LoadMI.getOperand(0).getReg(); | ||
// We don't really need Ops, but do a sanity check: | ||
assert(Ops.size() == 1 && FoldAsLoadDefReg == MI.getOperand(Ops[0]).getReg() && | ||
"Expected MI to have the only use of the load."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is correct to use an assertion here. Common code might call us with more than one element in Ops; e.g. if there's load into a register that is used as both input register operands of the add. (This won't happen in the call from optimizeLoadInstr - but common code may call this foldMemoryOperandImpl for other reasons.) We should just return nullptr in this case.
// opcodes. After that, as many loads as possible are now folded. | ||
// TODO: This may be beneficial with other opcodes as well as machine-sink | ||
// can move loads close to their user in a different MBB, which the isel | ||
// matcher did not see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this could be more generic, using the instruction maps just like the other foldMemoryOperandImpl. But that can be done later.
|
||
int UseOpIdx = MI.findRegisterUseOperandIdx(FoldAsLoadDefReg); | ||
assert(UseOpIdx != -1 && "Expected FoldAsLoadDefReg to be used by MI."); | ||
return foldMemoryOperand(MI, {((unsigned) UseOpIdx)}, *DefMI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The X86 version clears FoldAsLoadDefReg if sucessful. Not sure why since common code doesn't appear to rely on that. But I think we should be consistent here.
Remove RegMem parts Machinecombiner tests
09d50bd
to
add8403
Compare
This patch is still slightly unfinished/experimental. It has the option of using either PPC or alternative new patterns. Some feedback would be nice at this point (see below).
Main points:
MachineCombiner.
limited reg/mem folding across MBB boundaries which gives a nice improvement on imagick (13%!).
I have experimented with figuring out which pattern combinations would give the best improvement, measured by the decrease of maximum depth of any instructions in the MBB as reported by MachineTraceMetrics. Just the common-code binary patterns gives a total static improvement of -10294. Adding different fma patterns gave some further improvements:
Looking at these numbers, it seemed like the patterns of {fma+fma, fma+add, add+fma} was the most promising combination, and that the PPC patterns {fma+fma+fma, fma+fma+add} were only half of that.
However, even though these different settings of FMA patterns give further static improvements, there doesn't seem to be any further performance improvement on SPEC for any of them. With the experimental reassoc-additions pass (right colum) a bit more depth-improvement resulted, but not even with this is there any noticeable performance win.
A bit surprisingly, I saw instead a 13% improvement imagick without the MachineCombiner, just because of the global reg/mem folding I tried. Then with the MachineCombiner using only common-code patterns for binops, lbm is improved ~20%, and cactus ~2%.
My conclusions are that the MachineCombiner is worth enabling and even though there is no noticeable performance improvement on SPEC it seems good to have some basic FMA patterns, either like -z-fma or -ppc-fma . The simpler solution is probably to reuse the 2 PPC patterns, while the z-fma ones give better maxdepth improvement which may or may not be valuable sometimes. Compile time seems to be the same.
The experiments and results per above have included a minor fix in MachineCombiner (#82025), which hopefully can be committed but is probably not vital.
The full experimental patch:
MachineComb.patch.tar.gz