-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Experimental patch for understanding the pre-RA MachineScheduler #136483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
JonPsson1
wants to merge
1
commit into
llvm:main
Choose a base branch
from
JonPsson1:MISched_stats
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-schedfile -sched-show-ints -regalloc-stats
You can test this locally with the following command:git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/include/llvm/CodeGen/MachineScheduler.h llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h llvm/include/llvm/CodeGen/SlotIndexes.h llvm/include/llvm/CodeGen/TargetRegisterInfo.h llvm/lib/CodeGen/MachineScheduler.cpp llvm/lib/CodeGen/RegAllocGreedy.cpp llvm/lib/CodeGen/RegAllocGreedy.h llvm/lib/CodeGen/TargetRegisterInfo.cpp llvm/lib/Target/SystemZ/SystemZRegisterInfo.h View the diff from clang-format here.diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index 8a94ac886..10fa1bc8e 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -427,7 +427,8 @@ class ScheduleDAGMILive : public ScheduleDAGMI {
// Show liveness visually of registers before and after scheduling. Enabled
// with -sched-show-ints and -misched-only-block=NUM.
- void showIntervals(std::string Msg, std::string I, MachineBasicBlock *MBB) const;
+ void showIntervals(std::string Msg, std::string I,
+ MachineBasicBlock *MBB) const;
protected:
RegisterClassInfo *RegClassInfo;
diff --git a/llvm/include/llvm/CodeGen/SlotIndexes.h b/llvm/include/llvm/CodeGen/SlotIndexes.h
index 6da9b858c..e4113ed93 100644
--- a/llvm/include/llvm/CodeGen/SlotIndexes.h
+++ b/llvm/include/llvm/CodeGen/SlotIndexes.h
@@ -100,8 +100,8 @@ class raw_ostream;
unsigned getIndex() const {
return listEntry()->getIndex() | getSlot();
}
- private:
+ private:
/// Returns the slot for this SlotIndex.
Slot getSlot() const {
return static_cast<Slot>(lie.getInt());
diff --git a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
index 90d5132d4..6dbaba5f5 100644
--- a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
@@ -962,6 +962,7 @@ public:
private:
mutable std::set<unsigned> PrioRegClasses;
void initializePrioRegClasses() const;
+
public:
bool isPrioRC(unsigned RegClassID) const;
bool isPrioVirtReg(Register Reg, const MachineRegisterInfo *MRI) const;
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 300a79a70..c6bb8e5b7 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -71,12 +71,12 @@
#include <vector>
#include "llvm/IR/Module.h"
-#include <sstream>
-#include <iomanip>
+#include <cstdio>
#include <fstream>
+#include <iomanip>
+#include <sstream>
#include <stdio.h>
#include <unistd.h>
-#include <cstdio>
using namespace llvm;
@@ -1644,14 +1644,13 @@ bool isDefedByImplicitDef(MachineInstr *UseMI, const LiveInterval &LI,
// Get number of live vregs at MI, summarized for FP and GPR regs.
static void getNumLive(MachineInstr &MI, LiveIntervals *LIS,
- MachineRegisterInfo &MRI,
- const TargetRegisterInfo *TRI,
- std::vector<Register> &HandledVRegs,
- unsigned &NumLiveFP, unsigned &NumLiveGPR) {
+ MachineRegisterInfo &MRI, const TargetRegisterInfo *TRI,
+ std::vector<Register> &HandledVRegs, unsigned &NumLiveFP,
+ unsigned &NumLiveGPR) {
MachineFunction *MF = MI.getParent()->getParent();
const TargetInstrInfo *TII = MF->getSubtarget().getInstrInfo();
- unsigned MIIdx = LIS->getSlotIndexes()->getInstructionIndex(MI)
- .getRegSlot().getIndex();
+ unsigned MIIdx =
+ LIS->getSlotIndexes()->getInstructionIndex(MI).getRegSlot().getIndex();
std::vector<unsigned> RCCounts(TRI->getNumRegClasses(), 0);
NumLiveFP = NumLiveGPR = 0;
@@ -1695,9 +1694,9 @@ static void getNumLive(MachineInstr &MI, LiveIntervals *LIS,
NumLiveGPR += (NumGPRLoHi - E) / 2;
}
-void ScheduleDAGMILive::countRegOverlaps(RegOverlaps &RO,
- MachineBasicBlock::iterator FirstItr,
- MachineBasicBlock::iterator EndItr) const {
+void ScheduleDAGMILive::countRegOverlaps(
+ RegOverlaps &RO, MachineBasicBlock::iterator FirstItr,
+ MachineBasicBlock::iterator EndItr) const {
std::vector<Register> HandledVRegs;
for (unsigned I = 0, E = MRI.getNumVirtRegs(); I != E; ++I)
HandledVRegs.push_back(I);
@@ -1744,8 +1743,7 @@ static void countProperties(Properties &P, ScheduleDAGMILive *DAG) {
if (FPDef) {
P.FPDefs++;
P.FPTotLat += SU->Latency;
- }
- else if (IsDef) {
+ } else if (IsDef) {
P.OtherDefs++;
P.OtherLat += SU->Latency;
}
@@ -1770,10 +1768,9 @@ std::string getRegionID(unsigned CurrRegionIdx, const MachineBasicBlock *MBB) {
const Function *F = &MBB->getParent()->getFunction();
std::string TmpStr;
raw_string_ostream OS(TmpStr);
- OS << F->getParent()->getModuleIdentifier() << "OOOOOOO"
- << F->getName() << "OOOOOOO"
- << "MBB" << MBB->getNumber()
- << "Region" << CurrRegionIdx;
+ OS << F->getParent()->getModuleIdentifier() << "OOOOOOO" << F->getName()
+ << "OOOOOOO"
+ << "MBB" << MBB->getNumber() << "Region" << CurrRegionIdx;
return OS.str();
}
@@ -1803,8 +1800,7 @@ void ScheduleDAGMILive::showIntervals(std::string Msg,
if ((pos = R.find("-")) != std::string::npos) {
Start = std::stoi(R.substr(0, pos));
End = std::stoi(R.substr(pos + 1, R.length() - 1));
- }
- else
+ } else
Start = End = std::stoi(R);
for (unsigned i = Start; i <= End; i++)
VRegsToShow.push_back(i);
@@ -1833,18 +1829,22 @@ void ScheduleDAGMILive::showIntervals(std::string Msg,
}
std::vector<unsigned> Colors;
- Colors.push_back(31); Colors.push_back(32); Colors.push_back(33);
- Colors.push_back(34); Colors.push_back(35); Colors.push_back(36);
+ Colors.push_back(31);
+ Colors.push_back(32);
+ Colors.push_back(33);
+ Colors.push_back(34);
+ Colors.push_back(35);
+ Colors.push_back(36);
Colors.push_back(37);
- std::map<MachineInstr*, unsigned> NodeNum;
+ std::map<MachineInstr *, unsigned> NodeNum;
for (unsigned Idx = 0, End = SUnits.size(); Idx != End; ++Idx) {
const SUnit *SU = &SUnits[Idx];
NodeNum[SU->getInstr()] = Idx;
}
// Print reg numbers vertically.
- for(unsigned i = 0; i < 5; i++) {
+ for (unsigned i = 0; i < 5; i++) {
for (auto Reg : VRegsToShow) {
const LiveInterval &LI = LIS->getInterval(Register::index2VirtReg(Reg));
if (LI.empty())
@@ -1859,8 +1859,8 @@ void ScheduleDAGMILive::showIntervals(std::string Msg,
}
for (auto &I : *MBB) {
- unsigned CurrIdx = LIS->getSlotIndexes()->getInstructionIndex(I)
- .getRegSlot().getIndex();
+ unsigned CurrIdx =
+ LIS->getSlotIndexes()->getInstructionIndex(I).getRegSlot().getIndex();
// Indicate liveness for each vreg.
for (auto Reg : VRegsToShow) {
const LiveInterval &LI = LIS->getInterval(Register::index2VirtReg(Reg));
@@ -1869,7 +1869,7 @@ void ScheduleDAGMILive::showIntervals(std::string Msg,
dbgs() << "\033[" << Colors[Reg % Colors.size()] << "m";
bool reads, writes;
std::tie(reads, writes) =
- I.readsWritesVirtualRegister(Register::index2VirtReg(Reg));
+ I.readsWritesVirtualRegister(Register::index2VirtReg(Reg));
if (writes)
dbgs() << "\xE2\x94\xA2";
else if (reads)
@@ -1909,11 +1909,11 @@ void ScheduleDAGMILive::showIntervals(std::string Msg,
// different policies, compare and decide per region and write this file).
static cl::opt<bool> SCHEDFILE("schedfile", cl::init(false));
using std::string;
-string getcwd( void ) {
- char buff[PATH_MAX];
- getcwd( buff, PATH_MAX );
- string cwd( buff );
- return cwd;
+string getcwd(void) {
+ char buff[PATH_MAX];
+ getcwd(buff, PATH_MAX);
+ string cwd(buff);
+ return cwd;
}
string getNextSchedPolicy() {
string CWD = getcwd();
@@ -1943,7 +1943,7 @@ string getNextSchedPolicy() {
/// ScheduleDAGMILive then it will want to override this virtual method in order
/// to update any specialized state.
void ScheduleDAGMILive::schedule() {
- GenericScheduler *Strategy = ((GenericScheduler*) SchedImpl.get());
+ GenericScheduler *Strategy = ((GenericScheduler *)SchedImpl.get());
if (SCHEDFILE) {
string NextPolicy = getNextSchedPolicy();
string RegionID = getRegionID(CurrRegionIdx, BB);
@@ -1951,22 +1951,18 @@ void ScheduleDAGMILive::schedule() {
dbgs() << RegionID << " BOTTOMUP\n";
Strategy->RegionPolicy.OnlyTopDown = false;
Strategy->RegionPolicy.OnlyBottomUp = true;
- }
- else if (NextPolicy.compare("TOPDW") == 0) {
+ } else if (NextPolicy.compare("TOPDW") == 0) {
dbgs() << RegionID << " TOPDOWN\n";
Strategy->RegionPolicy.OnlyTopDown = true;
Strategy->RegionPolicy.OnlyBottomUp = false;
- }
- else if (NextPolicy.compare("BIDIR") == 0) {
+ } else if (NextPolicy.compare("BIDIR") == 0) {
dbgs() << RegionID << " BIDIR\n";
Strategy->RegionPolicy.OnlyTopDown = false;
Strategy->RegionPolicy.OnlyBottomUp = false;
- }
- else if (NextPolicy.compare("NOSCH") == 0) {
+ } else if (NextPolicy.compare("NOSCH") == 0) {
dbgs() << RegionID << " NOSCH\n";
return;
- }
- else
+ } else
assert(0 && "Bad read of next sched policy.");
}
@@ -1999,12 +1995,12 @@ void ScheduleDAGMILive::schedule() {
const SUnit *RegionLast = &SUnits[SUnits.size() - 1];
bool IsFirstInMBB = RegionFirst->getInstr() == BB->begin();
MachineBasicBlock::iterator BeforeRegItr;
- MachineBasicBlock::iterator
- AfterRegItr = std::next(RegionLast->getInstr()->getIterator());
+ MachineBasicBlock::iterator AfterRegItr =
+ std::next(RegionLast->getInstr()->getIterator());
if (!IsFirstInMBB)
BeforeRegItr = std::prev(RegionFirst->getInstr()->getIterator());
- MachineBasicBlock::iterator
- FirstItr = IsFirstInMBB ? BB->begin() : std::next(BeforeRegItr);
+ MachineBasicBlock::iterator FirstItr =
+ IsFirstInMBB ? BB->begin() : std::next(BeforeRegItr);
RegOverlaps OL0, OL1;
if (SchedPrintPressures)
countRegOverlaps(OL0, FirstItr, AfterRegItr);
@@ -2047,8 +2043,8 @@ void ScheduleDAGMILive::schedule() {
countProperties(P, this);
printRegionID(CurrRegionIdx, BB);
dbgs() << "_GPR" << OL1.GPR - OL0.GPR << "_FP" << OL1.FP - OL0.FP;
- dbgs() << "_GPRMax" << OL1.GPRMax - OL0.GPRMax << "_FPMax" <<
- OL1.FPMax - OL0.FPMax;
+ dbgs() << "_GPRMax" << OL1.GPRMax - OL0.GPRMax << "_FPMax"
+ << OL1.FPMax - OL0.FPMax;
dbgs() << "_Size" << SUnits.size();
dbgs() << "_Stores" << P.NumStores << "_Loads" << P.NumLoads;
dbgs() << "_FPDefs" << P.FPDefs << "_OtherDefs" << P.OtherDefs;
@@ -3272,8 +3268,9 @@ void SchedBoundary::bumpNode(SUnit *SU) {
unsigned IncMOps = SchedModel->getNumMicroOps(SU->getInstr());
// Allow for input order by skipping this assertion:
// assert(
- // (CurrMOps == 0 || (CurrMOps + IncMOps) <= SchedModel->getIssueWidth()) &&
- // "Cannot schedule this instruction's MicroOps in the current cycle.");
+ // (CurrMOps == 0 || (CurrMOps + IncMOps) <= SchedModel->getIssueWidth())
+ // && "Cannot schedule this instruction's MicroOps in the current
+ // cycle.");
unsigned ReadyCycle = (isTop() ? SU->TopReadyCycle : SU->BotReadyCycle);
LLVM_DEBUG(dbgs() << " Ready @" << ReadyCycle << "c\n");
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index b0677843c..db82d8aeb 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -2878,8 +2878,7 @@ bool RAGreedy::hasVirtRegAlloc() {
std::string getFunctionID(const Function *F) {
std::string TmpStr;
raw_string_ostream OS(TmpStr);
- OS << F->getParent()->getModuleIdentifier() << "OOOOOOO"
- << F->getName();
+ OS << F->getParent()->getModuleIdentifier() << "OOOOOOO" << F->getName();
return OS.str();
}
static cl::opt<bool> RegallocStats("regalloc-stats", cl::init(false));
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.h b/llvm/lib/CodeGen/RegAllocGreedy.h
index 3df92f7ad..46eaa3fc6 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.h
+++ b/llvm/lib/CodeGen/RegAllocGreedy.h
@@ -288,6 +288,7 @@ private:
unsigned NumFPSpilled;
unsigned NumOtherSpilled;
+
public:
RAGreedy(RequiredAnalyses &Analyses, const RegAllocFilterFunc F = nullptr);
diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index 10abf95e7..f0e2e6411 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -481,8 +481,8 @@ bool TargetRegisterInfo::isPrioRC(unsigned RegClassID) const {
return PrioRegClasses.count(RegClassID);
}
-bool TargetRegisterInfo::
-isPrioVirtReg(Register Reg, const MachineRegisterInfo *MRI) const {
+bool TargetRegisterInfo::isPrioVirtReg(Register Reg,
+ const MachineRegisterInfo *MRI) const {
initializePrioRegClasses();
return (Reg.isVirtual() &&
PrioRegClasses.count(MRI->getRegClass(Reg)->getID()));
|
Can you copy this to discource? It is the usual (and more convenient) place to discuss such things. |
Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When I realized that the GenericScheduler caused a lot more spilling to happen in cactus along with a performance penalty (5%), I started to try to understand what was going wrong. After many experiments, this is now my method of how to analyze a test case in regards to the pre-ra scheduler and spilling, which I would like to share (not meant to be merged):
The regalloc stats are per function, while the others are per scheduling region. The strings with "OOOO..."-delimiters are unique IDs for scheduling regions or functions that a script can parse along with the important numbers to the right.
Compiling this one file once more per above with -mllvm -misched-inputorder added (same output as -enable-misched=false but with these dumps) shows some but much less FP spilling:
REGALLOCSTAT: ML_BSSN_Advect.bcOOOOOOO_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd _NumInstrs9857_NumVirtRegs10318_NumFPSpilled483_NumOtherSpilled491
So this one function in the file spills 1848 FP vregs with the scheduler, but only 483 without it. To show the regions where the scheduler has caused more FP regs overlap than what was before scheduling:
The (positive) FP values here (66, 81234, ...) are the differences between the sums of overlapping FP vregs before and after scheduling. This sum is basically counted as the number of live FP regs at each MI in the region, summed together across all MIs in the region.
With this "interestingness test", llvm-reduce can be used to do the actual reduction. A wrapper for this purpose is reduce_region.sh, that needs LLCBIN in the environment and the input file as an argument. The test uses llc, so the ML_BSSN_Advect.ll passed is the output aftr the middle-end passes have run (clang -emit-llvm...).
In reduce_region.sh above, the values of -50 and -2 may be good to start with, and as a second step a yet smaller test case could be made by changing the numbers to, let's say -10 -1 or even -10 0, which may be useful. To get a side-by-side overview of the scheduling effects, I use two panes in a tmux session and adjust the llc output to look at both before and after, but there is also a script that doesn't work perfectly (yet) due to escape sequences (may need to adjust some spacing in it): sidebyside_sched.sh show_out.
It is quite interesting to see visually how the registers have been become more overlapping with each other.
-showgr is passed to llc).
The schedule has changed to the right with more overlapping lines and higher F values, which should explain the extra spilling. The original schedule had the VL64s (load from memory) scheduled low, while the GenericScheduler has allowed them to float up, which wasn't so good in this case.
This type of reduction/visualization has given me many clues to potential improvements of the scheduler strategy. This file and example shows the probably still most important heuristic, namely to schedule loads (and other instructions defining but not redefining a reg, with no uses becoming live) low in a big region. It seems that this is basically all you need to do to not cause spilling - at least in big regions like this. There are of course a few files that do not like that, so there is still room for improvement in SystemZPreRASchedStrategy. Usually, but not always, can the original problem be understood in the reduced region and if fixed also handles the original file.
In the SystemZPreRASchedStrategy this has evolved into computeSULivenessScore() over a series of many, many experiments with this. I have tried many good and reasonable ideas where only some happened to give good benchmark results as well. I have experimented with (small) DFS trees but removed them again as they where not needed to give good results. It is probably worth another try with them at some point. I also tried reversing this "loads down" heuristic to "stores up", but that didn't work very well for one or other reason (same with bi-directional). This is probably also worth revisting at some point with a careful approach - storing a value right after the last def should reduce spilling and be an improvement. There are still many things to be tried, but the current strategy is at a good point with good results. FP benchmarks are improving but Int not so much (no regressions however).
I may be wrong, but my feeling is that to get best performance this probably has to be tuned and benchmarked for a specific target. On the other hand, probably some other target could still get an improvement with this strategy, maybe after some adjustments.
There is also the -schedfile option which I have tried by building all different policies (including disabling) and then recompiled a final time taking for each region the policy that gave the least spill (from a file). This could also be interesting to revisit, but so far it hasn't given any increased performance.
The scripts:
sched_scripts.tar.gz
@uweigand @dominik-steenken @stefan-sf-ibm @atrick @michaelmaitland @wangpc-pp @mshockwave