-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[MachineScheduler] Experimental option to partially disable pre-ra scheduling. #90181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
You can test this locally with the following command:git-clang-format --diff b446f9079bdc1123e1aafe5a535d8dbea835d8e8 bae21d51a33c66672e3a67dc919a6b23fa266b09 -- llvm/include/llvm/CodeGen/MachineScheduler.h llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h llvm/lib/CodeGen/MachineScheduler.cpp View the diff from clang-format here.diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index 6a0c886cbd..5c8711ef68 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -1075,9 +1075,25 @@ public:
/// Represent the type of SchedCandidate found within a single queue.
/// pickNodeBidirectional depends on these listed by decreasing priority.
enum CandReason : uint8_t {
- NoCand, Only1, PhysReg, RegExcess, RegCritical, Stall, Cluster, Weak,
- RegMax, ResourceReduce, ResourceDemand, BotHeightReduce, BotPathReduce,
- TopDepthReduce, TopPathReduce, NextDefUse, RegPressure, NodeOrder};
+ NoCand,
+ Only1,
+ PhysReg,
+ RegExcess,
+ RegCritical,
+ Stall,
+ Cluster,
+ Weak,
+ RegMax,
+ ResourceReduce,
+ ResourceDemand,
+ BotHeightReduce,
+ BotPathReduce,
+ TopDepthReduce,
+ TopPathReduce,
+ NextDefUse,
+ RegPressure,
+ NodeOrder
+ };
#ifndef NDEBUG
static const char *getReasonStr(GenericSchedulerBase::CandReason Reason);
@@ -1218,11 +1234,11 @@ class GenericScheduler : public GenericSchedulerBase {
// TODO: Integrate with SchedDFSResult class.
// SU -> Nodes above in subtree.
- std::vector<std::set<const SUnit *> > TreeSUs;
+ std::vector<std::set<const SUnit *>> TreeSUs;
// SU -> Virtual regs defined above in subtree.
- std::vector<std::set<Register> > TreeDefs;
+ std::vector<std::set<Register>> TreeDefs;
// SU -> Regs used but not defined above in subtree.
- std::vector<std::set<Register> > TreeUses;
+ std::vector<std::set<Register>> TreeUses;
// If this SU is non-null, it is the start of a subtree to be scheduled as
// a unit.
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index f071dc40c3..5dbb04aaed 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -2509,9 +2509,10 @@ static bool doOOOSchedForRegion(unsigned NumRegionInstrs) {
return OOOSCHED && NumRegionInstrs > NoOOOSchedBelow;
}
-static bool doHeightHeurForRegion(const ScheduleDAGMI *DAG, unsigned DAGHeight) {
- return HEIGHTHEUR &&
- DAGHeight != 0 && (DAG->SUnits.size() / DAGHeight) < HEIGHTIFWFAC;
+static bool doHeightHeurForRegion(const ScheduleDAGMI *DAG,
+ unsigned DAGHeight) {
+ return HEIGHTHEUR && DAGHeight != 0 &&
+ (DAG->SUnits.size() / DAGHeight) < HEIGHTIFWFAC;
}
bool SchedBoundary::checkHazard(SUnit *SU) {
@@ -2756,8 +2757,9 @@ void SchedBoundary::bumpNode(SUnit *SU) {
const MCSchedClassDesc *SC = DAG->getSchedClass(SU);
unsigned IncMOps = SchedModel->getNumMicroOps(SU->getInstr());
assert((doOOOSchedForRegion(DAG->SUnits.size()) || INPUTORDER ||
- (CurrMOps == 0 || (CurrMOps + IncMOps) <= SchedModel->getIssueWidth())) &&
- "Cannot schedule this instruction's MicroOps in the current cycle.");
+ (CurrMOps == 0 ||
+ (CurrMOps + IncMOps) <= SchedModel->getIssueWidth())) &&
+ "Cannot schedule this instruction's MicroOps in the current cycle.");
unsigned ReadyCycle = (isTop() ? SU->TopReadyCycle : SU->BotReadyCycle);
LLVM_DEBUG(dbgs() << " Ready @" << ReadyCycle << "c\n");
@@ -3172,7 +3174,8 @@ const char *GenericSchedulerBase::getReasonStr(
case BotHeightReduce:return "BOT-HEIGHT";
case BotPathReduce: return "BOT-PATH ";
case NextDefUse: return "DEF-USE ";
- case RegPressure: return "REG-PRESS ";
+ case RegPressure:
+ return "REG-PRESS ";
case NodeOrder: return "ORDER ";
};
llvm_unreachable("Unknown reason!");
@@ -3337,9 +3340,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
if (DFS) {
DAG->computeDFSResult();
const SchedDFSResult *DFSResult = DAG->getDFSResult();
- TreeDefs = std::vector<std::set<Register> > (DAG->SUnits.size());
- TreeUses = std::vector<std::set<Register> > (DAG->SUnits.size());
- TreeSUs = std::vector<std::set<const SUnit *> >(DAG->SUnits.size());
+ TreeDefs = std::vector<std::set<Register>>(DAG->SUnits.size());
+ TreeUses = std::vector<std::set<Register>>(DAG->SUnits.size());
+ TreeSUs = std::vector<std::set<const SUnit *>>(DAG->SUnits.size());
assert(NextSubtreeSU == nullptr);
for (unsigned Idx = 0, End = DAG->SUnits.size(); Idx != End; ++Idx) {
@@ -3408,15 +3411,16 @@ void GenericScheduler::initLiveRegs(ScheduleDAGMILive *DAG) {
for (unsigned I = 0, E = DAG->MRI.getNumVirtRegs(); I != E; ++I) {
Register VirtReg = Register::index2VirtReg(I);
const LiveInterval &LI = DAG->getLIS()->getInterval(VirtReg);
- LiveQueryResult LRQ = LI.Query(DAG->getLIS()->
- getInstructionIndex(*DAG->SUnits.back().getInstr()));
+ LiveQueryResult LRQ = LI.Query(
+ DAG->getLIS()->getInstructionIndex(*DAG->SUnits.back().getInstr()));
if (LRQ.valueOut())
LiveRegs.insert(VirtReg);
}
- LLVM_DEBUG( dbgs() << " Live outs: ";
- for (auto Reg : LiveRegs)
- dbgs() << "%" << Register::virtReg2Index(Reg) << ", ";
- dbgs() << "\n";);
+ LLVM_DEBUG(dbgs() << " Live outs: ";
+ for (auto Reg
+ : LiveRegs) dbgs()
+ << "%" << Register::virtReg2Index(Reg) << ", ";
+ dbgs() << "\n";);
}
/// Initialize the per-region scheduling policy.
@@ -3630,7 +3634,8 @@ int biasPhysReg(const SUnit *SU, bool isTop) {
}
// Compute the current PressureDiff for MI and return it in PDiff.
-void GenericScheduler::getMIPDiff(const MachineInstr *MI, PressureDiff &PDiff) const {
+void GenericScheduler::getMIPDiff(const MachineInstr *MI,
+ PressureDiff &PDiff) const {
std::set<Register> Kills, Defs;
for (auto &MO : MI->explicit_operands()) {
@@ -3640,14 +3645,15 @@ void GenericScheduler::getMIPDiff(const MachineInstr *MI, PressureDiff &PDiff) c
LiveQueryResult LRQ = LI.Query(DAG->getLIS()->getInstructionIndex(*MI));
if (MO.isUse() && !LiveRegs.count(MO.getReg()))
Kills.insert(MO.getReg());
- else if (MO.isDef() && LRQ.valueOut() != nullptr && LRQ.valueIn() == nullptr)
+ else if (MO.isDef() && LRQ.valueOut() != nullptr &&
+ LRQ.valueIn() == nullptr)
Defs.insert(MO.getReg());
}
for (auto &Kill : Kills)
- PDiff.addPressureChange(Kill, false/*IsDec*/, &DAG->MRI);
+ PDiff.addPressureChange(Kill, false /*IsDec*/, &DAG->MRI);
for (auto &Def : Defs)
- PDiff.addPressureChange(Def, true/*IsDec*/, &DAG->MRI);
+ PDiff.addPressureChange(Def, true /*IsDec*/, &DAG->MRI);
}
// Compute the current PressureDiff for a subtree beginning at the NodeNum SU.
@@ -3673,7 +3679,7 @@ void GenericScheduler::getTreePDiff(unsigned NodeNum,
// subtree defines (and are not redefining).
for (auto R : TreeDefs[NodeNum])
if (LiveRegs.count(R) && !TreeUses[NodeNum].count(R))
- PDiff.addPressureChange(R, true/*IsDec*/, &DAG->MRI);
+ PDiff.addPressureChange(R, true /*IsDec*/, &DAG->MRI);
}
// Compare two pressure diffs and retun a non-zero value only in cases where
@@ -3787,7 +3793,7 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
if (doOOOSchedForRegion(DAG->SUnits.size())) {
bool SkipPhysRegs = biasPhysReg(TryCand.SU, TryCand.AtTop) &&
- biasPhysReg(Cand.SU, TryCand.AtTop);
+ biasPhysReg(Cand.SU, TryCand.AtTop);
if (REGPRESS && !SkipPhysRegs) {
// Schedule from NextQueue until it's empty.
@@ -3808,7 +3814,8 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
CandMIPDiff.dump(*TRI);
dbgs() << "SU(" << TryCand.SU->NodeNum << ") PDiff: \t";
TryCandMIPDiff.dump(*TRI););
- tryLess(TryCandRPScore, 0, TryCand, Cand, GenericSchedulerBase::RegPressure);
+ tryLess(TryCandRPScore, 0, TryCand, Cand,
+ GenericSchedulerBase::RegPressure);
return TryCand.Reason != NoCand;
}
@@ -3858,12 +3865,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
return MaxPredNodeNum;
};
bool TryCandIncr = onlyIncreases(TryCandMIPDiff);
- bool CandIncr = onlyIncreases(CandMIPDiff);
+ bool CandIncr = onlyIncreases(CandMIPDiff);
if (TryCandIncr != CandIncr) {
- bool TryCandHeur = (TryCandIncr &&
- maxPredNum(TryCand.SU) < Cand.SU->NodeNum);
- bool CandHeur = (CandIncr &&
- maxPredNum(Cand.SU) < TryCand.SU->NodeNum);
+ bool TryCandHeur =
+ (TryCandIncr && maxPredNum(TryCand.SU) < Cand.SU->NodeNum);
+ bool CandHeur = (CandIncr && maxPredNum(Cand.SU) < TryCand.SU->NodeNum);
if (tryLess(TryCandHeur, CandHeur, TryCand, Cand,
GenericSchedulerBase::RegPressure))
return TryCand.Reason != NoCand;
@@ -3872,7 +3878,7 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
if (!SkipPhysRegs && doHeightHeurForRegion(DAG, DAGHeight) &&
tryLatency(TryCand, Cand, *Zone))
- return TryCand.Reason != NoCand;
+ return TryCand.Reason != NoCand;
// Fall through to original instruction order.
if (TryCand.SU->NodeNum > Cand.SU->NodeNum) {
@@ -3987,10 +3993,11 @@ void GenericScheduler::pickNodeFromQueue(SchedBoundary &Zone,
// getMaxPressureDelta temporarily modifies the tracker.
RegPressureTracker &TempTracker = const_cast<RegPressureTracker&>(RPTracker);
- LLVM_DEBUG( dbgs() << "Live regs: ";
- for (auto R : LiveRegs)
- dbgs() << "%" << Register::virtReg2Index(R) << ", ";
- dbgs() << "\n";);
+ LLVM_DEBUG(dbgs() << "Live regs: ";
+ for (auto R
+ : LiveRegs) dbgs()
+ << "%" << Register::virtReg2Index(R) << ", ";
+ dbgs() << "\n";);
ReadyQueue &Q = Zone.Available;
for (SUnit *SU : Q) {
@@ -4202,12 +4209,11 @@ void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
} else {
const LiveInterval &LI = DAG->getLIS()->getInterval(MO.getReg());
LiveQueryResult LRQ =
- LI.Query(DAG->getLIS()->getInstructionIndex(*MI));
- if(!LRQ.valueIn())
+ LI.Query(DAG->getLIS()->getInstructionIndex(*MI));
+ if (!LRQ.valueIn())
LiveRegs.erase(MO.getReg());
}
- }
- else if (MO.readsReg())
+ } else if (MO.readsReg())
LiveRegs.insert(MO.getReg());
}
++NumScheduled;
@@ -4218,24 +4224,23 @@ void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
for (auto *TSU : TreeSUs[SU->NodeNum])
if (!TSU->isScheduled)
NextQueue.insert(TSU);
- LLVM_DEBUG(dbgs() << "Scheduling subtree: ";
- for (auto *NxSU : NextQueue)
- dbgs() << NxSU->NodeNum << " ";
+ LLVM_DEBUG(dbgs() << "Scheduling subtree: "; for (auto *NxSU
+ : NextQueue) dbgs()
+ << NxSU->NodeNum << " ";
dbgs() << "\n";);
}
NextSubtreeSU = nullptr;
}
if (!NextQueue.empty()) {
- assert (NextQueue.count(SU) && "Failed to schedule planned SU.");
+ assert(NextQueue.count(SU) && "Failed to schedule planned SU.");
NextQueue.erase(SU);
#ifndef NDEBUG
const SchedDFSResult *DFSResult = DAG->getDFSResult();
unsigned SUTree = DFSResult->getSubtreeID(SU);
for (const SDep &Pred : SU->Preds) {
const SUnit *PredSU = Pred.getSUnit();
- assert((PredSU->isBoundaryNode() ||
- Pred.getKind() != SDep::Data ||
+ assert((PredSU->isBoundaryNode() || Pred.getKind() != SDep::Data ||
(DFSResult->getSubtreeID(PredSU) == SUTree &&
NextQueue.count(PredSU)) ||
LiveRegs.count(Pred.getReg()) ||
|
ping - no interest in this anyone? I think it would be good as a first step to at least know of the current deficiencies in the machine scheduler across targets... |
Thanks for you patch. I have a few questions and observations
Are you running pre-ra and post-ra scheduler? Do you see the regression if you run both vs only running pre-ra?
It sound like the long term solution is to address spilling directly. Do you know if the issue is spilling in general or if it has to do specifically with larger regions?
Is this dynamic instruction count or runtime? Do you see any regressions due to this change in SPEC? What is the difference between the performance with this change vs disabling pre-ra scheduling entirely? I can take a look to see how this change is impacting benchmarks on RISC-V. |
On SystemZ, both machine schedulers are run. Post-RA scheduling is done with a target-specific MachineSchedStrategy and HazardRecognizer which has been running for years so I wouldn't suspect it. In this case it seems obvious to me that the pre-ra scheduler is to blame given the increase in spilling, which only it can affect. I anyway checked this, and found that there is some additional variation by disabling the post-RA scheduler, but the significant part is definitely the pre-RA pass.
As can be seen in the table I posted, this seems to be a problem in larger regions, indeed. It seems the scheduler has a lot of nodes to pick from and fails to make sensible choices (looking at the DAG) and a lot of unfortunate overlapping results. I am kind of optimistic that this could be fixed by some new heuristic, but not sure exactly how.
Runtime (and spill/reload instructions).
On SystemZ, no. This is probably because disabling bigger regions mainly affects this benchmark.
I have seen a few minor regressions by disabling pre-ra scheduling all together (<5%).
That would be nice - thanks! It would be interesting to see what size limit you end up using as that may be slightly different (number of target instructions). |
// spilling heavily with huge regions (like >350 instructions). This option | ||
// makes any sched region bigger than its value have pre-ra scheduling | ||
// skipped. | ||
cl::opt<unsigned> NoSchedAbove("nosched-above", cl::init(~0U)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions on definition of the option:
- Should this option name be specific to MachineScheduler (pre-ra)? Maybe
misched-nosched-above
? - Should this option have a
cl::desc
? - Should it be cl::Hidden or not?
- Should it have the prefix
experimental
to itexperimental-xxx
so it is clear that it is experimental?
SummaryI ran some experiments based on #90181. I only ran spec2006 C/C++ benchmarks. The experiments use a value of -nosched-above=350 and compares to the default In summary, there is no significant change on spec2006 with this option enabled Dynamic ICspec2006intThere was no change in dynamic IC for most benchmarks. 400.perlbench and 403.gcc spec2006fpThere was no change in dynamic IC for 433.milc, 444.namd, 450.soplex, 470.lbm, 447.deal saw a 0.00003% decrease in dynamic IC and 453.povray saw a 0.0007% Spillsspec2006int
spec2006fp
ConclusionI don't mind adding this functionality, especially since it will be behind an option, and that Some ideas
|
@michaelmaitland Thank you for taking your time with this! Interesting that on RISC-V this does not seem to matter much, at least generally. It would then be interesting to see if any individual file got better on RISC-V by disabling the scheduler (-mllvm -enable-misched=false). It would be most interesting to see this per scheduled region, or even per function, but any big regression should also be visible on a file comparison. I fully agree that this option is only temporary, if even that. Instead of actually pushing this patch as it is, I will soon add here some experimental pre-RA heuristics that should at least in theory be better for an out-of-order target. I am hoping that this will be evaluated on other OOO targets, and if there indeed is a common ground here, some new OOO strategy could be developed.
That's a good question, but I have not a simple answer to that. One observation is that the instructions are before scheduling in a fairly reasonable order with long chains of def-use-def-use... but then the scheduler decides to mix those chains up so that a lot more registers are live in parallel than before. The GenericScheduler is dumb in that it does this after looking at just different SUnits, and does not consider anything like sub-trees or groups of instructions. It will never do some "lookahead" and finish of a sequence of instructions where it could reduce liveness, but it may instead continue to add too much ILP even though the heuristic is related to register pressure. So I would say that in this worst case (cactus) it is simply adding too much ILP in the huge block, while the unscheduled instructions where in a pretty good order to begin with.
For one thing, it is looking at the input order of the instructions and determines the type of register pressure set that is the most important to keep track of during scheduling. In this FP-intensive region, the GPRs are incorrectly deemed most important, and so the scheduler causes a lot of FP spills. But as said earlier, fixing this alone did not help much. Maybe the real problem is that GenericScheduler is cycle-exact (with Available/Pending queues) and resource balancing. The register pressure heuristics seem to be nice add-ons on top of that but do not do the full job. Therefore, I am thinking of a different scheduler more aimed at OOO targets where register pressure is first priority, and then probably some ILP / latency optimization without causing any spilling.
As I will post here soon, one part of my experiments have been to not have the Pending queue at all (unless Available grows huge). The motivation is that for a non-cycle-exact target it doesn't matter if e.g. a load has a latency that makes it not fit in the current cycle, so schedule it directly instead and close the live-region. |
Have you tried setting |
I tried it now on your suggestion, but it is actually making things worse. |
I have now tried to boil down the more important parts of my experimental results that were kind of successful on SystemZ to something more generic that would hopefully be of interest to other (OOO) targets. This is still experimental and the major features are therefore controlled with CL options as seen below in order. My question now is if there is any possibility of collaboration across targets on this? Could this (if used on SystemZ) become a SchedStrategy in MachineScheduler, or need it be a SystemZ specific strategy? See previous comments for motivation for this. And as always, any feedback is most welcome. -misched-nohazards (default true) Disables cycle hazards so that even if a predecessor is not ready on the cycle it is still put in Available queue instead of in Pending. The idea is to be able to reduce live ranges by e.g. putting a load immediately before its user. -misched-regpress (default true) Attempt to reduce live range overlapping / spilling. My conclusions here (on SystemZ / SPEC) have been that the input (unscheduled) order is often quite good already from this perspective. The scheduler is probably most likely The DFS trees can also help additionally here by giving a bit of "lookahead". Each SU is (with this patch) mapped to a set of registers used and defined above it in its DFS subtree. If the SU can be scheduled as a unit with its subtree predecessors without causing any new registers to become live, it is done if this will close a live range by scheduling a def of a live register. I have found that subtrees of sizes 3-5 seem to work best. Regardless if the comparison between Cand and TryCand is done with just the SUnit:s or also using the DFS subtrees, a set of live registers is used to get the actual current effects. I have not yet tried to merge this with preexisting code. I have tried bi-directional scheduling with the idea of scheduling e.g. a store immediately after its defining instruction going top-down - kind of the reverse to the bottom-up "load immediately before use". This was interestingly not successful as expected so I have reverted to only do this bottom-up. The stores are now handled in a simpler way: given that source order is typically good for register pressure, and that a store can't just be pushed up in the list due to latency considerations of its predecessors I have found a less aggressive approach: Put the store after its predecessor as seen in the input order. This has the effect of moving up a store if it was for some reason put further down in the input. This is not done just for a memory store but for any instruction that kills a register without defining one. Todo: This should probably only kick in when useful but currently this is done regardless of the current register pressure. -misched-dfs (default true) Compute the DFSResult for use per above. -dfs-size (default 4) The max size of DFS subtrees that may be scheduled as a unit. -misched-heightheur (default true) After register pressure heuristics, some kind of attempt for increasing ILP by considering e.g. height/depth of SUs is made. The idea is to do this without causing increased spilling, but the balance between these two still has room for improvement. -misched-heightifwfac (default 3) A rough heuristic to only do the height heuristic (ILP) on DAGs with a "width factor" less than this value. The idea is to not become easily too aggressive with this on regions with many SUs that are mostly in parallell. Only if the DAG is quite "high and narrow" should this be done. This could likely be improved. -misched-ooo Enables this experimental scheduling (if not passed the patch is NFC). -no-ooosched-below If a non-zero value is passed normal scheduling is done for regions with lesser number of instructions than this value. This can be interesting as a strategy can have different levels of success on small/huge regions. Originally, before the machine scheduler got enabled, SystemZ used to run the "Bottom-up register reduction list scheduling" for ISel scheduling. I tried this setup again now but found that doing so does not help cactus. Generally however, across benchmarks, this gives the least spill for regions less than 100 instructions. On some bigger regions though (>7500 instructions) "list-burr" actually causes more spilling than "source". So not just GenericScheduler messes this huge region up. In the range 100 to 1000 instructions, the current behavior ("source" + mischeduler) seems to be beneficial, but it cannot handle huge regions. This version of this is not a great improvement on SystemZ comparing to just disabling the MachineScheduler pre-ra entirely. I know there is a little room for further performance improvements, but I am not sure that would be enough to motivate a SystemZ specific scheduling strategy. Ideally other targets would find this interesting and begin to use/develop it. @michaelmaitland Did you try to disable machine-scheduler and compare spilling? What happens if you try this (with -misched-ooo -misched-heightheur=false) @atrick Any comments on my use of the DFS subtrees? |
This is a typical problem with pre-ra scheduling. The pressure looks fine before scheduling. The list scheduler backs itself into a corner trying to expose ILP before it reaches the pressure limit, then can't back itself out. It would be great if DFS subtrees can handle this case. You'll may need to dig into the design and see if it can recognize good subtrees on cactus and tweak its own heuristics, which can be sensitive. You would have to decide whether to always enable subtrees for your target (and pay any compile time cost). Or somehow identify DAG patterns where it's worth enabling subtrees. In general, it would be awesome if the scheduler could recognize certain classes of DAGs and set the heuristics accordingly before it starts scheduling. So in this case, it should know to try to keep register pressure below the machine's limit because it will need those registers later. Generic scheduler's strategy is to avoid perturbing the schedule unless it sees an opportunity, but that can still fail with large subtrees like this. Another approach would be a 2-pass scheduler. The first pass would expose ILP. The second pass would then know how far register pressure had been exceeded and would reschedule to avoid spills. The generic schedule doesn't do that because it's trying to be compile-time friendly and trying not to perturb the original instruction schedule. |
I don't think this branch will actually land, although there may be some ideas worth trying. I have now another PR in progress that will replace this one. See #135076. @michaelmaitland Maybe you would like to try the new SystemZPreRASchedStrategy and see if it works on your target as well? |
CC @mshockwave |
I found a big regression on SPEC/cactus which comes from enabling the machine scheduler before register allocation - cactus ran 8% faster by just disabling the pre-ra machine scheduler. The reason seems to be that it prdoduces a schedule that gives heavily increased spilling during register allocation. I have experimented quite a bit with the scheduler heuristics but do not yet have an overall solution to this. This is merely a first step that enables experimentally to disable scheduling of bigger regions.
This is actually relevant not only for SystemZ (@greened @efriedma-quic @asb @ecnelises @uweigand @dominik-steenken @arsenm @michaelmaitland):
With this setting, I see a 9% speedup on cactus on SystemZ / z16. I think it would be interesting to try this on at least aarch64 and PowerPC as well, given the reduced spilling. Of course, maybe some other number than 350 would be optimal, but it looks like a good start. Would anyone be willing to do this?
This is a screenshot of some of my own personal experimental measurements on SPEC17/SystemZ:

I have during scheduling counted the register overlaps before and after the scheduling of a region, both for FP and GPR registers (on SystemZ). This is a summary which from left to right shows the total number of regions of a particular size(-interval), then in parenthesis the number of regions where the resulting register overlaps where different (the size ranges is first in steps of 10, and then in bigger steps. Only size intervals with something changing are printed (some bigger ones are omitted). The FP / GPR columns show the overall results - green / negative is good as overlaps decreased, while red shows the scheduler strategy has increased overlaps.
If overlaps increase, that means register pressure will increase and more spilling will typically result. As can be seen, the results are only satisfactory for the upper right hand corner - the GPR register pressure in small (common) regions. Bigger regions (rare) with FP is really bad. This overlap measurement unit is sort of an integral (sum) of all the overlaps across the region at point of each instruction.
A first next step might be to disable big regions non-experimentally for now on SystemZ. In the long run, I am hoping to have better general heuristics that works also on big regions. Some thoughts:
I was hoping that ilp-min would remedy this situation, but to my surprise it did not. It is actually a lot worse that the default strategy overall on SPEC/SystemZ, as well as on the cactus file, for register spilling.
The default MachineSchedStrategy does many things that make sense mostly on an in-order target, it seems to me. An out-of-order target would primarily like to reduce spilling, and secondarily also do things like critical path / height adjustments and try to balance out usages of processor resources.
The default scheduler does a cycle based decision and puts SUs in Pending if it is not cycle-ready even though it is possible to schedule it. The alternative would be to put all available SUs in Available, and thus make it possible to reduce live ranges, by e.g. scheduling a load immediately next to its user.
It seems to me that there should be an OOO strategy that should never increase spilling (hopefully improve it), while increasing ILP in a sensible manner. It shouldn't worry too much about cycle exact scheduling or processor resources balancing. Has anyone else worked on this also?
While building the sched graph, it tracks the register pressures and decides what should be the critical pressure set during scheduling. The cactus example is an FP intensive region, but in the given input order the GPR registers are instead deemed to be the more important during scheduling (This is however not the real problem in this case).
Overall SPEC/SystemZ shows an 18% in spilled FP live ranges with machine-scheduler enabled, while the rest (GPR) is about the same. This is too much to ignore, of course.
I have done quite some experimenting with this and have been able to get even some additional improvement on cactus/SystemZ by a simpler scheduling strategy (12%). It goes bottom-up, always schedules a load lower, and pushes stores up a bit, to decrease the overlapping of registers. In addition, it also adds a height/depth heuristic which gives the additional performance improvement. I did try bi-directional extensively but to my surprise that was not helpful. This is quite tricky to get right it seems, and I would be happy to collaborate with anyone else interested in this.