[MachineScheduler] Experimental option to partially disable pre-ra scheduling. #90181

JonPsson1 · 2024-04-26T09:05:27Z

I found a big regression on SPEC/cactus which comes from enabling the machine scheduler before register allocation - cactus ran 8% faster by just disabling the pre-ra machine scheduler. The reason seems to be that it prdoduces a schedule that gives heavily increased spilling during register allocation. I have experimented quite a bit with the scheduler heuristics but do not yet have an overall solution to this. This is merely a first step that enables experimentally to disable scheduling of bigger regions.

This is actually relevant not only for SystemZ (@greened @efriedma-quic @asb @ecnelises @uweigand @dominik-steenken @arsenm @michaelmaitland):

clang -c -S /dev/null -o -O3 -target s390x-linux-gnu  -march=z16  ./ML_BSSN_RHS.ii -mllvm --stats -w  |& grep spilled
   338 regalloc                     - Number of spilled snippets
  2888 regalloc                     - Number of spilled live ranges

clang -c -S /dev/null -o -O3 -target s390x-linux-gnu  -march=z16  ./ML_BSSN_RHS.ii -mllvm --stats -w -mllvm -nosched-above=350   |& grep spilled
   352 regalloc                     - Number of spilled snippets
  1872 regalloc                     - Number of spilled live ranges


clang -c -S /dev/null -o -O3 -target aarch64  ./ML_BSSN_RHS.ii -mllvm --stats -w  |& grep spilled
   555 regalloc                     - Number of spilled snippets
  5038 regalloc                     - Number of spilled live ranges

clang -c -S /dev/null -o -O3 -target aarch64  ./ML_BSSN_RHS.ii -mllvm --stats -w -mllvm -nosched-above=350 |& grep spilled
   566 regalloc                     - Number of spilled snippets
  2228 regalloc                     - Number of spilled live ranges


clang -c -S /dev/null -o -O3 -target powerpc64-unknown-linux-gnu -mcpu=pwr9  ./ML_BSSN_RHS.ii -mllvm --stats -w  |& grep spilled
   331 regalloc                     - Number of spilled snippets
  4213 regalloc                     - Number of spilled live ranges

clang -c -S /dev/null -o -O3 -target powerpc64-unknown-linux-gnu -mcpu=pwr9  ./ML_BSSN_RHS.ii -mllvm --stats -w -mllvm -nosched-above=350 |& grep spilled
   180 regalloc                     - Number of spilled snippets
   915 regalloc                     - Number of spilled live ranges

With this setting, I see a 9% speedup on cactus on SystemZ / z16. I think it would be interesting to try this on at least aarch64 and PowerPC as well, given the reduced spilling. Of course, maybe some other number than 350 would be optimal, but it looks like a good start. Would anyone be willing to do this?

This is a screenshot of some of my own personal experimental measurements on SPEC17/SystemZ:

I have during scheduling counted the register overlaps before and after the scheduling of a region, both for FP and GPR registers (on SystemZ). This is a summary which from left to right shows the total number of regions of a particular size(-interval), then in parenthesis the number of regions where the resulting register overlaps where different (the size ranges is first in steps of 10, and then in bigger steps. Only size intervals with something changing are printed (some bigger ones are omitted). The FP / GPR columns show the overall results - green / negative is good as overlaps decreased, while red shows the scheduler strategy has increased overlaps.

If overlaps increase, that means register pressure will increase and more spilling will typically result. As can be seen, the results are only satisfactory for the upper right hand corner - the GPR register pressure in small (common) regions. Bigger regions (rare) with FP is really bad. This overlap measurement unit is sort of an integral (sum) of all the overlaps across the region at point of each instruction.

A first next step might be to disable big regions non-experimentally for now on SystemZ. In the long run, I am hoping to have better general heuristics that works also on big regions. Some thoughts:

I was hoping that ilp-min would remedy this situation, but to my surprise it did not. It is actually a lot worse that the default strategy overall on SPEC/SystemZ, as well as on the cactus file, for register spilling.
The default MachineSchedStrategy does many things that make sense mostly on an in-order target, it seems to me. An out-of-order target would primarily like to reduce spilling, and secondarily also do things like critical path / height adjustments and try to balance out usages of processor resources.
The default scheduler does a cycle based decision and puts SUs in Pending if it is not cycle-ready even though it is possible to schedule it. The alternative would be to put all available SUs in Available, and thus make it possible to reduce live ranges, by e.g. scheduling a load immediately next to its user.
It seems to me that there should be an OOO strategy that should never increase spilling (hopefully improve it), while increasing ILP in a sensible manner. It shouldn't worry too much about cycle exact scheduling or processor resources balancing. Has anyone else worked on this also?
While building the sched graph, it tracks the register pressures and decides what should be the critical pressure set during scheduling. The cactus example is an FP intensive region, but in the given input order the GPR registers are instead deemed to be the more important during scheduling (This is however not the real problem in this case).
Overall SPEC/SystemZ shows an 18% in spilled FP live ranges with machine-scheduler enabled, while the rest (GPR) is about the same. This is too much to ignore, of course.
I have done quite some experimenting with this and have been able to get even some additional improvement on cactus/SystemZ by a simpler scheduling strategy (12%). It goes bottom-up, always schedules a load lower, and pushes stores up a bit, to decrease the overlapping of registers. In addition, it also adds a height/depth heuristic which gives the additional performance improvement. I did try bi-directional extensively but to my surprise that was not helpful. This is quite tricky to get right it seems, and I would be happy to collaborate with anyone else interested in this.

github-actions · 2024-04-26T09:08:47Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff b446f9079bdc1123e1aafe5a535d8dbea835d8e8 bae21d51a33c66672e3a67dc919a6b23fa266b09 -- llvm/include/llvm/CodeGen/MachineScheduler.h llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h llvm/lib/CodeGen/MachineScheduler.cpp

View the diff from clang-format here.

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index 6a0c886cbd..5c8711ef68 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -1075,9 +1075,25 @@ public:
   /// Represent the type of SchedCandidate found within a single queue.
   /// pickNodeBidirectional depends on these listed by decreasing priority.
   enum CandReason : uint8_t {
-    NoCand, Only1, PhysReg, RegExcess, RegCritical, Stall, Cluster, Weak,
-    RegMax, ResourceReduce, ResourceDemand, BotHeightReduce, BotPathReduce,
-    TopDepthReduce, TopPathReduce, NextDefUse, RegPressure, NodeOrder};
+    NoCand,
+    Only1,
+    PhysReg,
+    RegExcess,
+    RegCritical,
+    Stall,
+    Cluster,
+    Weak,
+    RegMax,
+    ResourceReduce,
+    ResourceDemand,
+    BotHeightReduce,
+    BotPathReduce,
+    TopDepthReduce,
+    TopPathReduce,
+    NextDefUse,
+    RegPressure,
+    NodeOrder
+  };
 
 #ifndef NDEBUG
   static const char *getReasonStr(GenericSchedulerBase::CandReason Reason);
@@ -1218,11 +1234,11 @@ class GenericScheduler : public GenericSchedulerBase {
 
   // TODO: Integrate with SchedDFSResult class.
   // SU -> Nodes above in subtree.
-  std::vector<std::set<const SUnit *> > TreeSUs;
+  std::vector<std::set<const SUnit *>> TreeSUs;
   // SU -> Virtual regs defined above in subtree.
-  std::vector<std::set<Register> > TreeDefs;
+  std::vector<std::set<Register>> TreeDefs;
   // SU -> Regs used but not defined above in subtree.
-  std::vector<std::set<Register> > TreeUses;
+  std::vector<std::set<Register>> TreeUses;
 
   // If this SU is non-null, it is the start of a subtree to be scheduled as
   // a unit.
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index f071dc40c3..5dbb04aaed 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -2509,9 +2509,10 @@ static bool doOOOSchedForRegion(unsigned NumRegionInstrs) {
   return OOOSCHED && NumRegionInstrs > NoOOOSchedBelow;
 }
 
-static bool doHeightHeurForRegion(const ScheduleDAGMI *DAG, unsigned DAGHeight) {
-  return HEIGHTHEUR &&
-         DAGHeight != 0 && (DAG->SUnits.size() / DAGHeight) < HEIGHTIFWFAC;
+static bool doHeightHeurForRegion(const ScheduleDAGMI *DAG,
+                                  unsigned DAGHeight) {
+  return HEIGHTHEUR && DAGHeight != 0 &&
+         (DAG->SUnits.size() / DAGHeight) < HEIGHTIFWFAC;
 }
 
 bool SchedBoundary::checkHazard(SUnit *SU) {
@@ -2756,8 +2757,9 @@ void SchedBoundary::bumpNode(SUnit *SU) {
   const MCSchedClassDesc *SC = DAG->getSchedClass(SU);
   unsigned IncMOps = SchedModel->getNumMicroOps(SU->getInstr());
   assert((doOOOSchedForRegion(DAG->SUnits.size()) || INPUTORDER ||
-       (CurrMOps == 0 || (CurrMOps + IncMOps) <= SchedModel->getIssueWidth())) &&
-       "Cannot schedule this instruction's MicroOps in the current cycle.");
+          (CurrMOps == 0 ||
+           (CurrMOps + IncMOps) <= SchedModel->getIssueWidth())) &&
+         "Cannot schedule this instruction's MicroOps in the current cycle.");
 
   unsigned ReadyCycle = (isTop() ? SU->TopReadyCycle : SU->BotReadyCycle);
   LLVM_DEBUG(dbgs() << "  Ready @" << ReadyCycle << "c\n");
@@ -3172,7 +3174,8 @@ const char *GenericSchedulerBase::getReasonStr(
   case BotHeightReduce:return "BOT-HEIGHT";
   case BotPathReduce:  return "BOT-PATH  ";
   case NextDefUse:     return "DEF-USE   ";
-  case RegPressure:    return "REG-PRESS ";
+  case RegPressure:
+    return "REG-PRESS ";
   case NodeOrder:      return "ORDER     ";
   };
   llvm_unreachable("Unknown reason!");
@@ -3337,9 +3340,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   if (DFS) {
     DAG->computeDFSResult();
     const SchedDFSResult *DFSResult = DAG->getDFSResult();
-    TreeDefs  = std::vector<std::set<Register> >     (DAG->SUnits.size());
-    TreeUses  = std::vector<std::set<Register> >     (DAG->SUnits.size());
-    TreeSUs   = std::vector<std::set<const SUnit *> >(DAG->SUnits.size());
+    TreeDefs = std::vector<std::set<Register>>(DAG->SUnits.size());
+    TreeUses = std::vector<std::set<Register>>(DAG->SUnits.size());
+    TreeSUs = std::vector<std::set<const SUnit *>>(DAG->SUnits.size());
     assert(NextSubtreeSU == nullptr);
 
     for (unsigned Idx = 0, End = DAG->SUnits.size(); Idx != End; ++Idx) {
@@ -3408,15 +3411,16 @@ void GenericScheduler::initLiveRegs(ScheduleDAGMILive *DAG) {
   for (unsigned I = 0, E = DAG->MRI.getNumVirtRegs(); I != E; ++I) {
     Register VirtReg = Register::index2VirtReg(I);
     const LiveInterval &LI = DAG->getLIS()->getInterval(VirtReg);
-    LiveQueryResult LRQ = LI.Query(DAG->getLIS()->
-                            getInstructionIndex(*DAG->SUnits.back().getInstr()));
+    LiveQueryResult LRQ = LI.Query(
+        DAG->getLIS()->getInstructionIndex(*DAG->SUnits.back().getInstr()));
     if (LRQ.valueOut())
       LiveRegs.insert(VirtReg);
   }
-  LLVM_DEBUG( dbgs() << " Live outs: ";
-              for (auto Reg : LiveRegs)
-                dbgs() << "%" << Register::virtReg2Index(Reg) << ", ";
-              dbgs() << "\n";);
+  LLVM_DEBUG(dbgs() << " Live outs: ";
+             for (auto Reg
+                  : LiveRegs) dbgs()
+             << "%" << Register::virtReg2Index(Reg) << ", ";
+             dbgs() << "\n";);
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3630,7 +3634,8 @@ int biasPhysReg(const SUnit *SU, bool isTop) {
 }
 
 // Compute the current PressureDiff for MI and return it in PDiff.
-void GenericScheduler::getMIPDiff(const MachineInstr *MI, PressureDiff &PDiff) const {
+void GenericScheduler::getMIPDiff(const MachineInstr *MI,
+                                  PressureDiff &PDiff) const {
   std::set<Register> Kills, Defs;
 
   for (auto &MO : MI->explicit_operands()) {
@@ -3640,14 +3645,15 @@ void GenericScheduler::getMIPDiff(const MachineInstr *MI, PressureDiff &PDiff) c
     LiveQueryResult LRQ = LI.Query(DAG->getLIS()->getInstructionIndex(*MI));
     if (MO.isUse() && !LiveRegs.count(MO.getReg()))
       Kills.insert(MO.getReg());
-    else if (MO.isDef() && LRQ.valueOut() != nullptr && LRQ.valueIn() == nullptr)
+    else if (MO.isDef() && LRQ.valueOut() != nullptr &&
+             LRQ.valueIn() == nullptr)
       Defs.insert(MO.getReg());
   }
 
   for (auto &Kill : Kills)
-    PDiff.addPressureChange(Kill, false/*IsDec*/, &DAG->MRI);
+    PDiff.addPressureChange(Kill, false /*IsDec*/, &DAG->MRI);
   for (auto &Def : Defs)
-    PDiff.addPressureChange(Def, true/*IsDec*/, &DAG->MRI);
+    PDiff.addPressureChange(Def, true /*IsDec*/, &DAG->MRI);
 }
 
 // Compute the current PressureDiff for a subtree beginning at the NodeNum SU.
@@ -3673,7 +3679,7 @@ void GenericScheduler::getTreePDiff(unsigned NodeNum,
   // subtree defines (and are not redefining).
   for (auto R : TreeDefs[NodeNum])
     if (LiveRegs.count(R) && !TreeUses[NodeNum].count(R))
-      PDiff.addPressureChange(R, true/*IsDec*/, &DAG->MRI);
+      PDiff.addPressureChange(R, true /*IsDec*/, &DAG->MRI);
 }
 
 // Compare two pressure diffs and retun a non-zero value only in cases where
@@ -3787,7 +3793,7 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   if (doOOOSchedForRegion(DAG->SUnits.size())) {
 
     bool SkipPhysRegs = biasPhysReg(TryCand.SU, TryCand.AtTop) &&
-      biasPhysReg(Cand.SU, TryCand.AtTop);
+                        biasPhysReg(Cand.SU, TryCand.AtTop);
 
     if (REGPRESS && !SkipPhysRegs) {
       // Schedule from NextQueue until it's empty.
@@ -3808,7 +3814,8 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
                    CandMIPDiff.dump(*TRI);
                    dbgs() << "SU(" << TryCand.SU->NodeNum << ") PDiff: \t";
                    TryCandMIPDiff.dump(*TRI););
-        tryLess(TryCandRPScore, 0, TryCand, Cand, GenericSchedulerBase::RegPressure);
+        tryLess(TryCandRPScore, 0, TryCand, Cand,
+                GenericSchedulerBase::RegPressure);
         return TryCand.Reason != NoCand;
       }
 
@@ -3858,12 +3865,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
         return MaxPredNodeNum;
       };
       bool TryCandIncr = onlyIncreases(TryCandMIPDiff);
-      bool CandIncr    = onlyIncreases(CandMIPDiff);
+      bool CandIncr = onlyIncreases(CandMIPDiff);
       if (TryCandIncr != CandIncr) {
-        bool TryCandHeur = (TryCandIncr &&
-                            maxPredNum(TryCand.SU) < Cand.SU->NodeNum);
-        bool CandHeur = (CandIncr &&
-                         maxPredNum(Cand.SU) < TryCand.SU->NodeNum);
+        bool TryCandHeur =
+            (TryCandIncr && maxPredNum(TryCand.SU) < Cand.SU->NodeNum);
+        bool CandHeur = (CandIncr && maxPredNum(Cand.SU) < TryCand.SU->NodeNum);
         if (tryLess(TryCandHeur, CandHeur, TryCand, Cand,
                     GenericSchedulerBase::RegPressure))
           return TryCand.Reason != NoCand;
@@ -3872,7 +3878,7 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
 
     if (!SkipPhysRegs && doHeightHeurForRegion(DAG, DAGHeight) &&
         tryLatency(TryCand, Cand, *Zone))
-        return TryCand.Reason != NoCand;
+      return TryCand.Reason != NoCand;
 
     // Fall through to original instruction order.
     if (TryCand.SU->NodeNum > Cand.SU->NodeNum) {
@@ -3987,10 +3993,11 @@ void GenericScheduler::pickNodeFromQueue(SchedBoundary &Zone,
   // getMaxPressureDelta temporarily modifies the tracker.
   RegPressureTracker &TempTracker = const_cast<RegPressureTracker&>(RPTracker);
 
-  LLVM_DEBUG( dbgs() << "Live regs: ";
-              for (auto R : LiveRegs)
-                dbgs() << "%" << Register::virtReg2Index(R) << ", ";
-              dbgs() << "\n";);
+  LLVM_DEBUG(dbgs() << "Live regs: ";
+             for (auto R
+                  : LiveRegs) dbgs()
+             << "%" << Register::virtReg2Index(R) << ", ";
+             dbgs() << "\n";);
 
   ReadyQueue &Q = Zone.Available;
   for (SUnit *SU : Q) {
@@ -4202,12 +4209,11 @@ void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
         } else {
           const LiveInterval &LI = DAG->getLIS()->getInterval(MO.getReg());
           LiveQueryResult LRQ =
-            LI.Query(DAG->getLIS()->getInstructionIndex(*MI));
-          if(!LRQ.valueIn())
+              LI.Query(DAG->getLIS()->getInstructionIndex(*MI));
+          if (!LRQ.valueIn())
             LiveRegs.erase(MO.getReg());
         }
-      }
-      else if (MO.readsReg())
+      } else if (MO.readsReg())
         LiveRegs.insert(MO.getReg());
     }
   ++NumScheduled;
@@ -4218,24 +4224,23 @@ void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
       for (auto *TSU : TreeSUs[SU->NodeNum])
         if (!TSU->isScheduled)
           NextQueue.insert(TSU);
-      LLVM_DEBUG(dbgs() << "Scheduling subtree: ";
-                 for (auto *NxSU : NextQueue)
-                   dbgs() << NxSU->NodeNum << " ";
+      LLVM_DEBUG(dbgs() << "Scheduling subtree: "; for (auto *NxSU
+                                                        : NextQueue) dbgs()
+                                                   << NxSU->NodeNum << " ";
                  dbgs() << "\n";);
     }
     NextSubtreeSU = nullptr;
   }
 
   if (!NextQueue.empty()) {
-    assert (NextQueue.count(SU) && "Failed to schedule planned SU.");
+    assert(NextQueue.count(SU) && "Failed to schedule planned SU.");
     NextQueue.erase(SU);
 #ifndef NDEBUG
     const SchedDFSResult *DFSResult = DAG->getDFSResult();
     unsigned SUTree = DFSResult->getSubtreeID(SU);
     for (const SDep &Pred : SU->Preds) {
       const SUnit *PredSU = Pred.getSUnit();
-      assert((PredSU->isBoundaryNode() ||
-              Pred.getKind() != SDep::Data ||
+      assert((PredSU->isBoundaryNode() || Pred.getKind() != SDep::Data ||
               (DFSResult->getSubtreeID(PredSU) == SUTree &&
                NextQueue.count(PredSU)) ||
               LiveRegs.count(Pred.getReg()) ||

JonPsson1 · 2024-05-07T10:56:26Z

ping - no interest in this anyone?

I think it would be good as a first step to at least know of the current deficiencies in the machine scheduler across targets...

@atrick

michaelmaitland · 2024-05-07T13:46:11Z

Thanks for you patch. I have a few questions and observations

I found a big regression on SPEC/cactus which comes from enabling the machine scheduler before register allocation

Are you running pre-ra and post-ra scheduler? Do you see the regression if you run both vs only running pre-ra?

The reason seems to be that it prdoduces a schedule that gives heavily increased spilling during register allocation.

It sound like the long term solution is to address spilling directly. Do you know if the issue is spilling in general or if it has to do specifically with larger regions?

cactus ran 8% faster by just disabling the pre-ra machine scheduler

Is this dynamic instruction count or runtime?

Do you see any regressions due to this change in SPEC?

What is the difference between the performance with this change vs disabling pre-ra scheduling entirely?

I can take a look to see how this change is impacting benchmarks on RISC-V.

JonPsson1 · 2024-05-08T09:38:45Z

Thanks for you patch. I have a few questions and observations

I found a big regression on SPEC/cactus which comes from enabling the machine scheduler before register allocation

Are you running pre-ra and post-ra scheduler? Do you see the regression if you run both vs only running pre-ra?

On SystemZ, both machine schedulers are run. Post-RA scheduling is done with a target-specific MachineSchedStrategy and HazardRecognizer which has been running for years so I wouldn't suspect it. In this case it seems obvious to me that the pre-ra scheduler is to blame given the increase in spilling, which only it can affect.

I anyway checked this, and found that there is some additional variation by disabling the post-RA scheduler, but the significant part is definitely the pre-RA pass.

The reason seems to be that it prdoduces a schedule that gives heavily increased spilling during register allocation.

It sound like the long term solution is to address spilling directly. Do you know if the issue is spilling in general or if it has to do specifically with larger regions?

As can be seen in the table I posted, this seems to be a problem in larger regions, indeed. It seems the scheduler has a lot of nodes to pick from and fails to make sensible choices (looking at the DAG) and a lot of unfortunate overlapping results. I am kind of optimistic that this could be fixed by some new heuristic, but not sure exactly how.

cactus ran 8% faster by just disabling the pre-ra machine scheduler

Is this dynamic instruction count or runtime?

Runtime (and spill/reload instructions).

Do you see any regressions due to this change in SPEC?

On SystemZ, no. This is probably because disabling bigger regions mainly affects this benchmark.

What is the difference between the performance with this change vs disabling pre-ra scheduling entirely?

I have seen a few minor regressions by disabling pre-ra scheduling all together (<5%).

I can take a look to see how this change is impacting benchmarks on RISC-V.

That would be nice - thanks! It would be interesting to see what size limit you end up using as that may be slightly different (number of target instructions).

michaelmaitland · 2024-05-20T20:24:01Z

llvm/lib/CodeGen/MachineScheduler.cpp

+// spilling heavily with huge regions (like >350 instructions). This option
+// makes any sched region bigger than its value have pre-ra scheduling
+// skipped.
+cl::opt<unsigned> NoSchedAbove("nosched-above", cl::init(~0U));


Some questions on definition of the option:

Should this option name be specific to MachineScheduler (pre-ra)? Maybe misched-nosched-above?

Should this option have a cl::desc?

Should it be cl::Hidden or not?

Should it have the prefix experimental to it experimental-xxx so it is clear that it is experimental?

michaelmaitland · 2024-05-20T23:16:54Z

Summary

I ran some experiments based on #90181.
I collected dynamic instruction count and number of spills for spec2006int and
spec2006fp.

I only ran spec2006 C/C++ benchmarks.

The experiments use a value of -nosched-above=350 and compares to the default
value of ~0U.

In summary, there is no significant change on spec2006 with this option enabled
when it comes to dynamic IC or spills on the C/C++ spec2006 benchmarks on
RISC-V.

Dynamic IC

spec2006int

There was no change in dynamic IC for most benchmarks. 400.perlbench and 403.gcc
dynamic IC decreased by 0.0002% and 0.000000002% respectively. 456.hmmer and
483.xalancbmk saw increase in dynamic IC by 0.00000003% and 0.027% respectively.

spec2006fp

There was no change in dynamic IC for 433.milc, 444.namd, 450.soplex, 470.lbm,
482.sphinx3, and 998.specrand.

447.deal saw a 0.00003% decrease in dynamic IC and 453.povray saw a 0.0007%
increase in dynamic IC.

Spills

spec2006int

config	Number of Spills Inserted	Number of Spilled Live Ranges	Number of Spilled Snippets
default	16101	33827	714
-nosched-above=350	16111	33839	709

spec2006fp

config	Number of Spills Inserted	Number of Spilled Live Ranges	Number of Spilled Snippets
default	6585	10433	131
-nosched-above=350	6586	10420	131

Conclusion

I don't mind adding this functionality, especially since it will be behind an option, and that
option will be marked experimental. But I think it is important that we agree that its only
helping as a bandaid to solve an underlying problem. The long term solution is to pinpoint
why you are seeing these spills and fix that problem. It is possible that by fixing the
underlying problem, you may see better scheduling on smaller sized regions too.

Some ideas

Have you looked at the instructions that are leading to the performance differences? Are there
certain kinds of instructions or instruction sequences that are getting impacted?
Have you looked at what heuristics are causing the ordering that leads to the spills?
How many instructions are in the Available queue during scheduling of the larger
regions? Is there opportunity to refine your SchedModel so that the Avaliable and
Pending are better populated?

JonPsson1 · 2024-05-21T10:14:45Z

@michaelmaitland Thank you for taking your time with this! Interesting that on RISC-V this does not seem to matter much, at least generally. It would then be interesting to see if any individual file got better on RISC-V by disabling the scheduler (-mllvm -enable-misched=false). It would be most interesting to see this per scheduled region, or even per function, but any big regression should also be visible on a file comparison.

I fully agree that this option is only temporary, if even that. Instead of actually pushing this patch as it is, I will soon add here some experimental pre-RA heuristics that should at least in theory be better for an out-of-order target. I am hoping that this will be evaluated on other OOO targets, and if there indeed is a common ground here, some new OOO strategy could be developed.

Have you looked at the instructions that are leading to the performance differences? Are there certain kinds of instructions or instruction sequences that are getting impacted?

That's a good question, but I have not a simple answer to that. One observation is that the instructions are before scheduling in a fairly reasonable order with long chains of def-use-def-use... but then the scheduler decides to mix those chains up so that a lot more registers are live in parallel than before. The GenericScheduler is dumb in that it does this after looking at just different SUnits, and does not consider anything like sub-trees or groups of instructions. It will never do some "lookahead" and finish of a sequence of instructions where it could reduce liveness, but it may instead continue to add too much ILP even though the heuristic is related to register pressure. So I would say that in this worst case (cactus) it is simply adding too much ILP in the huge block, while the unscheduled instructions where in a pretty good order to begin with.

Have you looked at what heuristics are causing the ordering that leads to the spills?

For one thing, it is looking at the input order of the instructions and determines the type of register pressure set that is the most important to keep track of during scheduling. In this FP-intensive region, the GPRs are incorrectly deemed most important, and so the scheduler causes a lot of FP spills. But as said earlier, fixing this alone did not help much.

Maybe the real problem is that GenericScheduler is cycle-exact (with Available/Pending queues) and resource balancing. The register pressure heuristics seem to be nice add-ons on top of that but do not do the full job. Therefore, I am thinking of a different scheduler more aimed at OOO targets where register pressure is first priority, and then probably some ILP / latency optimization without causing any spilling.

How many instructions are in the Available queue during scheduling of the larger regions? Is there opportunity to refine your SchedModel so that the Avaliable and Pending are better populated?

As I will post here soon, one part of my experiments have been to not have the Pending queue at all (unless Available grows huge). The motivation is that for a non-cycle-exact target it doesn't matter if e.g. a load has a latency that makes it not fit in the current cycle, so schedule it directly instead and close the live-region.

michaelmaitland · 2024-05-21T15:09:52Z

It seems the scheduler has a lot of nodes to pick from and fails to make sensible choices (looking at the DAG) and a lot of unfortunate overlapping results. I am kind of optimistic that this could be fixed by some new heuristic, but not sure exactly how.

Have you tried setting misched-limit as a smaller value? It sounds like if the problem is as you describe it, this option may have a similar impact as the option you propose in this PR. The difference is that you skip large regions entirely and this option limits the size of instructions it is willing to consider in heuristics.

JonPsson1 · 2024-05-22T09:26:33Z

Have you tried setting misched-limit as a smaller value? It sounds like if the problem is as you describe it, this option may have a similar impact as the option you propose in this PR. The difference is that you skip large regions entirely and this option limits the size of instructions it is willing to consider in heuristics.

I tried it now on your suggestion, but it is actually making things worse.

JonPsson1 · 2024-06-14T07:01:34Z

I have now tried to boil down the more important parts of my experimental results that were kind of successful on SystemZ to something more generic that would hopefully be of interest to other (OOO) targets.

This is still experimental and the major features are therefore controlled with CL options as seen below in order. My question now is if there is any possibility of collaboration across targets on this? Could this (if used on SystemZ) become a SchedStrategy in MachineScheduler, or need it be a SystemZ specific strategy? See previous comments for motivation for this. And as always, any feedback is most welcome.

-misched-nohazards (default true)

Disables cycle hazards so that even if a predecessor is not ready on the cycle it is still put in Available queue instead of in Pending. The idea is to be able to reduce live ranges by e.g. putting a load immediately before its user.

-misched-regpress (default true)

Attempt to reduce live range overlapping / spilling. My conclusions here (on SystemZ / SPEC) have been that the input (unscheduled) order is often quite good already from this perspective. The scheduler is probably most likely
to mess this up by moving instructions around. This is because it seems that having "general" improvements in place does not help so much - like always scheduling a (imm-)load before its user. What does seem to be beneficial though is to compare two SUs and if one is decreasing pressure while the other one is increasing it (same pressure set), then put the load below.

The DFS trees can also help additionally here by giving a bit of "lookahead". Each SU is (with this patch) mapped to a set of registers used and defined above it in its DFS subtree. If the SU can be scheduled as a unit with its subtree predecessors without causing any new registers to become live, it is done if this will close a live range by scheduling a def of a live register. I have found that subtrees of sizes 3-5 seem to work best.

Regardless if the comparison between Cand and TryCand is done with just the SUnit:s or also using the DFS subtrees, a set of live registers is used to get the actual current effects. I have not yet tried to merge this with preexisting code.

I have tried bi-directional scheduling with the idea of scheduling e.g. a store immediately after its defining instruction going top-down - kind of the reverse to the bottom-up "load immediately before use". This was interestingly not successful as expected so I have reverted to only do this bottom-up. The stores are now handled in a simpler way: given that source order is typically good for register pressure, and that a store can't just be pushed up in the list due to latency considerations of its predecessors I have found a less aggressive approach: Put the store after its predecessor as seen in the input order. This has the effect of moving up a store if it was for some reason put further down in the input. This is not done just for a memory store but for any instruction that kills a register without defining one.

Todo: This should probably only kick in when useful but currently this is done regardless of the current register pressure.

-misched-dfs (default true)

Compute the DFSResult for use per above.

-dfs-size (default 4)

The max size of DFS subtrees that may be scheduled as a unit.

-misched-heightheur (default true)

After register pressure heuristics, some kind of attempt for increasing ILP by considering e.g. height/depth of SUs is made. The idea is to do this without causing increased spilling, but the balance between these two still has room for improvement.

-misched-heightifwfac (default 3)

A rough heuristic to only do the height heuristic (ILP) on DAGs with a "width factor" less than this value. The idea is to not become easily too aggressive with this on regions with many SUs that are mostly in parallell. Only if the DAG is quite "high and narrow" should this be done. This could likely be improved.

-misched-ooo

Enables this experimental scheduling (if not passed the patch is NFC).

-no-ooosched-below

If a non-zero value is passed normal scheduling is done for regions with lesser number of instructions than this value. This can be interesting as a strategy can have different levels of success on small/huge regions.

Originally, before the machine scheduler got enabled, SystemZ used to run the "Bottom-up register reduction list scheduling" for ISel scheduling. I tried this setup again now but found that doing so does not help cactus. Generally however, across benchmarks, this gives the least spill for regions less than 100 instructions. On some bigger regions though (>7500 instructions) "list-burr" actually causes more spilling than "source". So not just GenericScheduler messes this huge region up.

In the range 100 to 1000 instructions, the current behavior ("source" + mischeduler) seems to be beneficial, but it cannot handle huge regions.

This version of this is not a great improvement on SystemZ comparing to just disabling the MachineScheduler pre-ra entirely. I know there is a little room for further performance improvements, but I am not sure that would be enough to motivate a SystemZ specific scheduling strategy. Ideally other targets would find this interesting and begin to use/develop it.

@michaelmaitland Did you try to disable machine-scheduler and compare spilling? What happens if you try this (with -misched-ooo -misched-heightheur=false)

@atrick Any comments on my use of the DFS subtrees?

JonPsson1 · 2024-06-14T07:02:37Z

@uweigand @dominik-steenken

atrick · 2024-06-14T17:00:56Z

This is a typical problem with pre-ra scheduling. The pressure looks fine before scheduling. The list scheduler backs itself into a corner trying to expose ILP before it reaches the pressure limit, then can't back itself out.

It would be great if DFS subtrees can handle this case. You'll may need to dig into the design and see if it can recognize good subtrees on cactus and tweak its own heuristics, which can be sensitive. You would have to decide whether to always enable subtrees for your target (and pay any compile time cost). Or somehow identify DAG patterns where it's worth enabling subtrees.

In general, it would be awesome if the scheduler could recognize certain classes of DAGs and set the heuristics accordingly before it starts scheduling. So in this case, it should know to try to keep register pressure below the machine's limit because it will need those registers later.

Generic scheduler's strategy is to avoid perturbing the schedule unless it sees an opportunity, but that can still fail with large subtrees like this. Another approach would be a 2-pass scheduler. The first pass would expose ILP. The second pass would then know how far register pressure had been exceeded and would reschedule to avoid spills. The generic schedule doesn't do that because it's trying to be compile-time friendly and trying not to perturb the original instruction schedule.

JonPsson1 · 2025-04-09T20:08:51Z

I don't think this branch will actually land, although there may be some ideas worth trying. I have now another PR in progress that will replace this one. See #135076.

@michaelmaitland Maybe you would like to try the new SystemZPreRASchedStrategy and see if it works on your target as well?

michaelmaitland · 2025-04-14T15:05:29Z

I don't think this branch will actually land, although there may be some ideas worth trying. I have now another PR in progress that will replace this one. See #135076.

@michaelmaitland Maybe you would like to try the new SystemZPreRASchedStrategy and see if it works on your target as well?

CC @mshockwave

JonPsson1 added the llvm:codegen label Apr 26, 2024

michaelmaitland self-requested a review May 7, 2024 13:36

michaelmaitland reviewed May 20, 2024

View reviewed changes

michaelmaitland requested review from fpetrogalli and wangpc-pp May 20, 2024 23:19

JonPsson1 added 2 commits June 13, 2024 10:15

-nosched-above

5224a6b

Some heuristics for OOO scheduling

bae21d5

JonPsson1 force-pushed the MISchedFix branch from 3d2fa29 to bae21d5 Compare June 14, 2024 06:27

JonPsson1 mentioned this pull request Apr 9, 2025

[SystemZ] Add a SystemZ specific pre-RA scheduling strategy. #135076

Open

JonPsson1 closed this Apr 9, 2025

JonPsson1 deleted the MISchedFix branch April 9, 2025 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MachineScheduler] Experimental option to partially disable pre-ra scheduling. #90181

[MachineScheduler] Experimental option to partially disable pre-ra scheduling. #90181

Uh oh!

JonPsson1 commented Apr 26, 2024

Uh oh!

github-actions bot commented Apr 26, 2024 •

edited

Loading

Uh oh!

JonPsson1 commented May 7, 2024

Uh oh!

michaelmaitland commented May 7, 2024

Uh oh!

JonPsson1 commented May 8, 2024

Uh oh!

michaelmaitland May 20, 2024

Uh oh!

michaelmaitland commented May 20, 2024

Uh oh!

JonPsson1 commented May 21, 2024

Uh oh!

michaelmaitland commented May 21, 2024 •

edited

Loading

Uh oh!

JonPsson1 commented May 22, 2024

Uh oh!

JonPsson1 commented Jun 14, 2024

Uh oh!

JonPsson1 commented Jun 14, 2024

Uh oh!

atrick commented Jun 14, 2024

Uh oh!

JonPsson1 commented Apr 9, 2025

Uh oh!

michaelmaitland commented Apr 14, 2025

Uh oh!

Uh oh!

[MachineScheduler] Experimental option to partially disable pre-ra scheduling. #90181

[MachineScheduler] Experimental option to partially disable pre-ra scheduling. #90181

Uh oh!

Conversation

JonPsson1 commented Apr 26, 2024

Uh oh!

github-actions bot commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonPsson1 commented May 7, 2024

Uh oh!

michaelmaitland commented May 7, 2024

Uh oh!

JonPsson1 commented May 8, 2024

Uh oh!

michaelmaitland May 20, 2024

Choose a reason for hiding this comment

Uh oh!

michaelmaitland commented May 20, 2024

Summary

Dynamic IC

spec2006int

spec2006fp

Spills

spec2006int

spec2006fp

Conclusion

Some ideas

Uh oh!

JonPsson1 commented May 21, 2024

Uh oh!

michaelmaitland commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonPsson1 commented May 22, 2024

Uh oh!

JonPsson1 commented Jun 14, 2024

Uh oh!

JonPsson1 commented Jun 14, 2024

Uh oh!

atrick commented Jun 14, 2024

Uh oh!

JonPsson1 commented Apr 9, 2025

Uh oh!

michaelmaitland commented Apr 14, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 26, 2024 •

edited

Loading

michaelmaitland commented May 21, 2024 •

edited

Loading