Skip to content

[AMDGPU][Scheduler] Refactor VGPR rematerialization during scheduling #118722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

lucas-rami
Copy link
Contributor

AMDGPU scheduler's PreRARematStage attempts to increase function occupancy w.r.t. VGPR usage by rematerializing trivial VGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.

  1. Analysis and collection (canIncreaseOccupancy). The number of VGPRs to save to increase occupancy in each region at minimum occupancy is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to increase function occupancy (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.
  2. Rematerialization (sinkTriviallyRematInsts). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.

This is the first change in a series of changes to improve rematerialization during scheduling. The next immediate steps are to add support for SGPR/AGPR rematerialization, then potentially relax some constraints on instructions we consider eligible for rematerialization.

Copy link

github-actions bot commented Dec 5, 2024

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Dec 5, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

AMDGPU scheduler's PreRARematStage attempts to increase function occupancy w.r.t. VGPR usage by rematerializing trivial VGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.

  1. Analysis and collection (canIncreaseOccupancy). The number of VGPRs to save to increase occupancy in each region at minimum occupancy is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to increase function occupancy (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.
  2. Rematerialization (sinkTriviallyRematInsts). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.

This is the first change in a series of changes to improve rematerialization during scheduling. The next immediate steps are to add support for SGPR/AGPR rematerialization, then potentially relax some constraints on instructions we consider eligible for rematerialization.


Patch is 43.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118722.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+214-197)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+40-20)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.cpp (+4)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5)
  • (modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp (+13)
  • (modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h (+7)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+141-4)
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 57f517bfba0ebb..e43014efa08c5a 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -25,8 +25,13 @@
 
 #include "GCNSchedStrategy.h"
 #include "AMDGPUIGroupLP.h"
+#include "GCNRegPressure.h"
 #include "SIMachineFunctionInfo.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/STLExtras.h"
 #include "llvm/CodeGen/RegisterClassInfo.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Support/ErrorHandling.h"
 
 #define DEBUG_TYPE "machine-scheduler"
 
@@ -945,20 +950,20 @@ bool PreRARematStage::initGCNSchedStage() {
     return false;
 
   const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
-  // Check maximum occupancy
+  // Rematerialization will not help if occupancy is LDS-limited.
   if (ST.computeOccupancy(MF.getFunction(), MFI.getLDSSize()) ==
       DAG.MinOccupancy)
     return false;
 
   // FIXME: This pass will invalidate cached MBBLiveIns for regions
-  // inbetween the defs and region we sinked the def to. Cached pressure
-  // for regions where a def is sinked from will also be invalidated. Will
-  // need to be fixed if there is another pass after this pass.
+  // inbetween the defs and region we sinked the def to. Will need to be fixed
+  // if there is another pass after this pass.
   assert(!S.hasNextStage());
 
-  collectRematerializableInstructions();
-  if (RematerializableInsts.empty() || !sinkTriviallyRematInsts(ST, TII))
+  std::vector<RematInstruction> RematInstructions;
+  if (!canIncreaseOccupancy(RematInstructions))
     return false;
+  sinkTriviallyRematInsts(RematInstructions, ST, TII);
 
   LLVM_DEBUG(
       dbgs() << "Retrying function scheduling with improved occupancy of "
@@ -1254,8 +1259,7 @@ GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
 #ifndef NDEBUG
   LLVM_DEBUG(
       printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
+      dbgs() << "\n\t" << "Metric: "
              << (SumBubbles
                      ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
                      : 1)
@@ -1290,8 +1294,7 @@ GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
 #ifndef NDEBUG
   LLVM_DEBUG(
       printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
+      dbgs() << "\n\t" << "Metric: "
              << (SumBubbles
                      ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
                      : 1)
@@ -1339,8 +1342,7 @@ bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
       dbgs()
       << "\n\t      *** In shouldRevertScheduling ***\n"
       << "      *********** BEFORE UnclusteredHighRPStage ***********\n");
-  ScheduleMetrics MBefore =
-      getScheduleMetrics(DAG.SUnits);
+  ScheduleMetrics MBefore = getScheduleMetrics(DAG.SUnits);
   LLVM_DEBUG(
       dbgs()
       << "\n      *********** AFTER UnclusteredHighRPStage ***********\n");
@@ -1467,231 +1469,246 @@ void GCNSchedStage::revertScheduling() {
   DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);
 }
 
-void PreRARematStage::collectRematerializableInstructions() {
+/// Allows to easily filter for this stage's debug output.
+#define RA_DEBUG(X) LLVM_DEBUG(dbgs() << "[PreRARemat] "; X;)
+
+bool PreRARematStage::canIncreaseOccupancy(
+    std::vector<RematInstruction> &RematInstructions) {
   const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(DAG.TRI);
-  for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
-    Register Reg = Register::index2VirtReg(I);
-    if (!DAG.LIS->hasInterval(Reg))
-      continue;
 
-    // TODO: Handle AGPR and SGPR rematerialization
-    if (!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
-        !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
-      continue;
+  RA_DEBUG(dbgs() << "Collecting rematerializable instructions\n");
 
-    MachineOperand *Op = DAG.MRI.getOneDef(Reg);
-    MachineInstr *Def = Op->getParent();
-    if (Op->getSubReg() != 0 || !isTriviallyReMaterializable(*Def))
+  // Maps optimizable regions (i.e., regions at minimum and VGPR-limited
+  // occupancy) to the numbers of VGPRs that must be deducted from their maximum
+  // VGPR pressure for their occupancy to be increased by one.
+  DenseMap<unsigned, unsigned> OptRegions;
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    if (!DAG.RegionsWithMinOcc[I])
       continue;
+    GCNRegPressure &RP = DAG.Pressure[I];
 
-    MachineInstr *UseI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
-    if (Def->getParent() == UseI->getParent())
+    // We do not rematerialize SGPR-defining regions yet so do not bother
+    // optimizing regions whose occupancy is SGPR-limited.
+    if (ST.getOccupancyWithNumSGPRs(RP.getSGPRNum()) == DAG.MinOccupancy)
       continue;
 
-    // We are only collecting defs that are defined in another block and are
-    // live-through or used inside regions at MinOccupancy. This means that the
-    // register must be in the live-in set for the region.
-    bool AddedToRematList = false;
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      auto It = DAG.LiveIns[I].find(Reg);
-      if (It != DAG.LiveIns[I].end() && !It->second.none()) {
-        if (DAG.RegionsWithMinOcc[I]) {
-          RematerializableInsts[I][Def] = UseI;
-          AddedToRematList = true;
-        }
-
-        // Collect regions with rematerializable reg as live-in to avoid
-        // searching later when updating RP.
-        RematDefToLiveInRegions[Def].push_back(I);
-      }
-    }
-    if (!AddedToRematList)
-      RematDefToLiveInRegions.erase(Def);
-  }
-}
-
-bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
-                                              const TargetInstrInfo *TII) {
-  // Temporary copies of cached variables we will be modifying and replacing if
-  // sinking succeeds.
-  SmallVector<
-      std::pair<MachineBasicBlock::iterator, MachineBasicBlock::iterator>, 32>
-      NewRegions;
-  DenseMap<unsigned, GCNRPTracker::LiveRegSet> NewLiveIns;
-  DenseMap<unsigned, GCNRegPressure> NewPressure;
-  BitVector NewRescheduleRegions;
-  LiveIntervals *LIS = DAG.LIS;
-
-  NewRegions.resize(DAG.Regions.size());
-  NewRescheduleRegions.resize(DAG.Regions.size());
-
-  // Collect only regions that has a rematerializable def as a live-in.
-  SmallSet<unsigned, 16> ImpactedRegions;
-  for (const auto &It : RematDefToLiveInRegions)
-    ImpactedRegions.insert(It.second.begin(), It.second.end());
-
-  // Make copies of register pressure and live-ins cache that will be updated
-  // as we rematerialize.
-  for (auto Idx : ImpactedRegions) {
-    NewPressure[Idx] = DAG.Pressure[Idx];
-    NewLiveIns[Idx] = DAG.LiveIns[Idx];
+    unsigned NumVGPRs = RP.getVGPRNum(ST.hasGFX90AInsts());
+    unsigned NumToIncreaseOcc = ST.getNumVGPRsToIncreaseOccupancy(NumVGPRs);
+    OptRegions.insert({I, NumToIncreaseOcc});
+    RA_DEBUG(dbgs() << "Region " << I << " has min. occupancy: decrease by "
+                    << NumToIncreaseOcc << " VGPR(s) to improve occupancy\n");
   }
-  NewRegions = DAG.Regions;
-  NewRescheduleRegions.reset();
+  if (OptRegions.empty())
+    return false;
 
-  DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
-  bool Improved = false;
-  for (auto I : ImpactedRegions) {
-    if (!DAG.RegionsWithMinOcc[I])
-      continue;
+  // Tracks estimated rematerialization gains (i.e., reduction in RP) for
+  // this instruction in each optimizable region.
+  auto ReduceRPInRegion = [&](auto OptIt, unsigned I,
+                              LaneBitmask Mask) -> bool {
+    auto NumRegs = SIRegisterInfo::getNumCoveredRegs(Mask);
+    unsigned &RPExcess = OptIt->getSecond();
+    if (NumRegs >= RPExcess) {
+      OptRegions.erase(I);
+      LLVM_DEBUG(dbgs() << "sinking increases occupancy in region " << I
+                        << "\n");
+    } else {
+      RPExcess -= NumRegs;
+      LLVM_DEBUG(dbgs() << "sinking reduces excess pressure in region " << I
+                        << " by " << NumRegs << " (" << RPExcess << " left)\n");
+    }
+    return OptRegions.empty();
+  };
+
+  // We need up-to-date live-out info. to query live-out register masks in
+  // regions containing rematerializable instructions.
+  DAG.RegionLiveOuts.buildLiveRegMap();
+
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    auto Region = DAG.Regions[I];
+    for (auto MI = Region.first; MI != Region.second; ++MI) {
+      // We only support instructions with at least one register (but
+      // non-subregister) operand.
+      MachineInstr &DefMI = *MI;
+      if (DefMI.isBundle() || !DefMI.getNumOperands() ||
+          !DefMI.getOperand(0).isReg() || DefMI.getOperand(0).getSubReg())
+        continue;
 
-    Improved = false;
-    int VGPRUsage = NewPressure[I].getVGPRNum(ST.hasGFX90AInsts());
-    int SGPRUsage = NewPressure[I].getSGPRNum();
+      // We only support rematerializing virtual VGPRs with one definition and
+      // one use.
+      Register Reg = DefMI.getOperand(0).getReg();
+      if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) ||
+          !SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
+          !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
+        continue;
 
-    // TODO: Handle occupancy drop due to AGPR and SGPR.
-    // Check if cause of occupancy drop is due to VGPR usage and not SGPR.
-    if (ST.getOccupancyWithNumSGPRs(SGPRUsage) == DAG.MinOccupancy)
-      break;
+      // The instruction must be trivially rematerializable and have no virtual
+      // register use.
+      if (!isTriviallyReMaterializable(DefMI))
+        continue;
 
-    // The occupancy of this region could have been improved by a previous
-    // iteration's sinking of defs.
-    if (NewPressure[I].getOccupancy(ST) > DAG.MinOccupancy) {
-      NewRescheduleRegions[I] = true;
-      Improved = true;
-      continue;
-    }
+      // We only care to rematerialize the instruction if its single use is in a
+      // different block.
+      MachineInstr *UseMI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
+      if (DefMI.getParent() == UseMI->getParent())
+        continue;
 
-    // First check if we have enough trivially rematerializable instructions to
-    // improve occupancy. Optimistically assume all instructions we are able to
-    // sink decreased RP.
-    int TotalSinkableRegs = 0;
-    for (const auto &It : RematerializableInsts[I]) {
-      MachineInstr *Def = It.first;
-      Register DefReg = Def->getOperand(0).getReg();
-      TotalSinkableRegs +=
-          SIRegisterInfo::getNumCoveredRegs(NewLiveIns[I][DefReg]);
-    }
-    int VGPRsAfterSink = VGPRUsage - TotalSinkableRegs;
-    unsigned OptimisticOccupancy = ST.getOccupancyWithNumVGPRs(VGPRsAfterSink);
-    // If in the most optimistic scenario, we cannot improve occupancy, then do
-    // not attempt to sink any instructions.
-    if (OptimisticOccupancy <= DAG.MinOccupancy)
-      break;
+      RA_DEBUG(dbgs() << "In region " << I << ", instruction " << DefMI
+                      << " is rematerializable with single use " << *UseMI);
+      auto &Remat = RematInstructions.emplace_back(&DefMI, I, UseMI);
+
+      bool RematUseful = false;
+      if (auto It = OptRegions.find(I); It != OptRegions.end()) {
+        // Optimistically consider that moving the instruction out of its
+        // defining region will reduce RP in the latter; this assumes that
+        // maximum RP in the region is reached somewhere between the defining
+        // instruction and the end of the region.
+        RA_DEBUG(dbgs() << "  Instruction's defining region is optimizable: ");
+        RematUseful = true;
+        auto RegMask = DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg];
+        if (ReduceRPInRegion(It, I, RegMask))
+          return true;
+      }
 
-    unsigned ImproveOccupancy = 0;
-    SmallVector<MachineInstr *, 4> SinkedDefs;
-    for (auto &It : RematerializableInsts[I]) {
-      MachineInstr *Def = It.first;
-      MachineBasicBlock::iterator InsertPos =
-          MachineBasicBlock::iterator(It.second);
-      Register Reg = Def->getOperand(0).getReg();
-      // Rematerialize MI to its use block. Since we are only rematerializing
-      // instructions that do not have any virtual reg uses, we do not need to
-      // call LiveRangeEdit::allUsesAvailableAt() and
-      // LiveRangeEdit::canRematerializeAt().
-      TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
-                         Def->getOperand(0).getSubReg(), *Def, *DAG.TRI);
-      MachineInstr *NewMI = &*std::prev(InsertPos);
-      LIS->InsertMachineInstrInMaps(*NewMI);
-      LIS->removeInterval(Reg);
-      LIS->createAndComputeVirtRegInterval(Reg);
-      InsertedMIToOldDef[NewMI] = Def;
-
-      // Update region boundaries in scheduling region we sinked from since we
-      // may sink an instruction that was at the beginning or end of its region
-      DAG.updateRegionBoundaries(NewRegions, Def, /*NewMI =*/nullptr,
-                                 /*Removing =*/true);
-
-      // Update region boundaries in region we sinked to.
-      DAG.updateRegionBoundaries(NewRegions, InsertPos, NewMI);
-
-      LaneBitmask PrevMask = NewLiveIns[I][Reg];
-      // FIXME: Also update cached pressure for where the def was sinked from.
-      // Update RP for all regions that has this reg as a live-in and remove
-      // the reg from all regions as a live-in.
-      for (auto Idx : RematDefToLiveInRegions[Def]) {
-        NewLiveIns[Idx].erase(Reg);
-        if (InsertPos->getParent() != DAG.Regions[Idx].first->getParent()) {
-          // Def is live-through and not used in this block.
-          NewPressure[Idx].inc(Reg, PrevMask, LaneBitmask::getNone(), DAG.MRI);
+      for (unsigned LVRegion = 0; LVRegion != E; ++LVRegion) {
+        // We are only collecting regions in which the register is a live-in
+        // (and may be live-through).
+        auto It = DAG.LiveIns[LVRegion].find(Reg);
+        if (It == DAG.LiveIns[LVRegion].end() || It->second.none())
+          continue;
+        Remat.LiveInRegions.insert(LVRegion);
+        RA_DEBUG(dbgs() << "  Def is live-in in region " << LVRegion << ": ");
+
+        // Account for the reduction in RP due to the rematerialization in an
+        // optimizable region in which the defined register is a live-in. This
+        // is exact for live-through region but optimistic in the using region,
+        // where RP is actually reduced only if maximum RP is reached somewhere
+        // between the beginning of the region and the rematerializable
+        // instruction's use.
+        if (auto It = OptRegions.find(LVRegion); It != OptRegions.end()) {
+          RematUseful = true;
+          if (ReduceRPInRegion(It, LVRegion, DAG.LiveIns[LVRegion][Reg]))
+            return true;
         } else {
-          // Def is used and rematerialized into this block.
-          GCNDownwardRPTracker RPT(*LIS);
-          auto *NonDbgMI = &*skipDebugInstructionsForward(
-              NewRegions[Idx].first, NewRegions[Idx].second);
-          RPT.reset(*NonDbgMI, &NewLiveIns[Idx]);
-          RPT.advance(NewRegions[Idx].second);
-          NewPressure[Idx] = RPT.moveMaxPressure();
+          LLVM_DEBUG(dbgs() << "unoptimizable region\n");
         }
       }
 
-      SinkedDefs.push_back(Def);
-      ImproveOccupancy = NewPressure[I].getOccupancy(ST);
-      if (ImproveOccupancy > DAG.MinOccupancy)
-        break;
+      // If the instruction is not a live-in or live-out in any optimizable
+      // region then there is no point in rematerializing it.
+      if (!RematUseful) {
+        RematInstructions.pop_back();
+        RA_DEBUG(
+            dbgs()
+            << "  No impact on any optimizable region, dropping instruction\n");
+      }
     }
+  }
 
-    // Remove defs we just sinked from all regions' list of sinkable defs
-    for (auto &Def : SinkedDefs)
-      for (auto TrackedIdx : RematDefToLiveInRegions[Def])
-        RematerializableInsts[TrackedIdx].erase(Def);
+  RA_DEBUG(dbgs() << "Cannot increase occupancy through rematerialization\n");
+  return false;
+}
 
-    if (ImproveOccupancy <= DAG.MinOccupancy)
-      break;
+void PreRARematStage::sinkTriviallyRematInsts(
+    ArrayRef<RematInstruction> RematInstructions, const GCNSubtarget &ST,
+    const TargetInstrInfo *TII) {
+  // Collect regions whose live-ins or register pressure will change due to
+  // rematerialization, and map those whose maximum RP we need to fully
+  // recompute at the end to true.
+  SmallDenseMap<unsigned, bool> ImpactedRegions;
+  // Maps rematerialized instuctions to the one they were rematerialized from.
+  DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
+  LiveIntervals *LIS = DAG.LIS;
 
-    NewRescheduleRegions[I] = true;
-    Improved = true;
+  for (const RematInstruction &Remat : RematInstructions) {
+    MachineInstr *DefMI = Remat.RematMI;
+    MachineBasicBlock::iterator InsertPos(Remat.UseMI);
+    Register Reg = DefMI->getOperand(0).getReg();
+
+    // Rematerialize MI to its use block. Since we are only rematerializing
+    // instructions that do not have any virtual reg uses, we do not need to
+    // call LiveRangeEdit::allUsesAvailableAt() and
+    // LiveRangeEdit::canRematerializeAt().
+    TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
+                       DefMI->getOperand(0).getSubReg(), *DefMI, *DAG.TRI);
+    MachineInstr *NewMI = &*std::prev(InsertPos);
+    LIS->InsertMachineInstrInMaps(*NewMI);
+    LIS->removeInterval(Reg);
+    LIS->createAndComputeVirtRegInterval(Reg);
+    InsertedMIToOldDef[NewMI] = DefMI;
+
+    // Update region boundaries in scheduling region we sinked from since we
+    // may sink an instruction that was at the beginning or end of its region.
+    DAG.updateRegionBoundaries(DAG.Regions, DefMI, /*NewMI =*/nullptr,
+                               /*Removing =*/true);
+
+    // Update region boundaries in region we sinked to.
+    DAG.updateRegionBoundaries(DAG.Regions, InsertPos, NewMI);
+
+    // Collect all regions impacted by the rematerialization and update their
+    // live-in/RP information.
+    for (unsigned I : Remat.LiveInRegions) {
+      ImpactedRegions.insert({I, false});
+      GCNRPTracker::LiveRegSet &RegionLiveIns = DAG.LiveIns[I];
+
+      // The register is no longer a live-in in all regions but the one that
+      // contains the single use. In live-through regions, maximum register
+      // pressure decreases predictably so we can directly update it. In the
+      // using region, maximum register pressure may or may not decrease, so we
+      // will mark it for re-computation after all materializations.
+      RegionLiveIns.erase(Reg);
+      if (Remat.UseMI->getParent() != DAG.Regions[I].first->getParent()) {
+        // Register is live-through and not used in this block.
+        LaneBitmask PrevMask = RegionLiveIns[Reg];
+        DAG.Pressure[I].inc(Reg, PrevMask, LaneBitmask::getNone(), DAG.MRI);
+      } else {
+        // Register is used in this block.
+        ImpactedRegions[I] = true;
+      }
+    }
+
+    // RP in the region from which the instruction was rematerialized may or may
+    // not change, recompute-it fully later.
+    ImpactedRegions[Remat.DefRegion] = true;
   }
 
-  if (!Improved) {
-    // Occupancy was not improved for all regions that were at MinOccupancy.
-    // Undo sinking and remove newly rematerialized instructions.
-    for (auto &Entry : InsertedMIToOldDef) {
-      MachineInstr *MI = Entry.first;
-      MachineInstr *OldMI = Entry.second;
-      Register Reg = MI->getOperand(0).getReg();
-      LIS->RemoveMachineInstrFromMaps(*MI);
-      MI->eraseFromParent();
-      OldMI->clearRegisterDeads(Reg);
-      LIS->removeInterval(Reg);
-      LIS->createAndComputeVirtRegInterval(Reg);
+  // All regions impacted by at least one rematerialization must be rescheduled.
+  BitVector NewRescheduleRegions(DAG.Regions.size());
+  for (auto &[I, RecomputeRP] : ImpactedRegions) {
+    NewRescheduleRegions[I] = true;
+    DAG.MBBLiveIns.erase(DAG.Regions[I].first->getParent());
+
+    // Recompute maximum RP in regions in which at least one instruction was
+    // rematerialized from or to.
+    if (Recom...
[truncated]

Copy link

github-actions bot commented Dec 5, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Comment on lines +439 to +441
/// Single use of the rematerializable instruction's defined register,
/// located in a different block.
MachineInstr *UseMI;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the scheduling was per block only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU it is per region, and in the case of a rematerialization both the region where the instruction was rematerialized from and the one it is rematerialized to will be rescheduled after this scheduling stage takes place.

In the future we could improve rematerialization to work even if the def and use are in the same block but not the same region, which from what I understand can happen and make sense.

// non-subregister) operand.
MachineInstr &DefMI = *MI;
if (DefMI.isBundle() || !DefMI.getNumOperands() ||
!DefMI.getOperand(0).isReg() || DefMI.getOperand(0).getSubReg())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rematerializing undef subregister defs should work. Also does this need to verify that operand 0 is a def? But this is repeating isReallyTriviallyReMaterializeable's logic anyway, most of these pre-checks should be redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catches. We indeed don't need to check that operand 0 is a def here as it is already implied by isTriviallyReMaterializable. I now allow undef subregister defs as well (see test_occ_7_sink_one_def_of_undef_subreg_for_8).

However, I am now forced to do something that seems slightly weird to me during rematerialization; I have to explicitly set the subreg on the new MI's def after calling TII.reMaterialize even though the subreg is passed to the method already (if I don't do this I hit an assert in DAG.LIS->createAndComputeVirtRegInterval(Reg) at the end). Not sure this is the correct way to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subregister should survive the clone and substitute. What assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one in LiveInterval::computeSubRangeUndefs. Looking at the in-flight MIR before explicitly setting the subreg after rematerialization it seems that the NewMI's def reg is being assigned a different non-intersecting subregister than the original one, which leads to the two masks tested here to be completely unrelated.

In the test_occ_7_sink_one_def_of_undef_subreg_for_8 test for example, after calling SIInstrInfo::reMaterialize but before setting the subreg explicitly, the new MI uses sub2 instead of sub1.

[PreRARemat] Rematerializing undef %32.sub1:vreg_64_align2 = V_MOV_B32_e32 23, implicit $exec
[PreRARemat] New MI:         undef %32.sub2:vreg_64_align2 = V_MOV_B32_e32 23, implicit $exec

@lucas-rami lucas-rami requested a review from jrbyrnes December 5, 2024 15:02
Comment on lines +1715 to +1719
// Even though TargetInstrInfo::isReallyTriviallyReMaterializable already
// ensures that the instruction has no virtual register uses,
// SIInstrInfo::isReallyTriviallyReMaterializable may consider an instruction
// rematerializable and return before calling its parent's method, so we need
// to double-check here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a bug? All of these rematerialize functions are confusing

Copy link
Contributor Author

@lucas-rami lucas-rami Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing to me as well. Our SIInstrInfo override considers that rematerializing VOP{1,2,3} instructions with virtual register uses is trivial, which I don't really agree with. Removing the override in favor of its parent's method breaks many unit tests (58 files report errors) however, so if that's ok I will leave this as is for now and investigate this further in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our SIInstrInfo override considers that rematerializing VOP{1,2,3} instructions with virtual register uses is trivial, which I don't really agree with

I think this is one of those cases that was supposed to be moved up to the generic implementation. The actual uses consider liveness, so it shouldn't be baked into the low level check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 08d7eec never got finished / reapplied

Copy link
Contributor

@jrbyrnes jrbyrnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analysis doesn't prove that we will change occupancy through rematerialization -- I think this is a complex problem involving the live regs at the rematerialization point, which may change with scheduling. Because of this, we may want to keep the rollback functionality?

for (auto [Remat, NewMI] : llvm::zip_equal(RematInstructions, NewMIs)) {
// Remove rematerialized instruction from BBLiveInMap since we are sinking
// it from its MBB. Also remove it from live intervals and from its MBB.
MachineInstr *OldMI = Remat.RematMI;
DAG.BBLiveInMap.erase(OldMI);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is maintaining the map the way we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a closer look, it's probably not indeed (and potentially the same applies to MBBLiveIns). Working to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some more time looking at this, and it seems both BBLiveInMap and MBBLiveIns are not accessed after sinking instructions. Their only uses are in computeBlockPressure which is only called for other scheduling stages.

However, as I was trying to understand the exact purpose of BBLiveInMap I noticed that while GCNScheduleDAGMILive::getRegionLiveInMap seems to want to compute the live reg set at the beginning of the initial region of each block, the BB pointer is never updated. Seems like it should be inside the outer do/while loop.

auto I = Regions.rbegin(), E = Regions.rend();
auto *BB = I->first->getParent();
do {
auto *MI = &*skipDebugInstructionsForward(I->first, I->second);
RegionFirstMIs.push_back(MI);
do {
++I;
} while (I != E && I->first->getParent() == BB);
} while (I != E);

What happens right now is that non-initial regions (if any) in the first block are skipped but they are not in other blocks. I don't see why the first block should be treated differently, is that a bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some more time looking at this, and it seems both BBLiveInMap and MBBLiveIns are not accessed after sinking instructions.

We should probably either completely abandon trying to maintain these, or do it correctly.

What happens right now is that non-initial regions (if any) in the first block are skipped

Yes I would expect this to build of region identifiers to the LiveRegSet for each region -- similar to getRegionLiveOutMap.

However, we currently only use the LiveRegMap to get the LiveRegs per block and not per region, so we aren't exposed to any undue RP inaccuracies from this currently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I removed the updates to BBLiveInMap and MBBLiveIns from the stage for now since they are not accessed after it anyway. I also fixed the issue in getRegionLiveInMap.

SlotIndex NewDef = Slots->getInstructionIndex(*NewMI).getRegSlot();
SlotIndex NewKill = Slots->getInstructionIndex(*Remat.UseMI).getRegSlot();

auto UpdateSegment = [&](LiveRange::Segments &Segments) -> void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the LIS for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for this particular segment change? If so I don't know if there exists some function that does this already. We would just like to avoid doing a more expensive full recomputation of the LI with

DAG.LIS->removeInterval(Reg);
DAG.LIS->createAndComputeVirtRegInterval(Reg);

NewPressure[Idx] = DAG.Pressure[Idx];
NewLiveIns[Idx] = DAG.LiveIns[Idx];
unsigned NumVGPRs = RP.getVGPRNum(ST.hasGFX90AInsts());
unsigned NumToIncreaseOcc = ST.getNumVGPRsToIncreaseOccupancy(NumVGPRs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case to eliminate spilling? For spilling in unified register file case (hasGFX90AInsts), there are additional constraints we need to consider (see #85860)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I am not considering the trade-off between occupancy and spilling i.e., I am not allowing any spilling while trying to increase occupancy. In that context, is there anything else I should consider? I plan to properly model excess ArchVGPR/AccVGPR in both unified and non-unified RFs in a following PR which will be able to remat instructions for all classes of virtual def regs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be trying to increase occupancy in cases where there is spilling ? Or just eliminate the spilling ? Eliminating spilling with a reduction in ArchVGPR usage alone may not be possible if AccVGPR pressure is too high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be trying to increase occupancy in cases where there is spilling ? Or just eliminate the spilling ?

Thanks for the suggestion, I had not explicitly considered trying to reduce/remove spilling even if we are unable to increase occupancy indeed. I have added support for it. In situation where occupancy is 1 and there is spilling, the stage now tries to collect as many rematerializable instructions as necessary to reduce/eliminate spilling, and will rematerialize just enough of them to eliminate spilling (or at least reduce it) even if we cannot increase occupancy (see new unit test test_occ_1_sink_for_spill). If it is possible to eliminate spilling and increase occupancy then it rematerializes just enough instructions for both of them to happen.

Eliminating spilling with a reduction in ArchVGPR usage alone may not be possible if AccVGPR pressure is too high.

For now I still use the same approximation as in the current implementation, and plan to properly model excess RP for all register types in a further PR if that is ok.

Comment on lines +1540 to +1544
Register Reg = DefMI.getOperand(0).getReg();
if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) ||
!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
!DAG.MRI.hasOneDef(Reg))
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this is covered by isTriviallyReMaterializable, I think. !hasInterval would be malformed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TargetInstrInfo::isTriviallyReMaterializable early returns on

MI.getOpcode() == TargetOpcode::IMPLICIT_DEF && MI.getNumOperands() == 1

then I think I need the "is virtual" check. And I am not sure that TargetInstrInfo::isReallyTriviallyReMaterializable implies that the register has only one def.

I left the hasInterval check because it was here before so I assumed the interval could be missing, but perhaps not.

@lucas-rami
Copy link
Contributor Author

The analysis doesn't prove that we will change occupancy through rematerialization -- I think this is a complex problem involving the live regs at the rematerialization point, which may change with scheduling. Because of this, we may want to keep the rollback functionality?

I completely agree, but my rationale for removing the rollback functionality is that after rescheduling finalizeGCNRegion -> checkScheduling -> revertScheduling can already revert scheduling on a per-region basis if we caused a drop in occupancy or extra spilling to happen, at the cost of a hit to compile-time compared to rollbacking immediately. I also suppose that rescheduling post-remat can improve occupancy/spilling beyond what we can assess at the end of rematerialization. Doing an immediate rollback would make us miss on those potential improvements I think.

@jrbyrnes
Copy link
Contributor

jrbyrnes commented Jan 3, 2025

The analysis doesn't prove that we will change occupancy through rematerialization -- I think this is a complex problem involving the live regs at the rematerialization point, which may change with scheduling. Because of this, we may want to keep the rollback functionality?

I completely agree, but my rationale for removing the rollback functionality is that after rescheduling finalizeGCNRegion -> checkScheduling -> revertScheduling can already revert scheduling on a per-region basis if we caused a drop in occupancy or extra spilling to happen, at the cost of a hit to compile-time compared to rollbacking immediately. I also suppose that rescheduling post-remat can improve occupancy/spilling beyond what we can assess at the end of rematerialization. Doing an immediate rollback would make us miss on those potential improvements I think.

revertScheduling doesn't un-materialize the remats.

It's possible that we satisfy the heuristic via rematerialization, but don't increase occupancy before rescheduling. Post-remat rescheduling may improve occupancy / spilling, but it also may not; I think we should err on the side of not doing the remats as this may introduce regressions.

If we have reason to believe that we should always try to reschedule after remat, then we should move the remat rollback functionality into revertScheduling

- The stage now tries to remove any VGPR spilling before trying to
  increase occupancy. In all cases it remats just enough instructions to
  reduce/eliminate spilling and/or increase occupancy.
- Stop updating MBBLiveIns and BBLiveInMap in the stage, since these are
  not accessed.
- Fixed region boundary update logic, it did not work when a region
  became empty due to remating all of its MIs.
- Fixed issue in getRegionLiveMap wherein the BB pointer was not updated
  correctly.
@lucas-rami
Copy link
Contributor Author

lucas-rami commented Jan 7, 2025

revertScheduling doesn't un-materialize the remats.

I did not realize this, thanks. I see the need for potentially rollbacking our remat decisions.

If we have reason to believe that we should always try to reschedule after remat, then we should move the remat rollback functionality into revertScheduling

I think we should always try to reschedule since the remat heuristics are only "slightly optimistic". In particular, they are exact in regions where rematerialized registers are simply live-through. My motivating example for this series of changes is a case of LICM bringing many defs inside the pre-header of a grid stripe loop containing the whole kernel (which I think is a fairly common situation), causing a lot of spilling. Most affected regions in such cases are those where rematable registers are live-through.

I am working on adding post-scheduling rollback functionality to the remat stage and trying to minimize the performance impact in the happy path where no rollbacking is required.

@lucas-rami
Copy link
Contributor Author

I am working on adding post-scheduling rollback functionality to the remat stage and trying to minimize the performance impact in the happy path where no rollbacking is required.

The last commit re-introduces the ability to rollbacks rematerializations when the achievable occupancy estimated by the stage cannot be reached. Unit tests are added for the two possible kind of regions where our heuristics may be optimistic: regions defining rematable register and regions using rematable registers.

In some scenarios rematerializations alone allow us to increase occupancy pre-rescheduling, but the scheduler then does a sub-optimal job and decreases the occupancy again to its pre-stage values. In such cases only the rescheduling is reverted, but rematerializations are not rollbacked. This happens for example in the post-remat reschedule of bb.2 in test_occ_7_sink_for_8_occ (machine-scheduler-sink-trivial-remats.mir), where rescheduling lowers the occupancy from 8 to 7. I am looking into this as a separate issue.

@lucas-rami
Copy link
Contributor Author

Closing because of seemingly infinite processing, replaced by #125885.

@lucas-rami lucas-rami closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants