-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU][Scheduler] Refactor VGPR rematerialization during scheduling #118722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-amdgpu Author: Lucas Ramirez (lucas-rami) ChangesAMDGPU scheduler's This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.
This is the first change in a series of changes to improve rematerialization during scheduling. The next immediate steps are to add support for SGPR/AGPR rematerialization, then potentially relax some constraints on instructions we consider eligible for rematerialization. Patch is 43.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118722.diff 7 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 57f517bfba0ebb..e43014efa08c5a 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -25,8 +25,13 @@
#include "GCNSchedStrategy.h"
#include "AMDGPUIGroupLP.h"
+#include "GCNRegPressure.h"
#include "SIMachineFunctionInfo.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/STLExtras.h"
#include "llvm/CodeGen/RegisterClassInfo.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Support/ErrorHandling.h"
#define DEBUG_TYPE "machine-scheduler"
@@ -945,20 +950,20 @@ bool PreRARematStage::initGCNSchedStage() {
return false;
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
- // Check maximum occupancy
+ // Rematerialization will not help if occupancy is LDS-limited.
if (ST.computeOccupancy(MF.getFunction(), MFI.getLDSSize()) ==
DAG.MinOccupancy)
return false;
// FIXME: This pass will invalidate cached MBBLiveIns for regions
- // inbetween the defs and region we sinked the def to. Cached pressure
- // for regions where a def is sinked from will also be invalidated. Will
- // need to be fixed if there is another pass after this pass.
+ // inbetween the defs and region we sinked the def to. Will need to be fixed
+ // if there is another pass after this pass.
assert(!S.hasNextStage());
- collectRematerializableInstructions();
- if (RematerializableInsts.empty() || !sinkTriviallyRematInsts(ST, TII))
+ std::vector<RematInstruction> RematInstructions;
+ if (!canIncreaseOccupancy(RematInstructions))
return false;
+ sinkTriviallyRematInsts(RematInstructions, ST, TII);
LLVM_DEBUG(
dbgs() << "Retrying function scheduling with improved occupancy of "
@@ -1254,8 +1259,7 @@ GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
#ifndef NDEBUG
LLVM_DEBUG(
printScheduleModel(ReadyCyclesSorted);
- dbgs() << "\n\t"
- << "Metric: "
+ dbgs() << "\n\t" << "Metric: "
<< (SumBubbles
? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
: 1)
@@ -1290,8 +1294,7 @@ GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
#ifndef NDEBUG
LLVM_DEBUG(
printScheduleModel(ReadyCyclesSorted);
- dbgs() << "\n\t"
- << "Metric: "
+ dbgs() << "\n\t" << "Metric: "
<< (SumBubbles
? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
: 1)
@@ -1339,8 +1342,7 @@ bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
dbgs()
<< "\n\t *** In shouldRevertScheduling ***\n"
<< " *********** BEFORE UnclusteredHighRPStage ***********\n");
- ScheduleMetrics MBefore =
- getScheduleMetrics(DAG.SUnits);
+ ScheduleMetrics MBefore = getScheduleMetrics(DAG.SUnits);
LLVM_DEBUG(
dbgs()
<< "\n *********** AFTER UnclusteredHighRPStage ***********\n");
@@ -1467,231 +1469,246 @@ void GCNSchedStage::revertScheduling() {
DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);
}
-void PreRARematStage::collectRematerializableInstructions() {
+/// Allows to easily filter for this stage's debug output.
+#define RA_DEBUG(X) LLVM_DEBUG(dbgs() << "[PreRARemat] "; X;)
+
+bool PreRARematStage::canIncreaseOccupancy(
+ std::vector<RematInstruction> &RematInstructions) {
const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(DAG.TRI);
- for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
- Register Reg = Register::index2VirtReg(I);
- if (!DAG.LIS->hasInterval(Reg))
- continue;
- // TODO: Handle AGPR and SGPR rematerialization
- if (!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
- !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
- continue;
+ RA_DEBUG(dbgs() << "Collecting rematerializable instructions\n");
- MachineOperand *Op = DAG.MRI.getOneDef(Reg);
- MachineInstr *Def = Op->getParent();
- if (Op->getSubReg() != 0 || !isTriviallyReMaterializable(*Def))
+ // Maps optimizable regions (i.e., regions at minimum and VGPR-limited
+ // occupancy) to the numbers of VGPRs that must be deducted from their maximum
+ // VGPR pressure for their occupancy to be increased by one.
+ DenseMap<unsigned, unsigned> OptRegions;
+ for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+ if (!DAG.RegionsWithMinOcc[I])
continue;
+ GCNRegPressure &RP = DAG.Pressure[I];
- MachineInstr *UseI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
- if (Def->getParent() == UseI->getParent())
+ // We do not rematerialize SGPR-defining regions yet so do not bother
+ // optimizing regions whose occupancy is SGPR-limited.
+ if (ST.getOccupancyWithNumSGPRs(RP.getSGPRNum()) == DAG.MinOccupancy)
continue;
- // We are only collecting defs that are defined in another block and are
- // live-through or used inside regions at MinOccupancy. This means that the
- // register must be in the live-in set for the region.
- bool AddedToRematList = false;
- for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
- auto It = DAG.LiveIns[I].find(Reg);
- if (It != DAG.LiveIns[I].end() && !It->second.none()) {
- if (DAG.RegionsWithMinOcc[I]) {
- RematerializableInsts[I][Def] = UseI;
- AddedToRematList = true;
- }
-
- // Collect regions with rematerializable reg as live-in to avoid
- // searching later when updating RP.
- RematDefToLiveInRegions[Def].push_back(I);
- }
- }
- if (!AddedToRematList)
- RematDefToLiveInRegions.erase(Def);
- }
-}
-
-bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
- const TargetInstrInfo *TII) {
- // Temporary copies of cached variables we will be modifying and replacing if
- // sinking succeeds.
- SmallVector<
- std::pair<MachineBasicBlock::iterator, MachineBasicBlock::iterator>, 32>
- NewRegions;
- DenseMap<unsigned, GCNRPTracker::LiveRegSet> NewLiveIns;
- DenseMap<unsigned, GCNRegPressure> NewPressure;
- BitVector NewRescheduleRegions;
- LiveIntervals *LIS = DAG.LIS;
-
- NewRegions.resize(DAG.Regions.size());
- NewRescheduleRegions.resize(DAG.Regions.size());
-
- // Collect only regions that has a rematerializable def as a live-in.
- SmallSet<unsigned, 16> ImpactedRegions;
- for (const auto &It : RematDefToLiveInRegions)
- ImpactedRegions.insert(It.second.begin(), It.second.end());
-
- // Make copies of register pressure and live-ins cache that will be updated
- // as we rematerialize.
- for (auto Idx : ImpactedRegions) {
- NewPressure[Idx] = DAG.Pressure[Idx];
- NewLiveIns[Idx] = DAG.LiveIns[Idx];
+ unsigned NumVGPRs = RP.getVGPRNum(ST.hasGFX90AInsts());
+ unsigned NumToIncreaseOcc = ST.getNumVGPRsToIncreaseOccupancy(NumVGPRs);
+ OptRegions.insert({I, NumToIncreaseOcc});
+ RA_DEBUG(dbgs() << "Region " << I << " has min. occupancy: decrease by "
+ << NumToIncreaseOcc << " VGPR(s) to improve occupancy\n");
}
- NewRegions = DAG.Regions;
- NewRescheduleRegions.reset();
+ if (OptRegions.empty())
+ return false;
- DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
- bool Improved = false;
- for (auto I : ImpactedRegions) {
- if (!DAG.RegionsWithMinOcc[I])
- continue;
+ // Tracks estimated rematerialization gains (i.e., reduction in RP) for
+ // this instruction in each optimizable region.
+ auto ReduceRPInRegion = [&](auto OptIt, unsigned I,
+ LaneBitmask Mask) -> bool {
+ auto NumRegs = SIRegisterInfo::getNumCoveredRegs(Mask);
+ unsigned &RPExcess = OptIt->getSecond();
+ if (NumRegs >= RPExcess) {
+ OptRegions.erase(I);
+ LLVM_DEBUG(dbgs() << "sinking increases occupancy in region " << I
+ << "\n");
+ } else {
+ RPExcess -= NumRegs;
+ LLVM_DEBUG(dbgs() << "sinking reduces excess pressure in region " << I
+ << " by " << NumRegs << " (" << RPExcess << " left)\n");
+ }
+ return OptRegions.empty();
+ };
+
+ // We need up-to-date live-out info. to query live-out register masks in
+ // regions containing rematerializable instructions.
+ DAG.RegionLiveOuts.buildLiveRegMap();
+
+ for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+ auto Region = DAG.Regions[I];
+ for (auto MI = Region.first; MI != Region.second; ++MI) {
+ // We only support instructions with at least one register (but
+ // non-subregister) operand.
+ MachineInstr &DefMI = *MI;
+ if (DefMI.isBundle() || !DefMI.getNumOperands() ||
+ !DefMI.getOperand(0).isReg() || DefMI.getOperand(0).getSubReg())
+ continue;
- Improved = false;
- int VGPRUsage = NewPressure[I].getVGPRNum(ST.hasGFX90AInsts());
- int SGPRUsage = NewPressure[I].getSGPRNum();
+ // We only support rematerializing virtual VGPRs with one definition and
+ // one use.
+ Register Reg = DefMI.getOperand(0).getReg();
+ if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) ||
+ !SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
+ !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
+ continue;
- // TODO: Handle occupancy drop due to AGPR and SGPR.
- // Check if cause of occupancy drop is due to VGPR usage and not SGPR.
- if (ST.getOccupancyWithNumSGPRs(SGPRUsage) == DAG.MinOccupancy)
- break;
+ // The instruction must be trivially rematerializable and have no virtual
+ // register use.
+ if (!isTriviallyReMaterializable(DefMI))
+ continue;
- // The occupancy of this region could have been improved by a previous
- // iteration's sinking of defs.
- if (NewPressure[I].getOccupancy(ST) > DAG.MinOccupancy) {
- NewRescheduleRegions[I] = true;
- Improved = true;
- continue;
- }
+ // We only care to rematerialize the instruction if its single use is in a
+ // different block.
+ MachineInstr *UseMI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
+ if (DefMI.getParent() == UseMI->getParent())
+ continue;
- // First check if we have enough trivially rematerializable instructions to
- // improve occupancy. Optimistically assume all instructions we are able to
- // sink decreased RP.
- int TotalSinkableRegs = 0;
- for (const auto &It : RematerializableInsts[I]) {
- MachineInstr *Def = It.first;
- Register DefReg = Def->getOperand(0).getReg();
- TotalSinkableRegs +=
- SIRegisterInfo::getNumCoveredRegs(NewLiveIns[I][DefReg]);
- }
- int VGPRsAfterSink = VGPRUsage - TotalSinkableRegs;
- unsigned OptimisticOccupancy = ST.getOccupancyWithNumVGPRs(VGPRsAfterSink);
- // If in the most optimistic scenario, we cannot improve occupancy, then do
- // not attempt to sink any instructions.
- if (OptimisticOccupancy <= DAG.MinOccupancy)
- break;
+ RA_DEBUG(dbgs() << "In region " << I << ", instruction " << DefMI
+ << " is rematerializable with single use " << *UseMI);
+ auto &Remat = RematInstructions.emplace_back(&DefMI, I, UseMI);
+
+ bool RematUseful = false;
+ if (auto It = OptRegions.find(I); It != OptRegions.end()) {
+ // Optimistically consider that moving the instruction out of its
+ // defining region will reduce RP in the latter; this assumes that
+ // maximum RP in the region is reached somewhere between the defining
+ // instruction and the end of the region.
+ RA_DEBUG(dbgs() << " Instruction's defining region is optimizable: ");
+ RematUseful = true;
+ auto RegMask = DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg];
+ if (ReduceRPInRegion(It, I, RegMask))
+ return true;
+ }
- unsigned ImproveOccupancy = 0;
- SmallVector<MachineInstr *, 4> SinkedDefs;
- for (auto &It : RematerializableInsts[I]) {
- MachineInstr *Def = It.first;
- MachineBasicBlock::iterator InsertPos =
- MachineBasicBlock::iterator(It.second);
- Register Reg = Def->getOperand(0).getReg();
- // Rematerialize MI to its use block. Since we are only rematerializing
- // instructions that do not have any virtual reg uses, we do not need to
- // call LiveRangeEdit::allUsesAvailableAt() and
- // LiveRangeEdit::canRematerializeAt().
- TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
- Def->getOperand(0).getSubReg(), *Def, *DAG.TRI);
- MachineInstr *NewMI = &*std::prev(InsertPos);
- LIS->InsertMachineInstrInMaps(*NewMI);
- LIS->removeInterval(Reg);
- LIS->createAndComputeVirtRegInterval(Reg);
- InsertedMIToOldDef[NewMI] = Def;
-
- // Update region boundaries in scheduling region we sinked from since we
- // may sink an instruction that was at the beginning or end of its region
- DAG.updateRegionBoundaries(NewRegions, Def, /*NewMI =*/nullptr,
- /*Removing =*/true);
-
- // Update region boundaries in region we sinked to.
- DAG.updateRegionBoundaries(NewRegions, InsertPos, NewMI);
-
- LaneBitmask PrevMask = NewLiveIns[I][Reg];
- // FIXME: Also update cached pressure for where the def was sinked from.
- // Update RP for all regions that has this reg as a live-in and remove
- // the reg from all regions as a live-in.
- for (auto Idx : RematDefToLiveInRegions[Def]) {
- NewLiveIns[Idx].erase(Reg);
- if (InsertPos->getParent() != DAG.Regions[Idx].first->getParent()) {
- // Def is live-through and not used in this block.
- NewPressure[Idx].inc(Reg, PrevMask, LaneBitmask::getNone(), DAG.MRI);
+ for (unsigned LVRegion = 0; LVRegion != E; ++LVRegion) {
+ // We are only collecting regions in which the register is a live-in
+ // (and may be live-through).
+ auto It = DAG.LiveIns[LVRegion].find(Reg);
+ if (It == DAG.LiveIns[LVRegion].end() || It->second.none())
+ continue;
+ Remat.LiveInRegions.insert(LVRegion);
+ RA_DEBUG(dbgs() << " Def is live-in in region " << LVRegion << ": ");
+
+ // Account for the reduction in RP due to the rematerialization in an
+ // optimizable region in which the defined register is a live-in. This
+ // is exact for live-through region but optimistic in the using region,
+ // where RP is actually reduced only if maximum RP is reached somewhere
+ // between the beginning of the region and the rematerializable
+ // instruction's use.
+ if (auto It = OptRegions.find(LVRegion); It != OptRegions.end()) {
+ RematUseful = true;
+ if (ReduceRPInRegion(It, LVRegion, DAG.LiveIns[LVRegion][Reg]))
+ return true;
} else {
- // Def is used and rematerialized into this block.
- GCNDownwardRPTracker RPT(*LIS);
- auto *NonDbgMI = &*skipDebugInstructionsForward(
- NewRegions[Idx].first, NewRegions[Idx].second);
- RPT.reset(*NonDbgMI, &NewLiveIns[Idx]);
- RPT.advance(NewRegions[Idx].second);
- NewPressure[Idx] = RPT.moveMaxPressure();
+ LLVM_DEBUG(dbgs() << "unoptimizable region\n");
}
}
- SinkedDefs.push_back(Def);
- ImproveOccupancy = NewPressure[I].getOccupancy(ST);
- if (ImproveOccupancy > DAG.MinOccupancy)
- break;
+ // If the instruction is not a live-in or live-out in any optimizable
+ // region then there is no point in rematerializing it.
+ if (!RematUseful) {
+ RematInstructions.pop_back();
+ RA_DEBUG(
+ dbgs()
+ << " No impact on any optimizable region, dropping instruction\n");
+ }
}
+ }
- // Remove defs we just sinked from all regions' list of sinkable defs
- for (auto &Def : SinkedDefs)
- for (auto TrackedIdx : RematDefToLiveInRegions[Def])
- RematerializableInsts[TrackedIdx].erase(Def);
+ RA_DEBUG(dbgs() << "Cannot increase occupancy through rematerialization\n");
+ return false;
+}
- if (ImproveOccupancy <= DAG.MinOccupancy)
- break;
+void PreRARematStage::sinkTriviallyRematInsts(
+ ArrayRef<RematInstruction> RematInstructions, const GCNSubtarget &ST,
+ const TargetInstrInfo *TII) {
+ // Collect regions whose live-ins or register pressure will change due to
+ // rematerialization, and map those whose maximum RP we need to fully
+ // recompute at the end to true.
+ SmallDenseMap<unsigned, bool> ImpactedRegions;
+ // Maps rematerialized instuctions to the one they were rematerialized from.
+ DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
+ LiveIntervals *LIS = DAG.LIS;
- NewRescheduleRegions[I] = true;
- Improved = true;
+ for (const RematInstruction &Remat : RematInstructions) {
+ MachineInstr *DefMI = Remat.RematMI;
+ MachineBasicBlock::iterator InsertPos(Remat.UseMI);
+ Register Reg = DefMI->getOperand(0).getReg();
+
+ // Rematerialize MI to its use block. Since we are only rematerializing
+ // instructions that do not have any virtual reg uses, we do not need to
+ // call LiveRangeEdit::allUsesAvailableAt() and
+ // LiveRangeEdit::canRematerializeAt().
+ TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
+ DefMI->getOperand(0).getSubReg(), *DefMI, *DAG.TRI);
+ MachineInstr *NewMI = &*std::prev(InsertPos);
+ LIS->InsertMachineInstrInMaps(*NewMI);
+ LIS->removeInterval(Reg);
+ LIS->createAndComputeVirtRegInterval(Reg);
+ InsertedMIToOldDef[NewMI] = DefMI;
+
+ // Update region boundaries in scheduling region we sinked from since we
+ // may sink an instruction that was at the beginning or end of its region.
+ DAG.updateRegionBoundaries(DAG.Regions, DefMI, /*NewMI =*/nullptr,
+ /*Removing =*/true);
+
+ // Update region boundaries in region we sinked to.
+ DAG.updateRegionBoundaries(DAG.Regions, InsertPos, NewMI);
+
+ // Collect all regions impacted by the rematerialization and update their
+ // live-in/RP information.
+ for (unsigned I : Remat.LiveInRegions) {
+ ImpactedRegions.insert({I, false});
+ GCNRPTracker::LiveRegSet &RegionLiveIns = DAG.LiveIns[I];
+
+ // The register is no longer a live-in in all regions but the one that
+ // contains the single use. In live-through regions, maximum register
+ // pressure decreases predictably so we can directly update it. In the
+ // using region, maximum register pressure may or may not decrease, so we
+ // will mark it for re-computation after all materializations.
+ RegionLiveIns.erase(Reg);
+ if (Remat.UseMI->getParent() != DAG.Regions[I].first->getParent()) {
+ // Register is live-through and not used in this block.
+ LaneBitmask PrevMask = RegionLiveIns[Reg];
+ DAG.Pressure[I].inc(Reg, PrevMask, LaneBitmask::getNone(), DAG.MRI);
+ } else {
+ // Register is used in this block.
+ ImpactedRegions[I] = true;
+ }
+ }
+
+ // RP in the region from which the instruction was rematerialized may or may
+ // not change, recompute-it fully later.
+ ImpactedRegions[Remat.DefRegion] = true;
}
- if (!Improved) {
- // Occupancy was not improved for all regions that were at MinOccupancy.
- // Undo sinking and remove newly rematerialized instructions.
- for (auto &Entry : InsertedMIToOldDef) {
- MachineInstr *MI = Entry.first;
- MachineInstr *OldMI = Entry.second;
- Register Reg = MI->getOperand(0).getReg();
- LIS->RemoveMachineInstrFromMaps(*MI);
- MI->eraseFromParent();
- OldMI->clearRegisterDeads(Reg);
- LIS->removeInterval(Reg);
- LIS->createAndComputeVirtRegInterval(Reg);
+ // All regions impacted by at least one rematerialization must be rescheduled.
+ BitVector NewRescheduleRegions(DAG.Regions.size());
+ for (auto &[I, RecomputeRP] : ImpactedRegions) {
+ NewRescheduleRegions[I] = true;
+ DAG.MBBLiveIns.erase(DAG.Regions[I].first->getParent());
+
+ // Recompute maximum RP in regions in which at least one instruction was
+ // rematerialized from or to.
+ if (Recom...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
/// Single use of the rematerializable instruction's defined register, | ||
/// located in a different block. | ||
MachineInstr *UseMI; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the scheduling was per block only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU it is per region, and in the case of a rematerialization both the region where the instruction was rematerialized from and the one it is rematerialized to will be rescheduled after this scheduling stage takes place.
In the future we could improve rematerialization to work even if the def and use are in the same block but not the same region, which from what I understand can happen and make sense.
// non-subregister) operand. | ||
MachineInstr &DefMI = *MI; | ||
if (DefMI.isBundle() || !DefMI.getNumOperands() || | ||
!DefMI.getOperand(0).isReg() || DefMI.getOperand(0).getSubReg()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rematerializing undef subregister defs should work. Also does this need to verify that operand 0 is a def? But this is repeating isReallyTriviallyReMaterializeable's logic anyway, most of these pre-checks should be redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catches. We indeed don't need to check that operand 0 is a def here as it is already implied by isTriviallyReMaterializable
. I now allow undef subregister defs as well (see test_occ_7_sink_one_def_of_undef_subreg_for_8
).
However, I am now forced to do something that seems slightly weird to me during rematerialization; I have to explicitly set the subreg on the new MI's def after calling TII.reMaterialize
even though the subreg is passed to the method already (if I don't do this I hit an assert in DAG.LIS->createAndComputeVirtRegInterval(Reg)
at the end). Not sure this is the correct way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subregister should survive the clone and substitute. What assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one in LiveInterval::computeSubRangeUndefs
. Looking at the in-flight MIR before explicitly setting the subreg after rematerialization it seems that the NewMI
's def reg is being assigned a different non-intersecting subregister than the original one, which leads to the two masks tested here to be completely unrelated.
In the test_occ_7_sink_one_def_of_undef_subreg_for_8
test for example, after calling SIInstrInfo::reMaterialize
but before setting the subreg explicitly, the new MI uses sub2
instead of sub1
.
[PreRARemat] Rematerializing undef %32.sub1:vreg_64_align2 = V_MOV_B32_e32 23, implicit $exec
[PreRARemat] New MI: undef %32.sub2:vreg_64_align2 = V_MOV_B32_e32 23, implicit $exec
// Even though TargetInstrInfo::isReallyTriviallyReMaterializable already | ||
// ensures that the instruction has no virtual register uses, | ||
// SIInstrInfo::isReallyTriviallyReMaterializable may consider an instruction | ||
// rematerializable and return before calling its parent's method, so we need | ||
// to double-check here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a bug? All of these rematerialize functions are confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit confusing to me as well. Our SIInstrInfo
override considers that rematerializing VOP{1,2,3} instructions with virtual register uses is trivial, which I don't really agree with. Removing the override in favor of its parent's method breaks many unit tests (58 files report errors) however, so if that's ok I will leave this as is for now and investigate this further in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our SIInstrInfo override considers that rematerializing VOP{1,2,3} instructions with virtual register uses is trivial, which I don't really agree with
I think this is one of those cases that was supposed to be moved up to the generic implementation. The actual uses consider liveness, so it shouldn't be baked into the low level check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 08d7eec never got finished / reapplied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analysis doesn't prove that we will change occupancy through rematerialization -- I think this is a complex problem involving the live regs at the rematerialization point, which may change with scheduling. Because of this, we may want to keep the rollback functionality?
for (auto [Remat, NewMI] : llvm::zip_equal(RematInstructions, NewMIs)) { | ||
// Remove rematerialized instruction from BBLiveInMap since we are sinking | ||
// it from its MBB. Also remove it from live intervals and from its MBB. | ||
MachineInstr *OldMI = Remat.RematMI; | ||
DAG.BBLiveInMap.erase(OldMI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is maintaining the map the way we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a closer look, it's probably not indeed (and potentially the same applies to MBBLiveIns
). Working to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent some more time looking at this, and it seems both BBLiveInMap
and MBBLiveIns
are not accessed after sinking instructions. Their only uses are in computeBlockPressure
which is only called for other scheduling stages.
However, as I was trying to understand the exact purpose of BBLiveInMap
I noticed that while GCNScheduleDAGMILive::getRegionLiveInMap
seems to want to compute the live reg set at the beginning of the initial region of each block, the BB
pointer is never updated. Seems like it should be inside the outer do/while loop.
llvm-project/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
Lines 886 to 894 in 6780ab3
auto I = Regions.rbegin(), E = Regions.rend(); | |
auto *BB = I->first->getParent(); | |
do { | |
auto *MI = &*skipDebugInstructionsForward(I->first, I->second); | |
RegionFirstMIs.push_back(MI); | |
do { | |
++I; | |
} while (I != E && I->first->getParent() == BB); | |
} while (I != E); |
What happens right now is that non-initial regions (if any) in the first block are skipped but they are not in other blocks. I don't see why the first block should be treated differently, is that a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent some more time looking at this, and it seems both BBLiveInMap and MBBLiveIns are not accessed after sinking instructions.
We should probably either completely abandon trying to maintain these, or do it correctly.
What happens right now is that non-initial regions (if any) in the first block are skipped
Yes I would expect this to build of region identifiers to the LiveRegSet for each region -- similar to getRegionLiveOutMap.
However, we currently only use the LiveRegMap to get the LiveRegs per block and not per region, so we aren't exposed to any undue RP inaccuracies from this currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I removed the updates to BBLiveInMap
and MBBLiveIns
from the stage for now since they are not accessed after it anyway. I also fixed the issue in getRegionLiveInMap
.
SlotIndex NewDef = Slots->getInstructionIndex(*NewMI).getRegSlot(); | ||
SlotIndex NewKill = Slots->getInstructionIndex(*Remat.UseMI).getRegSlot(); | ||
|
||
auto UpdateSegment = [&](LiveRange::Segments &Segments) -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the LIS for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for this particular segment change? If so I don't know if there exists some function that does this already. We would just like to avoid doing a more expensive full recomputation of the LI with
DAG.LIS->removeInterval(Reg);
DAG.LIS->createAndComputeVirtRegInterval(Reg);
NewPressure[Idx] = DAG.Pressure[Idx]; | ||
NewLiveIns[Idx] = DAG.LiveIns[Idx]; | ||
unsigned NumVGPRs = RP.getVGPRNum(ST.hasGFX90AInsts()); | ||
unsigned NumToIncreaseOcc = ST.getNumVGPRsToIncreaseOccupancy(NumVGPRs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case to eliminate spilling? For spilling in unified register file case (hasGFX90AInsts), there are additional constraints we need to consider (see #85860)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I am not considering the trade-off between occupancy and spilling i.e., I am not allowing any spilling while trying to increase occupancy. In that context, is there anything else I should consider? I plan to properly model excess ArchVGPR/AccVGPR in both unified and non-unified RFs in a following PR which will be able to remat instructions for all classes of virtual def regs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be trying to increase occupancy in cases where there is spilling ? Or just eliminate the spilling ? Eliminating spilling with a reduction in ArchVGPR usage alone may not be possible if AccVGPR pressure is too high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be trying to increase occupancy in cases where there is spilling ? Or just eliminate the spilling ?
Thanks for the suggestion, I had not explicitly considered trying to reduce/remove spilling even if we are unable to increase occupancy indeed. I have added support for it. In situation where occupancy is 1 and there is spilling, the stage now tries to collect as many rematerializable instructions as necessary to reduce/eliminate spilling, and will rematerialize just enough of them to eliminate spilling (or at least reduce it) even if we cannot increase occupancy (see new unit test test_occ_1_sink_for_spill
). If it is possible to eliminate spilling and increase occupancy then it rematerializes just enough instructions for both of them to happen.
Eliminating spilling with a reduction in ArchVGPR usage alone may not be possible if AccVGPR pressure is too high.
For now I still use the same approximation as in the current implementation, and plan to properly model excess RP for all register types in a further PR if that is ok.
Register Reg = DefMI.getOperand(0).getReg(); | ||
if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) || | ||
!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) || | ||
!DAG.MRI.hasOneDef(Reg)) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this is covered by isTriviallyReMaterializable, I think. !hasInterval would be malformed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If TargetInstrInfo::isTriviallyReMaterializable
early returns on
MI.getOpcode() == TargetOpcode::IMPLICIT_DEF && MI.getNumOperands() == 1
then I think I need the "is virtual" check. And I am not sure that TargetInstrInfo::isReallyTriviallyReMaterializable
implies that the register has only one def.
I left the hasInterval
check because it was here before so I assumed the interval could be missing, but perhaps not.
I completely agree, but my rationale for removing the rollback functionality is that after rescheduling |
revertScheduling doesn't un-materialize the remats. It's possible that we satisfy the heuristic via rematerialization, but don't increase occupancy before rescheduling. Post-remat rescheduling may improve occupancy / spilling, but it also may not; I think we should err on the side of not doing the remats as this may introduce regressions. If we have reason to believe that we should always try to reschedule after remat, then we should move the remat rollback functionality into revertScheduling |
- The stage now tries to remove any VGPR spilling before trying to increase occupancy. In all cases it remats just enough instructions to reduce/eliminate spilling and/or increase occupancy. - Stop updating MBBLiveIns and BBLiveInMap in the stage, since these are not accessed. - Fixed region boundary update logic, it did not work when a region became empty due to remating all of its MIs. - Fixed issue in getRegionLiveMap wherein the BB pointer was not updated correctly.
I did not realize this, thanks. I see the need for potentially rollbacking our remat decisions.
I think we should always try to reschedule since the remat heuristics are only "slightly optimistic". In particular, they are exact in regions where rematerialized registers are simply live-through. My motivating example for this series of changes is a case of LICM bringing many defs inside the pre-header of a grid stripe loop containing the whole kernel (which I think is a fairly common situation), causing a lot of spilling. Most affected regions in such cases are those where rematable registers are live-through. I am working on adding post-scheduling rollback functionality to the remat stage and trying to minimize the performance impact in the happy path where no rollbacking is required. |
The last commit re-introduces the ability to rollbacks rematerializations when the achievable occupancy estimated by the stage cannot be reached. Unit tests are added for the two possible kind of regions where our heuristics may be optimistic: regions defining rematable register and regions using rematable registers. In some scenarios rematerializations alone allow us to increase occupancy pre-rescheduling, but the scheduler then does a sub-optimal job and decreases the occupancy again to its pre-stage values. In such cases only the rescheduling is reverted, but rematerializations are not rollbacked. This happens for example in the post-remat reschedule of |
Closing because of seemingly infinite processing, replaced by #125885. |
AMDGPU scheduler's
PreRARematStage
attempts to increase function occupancy w.r.t. VGPR usage by rematerializing trivial VGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.
canIncreaseOccupancy
). The number of VGPRs to save to increase occupancy in each region at minimum occupancy is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to increase function occupancy (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.sinkTriviallyRematInsts
). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.This is the first change in a series of changes to improve rematerialization during scheduling. The next immediate steps are to add support for SGPR/AGPR rematerialization, then potentially relax some constraints on instructions we consider eligible for rematerialization.