-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling #125885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Lucas Ramirez (lucas-rami) ChangesAMDGPU scheduler's In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.
In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked. This replaces #118722, which seems to be stuck in an infinite processing loop. Patch is 264.49 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125885.diff 11 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachineRegisterInfo.h b/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
index 4fddc2033b81b7c..4dd7b20a68a0a96 100644
--- a/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
@@ -23,6 +23,7 @@
#include "llvm/ADT/iterator_range.h"
#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBundle.h"
#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/RegisterBank.h"
@@ -592,6 +593,9 @@ class MachineRegisterInfo {
/// multiple uses.
bool hasOneNonDBGUser(Register RegNo) const;
+ /// If the register has a single non-Debug instruction using the specified
+ /// register, returns it; otherwise returns nullptr.
+ MachineInstr *getOneNonDBGUser(Register RegNo) const;
/// hasAtMostUses - Return true if the given register has at most \p MaxUsers
/// non-debug user instructions.
diff --git a/llvm/lib/CodeGen/MachineRegisterInfo.cpp b/llvm/lib/CodeGen/MachineRegisterInfo.cpp
index 937f63f6c5e0046..b7135251781ad98 100644
--- a/llvm/lib/CodeGen/MachineRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/MachineRegisterInfo.cpp
@@ -432,6 +432,11 @@ bool MachineRegisterInfo::hasOneNonDBGUser(Register RegNo) const {
return hasSingleElement(use_nodbg_instructions(RegNo));
}
+MachineInstr *MachineRegisterInfo::getOneNonDBGUser(Register RegNo) const {
+ auto RegNoDbgUsers = use_nodbg_instructions(RegNo);
+ return hasSingleElement(RegNoDbgUsers) ? &*RegNoDbgUsers.begin() : nullptr;
+}
+
bool MachineRegisterInfo::hasAtMostUserInstrs(Register Reg,
unsigned MaxUsers) const {
return hasNItemsOrLess(use_instr_nodbg_begin(Reg), use_instr_nodbg_end(),
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 6e693066de10b6b..ceaf710f5d47fd3 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -25,8 +25,13 @@
#include "GCNSchedStrategy.h"
#include "AMDGPUIGroupLP.h"
+#include "GCNRegPressure.h"
#include "SIMachineFunctionInfo.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/STLExtras.h"
#include "llvm/CodeGen/RegisterClassInfo.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Support/ErrorHandling.h"
#define DEBUG_TYPE "machine-scheduler"
@@ -307,11 +312,11 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
HasHighPressure = true;
if (SGPRDelta > VGPRDelta) {
Cand.RPDelta.CriticalMax =
- PressureChange(AMDGPU::RegisterPressureSets::SReg_32);
+ PressureChange(AMDGPU::RegisterPressureSets::SReg_32);
Cand.RPDelta.CriticalMax.setUnitInc(SGPRDelta);
} else {
Cand.RPDelta.CriticalMax =
- PressureChange(AMDGPU::RegisterPressureSets::VGPR_32);
+ PressureChange(AMDGPU::RegisterPressureSets::VGPR_32);
Cand.RPDelta.CriticalMax.setUnitInc(VGPRDelta);
}
}
@@ -324,7 +329,7 @@ void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
const RegPressureTracker &RPTracker,
SchedCandidate &Cand,
bool IsBottomUp) {
- const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo*>(TRI);
+ const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(TRI);
ArrayRef<unsigned> Pressure = RPTracker.getRegSetPressureAtPos();
unsigned SGPRPressure = 0;
unsigned VGPRPressure = 0;
@@ -420,7 +425,7 @@ SUnit *GCNSchedStrategy::pickNodeBidirectional(bool &IsTopNode) {
pickNodeFromQueue(Top, TopPolicy, DAG->getTopRPTracker(), TCand,
/*IsBottomUp=*/false);
assert(TCand.SU == TopCand.SU &&
- "Last pick result should correspond to re-picking right now");
+ "Last pick result should correspond to re-picking right now");
}
#endif
}
@@ -890,13 +895,13 @@ GCNScheduleDAGMILive::getRegionLiveInMap() const {
std::vector<MachineInstr *> RegionFirstMIs;
RegionFirstMIs.reserve(Regions.size());
auto I = Regions.rbegin(), E = Regions.rend();
- auto *BB = I->first->getParent();
do {
+ const MachineBasicBlock *MBB = I->first->getParent();
auto *MI = &*skipDebugInstructionsForward(I->first, I->second);
RegionFirstMIs.push_back(MI);
do {
++I;
- } while (I != E && I->first->getParent() == BB);
+ } while (I != E && I->first->getParent() == MBB);
} while (I != E);
return getLiveRegMap(RegionFirstMIs, /*After=*/false, *LIS);
}
@@ -1081,31 +1086,32 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
return true;
}
-bool PreRARematStage::initGCNSchedStage() {
- if (!GCNSchedStage::initGCNSchedStage())
- return false;
-
- if (DAG.RegionsWithMinOcc.none() || DAG.Regions.size() == 1)
- return false;
-
- const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
- // Rematerialization will not help if occupancy is not limited by reg usage.
- if (ST.getOccupancyWithWorkGroupSizes(MF).second == DAG.MinOccupancy)
- return false;
+/// Allows to easily filter for this stage's debug output.
+#define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << "[PreRARemat] "; X;)
- // FIXME: This pass will invalidate cached MBBLiveIns for regions
- // inbetween the defs and region we sinked the def to. Cached pressure
- // for regions where a def is sinked from will also be invalidated. Will
- // need to be fixed if there is another pass after this pass.
+bool PreRARematStage::initGCNSchedStage() {
+ // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
+ // regions inbetween the defs and region we sinked the def to. Will need to be
+ // fixed if there is another pass after this pass.
assert(!S.hasNextStage());
- collectRematerializableInstructions();
- if (RematerializableInsts.empty() || !sinkTriviallyRematInsts(ST, TII))
+ if (!GCNSchedStage::initGCNSchedStage() || DAG.RegionsWithMinOcc.none() ||
+ DAG.Regions.size() == 1 || !canIncreaseOccupancyOrReduceSpill())
return false;
- LLVM_DEBUG(
- dbgs() << "Retrying function scheduling with improved occupancy of "
- << DAG.MinOccupancy << " from rematerializing\n");
+ // Rematerialize identified instructions and update scheduler's state.
+ rematerialize();
+ if (GCNTrackers)
+ DAG.RegionLiveOuts.buildLiveRegMap();
+ REMAT_DEBUG(
+ dbgs() << "Retrying function scheduling with new min. occupancy of "
+ << AchievedOcc << " from rematerializing (original was "
+ << DAG.MinOccupancy << ", target was " << TargetOcc << ")\n");
+ if (AchievedOcc > DAG.MinOccupancy) {
+ DAG.MinOccupancy = AchievedOcc;
+ SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
+ MFI.increaseOccupancy(MF, DAG.MinOccupancy);
+ }
return true;
}
@@ -1397,8 +1403,7 @@ GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
#ifndef NDEBUG
LLVM_DEBUG(
printScheduleModel(ReadyCyclesSorted);
- dbgs() << "\n\t"
- << "Metric: "
+ dbgs() << "\n\t" << "Metric: "
<< (SumBubbles
? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
: 1)
@@ -1433,8 +1438,7 @@ GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
#ifndef NDEBUG
LLVM_DEBUG(
printScheduleModel(ReadyCyclesSorted);
- dbgs() << "\n\t"
- << "Metric: "
+ dbgs() << "\n\t" << "Metric: "
<< (SumBubbles
? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
: 1)
@@ -1482,8 +1486,7 @@ bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
dbgs()
<< "\n\t *** In shouldRevertScheduling ***\n"
<< " *********** BEFORE UnclusteredHighRPStage ***********\n");
- ScheduleMetrics MBefore =
- getScheduleMetrics(DAG.SUnits);
+ ScheduleMetrics MBefore = getScheduleMetrics(DAG.SUnits);
LLVM_DEBUG(
dbgs()
<< "\n *********** AFTER UnclusteredHighRPStage ***********\n");
@@ -1516,13 +1519,9 @@ bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) {
}
bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter) {
- if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
- return true;
-
- if (mayCauseSpilling(WavesAfter))
- return true;
-
- return false;
+ return GCNSchedStage::shouldRevertScheduling(WavesAfter) ||
+ mayCauseSpilling(WavesAfter) ||
+ (IncreaseOccupancy && WavesAfter < TargetOcc);
}
bool ILPInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
@@ -1615,237 +1614,307 @@ void GCNSchedStage::revertScheduling() {
DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);
}
-void PreRARematStage::collectRematerializableInstructions() {
+bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(DAG.TRI);
- for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
- Register Reg = Register::index2VirtReg(I);
- if (!DAG.LIS->hasInterval(Reg))
- continue;
-
- // TODO: Handle AGPR and SGPR rematerialization
- if (!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
- !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
- continue;
- MachineOperand *Op = DAG.MRI.getOneDef(Reg);
- MachineInstr *Def = Op->getParent();
- if (Op->getSubReg() != 0 || !isTriviallyReMaterializable(*Def))
- continue;
-
- MachineInstr *UseI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
- if (Def->getParent() == UseI->getParent())
- continue;
-
- // We are only collecting defs that are defined in another block and are
- // live-through or used inside regions at MinOccupancy. This means that the
- // register must be in the live-in set for the region.
- bool AddedToRematList = false;
- for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
- auto It = DAG.LiveIns[I].find(Reg);
- if (It != DAG.LiveIns[I].end() && !It->second.none()) {
- if (DAG.RegionsWithMinOcc[I]) {
- RematerializableInsts[I][Def] = UseI;
- AddedToRematList = true;
- }
-
- // Collect regions with rematerializable reg as live-in to avoid
- // searching later when updating RP.
- RematDefToLiveInRegions[Def].push_back(I);
+ REMAT_DEBUG(dbgs() << "Collecting rematerializable instructions in "
+ << MF.getFunction().getName() << '\n');
+
+ // Maps optimizable regions (i.e., regions at minimum and VGPR-limited
+ // occupancy, or regions with VGPR spilling) to their excess RP.
+ DenseMap<unsigned, unsigned> OptRegions;
+
+ // Note that the maximum number of VGPRs to use to eliminate spill may be
+ // lower than the maximum number to increase occupancy when the function has
+ // the "amdgpu-num-vgpr" attribute.
+ const std::pair<unsigned, unsigned> OccBounds =
+ ST.getOccupancyWithWorkGroupSizes(MF);
+ // FIXME: we should be able to just call ST.getMaxNumArchVGPRs() but that
+ // would use the occupancy bounds as determined by
+ // MF.getFunction().getWavesPerEU(), which look incorrect in some cases.
+ const unsigned MaxVGPRsNoSpill =
+ ST.getBaseMaxNumVGPRs(MF.getFunction(),
+ {ST.getMinNumArchVGPRs(OccBounds.second),
+ ST.getMaxNumArchVGPRs(OccBounds.first)},
+ false);
+ const unsigned MaxVGPRsIncOcc = ST.getMaxNumArchVGPRs(DAG.MinOccupancy + 1);
+ IncreaseOccupancy = OccBounds.second > DAG.MinOccupancy;
+
+ // Collect optimizable regions. If there is spilling in any region we will
+ // just try to reduce it. Otherwise we will try to increase occupancy by one.
+ for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+ GCNRegPressure &RP = DAG.Pressure[I];
+ unsigned NumVGPRs = RP.getArchVGPRNum();
+ unsigned ExcessRP = 0;
+ if (NumVGPRs > MaxVGPRsNoSpill) {
+ if (IncreaseOccupancy) {
+ // We won't try to increase occupancy.
+ IncreaseOccupancy = false;
+ OptRegions.clear();
+ }
+ // Region has VGPR spilling, we will try to reduce spilling as much as
+ // possible.
+ ExcessRP = NumVGPRs - MaxVGPRsNoSpill;
+ REMAT_DEBUG(dbgs() << "Region " << I << " is spilling VGPRs, save "
+ << ExcessRP << " VGPR(s) to eliminate spilling\n");
+ } else if (IncreaseOccupancy) {
+ if (ST.getOccupancyWithNumSGPRs(RP.getSGPRNum()) == DAG.MinOccupancy) {
+ // Occupancy is SGPR-limited in the region, no point in trying to
+ // increase it through VGPR usage.
+ IncreaseOccupancy = false;
+ OptRegions.clear();
+ } else if (NumVGPRs > MaxVGPRsIncOcc) {
+ // Occupancy is VGPR-limited.
+ ExcessRP = NumVGPRs - MaxVGPRsIncOcc;
+ REMAT_DEBUG(dbgs() << "Region " << I << " has min. occupancy: save "
+ << ExcessRP << " VGPR(s) to improve occupancy\n");
}
}
- if (!AddedToRematList)
- RematDefToLiveInRegions.erase(Def);
+ if (ExcessRP)
+ OptRegions.insert({I, ExcessRP});
}
-}
-
-bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
- const TargetInstrInfo *TII) {
- // Temporary copies of cached variables we will be modifying and replacing if
- // sinking succeeds.
- SmallVector<
- std::pair<MachineBasicBlock::iterator, MachineBasicBlock::iterator>, 32>
- NewRegions;
- DenseMap<unsigned, GCNRPTracker::LiveRegSet> NewLiveIns;
- DenseMap<unsigned, GCNRegPressure> NewPressure;
- BitVector NewRescheduleRegions;
- LiveIntervals *LIS = DAG.LIS;
-
- NewRegions.resize(DAG.Regions.size());
- NewRescheduleRegions.resize(DAG.Regions.size());
-
- // Collect only regions that has a rematerializable def as a live-in.
- SmallSet<unsigned, 16> ImpactedRegions;
- for (const auto &It : RematDefToLiveInRegions)
- ImpactedRegions.insert(It.second.begin(), It.second.end());
-
- // Make copies of register pressure and live-ins cache that will be updated
- // as we rematerialize.
- for (auto Idx : ImpactedRegions) {
- NewPressure[Idx] = DAG.Pressure[Idx];
- NewLiveIns[Idx] = DAG.LiveIns[Idx];
- }
- NewRegions = DAG.Regions;
- NewRescheduleRegions.reset();
-
- DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
- bool Improved = false;
- for (auto I : ImpactedRegions) {
- if (!DAG.RegionsWithMinOcc[I])
- continue;
+ if (OptRegions.empty())
+ return false;
- Improved = false;
- int VGPRUsage = NewPressure[I].getVGPRNum(ST.hasGFX90AInsts());
- int SGPRUsage = NewPressure[I].getSGPRNum();
+ // When we are reducing spilling, the target is the minimum achievable
+ // occupancy implied by workgroup sizes.
+ TargetOcc = IncreaseOccupancy ? DAG.MinOccupancy + 1 : OccBounds.first;
+
+ // Accounts for a reduction in RP in an optimizable region. Returns whether we
+ // estimate that we have identified enough rematerialization opportunities to
+ // achieve our goal.
+ auto ReduceRPInRegion = [&](auto OptIt, LaneBitmask Mask) -> bool {
+ auto NumRegs = SIRegisterInfo::getNumCoveredRegs(Mask);
+ unsigned I = OptIt->getFirst();
+ unsigned &Excess = OptIt->getSecond();
+ if (NumRegs >= Excess)
+ OptRegions.erase(I);
+ else
+ Excess -= NumRegs;
+ return OptRegions.empty();
+ };
+
+ // We need up-to-date live-out info. to query live-out register masks in
+ // regions containing rematerializable instructions.
+ DAG.RegionLiveOuts.buildLiveRegMap();
+
+ for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+ auto Region = DAG.Regions[I];
+ for (auto MI = Region.first; MI != Region.second; ++MI) {
+ // The instruction must be trivially rematerializable.
+ MachineInstr &DefMI = *MI;
+ if (!isTriviallyReMaterializable(DefMI))
+ continue;
- // TODO: Handle occupancy drop due to AGPR and SGPR.
- // Check if cause of occupancy drop is due to VGPR usage and not SGPR.
- if (ST.getOccupancyWithNumSGPRs(SGPRUsage) == DAG.MinOccupancy)
- break;
+ // We only support rematerializing virtual VGPRs with one definition.
+ Register Reg = DefMI.getOperand(0).getReg();
+ if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) ||
+ !SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
+ !DAG.MRI.hasOneDef(Reg))
+ continue;
- // The occupancy of this region could have been improved by a previous
- // iteration's sinking of defs.
- if (NewPressure[I].getOccupancy(ST) > DAG.MinOccupancy) {
- NewRescheduleRegions[I] = true;
- Improved = true;
- continue;
- }
+ // We only care to rematerialize the instruction if it has a single
+ // non-debug user in a different block.
+ MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
+ if (!UseMI || DefMI.getParent() == UseMI->getParent())
+ continue;
- // First check if we have enough trivially rematerializable instructions to
- // improve occupancy. Optimistically assume all instructions we are able to
- // sink decreased RP.
- int TotalSinkableRegs = 0;
- for (const auto &It : RematerializableInsts[I]) {
- MachineInstr *Def = It.first;
- Register DefReg = Def->getOperand(0).getReg();
- TotalSinkableRegs +=
- SIRegisterInfo::getNumCoveredRegs(NewLiveIns[I][DefReg]);
- }
- int VGPRsAfterSink = VGPRUsage - TotalSinkableRegs;
- unsigned OptimisticOccupancy = ST.getOccupancyWithNumVGPRs(VGPRsAfterSink);
- // If in the most optimistic scenario, we cannot improve occupancy, then do
- // not attempt to sink any instructions.
- if (OptimisticOccupancy <= DAG.MinOccupancy)
- break;
+ REMAT_DEBUG(dbgs() << "In region " << I
+ << ", rematerializable instruction " << DefMI);
+ auto &Remat = Rematerializations.emplace_back(&DefMI, I, UseMI);
+
+ bool RematUseful = false;
+ if (auto It = OptRegions.find(I); It != OptRegions.end()) {
+ // Optimistically consider that moving the instruction out of its
+ // defining region will reduce RP in the latter; this assumes that
+ // maximum RP in the region is reached somewhere between the defining
+ // instruction and the end of the region.
+ REMAT_DEBUG(
+ dbgs() << " Instruction's defining region is optimizable\n");
+ RematUseful = true;
+ if (ReduceRPInRegion(
+ It, DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg]))
+ return true;
+ }
- unsigned ImproveOccupancy = 0;
- SmallVector<MachineInstr *, 4> SinkedDefs;
- for (auto &It : RematerializableInsts[I]) {
- MachineInstr *Def = It.first;
- MachineBasicBlock::iterator InsertPos =
- MachineBasicBlock::iterator(It.second);
- Register Reg = Def->getOperand(0).getReg();
- // Rematerialize MI to its use block. Since we are only rematerializing
- // instructions that do not have any virtual reg uses, we do not need to
- // call LiveRangeEdit::allUsesAvailableAt() and
- // LiveRangeEdit::canRematerializeAt().
- TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
- Def->getOperand(0).getSubReg(), *Def, *DAG.TRI);
- MachineInstr *NewMI = &*std::prev(InsertPos);
- LIS->InsertMachineInstrInMaps(*NewMI);
- LIS->removeInterval(Reg);
- LIS->createAndComputeVirtRegInterval(Reg);
- InsertedMIToOldDef[NewMI] = Def;
-
- // Update region boundaries in scheduling region we sinked from since we
- // may sink an instruction that was at the beginning or end of its region
- DAG.updateRegionBoundaries(NewRegions, Def, /*NewMI =*/nullptr,
- /*Removing =*/true);
-
- // Update region boundaries in region we sinked to.
- DAG.updateRegionBoundaries(NewRegions, InsertPos, NewMI);
-
- LaneBitmask PrevMask = NewLiveIns[I][Reg];
- // FIXME: Also update cached pressure for where the def was sinked from.
- // Update RP for all regions that has this reg as a live-in and remove
- // the reg from all regions as a live-in.
- ...
[truncated]
|
The typo in GCNSubtarget::getMaxNumArchVGPRs made the function return the wrong value in all cases. The returned value is now correct; this causes some unit tests to break in largely cosmetic ways, since that value is used to calculate excess RP in the remat stage.
69be687
to
9f49681
Compare
@@ -6047,9 +7328,9 @@ body: | | |||
|
|||
bb.4: | |||
|
|||
S_NOP 0, implicit %52, implicit %62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a few cases here and below I moved this instruction to the top of the region to allow rematerializations to increase occupancy. Otherwise the instruction defining %60 is rematerialized at the beginning of the region which doesn't decrease maximum RP in the region, even though my optimistic heuristics estimate that it should have been enough. Ideally, I think the scheduler should be able to reorder instructions in such cases, but it does not. Alternatively, we could make the stage perform more rematerializations when it realizes pre-rescheduling that it did not achieve the expected goal.
- Properly account for AGPRs in excess RP calculations. - Account for SGPR spilling when deciding whether to try to increase occupancy. - Don't rollback when rescheduling managed to improve occuoancy after remats.
✅ With the latest revision this PR passed the C/C++ code formatter. |
This prevents rollbacking rematerializations in case rescheduling improved occupancy even further than the target. Also fix format.
HasRematDependency = true; | ||
break; | ||
} | ||
if (NumAGPRs > MaxVGPRs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set ExcessArchVGPRs = NumArchVGPRs
-- we can spill AGPR to VGPR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know that, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second though, not sure about this..
I think our best bet is to assume perfect allocation -- meaning we need NumAGPRs - MaxVGPRs number of available arch VGPR
Now that single-MI regions are no longer filtered out by the machine scheduler, we can greatly simplify the logic to update region boundaries during rematerialization/rollbacking, at the cost of a single walk through the MF's instructions at the beginning of the stage.
HasRematDependency = true; | ||
break; | ||
} | ||
if (NumAGPRs > MaxVGPRs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second though, not sure about this..
I think our best bet is to assume perfect allocation -- meaning we need NumAGPRs - MaxVGPRs number of available arch VGPR
- Allow remat in same block if using MI is in different region. Add unit test. - Take into account excess ArchVGPR/AGPR pressure above addressable limits. Add unit tests to make sure this works. - Fix unused register in existing unit test.
Also update one recently changed unit test from main.
This fails in some complex use cases, and it is not clear whether it is meaningfully faster that just letting the LIS object recompute it from scratch. Also removed stale comment in isTriviallyRematerilizable: we can now remat MIs with virtual register uses.
AMD internal CI/CD passed with the last patch. |
// We may be able to save one more whole ArchVGPR allocation granule. | ||
if (NumRegs >= ArchVGPRsToAlignment) { | ||
NumSavedRegs += Granule; | ||
ArchVGPRsToAlignment = Granule - (NumRegs - ArchVGPRsToAlignment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NumRegs - ArchVGPRsToAlignment
can be greater than Granule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From above unless I am mistaken we should get NumRegs := NumRegs - (NumRegs / Granule) * Granule == NumRegs % Granule
, so NumRegs < Granule
.
Additionally, we have NumRegs >= ArchVGPRsToAlignment
from the if condition and Granule >= ArchVGPRsToAlignment > 0
by construction. Hence,
; GISEL-GFX942-NEXT: v_accvgpr_write_b32 a4, v13 ; Reload Reuse | ||
; GISEL-GFX942-NEXT: buffer_store_dwordx4 v[2:5], v62, s[4:7], 0 offen | ||
; GISEL-GFX942-NEXT: buffer_store_dwordx4 v[6:9], v62, s[4:7], 0 offen offset:16 | ||
; GISEL-GFX942-NEXT: v_mov_b32_e32 v1, 0x2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here? Looks like our rematerialization did not help with spilling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the debug output. The upstream implementation identifies v_mov_b32_e32 v1, 0x2000
as a candidate for rematerialization but eventually does not remat anything because it determines it won't be able to improve occupancy. My implementation remats the instruction to reduce identified spilling (4 VGPRs) in the region below (starting on line 448) by 1 VGPR. That's not enough to eliminate spilling but it reduces it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay -- I took a look at this test. The scheduler sees that after the rematerialization, we have improved the VGPR pressure which should theoretically help with spilling. However, the heuristic which decides if we are spilling or not is wrong, which I have mentioned in my other comment.
This is a bit unclear from just looking at the resultant code, because ,during RA, we are doing a suboptimal job and thus have spilling. This is due to how RA partitions the register file between AGPRs and VGPRs. We find our minimum allowable wavesPerEU is 4, so we have 128 vector registers, but RA splits this into 64 agprs and 64 vgprs. So, ironically, we have a limit of 64 archvgpr which is the same as the incorrect limit imposed by requiring an 8 occupancy during scheduuling. However, this is an accident and should be fixed as I think Matt is working on improving this. Also, if we use attributes #0 = { "amdgpu-agpr-alloc"="0" }
we avoid the RA issue (and thus have no spiling), but the remat heuristics still think we are going to spill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking into this. With the spilling heuristic change you suggested the test now behaves as before.
namespace { | ||
/// Models excess register pressure in a region and tracks our progress as we | ||
/// identify rematerialization opportunities. | ||
struct ExcessRP { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could build better ExcessRP queries into GCNRegPressure and rewrite this heuristic to use those queries in combination with GCNRegPressure.inc() . This will help with code maintenance -- maybe as a followup PR
// The boundaries of an empty region may not point to a valid MI from which we | ||
// can get the parent MBB, so we have to iterate over all the MF's MBBs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need to do this, the region should know what blocks it covers directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify my rationale for this. When we end up emptying a region completely through rematerializations the region boundaries end up as Region.first == Region.second
(== BB.end()
if the region is the last in the BB). AFAIU I cannot retrieve the BB from the boundaries alone in this case, hence this linear search.
To avoid this linear search I added PreRARematStage::RegionBB
which stores the parent MBB of each region before remats.
Aditionally fixes bug where MIRegion would only be filled after collecting rematerializable instructions, however the collection phase depends on MIRegion to determine whether the user of an MI is located in a different region from that MI. This does not break any test.
Using the base versions was necessary when we were not using the default range of waves/EU. We use the default range now so we can directly use the "regular" methods.
Co-authored-by: Matt Arsenault <[email protected]>
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/18130 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/17917 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/18163 Here is the relevant piece of the build log for the reference
|
I reverted this patch locally and that got rid of the issue. |
#125885 did not update the test.
It's just a few instructions that moved, I regenerated the test and committed it directly: 382a085 |
Thank you |
Thanks @Pierre-vh! |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/16108 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/26826 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/18608 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/56/builds/25326 Here is the relevant piece of the build log for the reference
|
@lucas-rami, are you working on a fix or should this be reverted? |
Is this not the same issue solved by 382a085 and reported by a different builder that checked out LLVM at my commit (before the fix) each time? Line 509 of
Apologies if I am missing something here. |
Sorry. I missed that the fix happened here: #125885 (comment) |
I get sanitizer errors when building at this commit with Here's the asan error:
|
Yes, the same on bots. |
@lucas-rami we had to revert to fix bots |
Sounds good, thanks for the ping and sorry for the faulty commit, working on addressing the issue. |
…ling (llvm#125885) AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.
…ng scheduling (#125885)" (#139548) This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to issues detected by the address sanitizer (MIs have to be removed from live intervals before being removed from their parent MBB). Original commit description below. AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.
AMDGPU scheduler's
PreRARematStage
attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up.In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.
This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.
canIncreaseOccupancyOrReduceSpill
). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.rematerialize
). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.
This replaces #118722, which seems to be stuck in an infinite processing loop.