[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling #125885

lucas-rami · 2025-02-05T17:03:48Z

AMDGPU scheduler's PreRARematStage attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.

Analysis and collection (canIncreaseOccupancyOrReduceSpill). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.
Rematerialization (rematerialize). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.

This replaces #118722, which seems to be stuck in an infinite processing loop.

llvmbot · 2025-02-05T17:05:35Z

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

AMDGPU scheduler's PreRARematStage attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.

Analysis and collection (canIncreaseOccupancyOrReduceSpill). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.
Rematerialization (rematerialize). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.

This replaces #118722, which seems to be stuck in an infinite processing loop.

Patch is 264.49 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125885.diff

11 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineRegisterInfo.h (+4)
(modified) llvm/lib/CodeGen/MachineRegisterInfo.cpp (+5)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+412-273)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+83-29)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.cpp (+26-27)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+169-431)
(modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp (+27-13)
(modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h (+15)
(added) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-attr.mir (+221)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-debug.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+1328-51)

diff --git a/llvm/include/llvm/CodeGen/MachineRegisterInfo.h b/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
index 4fddc2033b81b7c..4dd7b20a68a0a96 100644
--- a/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
@@ -23,6 +23,7 @@
 #include "llvm/ADT/iterator_range.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineInstrBundle.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/RegisterBank.h"
@@ -592,6 +593,9 @@ class MachineRegisterInfo {
   /// multiple uses.
   bool hasOneNonDBGUser(Register RegNo) const;
 
+  /// If the register has a single non-Debug instruction using the specified
+  /// register, returns it; otherwise returns nullptr.
+  MachineInstr *getOneNonDBGUser(Register RegNo) const;
 
   /// hasAtMostUses - Return true if the given register has at most \p MaxUsers
   /// non-debug user instructions.
diff --git a/llvm/lib/CodeGen/MachineRegisterInfo.cpp b/llvm/lib/CodeGen/MachineRegisterInfo.cpp
index 937f63f6c5e0046..b7135251781ad98 100644
--- a/llvm/lib/CodeGen/MachineRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/MachineRegisterInfo.cpp
@@ -432,6 +432,11 @@ bool MachineRegisterInfo::hasOneNonDBGUser(Register RegNo) const {
   return hasSingleElement(use_nodbg_instructions(RegNo));
 }
 
+MachineInstr *MachineRegisterInfo::getOneNonDBGUser(Register RegNo) const {
+  auto RegNoDbgUsers = use_nodbg_instructions(RegNo);
+  return hasSingleElement(RegNoDbgUsers) ? &*RegNoDbgUsers.begin() : nullptr;
+}
+
 bool MachineRegisterInfo::hasAtMostUserInstrs(Register Reg,
                                               unsigned MaxUsers) const {
   return hasNItemsOrLess(use_instr_nodbg_begin(Reg), use_instr_nodbg_end(),
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 6e693066de10b6b..ceaf710f5d47fd3 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -25,8 +25,13 @@
 
 #include "GCNSchedStrategy.h"
 #include "AMDGPUIGroupLP.h"
+#include "GCNRegPressure.h"
 #include "SIMachineFunctionInfo.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/STLExtras.h"
 #include "llvm/CodeGen/RegisterClassInfo.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Support/ErrorHandling.h"
 
 #define DEBUG_TYPE "machine-scheduler"
 
@@ -307,11 +312,11 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
     HasHighPressure = true;
     if (SGPRDelta > VGPRDelta) {
       Cand.RPDelta.CriticalMax =
-        PressureChange(AMDGPU::RegisterPressureSets::SReg_32);
+          PressureChange(AMDGPU::RegisterPressureSets::SReg_32);
       Cand.RPDelta.CriticalMax.setUnitInc(SGPRDelta);
     } else {
       Cand.RPDelta.CriticalMax =
-        PressureChange(AMDGPU::RegisterPressureSets::VGPR_32);
+          PressureChange(AMDGPU::RegisterPressureSets::VGPR_32);
       Cand.RPDelta.CriticalMax.setUnitInc(VGPRDelta);
     }
   }
@@ -324,7 +329,7 @@ void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
                                          const RegPressureTracker &RPTracker,
                                          SchedCandidate &Cand,
                                          bool IsBottomUp) {
-  const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo*>(TRI);
+  const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(TRI);
   ArrayRef<unsigned> Pressure = RPTracker.getRegSetPressureAtPos();
   unsigned SGPRPressure = 0;
   unsigned VGPRPressure = 0;
@@ -420,7 +425,7 @@ SUnit *GCNSchedStrategy::pickNodeBidirectional(bool &IsTopNode) {
       pickNodeFromQueue(Top, TopPolicy, DAG->getTopRPTracker(), TCand,
                         /*IsBottomUp=*/false);
       assert(TCand.SU == TopCand.SU &&
-           "Last pick result should correspond to re-picking right now");
+             "Last pick result should correspond to re-picking right now");
     }
 #endif
   }
@@ -890,13 +895,13 @@ GCNScheduleDAGMILive::getRegionLiveInMap() const {
   std::vector<MachineInstr *> RegionFirstMIs;
   RegionFirstMIs.reserve(Regions.size());
   auto I = Regions.rbegin(), E = Regions.rend();
-  auto *BB = I->first->getParent();
   do {
+    const MachineBasicBlock *MBB = I->first->getParent();
     auto *MI = &*skipDebugInstructionsForward(I->first, I->second);
     RegionFirstMIs.push_back(MI);
     do {
       ++I;
-    } while (I != E && I->first->getParent() == BB);
+    } while (I != E && I->first->getParent() == MBB);
   } while (I != E);
   return getLiveRegMap(RegionFirstMIs, /*After=*/false, *LIS);
 }
@@ -1081,31 +1086,32 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
   return true;
 }
 
-bool PreRARematStage::initGCNSchedStage() {
-  if (!GCNSchedStage::initGCNSchedStage())
-    return false;
-
-  if (DAG.RegionsWithMinOcc.none() || DAG.Regions.size() == 1)
-    return false;
-
-  const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
-  // Rematerialization will not help if occupancy is not limited by reg usage.
-  if (ST.getOccupancyWithWorkGroupSizes(MF).second == DAG.MinOccupancy)
-    return false;
+/// Allows to easily filter for this stage's debug output.
+#define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << "[PreRARemat] "; X;)
 
-  // FIXME: This pass will invalidate cached MBBLiveIns for regions
-  // inbetween the defs and region we sinked the def to. Cached pressure
-  // for regions where a def is sinked from will also be invalidated. Will
-  // need to be fixed if there is another pass after this pass.
+bool PreRARematStage::initGCNSchedStage() {
+  // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
+  // regions inbetween the defs and region we sinked the def to. Will need to be
+  // fixed if there is another pass after this pass.
   assert(!S.hasNextStage());
 
-  collectRematerializableInstructions();
-  if (RematerializableInsts.empty() || !sinkTriviallyRematInsts(ST, TII))
+  if (!GCNSchedStage::initGCNSchedStage() || DAG.RegionsWithMinOcc.none() ||
+      DAG.Regions.size() == 1 || !canIncreaseOccupancyOrReduceSpill())
     return false;
 
-  LLVM_DEBUG(
-      dbgs() << "Retrying function scheduling with improved occupancy of "
-             << DAG.MinOccupancy << " from rematerializing\n");
+  // Rematerialize identified instructions and update scheduler's state.
+  rematerialize();
+  if (GCNTrackers)
+    DAG.RegionLiveOuts.buildLiveRegMap();
+  REMAT_DEBUG(
+      dbgs() << "Retrying function scheduling with new min. occupancy of "
+             << AchievedOcc << " from rematerializing (original was "
+             << DAG.MinOccupancy << ", target was " << TargetOcc << ")\n");
+  if (AchievedOcc > DAG.MinOccupancy) {
+    DAG.MinOccupancy = AchievedOcc;
+    SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
+    MFI.increaseOccupancy(MF, DAG.MinOccupancy);
+  }
   return true;
 }
 
@@ -1397,8 +1403,7 @@ GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
 #ifndef NDEBUG
   LLVM_DEBUG(
       printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
+      dbgs() << "\n\t" << "Metric: "
              << (SumBubbles
                      ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
                      : 1)
@@ -1433,8 +1438,7 @@ GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
 #ifndef NDEBUG
   LLVM_DEBUG(
       printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
+      dbgs() << "\n\t" << "Metric: "
              << (SumBubbles
                      ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
                      : 1)
@@ -1482,8 +1486,7 @@ bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
       dbgs()
       << "\n\t      *** In shouldRevertScheduling ***\n"
       << "      *********** BEFORE UnclusteredHighRPStage ***********\n");
-  ScheduleMetrics MBefore =
-      getScheduleMetrics(DAG.SUnits);
+  ScheduleMetrics MBefore = getScheduleMetrics(DAG.SUnits);
   LLVM_DEBUG(
       dbgs()
       << "\n      *********** AFTER UnclusteredHighRPStage ***********\n");
@@ -1516,13 +1519,9 @@ bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) {
 }
 
 bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter) {
-  if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
-    return true;
-
-  if (mayCauseSpilling(WavesAfter))
-    return true;
-
-  return false;
+  return GCNSchedStage::shouldRevertScheduling(WavesAfter) ||
+         mayCauseSpilling(WavesAfter) ||
+         (IncreaseOccupancy && WavesAfter < TargetOcc);
 }
 
 bool ILPInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
@@ -1615,237 +1614,307 @@ void GCNSchedStage::revertScheduling() {
   DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);
 }
 
-void PreRARematStage::collectRematerializableInstructions() {
+bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
   const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(DAG.TRI);
-  for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
-    Register Reg = Register::index2VirtReg(I);
-    if (!DAG.LIS->hasInterval(Reg))
-      continue;
-
-    // TODO: Handle AGPR and SGPR rematerialization
-    if (!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
-        !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
-      continue;
 
-    MachineOperand *Op = DAG.MRI.getOneDef(Reg);
-    MachineInstr *Def = Op->getParent();
-    if (Op->getSubReg() != 0 || !isTriviallyReMaterializable(*Def))
-      continue;
-
-    MachineInstr *UseI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
-    if (Def->getParent() == UseI->getParent())
-      continue;
-
-    // We are only collecting defs that are defined in another block and are
-    // live-through or used inside regions at MinOccupancy. This means that the
-    // register must be in the live-in set for the region.
-    bool AddedToRematList = false;
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      auto It = DAG.LiveIns[I].find(Reg);
-      if (It != DAG.LiveIns[I].end() && !It->second.none()) {
-        if (DAG.RegionsWithMinOcc[I]) {
-          RematerializableInsts[I][Def] = UseI;
-          AddedToRematList = true;
-        }
-
-        // Collect regions with rematerializable reg as live-in to avoid
-        // searching later when updating RP.
-        RematDefToLiveInRegions[Def].push_back(I);
+  REMAT_DEBUG(dbgs() << "Collecting rematerializable instructions in "
+                     << MF.getFunction().getName() << '\n');
+
+  // Maps optimizable regions (i.e., regions at minimum and VGPR-limited
+  // occupancy, or regions with VGPR spilling) to their excess RP.
+  DenseMap<unsigned, unsigned> OptRegions;
+
+  // Note that the maximum number of VGPRs to use to eliminate spill may be
+  // lower than the maximum number to increase occupancy when the function has
+  // the "amdgpu-num-vgpr" attribute.
+  const std::pair<unsigned, unsigned> OccBounds =
+      ST.getOccupancyWithWorkGroupSizes(MF);
+  // FIXME: we should be able to just call ST.getMaxNumArchVGPRs() but that
+  // would use the occupancy bounds as determined by
+  // MF.getFunction().getWavesPerEU(), which look incorrect in some cases.
+  const unsigned MaxVGPRsNoSpill =
+      ST.getBaseMaxNumVGPRs(MF.getFunction(),
+                            {ST.getMinNumArchVGPRs(OccBounds.second),
+                             ST.getMaxNumArchVGPRs(OccBounds.first)},
+                            false);
+  const unsigned MaxVGPRsIncOcc = ST.getMaxNumArchVGPRs(DAG.MinOccupancy + 1);
+  IncreaseOccupancy = OccBounds.second > DAG.MinOccupancy;
+
+  // Collect optimizable regions. If there is spilling in any region we will
+  // just try to reduce it. Otherwise we will try to increase occupancy by one.
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    GCNRegPressure &RP = DAG.Pressure[I];
+    unsigned NumVGPRs = RP.getArchVGPRNum();
+    unsigned ExcessRP = 0;
+    if (NumVGPRs > MaxVGPRsNoSpill) {
+      if (IncreaseOccupancy) {
+        // We won't try to increase occupancy.
+        IncreaseOccupancy = false;
+        OptRegions.clear();
+      }
+      // Region has VGPR spilling, we will try to reduce spilling as much as
+      // possible.
+      ExcessRP = NumVGPRs - MaxVGPRsNoSpill;
+      REMAT_DEBUG(dbgs() << "Region " << I << " is spilling VGPRs, save "
+                         << ExcessRP << " VGPR(s) to eliminate spilling\n");
+    } else if (IncreaseOccupancy) {
+      if (ST.getOccupancyWithNumSGPRs(RP.getSGPRNum()) == DAG.MinOccupancy) {
+        // Occupancy is SGPR-limited in the region, no point in trying to
+        // increase it through VGPR usage.
+        IncreaseOccupancy = false;
+        OptRegions.clear();
+      } else if (NumVGPRs > MaxVGPRsIncOcc) {
+        // Occupancy is VGPR-limited.
+        ExcessRP = NumVGPRs - MaxVGPRsIncOcc;
+        REMAT_DEBUG(dbgs() << "Region " << I << " has min. occupancy: save "
+                           << ExcessRP << " VGPR(s) to improve occupancy\n");
       }
     }
-    if (!AddedToRematList)
-      RematDefToLiveInRegions.erase(Def);
+    if (ExcessRP)
+      OptRegions.insert({I, ExcessRP});
   }
-}
-
-bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
-                                              const TargetInstrInfo *TII) {
-  // Temporary copies of cached variables we will be modifying and replacing if
-  // sinking succeeds.
-  SmallVector<
-      std::pair<MachineBasicBlock::iterator, MachineBasicBlock::iterator>, 32>
-      NewRegions;
-  DenseMap<unsigned, GCNRPTracker::LiveRegSet> NewLiveIns;
-  DenseMap<unsigned, GCNRegPressure> NewPressure;
-  BitVector NewRescheduleRegions;
-  LiveIntervals *LIS = DAG.LIS;
-
-  NewRegions.resize(DAG.Regions.size());
-  NewRescheduleRegions.resize(DAG.Regions.size());
-
-  // Collect only regions that has a rematerializable def as a live-in.
-  SmallSet<unsigned, 16> ImpactedRegions;
-  for (const auto &It : RematDefToLiveInRegions)
-    ImpactedRegions.insert(It.second.begin(), It.second.end());
-
-  // Make copies of register pressure and live-ins cache that will be updated
-  // as we rematerialize.
-  for (auto Idx : ImpactedRegions) {
-    NewPressure[Idx] = DAG.Pressure[Idx];
-    NewLiveIns[Idx] = DAG.LiveIns[Idx];
-  }
-  NewRegions = DAG.Regions;
-  NewRescheduleRegions.reset();
-
-  DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
-  bool Improved = false;
-  for (auto I : ImpactedRegions) {
-    if (!DAG.RegionsWithMinOcc[I])
-      continue;
+  if (OptRegions.empty())
+    return false;
 
-    Improved = false;
-    int VGPRUsage = NewPressure[I].getVGPRNum(ST.hasGFX90AInsts());
-    int SGPRUsage = NewPressure[I].getSGPRNum();
+  // When we are reducing spilling, the target is the minimum achievable
+  // occupancy implied by workgroup sizes.
+  TargetOcc = IncreaseOccupancy ? DAG.MinOccupancy + 1 : OccBounds.first;
+
+  // Accounts for a reduction in RP in an optimizable region. Returns whether we
+  // estimate that we have identified enough rematerialization opportunities to
+  // achieve our goal.
+  auto ReduceRPInRegion = [&](auto OptIt, LaneBitmask Mask) -> bool {
+    auto NumRegs = SIRegisterInfo::getNumCoveredRegs(Mask);
+    unsigned I = OptIt->getFirst();
+    unsigned &Excess = OptIt->getSecond();
+    if (NumRegs >= Excess)
+      OptRegions.erase(I);
+    else
+      Excess -= NumRegs;
+    return OptRegions.empty();
+  };
+
+  // We need up-to-date live-out info. to query live-out register masks in
+  // regions containing rematerializable instructions.
+  DAG.RegionLiveOuts.buildLiveRegMap();
+
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    auto Region = DAG.Regions[I];
+    for (auto MI = Region.first; MI != Region.second; ++MI) {
+      // The instruction must be trivially rematerializable.
+      MachineInstr &DefMI = *MI;
+      if (!isTriviallyReMaterializable(DefMI))
+        continue;
 
-    // TODO: Handle occupancy drop due to AGPR and SGPR.
-    // Check if cause of occupancy drop is due to VGPR usage and not SGPR.
-    if (ST.getOccupancyWithNumSGPRs(SGPRUsage) == DAG.MinOccupancy)
-      break;
+      // We only support rematerializing virtual VGPRs with one definition.
+      Register Reg = DefMI.getOperand(0).getReg();
+      if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) ||
+          !SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
+          !DAG.MRI.hasOneDef(Reg))
+        continue;
 
-    // The occupancy of this region could have been improved by a previous
-    // iteration's sinking of defs.
-    if (NewPressure[I].getOccupancy(ST) > DAG.MinOccupancy) {
-      NewRescheduleRegions[I] = true;
-      Improved = true;
-      continue;
-    }
+      // We only care to rematerialize the instruction if it has a single
+      // non-debug user in a different block.
+      MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
+      if (!UseMI || DefMI.getParent() == UseMI->getParent())
+        continue;
 
-    // First check if we have enough trivially rematerializable instructions to
-    // improve occupancy. Optimistically assume all instructions we are able to
-    // sink decreased RP.
-    int TotalSinkableRegs = 0;
-    for (const auto &It : RematerializableInsts[I]) {
-      MachineInstr *Def = It.first;
-      Register DefReg = Def->getOperand(0).getReg();
-      TotalSinkableRegs +=
-          SIRegisterInfo::getNumCoveredRegs(NewLiveIns[I][DefReg]);
-    }
-    int VGPRsAfterSink = VGPRUsage - TotalSinkableRegs;
-    unsigned OptimisticOccupancy = ST.getOccupancyWithNumVGPRs(VGPRsAfterSink);
-    // If in the most optimistic scenario, we cannot improve occupancy, then do
-    // not attempt to sink any instructions.
-    if (OptimisticOccupancy <= DAG.MinOccupancy)
-      break;
+      REMAT_DEBUG(dbgs() << "In region " << I
+                         << ", rematerializable instruction " << DefMI);
+      auto &Remat = Rematerializations.emplace_back(&DefMI, I, UseMI);
+
+      bool RematUseful = false;
+      if (auto It = OptRegions.find(I); It != OptRegions.end()) {
+        // Optimistically consider that moving the instruction out of its
+        // defining region will reduce RP in the latter; this assumes that
+        // maximum RP in the region is reached somewhere between the defining
+        // instruction and the end of the region.
+        REMAT_DEBUG(
+            dbgs() << "  Instruction's defining region is optimizable\n");
+        RematUseful = true;
+        if (ReduceRPInRegion(
+                It, DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg]))
+          return true;
+      }
 
-    unsigned ImproveOccupancy = 0;
-    SmallVector<MachineInstr *, 4> SinkedDefs;
-    for (auto &It : RematerializableInsts[I]) {
-      MachineInstr *Def = It.first;
-      MachineBasicBlock::iterator InsertPos =
-          MachineBasicBlock::iterator(It.second);
-      Register Reg = Def->getOperand(0).getReg();
-      // Rematerialize MI to its use block. Since we are only rematerializing
-      // instructions that do not have any virtual reg uses, we do not need to
-      // call LiveRangeEdit::allUsesAvailableAt() and
-      // LiveRangeEdit::canRematerializeAt().
-      TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
-                         Def->getOperand(0).getSubReg(), *Def, *DAG.TRI);
-      MachineInstr *NewMI = &*std::prev(InsertPos);
-      LIS->InsertMachineInstrInMaps(*NewMI);
-      LIS->removeInterval(Reg);
-      LIS->createAndComputeVirtRegInterval(Reg);
-      InsertedMIToOldDef[NewMI] = Def;
-
-      // Update region boundaries in scheduling region we sinked from since we
-      // may sink an instruction that was at the beginning or end of its region
-      DAG.updateRegionBoundaries(NewRegions, Def, /*NewMI =*/nullptr,
-                                 /*Removing =*/true);
-
-      // Update region boundaries in region we sinked to.
-      DAG.updateRegionBoundaries(NewRegions, InsertPos, NewMI);
-
-      LaneBitmask PrevMask = NewLiveIns[I][Reg];
-      // FIXME: Also update cached pressure for where the def was sinked from.
-      // Update RP for all regions that has this reg as a live-in and remove
-      // the reg from all regions as a live-in.
-      ...
[truncated]

The typo in GCNSubtarget::getMaxNumArchVGPRs made the function return the wrong value in all cases. The returned value is now correct; this causes some unit tests to break in largely cosmetic ways, since that value is used to calculate excess RP in the remat stage.

lucas-rami · 2025-02-10T14:43:28Z

llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir

@@ -6047,9 +7328,9 @@ body:             |

  bb.4:

+    S_NOP 0, implicit %52, implicit %62


In a few cases here and below I moved this instruction to the top of the region to allow rematerializations to increase occupancy. Otherwise the instruction defining %60 is rematerialized at the beginning of the region which doesn't decrease maximum RP in the region, even though my optimistic heuristics estimate that it should have been enough. Ideally, I think the scheduler should be able to reorder instructions in such cases, but it does not. Alternatively, we could make the stage perform more rematerializations when it realizes pre-rescheduling that it did not achieve the expected goal.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

- Properly account for AGPRs in excess RP calculations. - Account for SGPR spilling when deciding whether to try to increase occupancy. - Don't rollback when rescheduling managed to improve occuoancy after remats.

github-actions · 2025-02-17T12:31:32Z

✅ With the latest revision this PR passed the C/C++ code formatter.

This prevents rollbacking rematerializations in case rescheduling improved occupancy even further than the target. Also fix format.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-02-19T22:17:32Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

-        HasRematDependency = true;
-        break;
-      }
+    if (NumAGPRs > MaxVGPRs) {


Can you set ExcessArchVGPRs = NumArchVGPRs -- we can spill AGPR to VGPR

Didn't know that, thanks.

Actually, on second though, not sure about this..

I think our best bet is to assume perfect allocation -- meaning we need NumAGPRs - MaxVGPRs number of available arch VGPR

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Now that single-MI regions are no longer filtered out by the machine scheduler, we can greatly simplify the logic to update region boundaries during rematerialization/rollbacking, at the cost of a single walk through the MF's instructions at the beginning of the stage.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-03-11T17:35:14Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

-        HasRematDependency = true;
-        break;
-      }
+    if (NumAGPRs > MaxVGPRs) {


Actually, on second though, not sure about this..

I think our best bet is to assume perfect allocation -- meaning we need NumAGPRs - MaxVGPRs number of available arch VGPR

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

- Allow remat in same block if using MI is in different region. Add unit test. - Take into account excess ArchVGPR/AGPR pressure above addressable limits. Add unit tests to make sure this works. - Fix unused register in existing unit test.

Also update one recently changed unit test from main.

This fails in some complex use cases, and it is not clear whether it is meaningfully faster that just letting the LIS object recompute it from scratch. Also removed stale comment in isTriviallyRematerilizable: we can now remat MIs with virtual register uses.

lucas-rami · 2025-03-25T10:04:15Z

AMD internal CI/CD passed with the last patch.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-03-28T23:44:31Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+    // We may be able to save one more whole ArchVGPR allocation granule.
+    if (NumRegs >= ArchVGPRsToAlignment) {
+      NumSavedRegs += Granule;
+      ArchVGPRsToAlignment = Granule - (NumRegs - ArchVGPRsToAlignment);


I think NumRegs - ArchVGPRsToAlignment can be greater than Granule

From above unless I am mistaken we should get NumRegs := NumRegs - (NumRegs / Granule) * Granule == NumRegs % Granule, so NumRegs < Granule.

Additionally, we have NumRegs >= ArchVGPRsToAlignment from the if condition and Granule >= ArchVGPRsToAlignment > 0 by construction. Hence,

$$NumRegs < Granule \Longrightarrow NumRegs - ArchVGPRsToAlignment < Granule - ArchVGPRsToAlignment < Granule$$

jrbyrnes · 2025-03-28T23:49:46Z

llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll

-; GISEL-GFX942-NEXT:    v_accvgpr_write_b32 a4, v13 ; Reload Reuse
+; GISEL-GFX942-NEXT:    buffer_store_dwordx4 v[2:5], v62, s[4:7], 0 offen
+; GISEL-GFX942-NEXT:    buffer_store_dwordx4 v[6:9], v62, s[4:7], 0 offen offset:16
+; GISEL-GFX942-NEXT:    v_mov_b32_e32 v1, 0x2000


What happened here? Looks like our rematerialization did not help with spilling

I checked the debug output. The upstream implementation identifies v_mov_b32_e32 v1, 0x2000 as a candidate for rematerialization but eventually does not remat anything because it determines it won't be able to improve occupancy. My implementation remats the instruction to reduce identified spilling (4 VGPRs) in the region below (starting on line 448) by 1 VGPR. That's not enough to eliminate spilling but it reduces it.

Okay -- I took a look at this test. The scheduler sees that after the rematerialization, we have improved the VGPR pressure which should theoretically help with spilling. However, the heuristic which decides if we are spilling or not is wrong, which I have mentioned in my other comment.

This is a bit unclear from just looking at the resultant code, because ,during RA, we are doing a suboptimal job and thus have spilling. This is due to how RA partitions the register file between AGPRs and VGPRs. We find our minimum allowable wavesPerEU is 4, so we have 128 vector registers, but RA splits this into 64 agprs and 64 vgprs. So, ironically, we have a limit of 64 archvgpr which is the same as the incorrect limit imposed by requiring an 8 occupancy during scheduuling. However, this is an accident and should be fixed as I think Matt is working on improving this. Also, if we use attributes #0 = { "amdgpu-agpr-alloc"="0" } we avoid the RA issue (and thus have no spiling), but the remat heuristics still think we are going to spill.

Thanks a lot for looking into this. With the spilling heuristic change you suggested the test now behaves as before.

jrbyrnes · 2025-03-29T00:01:27Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+namespace {
+/// Models excess register pressure in a region and tracks our progress as we
+/// identify rematerialization opportunities.
+struct ExcessRP {


It would be nice if we could build better ExcessRP queries into GCNRegPressure and rewrite this heuristic to use those queries in combination with GCNRegPressure.inc() . This will help with code maintenance -- maybe as a followup PR

llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

arsenm · 2025-05-05T11:37:18Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  // The boundaries of an empty region may not point to a valid MI from which we
+  // can get the parent MBB, so we have to iterate over all the MF's MBBs.


Shouldn't need to do this, the region should know what blocks it covers directly?

Just to clarify my rationale for this. When we end up emptying a region completely through rematerializations the region boundaries end up as Region.first == Region.second (== BB.end() if the region is the last in the BB). AFAIU I cannot retrieve the BB from the boundaries alone in this case, hence this linear search.

To avoid this linear search I added PreRARematStage::RegionBB which stores the parent MBB of each region before remats.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Aditionally fixes bug where MIRegion would only be filled after collecting rematerializable instructions, however the collection phase depends on MIRegion to determine whether the user of an MI is located in a different region from that MI. This does not break any test.

Using the base versions was necessary when we were not using the default range of waves/EU. We use the default range now so we can directly use the "regular" methods.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Co-authored-by: Matt Arsenault <[email protected]>

llvm-ci · 2025-05-08T12:12:37Z

LLVM Buildbot has detected a new failure on builder ml-opt-devrel-x86-64 running on ml-opt-devrel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/18130

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

llvm-ci · 2025-05-08T12:23:39Z

LLVM Buildbot has detected a new failure on builder ml-opt-rel-x86-64 running on ml-opt-rel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/17917

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
+ /b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
/b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

llvm-ci · 2025-05-08T12:26:47Z

LLVM Buildbot has detected a new failure on builder ml-opt-dev-x86-64 running on ml-opt-dev-x86-64-b1 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/18163

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

jplehr · 2025-05-08T12:39:32Z

I reverted this patch locally and that got rid of the issue.
Not sure if this can be fixed forward or needs a revert.

#125885 did not update the test.

Pierre-vh · 2025-05-08T12:46:42Z

It's just a few instructions that moved, I regenerated the test and committed it directly: 382a085

jplehr · 2025-05-08T13:21:43Z

Thank you

lucas-rami · 2025-05-08T14:13:18Z

Thanks @Pierre-vh!

llvm-ci · 2025-05-08T15:46:05Z

LLVM Buildbot has detected a new failure on builder lld-x86_64-ubuntu-fast running on as-builder-4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/16108

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

llvm-ci · 2025-05-08T18:14:44Z

LLVM Buildbot has detected a new failure on builder llvm-x86_64-debian-dylib running on gribozavr4 while building llvm at step 7 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/26826

Here is the relevant piece of the build log for the reference

Step 7 (test-build-unified-tree-check-llvm) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/llvm-x86_64-debian-dylib/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/1/llvm-x86_64-debian-dylib/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

llvm-ci · 2025-05-08T18:17:11Z

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-expensive-checks-debian running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/18608

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

llvm-ci · 2025-05-08T18:28:22Z

LLVM Buildbot has detected a new failure on builder clang-x86_64-debian-fast running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/56/builds/25326

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/clang-x86_64-debian-fast/llvm.obj/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/1/clang-x86_64-debian-fast/llvm.obj/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/1/clang-x86_64-debian-fast/llvm.obj/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/1/clang-x86_64-debian-fast/llvm.obj/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

hubert-reinterpretcast · 2025-05-08T18:49:32Z

@lucas-rami, are you working on a fix or should this be reverted?

lucas-rami · 2025-05-08T19:40:56Z

Is this not the same issue solved by 382a085 and reported by a different builder that checked out LLVM at my commit (before the fix) each time? Line 509 of mfma-loop.ll on main is currently not the one reported by the CI output: ; GFX908-NEXT: s_mov_b32 s0, 16

llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

Line 509 in d7987f1

; GFX908-NEXT: v_mov_b32_e32 v0, 1.0

ninja-check-llvm-codegen-amdgpu also reports no error when I build close to top-of-tree (d7987f1).

Apologies if I am missing something here.

hubert-reinterpretcast · 2025-05-08T20:20:40Z

Is this not the same issue solved by 382a085 and reported by a different builder that checked out LLVM at my commit (before the fix) each time?

Sorry. I missed that the fix happened here: #125885 (comment)

vitalybuka · 2025-05-09T21:02:41Z

There are still broken bots:
https://lab.llvm.org/buildbot/#/builders/52
https://lab.llvm.org/buildbot/#/builders/24/builds/8265
https://lab.llvm.org/buildbot/#/builders/169/builds/11275

slackito · 2025-05-10T00:00:52Z

I get sanitizer errors when building at this commit with -DLLVM_USE_SANITIZER=Address and then running ninja check-llvm-codegen-amdgpu.

Here's the asan error:

==3331703==ERROR: AddressSanitizer: use-after-poison on address 0x7d39064ec8d4 at pc 0x55b90dc9ac25 bp 0x7ffff9c71550 sp 0x7ffff9c71548                                                                                            16:55:29 [69/2503]
READ of size 4 at 0x7d39064ec8d4 thread T0
    #0 0x55b90dc9ac24 in llvm::MachineInstr::getFlag(llvm::MachineInstr::MIFlag) const /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstr.h:409:12
    #1 0x55b90dc9ac24 in llvm::MachineInstr::isBundledWithPred() const /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstr.h:487:43
    #2 0x55b90dc9ac24 in llvm::SlotIndexes::removeMachineInstrFromMaps(llvm::MachineInstr&, bool) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SlotIndexes.cpp:129:3
    #3 0x55b909936baa in llvm::LiveIntervals::RemoveMachineInstrFromMaps(llvm::MachineInstr&) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/LiveIntervals.h:296:14
    #4 0x55b909936baa in llvm::PreRARematStage::rematerialize() /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:2090:14
    #5 0x55b9099314ef in llvm::PreRARematStage::initGCNSchedStage() /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:1117:3
    #6 0x55b90992e90a in llvm::GCNScheduleDAGMILive::runSchedStages() /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:966:17
    #7 0x55b90d79d811 in llvm::impl_detail::MachineSchedulerBase::scheduleRegions(llvm::ScheduleDAGInstrs&, bool) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineScheduler.cpp:825:13
    #8 0x55b90d79b8f3 in llvm::impl_detail::MachineSchedulerImpl::run(llvm::MachineFunction&, llvm::TargetMachine const&, llvm::impl_detail::MachineSchedulerImpl::RequiredAnalyses const&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/
MachineScheduler.cpp:484:3
    #9 0x55b90d7cdd43 in (anonymous namespace)::MachineSchedulerLegacy::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineScheduler.cpp:576:15
    #10 0x55b90d5d5377 in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineFunctionPass.cpp:108:10
    #11 0x55b90e2d524a in llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1406:27
    #12 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:180:25
    #13 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:469:9
    #14 0x55b90c933051 in (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:534:18
    #15 0x55b90e2d61be in (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1521:27
    #16 0x55b90e2d61be in llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:539:44
    #17 0x55b90870ff3e in compileModule(char**, llvm::LLVMContext&) /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:753:8
    #18 0x55b90870af9e in main /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:400:22
    #19 0x7f2907033ca7 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #20 0x7f2907033d64 in __libc_start_main csu/../csu/libc-start.c:360:3
    #21 0x55b908622930 in _start (/dev/shm/jgorbe-code/lldb-build/bin/llc+0x905b930)

0x7d39064ec8d4 is located 4052 bytes inside of 4096-byte region [0x7d39064eb900,0x7d39064ec900)
allocated by thread T0 here:
    #0 0x55b908709712 in operator new(unsigned long, std::align_val_t, std::nothrow_t const&) /usr/local/google/home/jgorbe/forks/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:104:3
    #1 0x55b90faa24af in llvm::allocate_buffer(unsigned long, unsigned long) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Support/MemAlloc.cpp:16:18
    #2 0x55b908bef425 in llvm::MallocAllocator::Allocate(unsigned long, unsigned long) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/AllocatorBase.h:92:12
    #3 0x55b908bef425 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::StartNewSlab() /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Allocator.h:346:42
    #4 0x55b908bef1a9 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::AllocateSlow(unsigned long, unsigned long, llvm::Align) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Allocator.h:202:5
    #5 0x55b90d5b6625 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::Allocate(unsigned long, unsigned long) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Allocator.h:216:12
    #6 0x55b90d5b6625 in llvm::MachineInstr* llvm::Recycler<llvm::MachineInstr, 80ul, 8ul>::Allocate<llvm::MachineInstr, llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>>(llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator,
4096ul, 4096ul, 128ul>&) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Recycler.h:95:57
    #7 0x55b90d5b6625 in llvm::MachineFunction::CreateMachineInstr(llvm::MCInstrDesc const&, llvm::DebugLoc, bool) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineFunction.cpp:429:35
    #8 0x55b908ca3454 in llvm::BuildMI(llvm::MachineBasicBlock&, llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, llvm::MIMetadata const&, llvm::MCInstrDesc const&, llvm::Register) /usr/local/google/home/jgorbe/code/llvm/llvm/include
/llvm/CodeGen/MachineInstrBuilder.h:397:25
    #9 0x55b90f3f0a78 in llvm::BuildMI(llvm::MachineBasicBlock*, llvm::MIMetadata const&, llvm::MCInstrDesc const&, llvm::Register) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstrBuilder.h:498:10
    #10 0x55b90f3f0a78 in llvm::FunctionLoweringInfo::set(llvm::Function const&, llvm::MachineFunction&, llvm::SelectionDAG*) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp:307:11
    #11 0x55b90f6afb74 in llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:579:13
    #12 0x55b9094a470a in llvm::AMDGPUDAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp:128:28
    #13 0x55b90f6a9adf in llvm::SelectionDAGISelLegacy::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:374:20
    #14 0x55b90d5d5377 in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineFunctionPass.cpp:108:10
    #15 0x55b90e2d524a in llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1406:27
    #16 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:180:25
    #17 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:469:9
    #18 0x55b90c933051 in (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:534:18
    #19 0x55b90e2d61be in (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1521:27
    #20 0x55b90e2d61be in llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:539:44
    #21 0x55b90870ff3e in compileModule(char**, llvm::LLVMContext&) /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:753:8
    #22 0x55b90870af9e in main /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:400:22
    #23 0x7f2907033ca7 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: use-after-poison /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstr.h:409:12 in llvm::MachineInstr::getFlag(llvm::MachineInstr::MIFlag) const
Shadow bytes around the buggy address:
  0x7d39064ec600: f7 00 00 00 00 00 00 00 00 00 00 f7 00 00 00 00
  0x7d39064ec680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7d39064ec700: 00 00 00 00 00 00 00 00 00 00 00 00 f7 00 00 00
  0x7d39064ec780: 00 00 00 00 00 00 00 f7 00 00 00 00 00 00 00 00
  0x7d39064ec800: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 00 00 00 00
=>0x7d39064ec880: 00 00 00 00 f7 f7 f7 f7 f7 f7[f7]f7 f7 f7 f7 f7
  0x7d39064ec900: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064ec980: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064eca00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064eca80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064ecb00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3331703==ABORTING
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /dev/shm/jgorbe-code/lldb-build/bin/FileCheck -check-prefix=GCN -check-prefix=SI /dev/shm/jgorbe-code/llvm/llvm/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll

vitalybuka · 2025-05-10T00:13:42Z

Yes, the same on bots.

…g scheduling (#125885)" (#139341) And related "[AMDGPU] Regenerate mfma-loop.ll test" Introduce memory error detected by Asan #125885. This reverts commit 382a085. This reverts commit 067caaa.

vitalybuka · 2025-05-10T00:52:26Z

@lucas-rami we had to revert to fix bots

lucas-rami · 2025-05-11T12:54:49Z

Sounds good, thanks for the ping and sorry for the faulty commit, working on addressing the issue.

…ling (llvm#125885) AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.

…ng scheduling (#125885)" (#139548) This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to issues detected by the address sanitizer (MIs have to be removed from live intervals before being removed from their parent MBB). Original commit description below. AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.

lucas-rami added the backend:AMDGPU label Feb 5, 2025

lucas-rami requested a review from jrbyrnes February 5, 2025 17:03

lucas-rami mentioned this pull request Feb 5, 2025

[AMDGPU][Scheduler] Refactor VGPR rematerialization during scheduling #118722

Closed

lucas-rami added 3 commits February 9, 2025 18:05

Integrate llvm#124327

e75789a

Fix broken unit test

9f49681

lucas-rami force-pushed the pre-ra-remat-refactor branch from 69be687 to 9f49681 Compare February 10, 2025 12:47

lucas-rami and others added 2 commits February 10, 2025 14:01

Merge branch 'main' into pre-ra-remat-refactor

74bf930

Reintroduce missing tests and fix existing one

daf4723

lucas-rami commented Feb 10, 2025

View reviewed changes

jrbyrnes reviewed Feb 13, 2025

View reviewed changes

Address feedback

6e0e80e

- Properly account for AGPRs in excess RP calculations. - Account for SGPR spilling when deciding whether to try to increase occupancy. - Don't rollback when rescheduling managed to improve occuoancy after remats.

lucas-rami added 2 commits February 17, 2025 12:45

== -> >= for achieved occupancy check at the end of the stage

cfc6147

This prevents rollbacking rematerializations in case rescheduling improved occupancy even further than the target. Also fix format.

Merge branch 'main' into pre-ra-remat-refactor

03fc223

jrbyrnes reviewed Feb 19, 2025

View reviewed changes

Address feedback

1216b5d

lucas-rami mentioned this pull request Feb 26, 2025

[MachineScheduler][AMDGPU] Allow scheduling of single-MI regions #128739

Merged

lucas-rami added 2 commits February 27, 2025 10:37

Merge branch 'main' into pre-ra-remat-refactor

6a9ba13

jrbyrnes reviewed Mar 11, 2025

View reviewed changes

lucas-rami added 3 commits March 18, 2025 14:28

Address feedback

c6027f6

- Allow remat in same block if using MI is in different region. Add unit test. - Take into account excess ArchVGPR/AGPR pressure above addressable limits. Add unit tests to make sure this works. - Fix unused register in existing unit test.

Merge branch 'main' into pre-ra-remat-refactor

1fca0f9

Also update one recently changed unit test from main.

jrbyrnes reviewed Mar 26, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Show resolved Hide resolved

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Show resolved Hide resolved

jrbyrnes reviewed Mar 28, 2025

View reviewed changes

jrbyrnes reviewed Mar 29, 2025

View reviewed changes

lucas-rami commented Mar 31, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll Outdated Show resolved Hide resolved

arsenm reviewed May 5, 2025

View reviewed changes

lucas-rami added 2 commits May 5, 2025 22:11

[NFS] Directly use getMaxNum{S,V}GPRs methods

869d367

Using the base versions was necessary when we were not using the default range of waves/EU. We use the default range now so we can directly use the "regular" methods.

arsenm approved these changes May 8, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Outdated Show resolved Hide resolved

Update llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

83db071

Co-authored-by: Matt Arsenault <[email protected]>

lucas-rami merged commit 067caaa into llvm:main May 8, 2025
6 of 9 checks passed

Pierre-vh added a commit that referenced this pull request May 8, 2025

[AMDGPU] Regenerate mfma-loop.ll test

382a085

#125885 did not update the test.

vitalybuka mentioned this pull request May 10, 2025

Revert "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)" #139341

Merged

		@@ -6047,9 +7328,9 @@ body: \|

		bb.4:

		S_NOP 0, implicit %52, implicit %62

		// The boundaries of an empty region may not point to a valid MI from which we
		// can get the parent MBB, so we have to iterate over all the MF's MBBs.

[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling #125885

[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling #125885

Uh oh!

Conversation

lucas-rami commented Feb 5, 2025

Uh oh!

llvmbot commented Feb 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbyrnes Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbyrnes Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lucas-rami commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvm-ci commented May 8, 2025

Uh oh!

llvm-ci commented May 8, 2025

Uh oh!

llvm-ci commented May 8, 2025

Uh oh!

jplehr commented May 8, 2025

Uh oh!

Pierre-vh commented May 8, 2025

Uh oh!

jplehr commented May 8, 2025

github-actions bot commented Feb 17, 2025 •

edited

Loading

jrbyrnes Mar 11, 2025 •

edited

Loading

jrbyrnes Mar 11, 2025 •

edited

Loading