Skip to content

[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling #125885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 8, 2025

Conversation

lucas-rami
Copy link
Contributor

AMDGPU scheduler's PreRARematStage attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.

  • Analysis and collection (canIncreaseOccupancyOrReduceSpill). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.
  • Rematerialization (rematerialize). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.

This replaces #118722, which seems to be stuck in an infinite processing loop.

@llvmbot
Copy link
Member

llvmbot commented Feb 5, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

AMDGPU scheduler's PreRARematStage attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks.

  • Analysis and collection (canIncreaseOccupancyOrReduceSpill). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here.
  • Rematerialization (rematerialize). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.

This replaces #118722, which seems to be stuck in an infinite processing loop.


Patch is 264.49 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125885.diff

11 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineRegisterInfo.h (+4)
  • (modified) llvm/lib/CodeGen/MachineRegisterInfo.cpp (+5)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+412-273)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+83-29)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.cpp (+26-27)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+169-431)
  • (modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp (+27-13)
  • (modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h (+15)
  • (added) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-attr.mir (+221)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-debug.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+1328-51)
diff --git a/llvm/include/llvm/CodeGen/MachineRegisterInfo.h b/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
index 4fddc2033b81b7c..4dd7b20a68a0a96 100644
--- a/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/MachineRegisterInfo.h
@@ -23,6 +23,7 @@
 #include "llvm/ADT/iterator_range.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineInstrBundle.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/RegisterBank.h"
@@ -592,6 +593,9 @@ class MachineRegisterInfo {
   /// multiple uses.
   bool hasOneNonDBGUser(Register RegNo) const;
 
+  /// If the register has a single non-Debug instruction using the specified
+  /// register, returns it; otherwise returns nullptr.
+  MachineInstr *getOneNonDBGUser(Register RegNo) const;
 
   /// hasAtMostUses - Return true if the given register has at most \p MaxUsers
   /// non-debug user instructions.
diff --git a/llvm/lib/CodeGen/MachineRegisterInfo.cpp b/llvm/lib/CodeGen/MachineRegisterInfo.cpp
index 937f63f6c5e0046..b7135251781ad98 100644
--- a/llvm/lib/CodeGen/MachineRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/MachineRegisterInfo.cpp
@@ -432,6 +432,11 @@ bool MachineRegisterInfo::hasOneNonDBGUser(Register RegNo) const {
   return hasSingleElement(use_nodbg_instructions(RegNo));
 }
 
+MachineInstr *MachineRegisterInfo::getOneNonDBGUser(Register RegNo) const {
+  auto RegNoDbgUsers = use_nodbg_instructions(RegNo);
+  return hasSingleElement(RegNoDbgUsers) ? &*RegNoDbgUsers.begin() : nullptr;
+}
+
 bool MachineRegisterInfo::hasAtMostUserInstrs(Register Reg,
                                               unsigned MaxUsers) const {
   return hasNItemsOrLess(use_instr_nodbg_begin(Reg), use_instr_nodbg_end(),
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 6e693066de10b6b..ceaf710f5d47fd3 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -25,8 +25,13 @@
 
 #include "GCNSchedStrategy.h"
 #include "AMDGPUIGroupLP.h"
+#include "GCNRegPressure.h"
 #include "SIMachineFunctionInfo.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/STLExtras.h"
 #include "llvm/CodeGen/RegisterClassInfo.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Support/ErrorHandling.h"
 
 #define DEBUG_TYPE "machine-scheduler"
 
@@ -307,11 +312,11 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
     HasHighPressure = true;
     if (SGPRDelta > VGPRDelta) {
       Cand.RPDelta.CriticalMax =
-        PressureChange(AMDGPU::RegisterPressureSets::SReg_32);
+          PressureChange(AMDGPU::RegisterPressureSets::SReg_32);
       Cand.RPDelta.CriticalMax.setUnitInc(SGPRDelta);
     } else {
       Cand.RPDelta.CriticalMax =
-        PressureChange(AMDGPU::RegisterPressureSets::VGPR_32);
+          PressureChange(AMDGPU::RegisterPressureSets::VGPR_32);
       Cand.RPDelta.CriticalMax.setUnitInc(VGPRDelta);
     }
   }
@@ -324,7 +329,7 @@ void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
                                          const RegPressureTracker &RPTracker,
                                          SchedCandidate &Cand,
                                          bool IsBottomUp) {
-  const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo*>(TRI);
+  const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(TRI);
   ArrayRef<unsigned> Pressure = RPTracker.getRegSetPressureAtPos();
   unsigned SGPRPressure = 0;
   unsigned VGPRPressure = 0;
@@ -420,7 +425,7 @@ SUnit *GCNSchedStrategy::pickNodeBidirectional(bool &IsTopNode) {
       pickNodeFromQueue(Top, TopPolicy, DAG->getTopRPTracker(), TCand,
                         /*IsBottomUp=*/false);
       assert(TCand.SU == TopCand.SU &&
-           "Last pick result should correspond to re-picking right now");
+             "Last pick result should correspond to re-picking right now");
     }
 #endif
   }
@@ -890,13 +895,13 @@ GCNScheduleDAGMILive::getRegionLiveInMap() const {
   std::vector<MachineInstr *> RegionFirstMIs;
   RegionFirstMIs.reserve(Regions.size());
   auto I = Regions.rbegin(), E = Regions.rend();
-  auto *BB = I->first->getParent();
   do {
+    const MachineBasicBlock *MBB = I->first->getParent();
     auto *MI = &*skipDebugInstructionsForward(I->first, I->second);
     RegionFirstMIs.push_back(MI);
     do {
       ++I;
-    } while (I != E && I->first->getParent() == BB);
+    } while (I != E && I->first->getParent() == MBB);
   } while (I != E);
   return getLiveRegMap(RegionFirstMIs, /*After=*/false, *LIS);
 }
@@ -1081,31 +1086,32 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
   return true;
 }
 
-bool PreRARematStage::initGCNSchedStage() {
-  if (!GCNSchedStage::initGCNSchedStage())
-    return false;
-
-  if (DAG.RegionsWithMinOcc.none() || DAG.Regions.size() == 1)
-    return false;
-
-  const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
-  // Rematerialization will not help if occupancy is not limited by reg usage.
-  if (ST.getOccupancyWithWorkGroupSizes(MF).second == DAG.MinOccupancy)
-    return false;
+/// Allows to easily filter for this stage's debug output.
+#define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << "[PreRARemat] "; X;)
 
-  // FIXME: This pass will invalidate cached MBBLiveIns for regions
-  // inbetween the defs and region we sinked the def to. Cached pressure
-  // for regions where a def is sinked from will also be invalidated. Will
-  // need to be fixed if there is another pass after this pass.
+bool PreRARematStage::initGCNSchedStage() {
+  // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
+  // regions inbetween the defs and region we sinked the def to. Will need to be
+  // fixed if there is another pass after this pass.
   assert(!S.hasNextStage());
 
-  collectRematerializableInstructions();
-  if (RematerializableInsts.empty() || !sinkTriviallyRematInsts(ST, TII))
+  if (!GCNSchedStage::initGCNSchedStage() || DAG.RegionsWithMinOcc.none() ||
+      DAG.Regions.size() == 1 || !canIncreaseOccupancyOrReduceSpill())
     return false;
 
-  LLVM_DEBUG(
-      dbgs() << "Retrying function scheduling with improved occupancy of "
-             << DAG.MinOccupancy << " from rematerializing\n");
+  // Rematerialize identified instructions and update scheduler's state.
+  rematerialize();
+  if (GCNTrackers)
+    DAG.RegionLiveOuts.buildLiveRegMap();
+  REMAT_DEBUG(
+      dbgs() << "Retrying function scheduling with new min. occupancy of "
+             << AchievedOcc << " from rematerializing (original was "
+             << DAG.MinOccupancy << ", target was " << TargetOcc << ")\n");
+  if (AchievedOcc > DAG.MinOccupancy) {
+    DAG.MinOccupancy = AchievedOcc;
+    SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
+    MFI.increaseOccupancy(MF, DAG.MinOccupancy);
+  }
   return true;
 }
 
@@ -1397,8 +1403,7 @@ GCNSchedStage::getScheduleMetrics(const std::vector<SUnit> &InputSchedule) {
 #ifndef NDEBUG
   LLVM_DEBUG(
       printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
+      dbgs() << "\n\t" << "Metric: "
              << (SumBubbles
                      ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
                      : 1)
@@ -1433,8 +1438,7 @@ GCNSchedStage::getScheduleMetrics(const GCNScheduleDAGMILive &DAG) {
 #ifndef NDEBUG
   LLVM_DEBUG(
       printScheduleModel(ReadyCyclesSorted);
-      dbgs() << "\n\t"
-             << "Metric: "
+      dbgs() << "\n\t" << "Metric: "
              << (SumBubbles
                      ? (SumBubbles * ScheduleMetrics::ScaleFactor) / CurrCycle
                      : 1)
@@ -1482,8 +1486,7 @@ bool UnclusteredHighRPStage::shouldRevertScheduling(unsigned WavesAfter) {
       dbgs()
       << "\n\t      *** In shouldRevertScheduling ***\n"
       << "      *********** BEFORE UnclusteredHighRPStage ***********\n");
-  ScheduleMetrics MBefore =
-      getScheduleMetrics(DAG.SUnits);
+  ScheduleMetrics MBefore = getScheduleMetrics(DAG.SUnits);
   LLVM_DEBUG(
       dbgs()
       << "\n      *********** AFTER UnclusteredHighRPStage ***********\n");
@@ -1516,13 +1519,9 @@ bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) {
 }
 
 bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter) {
-  if (GCNSchedStage::shouldRevertScheduling(WavesAfter))
-    return true;
-
-  if (mayCauseSpilling(WavesAfter))
-    return true;
-
-  return false;
+  return GCNSchedStage::shouldRevertScheduling(WavesAfter) ||
+         mayCauseSpilling(WavesAfter) ||
+         (IncreaseOccupancy && WavesAfter < TargetOcc);
 }
 
 bool ILPInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) {
@@ -1615,237 +1614,307 @@ void GCNSchedStage::revertScheduling() {
   DAG.Regions[RegionIdx] = std::pair(DAG.RegionBegin, DAG.RegionEnd);
 }
 
-void PreRARematStage::collectRematerializableInstructions() {
+bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
   const SIRegisterInfo *SRI = static_cast<const SIRegisterInfo *>(DAG.TRI);
-  for (unsigned I = 0, E = DAG.MRI.getNumVirtRegs(); I != E; ++I) {
-    Register Reg = Register::index2VirtReg(I);
-    if (!DAG.LIS->hasInterval(Reg))
-      continue;
-
-    // TODO: Handle AGPR and SGPR rematerialization
-    if (!SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
-        !DAG.MRI.hasOneDef(Reg) || !DAG.MRI.hasOneNonDBGUse(Reg))
-      continue;
 
-    MachineOperand *Op = DAG.MRI.getOneDef(Reg);
-    MachineInstr *Def = Op->getParent();
-    if (Op->getSubReg() != 0 || !isTriviallyReMaterializable(*Def))
-      continue;
-
-    MachineInstr *UseI = &*DAG.MRI.use_instr_nodbg_begin(Reg);
-    if (Def->getParent() == UseI->getParent())
-      continue;
-
-    // We are only collecting defs that are defined in another block and are
-    // live-through or used inside regions at MinOccupancy. This means that the
-    // register must be in the live-in set for the region.
-    bool AddedToRematList = false;
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      auto It = DAG.LiveIns[I].find(Reg);
-      if (It != DAG.LiveIns[I].end() && !It->second.none()) {
-        if (DAG.RegionsWithMinOcc[I]) {
-          RematerializableInsts[I][Def] = UseI;
-          AddedToRematList = true;
-        }
-
-        // Collect regions with rematerializable reg as live-in to avoid
-        // searching later when updating RP.
-        RematDefToLiveInRegions[Def].push_back(I);
+  REMAT_DEBUG(dbgs() << "Collecting rematerializable instructions in "
+                     << MF.getFunction().getName() << '\n');
+
+  // Maps optimizable regions (i.e., regions at minimum and VGPR-limited
+  // occupancy, or regions with VGPR spilling) to their excess RP.
+  DenseMap<unsigned, unsigned> OptRegions;
+
+  // Note that the maximum number of VGPRs to use to eliminate spill may be
+  // lower than the maximum number to increase occupancy when the function has
+  // the "amdgpu-num-vgpr" attribute.
+  const std::pair<unsigned, unsigned> OccBounds =
+      ST.getOccupancyWithWorkGroupSizes(MF);
+  // FIXME: we should be able to just call ST.getMaxNumArchVGPRs() but that
+  // would use the occupancy bounds as determined by
+  // MF.getFunction().getWavesPerEU(), which look incorrect in some cases.
+  const unsigned MaxVGPRsNoSpill =
+      ST.getBaseMaxNumVGPRs(MF.getFunction(),
+                            {ST.getMinNumArchVGPRs(OccBounds.second),
+                             ST.getMaxNumArchVGPRs(OccBounds.first)},
+                            false);
+  const unsigned MaxVGPRsIncOcc = ST.getMaxNumArchVGPRs(DAG.MinOccupancy + 1);
+  IncreaseOccupancy = OccBounds.second > DAG.MinOccupancy;
+
+  // Collect optimizable regions. If there is spilling in any region we will
+  // just try to reduce it. Otherwise we will try to increase occupancy by one.
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    GCNRegPressure &RP = DAG.Pressure[I];
+    unsigned NumVGPRs = RP.getArchVGPRNum();
+    unsigned ExcessRP = 0;
+    if (NumVGPRs > MaxVGPRsNoSpill) {
+      if (IncreaseOccupancy) {
+        // We won't try to increase occupancy.
+        IncreaseOccupancy = false;
+        OptRegions.clear();
+      }
+      // Region has VGPR spilling, we will try to reduce spilling as much as
+      // possible.
+      ExcessRP = NumVGPRs - MaxVGPRsNoSpill;
+      REMAT_DEBUG(dbgs() << "Region " << I << " is spilling VGPRs, save "
+                         << ExcessRP << " VGPR(s) to eliminate spilling\n");
+    } else if (IncreaseOccupancy) {
+      if (ST.getOccupancyWithNumSGPRs(RP.getSGPRNum()) == DAG.MinOccupancy) {
+        // Occupancy is SGPR-limited in the region, no point in trying to
+        // increase it through VGPR usage.
+        IncreaseOccupancy = false;
+        OptRegions.clear();
+      } else if (NumVGPRs > MaxVGPRsIncOcc) {
+        // Occupancy is VGPR-limited.
+        ExcessRP = NumVGPRs - MaxVGPRsIncOcc;
+        REMAT_DEBUG(dbgs() << "Region " << I << " has min. occupancy: save "
+                           << ExcessRP << " VGPR(s) to improve occupancy\n");
       }
     }
-    if (!AddedToRematList)
-      RematDefToLiveInRegions.erase(Def);
+    if (ExcessRP)
+      OptRegions.insert({I, ExcessRP});
   }
-}
-
-bool PreRARematStage::sinkTriviallyRematInsts(const GCNSubtarget &ST,
-                                              const TargetInstrInfo *TII) {
-  // Temporary copies of cached variables we will be modifying and replacing if
-  // sinking succeeds.
-  SmallVector<
-      std::pair<MachineBasicBlock::iterator, MachineBasicBlock::iterator>, 32>
-      NewRegions;
-  DenseMap<unsigned, GCNRPTracker::LiveRegSet> NewLiveIns;
-  DenseMap<unsigned, GCNRegPressure> NewPressure;
-  BitVector NewRescheduleRegions;
-  LiveIntervals *LIS = DAG.LIS;
-
-  NewRegions.resize(DAG.Regions.size());
-  NewRescheduleRegions.resize(DAG.Regions.size());
-
-  // Collect only regions that has a rematerializable def as a live-in.
-  SmallSet<unsigned, 16> ImpactedRegions;
-  for (const auto &It : RematDefToLiveInRegions)
-    ImpactedRegions.insert(It.second.begin(), It.second.end());
-
-  // Make copies of register pressure and live-ins cache that will be updated
-  // as we rematerialize.
-  for (auto Idx : ImpactedRegions) {
-    NewPressure[Idx] = DAG.Pressure[Idx];
-    NewLiveIns[Idx] = DAG.LiveIns[Idx];
-  }
-  NewRegions = DAG.Regions;
-  NewRescheduleRegions.reset();
-
-  DenseMap<MachineInstr *, MachineInstr *> InsertedMIToOldDef;
-  bool Improved = false;
-  for (auto I : ImpactedRegions) {
-    if (!DAG.RegionsWithMinOcc[I])
-      continue;
+  if (OptRegions.empty())
+    return false;
 
-    Improved = false;
-    int VGPRUsage = NewPressure[I].getVGPRNum(ST.hasGFX90AInsts());
-    int SGPRUsage = NewPressure[I].getSGPRNum();
+  // When we are reducing spilling, the target is the minimum achievable
+  // occupancy implied by workgroup sizes.
+  TargetOcc = IncreaseOccupancy ? DAG.MinOccupancy + 1 : OccBounds.first;
+
+  // Accounts for a reduction in RP in an optimizable region. Returns whether we
+  // estimate that we have identified enough rematerialization opportunities to
+  // achieve our goal.
+  auto ReduceRPInRegion = [&](auto OptIt, LaneBitmask Mask) -> bool {
+    auto NumRegs = SIRegisterInfo::getNumCoveredRegs(Mask);
+    unsigned I = OptIt->getFirst();
+    unsigned &Excess = OptIt->getSecond();
+    if (NumRegs >= Excess)
+      OptRegions.erase(I);
+    else
+      Excess -= NumRegs;
+    return OptRegions.empty();
+  };
+
+  // We need up-to-date live-out info. to query live-out register masks in
+  // regions containing rematerializable instructions.
+  DAG.RegionLiveOuts.buildLiveRegMap();
+
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    auto Region = DAG.Regions[I];
+    for (auto MI = Region.first; MI != Region.second; ++MI) {
+      // The instruction must be trivially rematerializable.
+      MachineInstr &DefMI = *MI;
+      if (!isTriviallyReMaterializable(DefMI))
+        continue;
 
-    // TODO: Handle occupancy drop due to AGPR and SGPR.
-    // Check if cause of occupancy drop is due to VGPR usage and not SGPR.
-    if (ST.getOccupancyWithNumSGPRs(SGPRUsage) == DAG.MinOccupancy)
-      break;
+      // We only support rematerializing virtual VGPRs with one definition.
+      Register Reg = DefMI.getOperand(0).getReg();
+      if (!Reg.isVirtual() || !DAG.LIS->hasInterval(Reg) ||
+          !SRI->isVGPRClass(DAG.MRI.getRegClass(Reg)) ||
+          !DAG.MRI.hasOneDef(Reg))
+        continue;
 
-    // The occupancy of this region could have been improved by a previous
-    // iteration's sinking of defs.
-    if (NewPressure[I].getOccupancy(ST) > DAG.MinOccupancy) {
-      NewRescheduleRegions[I] = true;
-      Improved = true;
-      continue;
-    }
+      // We only care to rematerialize the instruction if it has a single
+      // non-debug user in a different block.
+      MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
+      if (!UseMI || DefMI.getParent() == UseMI->getParent())
+        continue;
 
-    // First check if we have enough trivially rematerializable instructions to
-    // improve occupancy. Optimistically assume all instructions we are able to
-    // sink decreased RP.
-    int TotalSinkableRegs = 0;
-    for (const auto &It : RematerializableInsts[I]) {
-      MachineInstr *Def = It.first;
-      Register DefReg = Def->getOperand(0).getReg();
-      TotalSinkableRegs +=
-          SIRegisterInfo::getNumCoveredRegs(NewLiveIns[I][DefReg]);
-    }
-    int VGPRsAfterSink = VGPRUsage - TotalSinkableRegs;
-    unsigned OptimisticOccupancy = ST.getOccupancyWithNumVGPRs(VGPRsAfterSink);
-    // If in the most optimistic scenario, we cannot improve occupancy, then do
-    // not attempt to sink any instructions.
-    if (OptimisticOccupancy <= DAG.MinOccupancy)
-      break;
+      REMAT_DEBUG(dbgs() << "In region " << I
+                         << ", rematerializable instruction " << DefMI);
+      auto &Remat = Rematerializations.emplace_back(&DefMI, I, UseMI);
+
+      bool RematUseful = false;
+      if (auto It = OptRegions.find(I); It != OptRegions.end()) {
+        // Optimistically consider that moving the instruction out of its
+        // defining region will reduce RP in the latter; this assumes that
+        // maximum RP in the region is reached somewhere between the defining
+        // instruction and the end of the region.
+        REMAT_DEBUG(
+            dbgs() << "  Instruction's defining region is optimizable\n");
+        RematUseful = true;
+        if (ReduceRPInRegion(
+                It, DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg]))
+          return true;
+      }
 
-    unsigned ImproveOccupancy = 0;
-    SmallVector<MachineInstr *, 4> SinkedDefs;
-    for (auto &It : RematerializableInsts[I]) {
-      MachineInstr *Def = It.first;
-      MachineBasicBlock::iterator InsertPos =
-          MachineBasicBlock::iterator(It.second);
-      Register Reg = Def->getOperand(0).getReg();
-      // Rematerialize MI to its use block. Since we are only rematerializing
-      // instructions that do not have any virtual reg uses, we do not need to
-      // call LiveRangeEdit::allUsesAvailableAt() and
-      // LiveRangeEdit::canRematerializeAt().
-      TII->reMaterialize(*InsertPos->getParent(), InsertPos, Reg,
-                         Def->getOperand(0).getSubReg(), *Def, *DAG.TRI);
-      MachineInstr *NewMI = &*std::prev(InsertPos);
-      LIS->InsertMachineInstrInMaps(*NewMI);
-      LIS->removeInterval(Reg);
-      LIS->createAndComputeVirtRegInterval(Reg);
-      InsertedMIToOldDef[NewMI] = Def;
-
-      // Update region boundaries in scheduling region we sinked from since we
-      // may sink an instruction that was at the beginning or end of its region
-      DAG.updateRegionBoundaries(NewRegions, Def, /*NewMI =*/nullptr,
-                                 /*Removing =*/true);
-
-      // Update region boundaries in region we sinked to.
-      DAG.updateRegionBoundaries(NewRegions, InsertPos, NewMI);
-
-      LaneBitmask PrevMask = NewLiveIns[I][Reg];
-      // FIXME: Also update cached pressure for where the def was sinked from.
-      // Update RP for all regions that has this reg as a live-in and remove
-      // the reg from all regions as a live-in.
-      ...
[truncated]

The typo in GCNSubtarget::getMaxNumArchVGPRs made the function return
the wrong value in all cases. The returned value is now correct; this
causes some unit tests to break in largely cosmetic ways, since that
value is used to calculate excess RP in the remat stage.
@lucas-rami lucas-rami force-pushed the pre-ra-remat-refactor branch from 69be687 to 9f49681 Compare February 10, 2025 12:47
@@ -6047,9 +7328,9 @@ body: |

bb.4:

S_NOP 0, implicit %52, implicit %62
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a few cases here and below I moved this instruction to the top of the region to allow rematerializations to increase occupancy. Otherwise the instruction defining %60 is rematerialized at the beginning of the region which doesn't decrease maximum RP in the region, even though my optimistic heuristics estimate that it should have been enough. Ideally, I think the scheduler should be able to reorder instructions in such cases, but it does not. Alternatively, we could make the stage perform more rematerializations when it realizes pre-rescheduling that it did not achieve the expected goal.

- Properly account for AGPRs in excess RP calculations.
- Account for SGPR spilling when deciding whether to try to increase
  occupancy.
- Don't rollback when rescheduling managed to improve occuoancy after
  remats.
Copy link

github-actions bot commented Feb 17, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

This prevents rollbacking rematerializations in case rescheduling
improved occupancy even further than the target.

Also fix format.
HasRematDependency = true;
break;
}
if (NumAGPRs > MaxVGPRs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set ExcessArchVGPRs = NumArchVGPRs -- we can spill AGPR to VGPR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know that, thanks.

Copy link
Contributor

@jrbyrnes jrbyrnes Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second though, not sure about this..

I think our best bet is to assume perfect allocation -- meaning we need NumAGPRs - MaxVGPRs number of available arch VGPR

Now that single-MI regions are no longer filtered out by the machine
scheduler, we can greatly simplify the logic to update region boundaries
during rematerialization/rollbacking, at the cost of a single walk
through the MF's instructions at the beginning of the stage.
HasRematDependency = true;
break;
}
if (NumAGPRs > MaxVGPRs) {
Copy link
Contributor

@jrbyrnes jrbyrnes Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second though, not sure about this..

I think our best bet is to assume perfect allocation -- meaning we need NumAGPRs - MaxVGPRs number of available arch VGPR

- Allow remat in same block if using MI is in different region. Add unit
  test.
- Take into account excess ArchVGPR/AGPR pressure above addressable
  limits. Add unit tests to make sure this works.
- Fix unused register in existing unit test.
Also update one recently changed unit test from main.
This fails in some complex use cases, and it is not clear whether it is
meaningfully faster that just letting the LIS object recompute it from
scratch.

Also removed stale comment in isTriviallyRematerilizable: we can now
remat MIs with virtual register uses.
@lucas-rami
Copy link
Contributor Author

AMD internal CI/CD passed with the last patch.

// We may be able to save one more whole ArchVGPR allocation granule.
if (NumRegs >= ArchVGPRsToAlignment) {
NumSavedRegs += Granule;
ArchVGPRsToAlignment = Granule - (NumRegs - ArchVGPRsToAlignment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NumRegs - ArchVGPRsToAlignment can be greater than Granule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From above unless I am mistaken we should get NumRegs := NumRegs - (NumRegs / Granule) * Granule == NumRegs % Granule, so NumRegs < Granule.

Additionally, we have NumRegs >= ArchVGPRsToAlignment from the if condition and Granule >= ArchVGPRsToAlignment > 0 by construction. Hence,

$$NumRegs < Granule \Longrightarrow NumRegs - ArchVGPRsToAlignment < Granule - ArchVGPRsToAlignment < Granule$$

; GISEL-GFX942-NEXT: v_accvgpr_write_b32 a4, v13 ; Reload Reuse
; GISEL-GFX942-NEXT: buffer_store_dwordx4 v[2:5], v62, s[4:7], 0 offen
; GISEL-GFX942-NEXT: buffer_store_dwordx4 v[6:9], v62, s[4:7], 0 offen offset:16
; GISEL-GFX942-NEXT: v_mov_b32_e32 v1, 0x2000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? Looks like our rematerialization did not help with spilling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the debug output. The upstream implementation identifies v_mov_b32_e32 v1, 0x2000 as a candidate for rematerialization but eventually does not remat anything because it determines it won't be able to improve occupancy. My implementation remats the instruction to reduce identified spilling (4 VGPRs) in the region below (starting on line 448) by 1 VGPR. That's not enough to eliminate spilling but it reduces it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay -- I took a look at this test. The scheduler sees that after the rematerialization, we have improved the VGPR pressure which should theoretically help with spilling. However, the heuristic which decides if we are spilling or not is wrong, which I have mentioned in my other comment.

This is a bit unclear from just looking at the resultant code, because ,during RA, we are doing a suboptimal job and thus have spilling. This is due to how RA partitions the register file between AGPRs and VGPRs. We find our minimum allowable wavesPerEU is 4, so we have 128 vector registers, but RA splits this into 64 agprs and 64 vgprs. So, ironically, we have a limit of 64 archvgpr which is the same as the incorrect limit imposed by requiring an 8 occupancy during scheduuling. However, this is an accident and should be fixed as I think Matt is working on improving this. Also, if we use attributes #0 = { "amdgpu-agpr-alloc"="0" } we avoid the RA issue (and thus have no spiling), but the remat heuristics still think we are going to spill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this. With the spilling heuristic change you suggested the test now behaves as before.

namespace {
/// Models excess register pressure in a region and tracks our progress as we
/// identify rematerialization opportunities.
struct ExcessRP {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could build better ExcessRP queries into GCNRegPressure and rewrite this heuristic to use those queries in combination with GCNRegPressure.inc() . This will help with code maintenance -- maybe as a followup PR

Comment on lines 2207 to 2208
// The boundaries of an empty region may not point to a valid MI from which we
// can get the parent MBB, so we have to iterate over all the MF's MBBs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need to do this, the region should know what blocks it covers directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my rationale for this. When we end up emptying a region completely through rematerializations the region boundaries end up as Region.first == Region.second (== BB.end() if the region is the last in the BB). AFAIU I cannot retrieve the BB from the boundaries alone in this case, hence this linear search.

To avoid this linear search I added PreRARematStage::RegionBB which stores the parent MBB of each region before remats.

lucas-rami added 2 commits May 5, 2025 22:11
Aditionally fixes bug where MIRegion would only be filled after
collecting rematerializable instructions, however the collection phase
depends on MIRegion to determine whether the user of an MI is located in
a different region from that MI. This does not break any test.
Using the base versions was necessary when we were not using the
default range of waves/EU. We use the default range now so we can
directly use the "regular" methods.
@lucas-rami lucas-rami merged commit 067caaa into llvm:main May 8, 2025
6 of 9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-devrel-x86-64 running on ml-opt-devrel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/18130

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-rel-x86-64 running on ml-opt-rel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/17917

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
+ /b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
/b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-dev-x86-64 running on ml-opt-dev-x86-64-b1 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/18163

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@jplehr
Copy link
Contributor

jplehr commented May 8, 2025

I reverted this patch locally and that got rid of the issue.
Not sure if this can be fixed forward or needs a revert.

Pierre-vh added a commit that referenced this pull request May 8, 2025
@Pierre-vh
Copy link
Contributor

It's just a few instructions that moved, I regenerated the test and committed it directly: 382a085

@jplehr
Copy link
Contributor

jplehr commented May 8, 2025

Thank you

@lucas-rami
Copy link
Contributor Author

Thanks @Pierre-vh!

@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder lld-x86_64-ubuntu-fast running on as-builder-4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/16108

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
+ /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder llvm-x86_64-debian-dylib running on gribozavr4 while building llvm at step 7 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/26826

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-llvm) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/llvm-x86_64-debian-dylib/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/1/llvm-x86_64-debian-dylib/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-expensive-checks-debian running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/18608

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@llvm-ci
Copy link
Collaborator

llvm-ci commented May 8, 2025

LLVM Buildbot has detected a new failure on builder clang-x86_64-debian-fast running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/56/builds/25326

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/mfma-loop.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/b/1/clang-x86_64-debian-fast/llvm.obj/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs < /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll | /b/1/clang-x86_64-debian-fast/llvm.obj/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll # RUN: at line 2
+ /b/1/clang-x86_64-debian-fast/llvm.obj/bin/llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs
+ /b/1/clang-x86_64-debian-fast/llvm.obj/bin/FileCheck -enable-var-scope -check-prefixes=GFX908 /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
/b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll:509:16: error: GFX908-NEXT: is not on the line after the previous match
; GFX908-NEXT: s_mov_b32 s0, 16
               ^
<stdin>:338:2: note: 'next' match was here
 s_mov_b32 s0, 16
 ^
<stdin>:336:27: note: previous match ended here
 v_accvgpr_write_b32 a0, 0
                          ^
<stdin>:337:1: note: non-matching line after previous match is here
 v_mov_b32_e32 v0, 1.0
^

Input file: <stdin>
Check file: /b/1/clang-x86_64-debian-fast/llvm.src/llvm/test/CodeGen/AMDGPU/mfma-loop.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
          .
          .
          .
        333:  v_accvgpr_write_b32 a4, 0 
        334:  v_accvgpr_write_b32 a3, 0 
        335:  v_accvgpr_write_b32 a2, 0 
        336:  v_accvgpr_write_b32 a0, 0 
        337:  v_mov_b32_e32 v0, 1.0 
        338:  s_mov_b32 s0, 16 
next:509      !~~~~~~~~~~~~~~~  error: match on wrong line
        339:  v_mov_b32_e32 v1, 2.0 
        340: .LBB2_1: ; %for.cond.preheader 
        341:  ; =>This Inner Loop Header: Depth=1 
        342:  s_nop 1 
        343:  v_mfma_f32_32x32x1f32 a[0:31], v0, v1, a[0:31] 
          .
          .
          .
>>>>>>

--

...

@hubert-reinterpretcast
Copy link
Collaborator

@lucas-rami, are you working on a fix or should this be reverted?

@lucas-rami
Copy link
Contributor Author

Is this not the same issue solved by 382a085 and reported by a different builder that checked out LLVM at my commit (before the fix) each time? Line 509 of mfma-loop.ll on main is currently not the one reported by the CI output: ; GFX908-NEXT: s_mov_b32 s0, 16

; GFX908-NEXT: v_mov_b32_e32 v0, 1.0

ninja-check-llvm-codegen-amdgpu also reports no error when I build close to top-of-tree (d7987f1).

Apologies if I am missing something here.

@hubert-reinterpretcast
Copy link
Collaborator

Is this not the same issue solved by 382a085 and reported by a different builder that checked out LLVM at my commit (before the fix) each time?

Sorry. I missed that the fix happened here: #125885 (comment)

@vitalybuka
Copy link
Collaborator

@slackito
Copy link
Collaborator

I get sanitizer errors when building at this commit with -DLLVM_USE_SANITIZER=Address and then running ninja check-llvm-codegen-amdgpu.

Here's the asan error:

==3331703==ERROR: AddressSanitizer: use-after-poison on address 0x7d39064ec8d4 at pc 0x55b90dc9ac25 bp 0x7ffff9c71550 sp 0x7ffff9c71548                                                                                            16:55:29 [69/2503]
READ of size 4 at 0x7d39064ec8d4 thread T0
    #0 0x55b90dc9ac24 in llvm::MachineInstr::getFlag(llvm::MachineInstr::MIFlag) const /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstr.h:409:12
    #1 0x55b90dc9ac24 in llvm::MachineInstr::isBundledWithPred() const /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstr.h:487:43
    #2 0x55b90dc9ac24 in llvm::SlotIndexes::removeMachineInstrFromMaps(llvm::MachineInstr&, bool) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SlotIndexes.cpp:129:3
    #3 0x55b909936baa in llvm::LiveIntervals::RemoveMachineInstrFromMaps(llvm::MachineInstr&) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/LiveIntervals.h:296:14
    #4 0x55b909936baa in llvm::PreRARematStage::rematerialize() /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:2090:14
    #5 0x55b9099314ef in llvm::PreRARematStage::initGCNSchedStage() /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:1117:3
    #6 0x55b90992e90a in llvm::GCNScheduleDAGMILive::runSchedStages() /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:966:17
    #7 0x55b90d79d811 in llvm::impl_detail::MachineSchedulerBase::scheduleRegions(llvm::ScheduleDAGInstrs&, bool) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineScheduler.cpp:825:13
    #8 0x55b90d79b8f3 in llvm::impl_detail::MachineSchedulerImpl::run(llvm::MachineFunction&, llvm::TargetMachine const&, llvm::impl_detail::MachineSchedulerImpl::RequiredAnalyses const&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/
MachineScheduler.cpp:484:3
    #9 0x55b90d7cdd43 in (anonymous namespace)::MachineSchedulerLegacy::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineScheduler.cpp:576:15
    #10 0x55b90d5d5377 in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineFunctionPass.cpp:108:10
    #11 0x55b90e2d524a in llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1406:27
    #12 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:180:25
    #13 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:469:9
    #14 0x55b90c933051 in (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:534:18
    #15 0x55b90e2d61be in (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1521:27
    #16 0x55b90e2d61be in llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:539:44
    #17 0x55b90870ff3e in compileModule(char**, llvm::LLVMContext&) /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:753:8
    #18 0x55b90870af9e in main /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:400:22
    #19 0x7f2907033ca7 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #20 0x7f2907033d64 in __libc_start_main csu/../csu/libc-start.c:360:3
    #21 0x55b908622930 in _start (/dev/shm/jgorbe-code/lldb-build/bin/llc+0x905b930)

0x7d39064ec8d4 is located 4052 bytes inside of 4096-byte region [0x7d39064eb900,0x7d39064ec900)
allocated by thread T0 here:
    #0 0x55b908709712 in operator new(unsigned long, std::align_val_t, std::nothrow_t const&) /usr/local/google/home/jgorbe/forks/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:104:3
    #1 0x55b90faa24af in llvm::allocate_buffer(unsigned long, unsigned long) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Support/MemAlloc.cpp:16:18
    #2 0x55b908bef425 in llvm::MallocAllocator::Allocate(unsigned long, unsigned long) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/AllocatorBase.h:92:12
    #3 0x55b908bef425 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::StartNewSlab() /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Allocator.h:346:42
    #4 0x55b908bef1a9 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::AllocateSlow(unsigned long, unsigned long, llvm::Align) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Allocator.h:202:5
    #5 0x55b90d5b6625 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::Allocate(unsigned long, unsigned long) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Allocator.h:216:12
    #6 0x55b90d5b6625 in llvm::MachineInstr* llvm::Recycler<llvm::MachineInstr, 80ul, 8ul>::Allocate<llvm::MachineInstr, llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>>(llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator,
4096ul, 4096ul, 128ul>&) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/Support/Recycler.h:95:57
    #7 0x55b90d5b6625 in llvm::MachineFunction::CreateMachineInstr(llvm::MCInstrDesc const&, llvm::DebugLoc, bool) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineFunction.cpp:429:35
    #8 0x55b908ca3454 in llvm::BuildMI(llvm::MachineBasicBlock&, llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, llvm::MIMetadata const&, llvm::MCInstrDesc const&, llvm::Register) /usr/local/google/home/jgorbe/code/llvm/llvm/include
/llvm/CodeGen/MachineInstrBuilder.h:397:25
    #9 0x55b90f3f0a78 in llvm::BuildMI(llvm::MachineBasicBlock*, llvm::MIMetadata const&, llvm::MCInstrDesc const&, llvm::Register) /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstrBuilder.h:498:10
    #10 0x55b90f3f0a78 in llvm::FunctionLoweringInfo::set(llvm::Function const&, llvm::MachineFunction&, llvm::SelectionDAG*) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp:307:11
    #11 0x55b90f6afb74 in llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:579:13
    #12 0x55b9094a470a in llvm::AMDGPUDAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp:128:28
    #13 0x55b90f6a9adf in llvm::SelectionDAGISelLegacy::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:374:20
    #14 0x55b90d5d5377 in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/MachineFunctionPass.cpp:108:10
    #15 0x55b90e2d524a in llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1406:27
    #16 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:180:25
    #17 0x55b90c933051 in (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:469:9
    #18 0x55b90c933051 in (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/Analysis/CallGraphSCCPass.cpp:534:18
    #19 0x55b90e2d61be in (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:1521:27
    #20 0x55b90e2d61be in llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/jgorbe/code/llvm/llvm/lib/IR/LegacyPassManager.cpp:539:44
    #21 0x55b90870ff3e in compileModule(char**, llvm::LLVMContext&) /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:753:8
    #22 0x55b90870af9e in main /usr/local/google/home/jgorbe/code/llvm/llvm/tools/llc/llc.cpp:400:22
    #23 0x7f2907033ca7 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: use-after-poison /usr/local/google/home/jgorbe/code/llvm/llvm/include/llvm/CodeGen/MachineInstr.h:409:12 in llvm::MachineInstr::getFlag(llvm::MachineInstr::MIFlag) const
Shadow bytes around the buggy address:
  0x7d39064ec600: f7 00 00 00 00 00 00 00 00 00 00 f7 00 00 00 00
  0x7d39064ec680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7d39064ec700: 00 00 00 00 00 00 00 00 00 00 00 00 f7 00 00 00
  0x7d39064ec780: 00 00 00 00 00 00 00 f7 00 00 00 00 00 00 00 00
  0x7d39064ec800: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 00 00 00 00
=>0x7d39064ec880: 00 00 00 00 f7 f7 f7 f7 f7 f7[f7]f7 f7 f7 f7 f7
  0x7d39064ec900: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064ec980: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064eca00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064eca80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7d39064ecb00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3331703==ABORTING
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /dev/shm/jgorbe-code/lldb-build/bin/FileCheck -check-prefix=GCN -check-prefix=SI /dev/shm/jgorbe-code/llvm/llvm/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll

@vitalybuka
Copy link
Collaborator

Yes, the same on bots.

vitalybuka added a commit that referenced this pull request May 10, 2025
…g scheduling (#125885)" (#139341)

And related "[AMDGPU] Regenerate mfma-loop.ll test"

Introduce memory error detected by Asan #125885.

This reverts commit 382a085.
This reverts commit 067caaa.
@vitalybuka
Copy link
Collaborator

@lucas-rami we had to revert to fix bots

@lucas-rami
Copy link
Contributor Author

Sounds good, thanks for the ping and sorry for the faulty commit, working on addressing the issue.

lucas-rami added a commit to lucas-rami/llvm-project that referenced this pull request May 12, 2025
…ling (llvm#125885)

AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.

- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
lucas-rami added a commit that referenced this pull request May 13, 2025
…ng scheduling (#125885)" (#139548)

This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to
issues detected by the address sanitizer (MIs have to be removed from
live intervals before being removed from their parent MBB).

Original commit description below.

AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.

- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants