Skip to content

[AMDGPU] Insert before and after instructions that always use GDS #131338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

stepthomas
Copy link
Contributor

@stepthomas stepthomas commented Mar 14, 2025

It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete.

Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).

On GFX11 it is an architectural requirement that there must be no outstanding
GDS instructions when an "always GDS" instruction is issued, and also that
an always GDS instruction must be allowed to complete.

Insert a waits on LGKMcnt prior to (if necessary) and subsequent to
(unconditionally) any always GDS instruction, and an additional S_NOP
if the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT,
DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered
always GDS as of this patch).
@llvmbot
Copy link
Member

llvmbot commented Mar 14, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Stephen Thomas (stepthomas)

Changes

On GFX11 it is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete.

Insert a waits on LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).


Full diff: https://github.com/llvm/llvm-project/pull/131338.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+85-7)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+3-1)
  • (added) llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir (+26)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll (+4)
diff --git a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
index d0043bcc920b6..4802ed4bb53df 100644
--- a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
+++ b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
@@ -328,7 +328,9 @@ bool AMDGPUCustomBehaviour::isGWS(uint16_t Opcode) const {
 
 // taken from SIInstrInfo::isAlwaysGDS()
 bool AMDGPUCustomBehaviour::isAlwaysGDS(uint16_t Opcode) const {
-  return Opcode == AMDGPU::DS_ORDERED_COUNT || isGWS(Opcode);
+  return Opcode == AMDGPU::DS_ORDERED_COUNT ||
+         Opcode == AMDGPU::DS_ADD_GS_REG_RTN ||
+         Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
 }
 
 } // namespace llvm::mca
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 239f2664f59f3..9fbb79fe3d6aa 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -349,6 +349,16 @@ class WaitcntBrackets {
     LastFlat[DS_CNT] = ScoreUBs[DS_CNT];
   }
 
+  bool hasPendingGDS() const {
+    return LastGDS > ScoreLBs[DS_CNT] && LastGDS <= ScoreUBs[DS_CNT];
+  }
+
+  unsigned getPendingGDSWait() const {
+    return std::min(getScoreUB(DS_CNT) - LastGDS, getWaitCountMax(DS_CNT) - 1);
+  }
+
+  void setPendingGDS() { LastGDS = ScoreUBs[DS_CNT]; }
+
   // Return true if there might be pending writes to the vgpr-interval by VMEM
   // instructions with types different from V.
   bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
@@ -427,6 +437,8 @@ class WaitcntBrackets {
   unsigned PendingEvents = 0;
   // Remember the last flat memory operation.
   unsigned LastFlat[NUM_INST_CNTS] = {0};
+  // Remember the last GDS operation.
+  unsigned LastGDS = 0;
   // wait_cnt scores for every vgpr.
   // Keep track of the VgprUB and SgprUB to make merge at join efficient.
   int VgprUB = -1;
@@ -729,6 +741,10 @@ class SIInsertWaitcnts : public MachineFunctionPass {
                        MachineInstr *OldWaitcntInstr);
   void updateEventWaitcntAfter(MachineInstr &Inst,
                                WaitcntBrackets *ScoreBrackets);
+  bool isNextENDPGM(MachineBasicBlock::instr_iterator It,
+                    MachineBasicBlock *Block) const;
+  bool insertForcedWaitAfter(MachineInstr &Inst, MachineBasicBlock &Block,
+                             WaitcntBrackets &ScoreBrackets);
   bool insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block,
                             WaitcntBrackets &ScoreBrackets);
 };
@@ -1678,6 +1694,11 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
       }
     }
 
+    // GFX11 must wait for any pending GDS instruction to complete before
+    // any "Always GDS" instruction.
+    if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(MI.getOpcode()) && ScoreBrackets.hasPendingGDS())
+      addWait(Wait, DS_CNT, ScoreBrackets.getPendingGDSWait());
+
     if (MI.isCall() && callWaitsOnFunctionEntry(MI)) {
       // The function is going to insert a wait on everything in its prolog.
       // This still needs to be careful if the call target is a load (e.g. a GOT
@@ -1982,6 +2003,65 @@ static bool isCacheInvOrWBInst(MachineInstr &Inst) {
          Opc == AMDGPU::GLOBAL_WBINV;
 }
 
+// Return true if the next instruction is S_ENDPGM, following fallthrough
+// blocks if necessary.
+bool SIInsertWaitcnts::isNextENDPGM(MachineBasicBlock::instr_iterator It,
+                                    MachineBasicBlock *Block) const {
+  auto E = Block->instr_end();
+
+  while (true) {
+    if (It == E) {
+      if (auto FallThrough = Block->getFallThrough(false)) {
+        Block = FallThrough;
+        It = Block->instr_begin();
+        E = Block->instr_end();
+        continue;
+      }
+
+      return false;
+    }
+
+    if (!It->isMetaInstruction())
+      break;
+
+    It++;
+  }
+
+  assert(It != E);
+
+  return It->getOpcode() == AMDGPU::S_ENDPGM;
+}
+
+// Add a wait after an instruction if architecture requirements mandate one.
+bool SIInsertWaitcnts::insertForcedWaitAfter(MachineInstr &Inst,
+                                             MachineBasicBlock &Block,
+                                             WaitcntBrackets &ScoreBrackets) {
+  AMDGPU::Waitcnt Wait;
+  bool NeedsEndPGMCheck = false;
+
+  if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore())
+    Wait = WCG->getAllZeroWaitcnt(Inst.mayStore() &&
+                                  !SIInstrInfo::isAtomicRet(Inst));
+
+  if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(Inst.getOpcode())) {
+    Wait.DsCnt = 0;
+    NeedsEndPGMCheck = true;
+  }
+
+  ScoreBrackets.simplifyWaitcnt(Wait);
+
+  auto SuccessorIt = std::next(Inst.getIterator());
+  bool Result = generateWaitcnt(Wait, SuccessorIt, Block, ScoreBrackets,
+                                /*OldWaitcntInstr=*/nullptr);
+
+  if (Result && NeedsEndPGMCheck && isNextENDPGM(SuccessorIt, &Block)) {
+    BuildMI(Block, SuccessorIt, Inst.getDebugLoc(), TII->get(AMDGPU::S_NOP))
+        .addImm(0);
+  }
+
+  return Result;
+}
+
 void SIInsertWaitcnts::updateEventWaitcntAfter(MachineInstr &Inst,
                                                WaitcntBrackets *ScoreBrackets) {
   // Now look at the instruction opcode. If it is a memory access
@@ -1994,6 +2074,7 @@ void SIInsertWaitcnts::updateEventWaitcntAfter(MachineInstr &Inst,
         TII->hasModifiersSet(Inst, AMDGPU::OpName::gds)) {
       ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_ACCESS, Inst);
       ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_GPR_LOCK, Inst);
+      ScoreBrackets->setPendingGDS();
     } else {
       ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);
     }
@@ -2124,6 +2205,9 @@ bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {
 
     StrictDom |= mergeScore(M, LastFlat[T], Other.LastFlat[T]);
 
+    if (T == DS_CNT)
+      StrictDom |= mergeScore(M, LastGDS, Other.LastGDS);
+
     for (int J = 0; J <= VgprUB; J++)
       StrictDom |= mergeScore(M, VgprScores[T][J], Other.VgprScores[T][J]);
 
@@ -2249,13 +2333,7 @@ bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
 
     updateEventWaitcntAfter(Inst, &ScoreBrackets);
 
-    if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore()) {
-      AMDGPU::Waitcnt Wait = WCG->getAllZeroWaitcnt(
-          Inst.mayStore() && !SIInstrInfo::isAtomicRet(Inst));
-      ScoreBrackets.simplifyWaitcnt(Wait);
-      Modified |= generateWaitcnt(Wait, std::next(Inst.getIterator()), Block,
-                                  ScoreBrackets, /*OldWaitcntInstr=*/nullptr);
-    }
+    Modified |= insertForcedWaitAfter(Inst, Block, ScoreBrackets);
 
     LLVM_DEBUG({
       Inst.print(dbgs());
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 1e025f481ffa9..e383f6f2aef48 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4239,7 +4239,9 @@ bool SIInstrInfo::isSchedulingBoundary(const MachineInstr &MI,
 }
 
 bool SIInstrInfo::isAlwaysGDS(uint16_t Opcode) const {
-  return Opcode == AMDGPU::DS_ORDERED_COUNT || isGWS(Opcode);
+  return Opcode == AMDGPU::DS_ORDERED_COUNT ||
+         Opcode == AMDGPU::DS_ADD_GS_REG_RTN ||
+         Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
 }
 
 bool SIInstrInfo::modifiesModeRegister(const MachineInstr &MI) {
diff --git a/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir b/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir
new file mode 100644
index 0000000000000..c5b8491e52a41
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir
@@ -0,0 +1,26 @@
+# RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -run-pass si-insert-waitcnts %s -o - | FileCheck -check-prefix=GCN %s
+
+---
+# GCN-LABEL: name: test_ordered_count
+# GCN:       bb.0
+# GCN:       DS_ADD_U32
+# GCN:       DS_SUB_U32
+# GCN-NEXT:  S_WAITCNT 64535
+# GCN-NEXT:  $vgpr3 = DS_ORDERED_COUNT
+# GCN-NEXT:  S_WAITCNT 64519
+# GCN-NEXT:  $vgpr4_vgpr5 = DS_ADD_GS_REG_RTN
+# GCN-NEXT:  S_WAITCNT 64519
+# GCN-NEXT:  S_NOP 0
+
+name:   test_ordered_count
+body:    |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+
+    DS_ADD_U32 $vgpr1, $vgpr2, 12, -1, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+    DS_SUB_U32 $vgpr1, $vgpr2, 12, 0, implicit $m0, implicit $exec :: (load store (s32), addrspace 2)
+    $vgpr3 = DS_ORDERED_COUNT $vgpr0, 772, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+    $vgpr4_vgpr5 = DS_ADD_GS_REG_RTN $vgpr0, 32, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+    S_ENDPGM 0
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
index 081727c3b5e14..9aedaaec94e7e 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
@@ -9,6 +9,8 @@ define amdgpu_gs void @test_add_32(i32 %arg) {
 ; CHECK-LABEL: test_add_32:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_add_gs_reg_rtn v[0:1], v0 offset:16 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i32 @llvm.amdgcn.ds.add.gs.reg.rtn.i32(i32 %arg, i32 16)
   ret void
@@ -30,6 +32,8 @@ define amdgpu_gs void @test_add_64(i32 %arg) {
 ; CHECK-LABEL: test_add_64:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_add_gs_reg_rtn v[0:1], v0 offset:32 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i64 @llvm.amdgcn.ds.add.gs.reg.rtn.i64(i32 %arg, i32 32)
   ret void
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
index 63d4bac20b354..bb1c4604dd9d2 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
@@ -9,6 +9,8 @@ define amdgpu_gs void @test_sub_32(i32 %arg) {
 ; CHECK-LABEL: test_sub_32:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_sub_gs_reg_rtn v[0:1], v0 offset:16 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i32 @llvm.amdgcn.ds.sub.gs.reg.rtn.i32(i32 %arg, i32 16)
   ret void
@@ -30,6 +32,8 @@ define amdgpu_gs void @test_sub_64(i32 %arg) {
 ; CHECK-LABEL: test_sub_64:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_sub_gs_reg_rtn v[0:1], v0 offset:32 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i64 @llvm.amdgcn.ds.sub.gs.reg.rtn.i64(i32 %arg, i32 32)
   ret void

Copy link

github-actions bot commented Mar 14, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@stepthomas stepthomas requested a review from ScottEgerton March 14, 2025 14:11
Comment on lines 1697 to 1699
// GFX11 must wait for any pending GDS instruction to complete before
// any "Always GDS" instruction.
if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(MI.getOpcode()) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't restrict this to GFX11. The rule definitely applies to GFX9 and GFX10 too, so I'm guessing it applies to any architecture that supports GWS and this code can be unconditional.

auto E = Block->instr_end();

while (true) {
if (It == E) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can test It.isEnd() here, which means you don't need Block or E in this function.

Copy link
Contributor Author

@stepthomas stepthomas Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like It.isEnd() does exist, It->getParent() fails if It is at the end of the iterator, so it's not possible get the next block without knowing the current block already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you use isEnd then you don't need E for anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it looks like I'll still need Block, which is annoying.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be treated as a type of hazard, rather than something for waitcnt insertion to deal with?


while (true) {
if (It == E) {
if (auto FallThrough = Block->getFallThrough(false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (auto FallThrough = Block->getFallThrough(false)) {
if (MachineBasicBlock *FallThrough = Block->getFallThrough(false)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, getFallThrough is far more powerful than we need here. We could just use std::next. But maybe that would be less readable.

Copy link
Contributor Author

@stepthomas stepthomas Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  auto EndBlock = Block->getParent()->end();

  while (true) {
    if (It.isEnd()) {
      auto NextBlock = std::next(Block->getIterator());
      if (NextBlock != EndBlock) {
        It = NextBlock->instr_begin();
        Block = &*NextBlock;
        continue;
      }

      return false;
    }

    if (!It->isMetaInstruction())
      break;

    It++;
  }

  assert(!It.isEnd());

This is what come up with if I want to use std::next(). I think it could get a bit more readable if I had an iterator over blocks starting from Block, advancing it as the instruction iterator encountered block ends, but is ++ on iterators the same as using std::next()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is ++ on iterators the same as using std::next()?

Yes.

// blocks if necessary.
bool SIInsertWaitcnts::isNextENDPGM(MachineBasicBlock::instr_iterator It,
MachineBasicBlock *Block) const {
auto E = Block->instr_end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally looking through bundles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I guess we're can be fairly sure S_ENDPGM won't be in a bundle?

@jayfoad
Copy link
Contributor

jayfoad commented Mar 18, 2025

Should this be treated as a type of hazard, rather than something for waitcnt insertion to deal with?

No. The waitcnt before a GWS instruction definitely belongs in this pass, since we need to track whether there is an outstanding GDS operation. For the waitcnt after a GWS instruction it is less clear, but I think it fits fairly naturally into this pass.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@stepthomas stepthomas merged commit 2e3fa4b into llvm:main Mar 21, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants