[AMDGPU] Insert before and after instructions that always use GDS #131338

stepthomas · 2025-03-14T14:05:27Z

It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete.

Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).

On GFX11 it is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert a waits on LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).

llvmbot · 2025-03-14T14:06:04Z

@llvm/pr-subscribers-backend-amdgpu

Author: Stephen Thomas (stepthomas)

Changes

On GFX11 it is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete.

Insert a waits on LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).

Full diff: https://github.com/llvm/llvm-project/pull/131338.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp (+3-1)
(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+85-7)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+3-1)
(added) llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir (+26)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll (+4)

diff --git a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
index d0043bcc920b6..4802ed4bb53df 100644
--- a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
+++ b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
@@ -328,7 +328,9 @@ bool AMDGPUCustomBehaviour::isGWS(uint16_t Opcode) const {
 
 // taken from SIInstrInfo::isAlwaysGDS()
 bool AMDGPUCustomBehaviour::isAlwaysGDS(uint16_t Opcode) const {
-  return Opcode == AMDGPU::DS_ORDERED_COUNT || isGWS(Opcode);
+  return Opcode == AMDGPU::DS_ORDERED_COUNT ||
+         Opcode == AMDGPU::DS_ADD_GS_REG_RTN ||
+         Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
 }
 
 } // namespace llvm::mca
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 239f2664f59f3..9fbb79fe3d6aa 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -349,6 +349,16 @@ class WaitcntBrackets {
     LastFlat[DS_CNT] = ScoreUBs[DS_CNT];
   }
 
+  bool hasPendingGDS() const {
+    return LastGDS > ScoreLBs[DS_CNT] && LastGDS <= ScoreUBs[DS_CNT];
+  }
+
+  unsigned getPendingGDSWait() const {
+    return std::min(getScoreUB(DS_CNT) - LastGDS, getWaitCountMax(DS_CNT) - 1);
+  }
+
+  void setPendingGDS() { LastGDS = ScoreUBs[DS_CNT]; }
+
   // Return true if there might be pending writes to the vgpr-interval by VMEM
   // instructions with types different from V.
   bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
@@ -427,6 +437,8 @@ class WaitcntBrackets {
   unsigned PendingEvents = 0;
   // Remember the last flat memory operation.
   unsigned LastFlat[NUM_INST_CNTS] = {0};
+  // Remember the last GDS operation.
+  unsigned LastGDS = 0;
   // wait_cnt scores for every vgpr.
   // Keep track of the VgprUB and SgprUB to make merge at join efficient.
   int VgprUB = -1;
@@ -729,6 +741,10 @@ class SIInsertWaitcnts : public MachineFunctionPass {
                        MachineInstr *OldWaitcntInstr);
   void updateEventWaitcntAfter(MachineInstr &Inst,
                                WaitcntBrackets *ScoreBrackets);
+  bool isNextENDPGM(MachineBasicBlock::instr_iterator It,
+                    MachineBasicBlock *Block) const;
+  bool insertForcedWaitAfter(MachineInstr &Inst, MachineBasicBlock &Block,
+                             WaitcntBrackets &ScoreBrackets);
   bool insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block,
                             WaitcntBrackets &ScoreBrackets);
 };
@@ -1678,6 +1694,11 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
       }
     }
 
+    // GFX11 must wait for any pending GDS instruction to complete before
+    // any "Always GDS" instruction.
+    if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(MI.getOpcode()) && ScoreBrackets.hasPendingGDS())
+      addWait(Wait, DS_CNT, ScoreBrackets.getPendingGDSWait());
+
     if (MI.isCall() && callWaitsOnFunctionEntry(MI)) {
       // The function is going to insert a wait on everything in its prolog.
       // This still needs to be careful if the call target is a load (e.g. a GOT
@@ -1982,6 +2003,65 @@ static bool isCacheInvOrWBInst(MachineInstr &Inst) {
          Opc == AMDGPU::GLOBAL_WBINV;
 }
 
+// Return true if the next instruction is S_ENDPGM, following fallthrough
+// blocks if necessary.
+bool SIInsertWaitcnts::isNextENDPGM(MachineBasicBlock::instr_iterator It,
+                                    MachineBasicBlock *Block) const {
+  auto E = Block->instr_end();
+
+  while (true) {
+    if (It == E) {
+      if (auto FallThrough = Block->getFallThrough(false)) {
+        Block = FallThrough;
+        It = Block->instr_begin();
+        E = Block->instr_end();
+        continue;
+      }
+
+      return false;
+    }
+
+    if (!It->isMetaInstruction())
+      break;
+
+    It++;
+  }
+
+  assert(It != E);
+
+  return It->getOpcode() == AMDGPU::S_ENDPGM;
+}
+
+// Add a wait after an instruction if architecture requirements mandate one.
+bool SIInsertWaitcnts::insertForcedWaitAfter(MachineInstr &Inst,
+                                             MachineBasicBlock &Block,
+                                             WaitcntBrackets &ScoreBrackets) {
+  AMDGPU::Waitcnt Wait;
+  bool NeedsEndPGMCheck = false;
+
+  if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore())
+    Wait = WCG->getAllZeroWaitcnt(Inst.mayStore() &&
+                                  !SIInstrInfo::isAtomicRet(Inst));
+
+  if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(Inst.getOpcode())) {
+    Wait.DsCnt = 0;
+    NeedsEndPGMCheck = true;
+  }
+
+  ScoreBrackets.simplifyWaitcnt(Wait);
+
+  auto SuccessorIt = std::next(Inst.getIterator());
+  bool Result = generateWaitcnt(Wait, SuccessorIt, Block, ScoreBrackets,
+                                /*OldWaitcntInstr=*/nullptr);
+
+  if (Result && NeedsEndPGMCheck && isNextENDPGM(SuccessorIt, &Block)) {
+    BuildMI(Block, SuccessorIt, Inst.getDebugLoc(), TII->get(AMDGPU::S_NOP))
+        .addImm(0);
+  }
+
+  return Result;
+}
+
 void SIInsertWaitcnts::updateEventWaitcntAfter(MachineInstr &Inst,
                                                WaitcntBrackets *ScoreBrackets) {
   // Now look at the instruction opcode. If it is a memory access
@@ -1994,6 +2074,7 @@ void SIInsertWaitcnts::updateEventWaitcntAfter(MachineInstr &Inst,
         TII->hasModifiersSet(Inst, AMDGPU::OpName::gds)) {
       ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_ACCESS, Inst);
       ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_GPR_LOCK, Inst);
+      ScoreBrackets->setPendingGDS();
     } else {
       ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);
     }
@@ -2124,6 +2205,9 @@ bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {
 
     StrictDom |= mergeScore(M, LastFlat[T], Other.LastFlat[T]);
 
+    if (T == DS_CNT)
+      StrictDom |= mergeScore(M, LastGDS, Other.LastGDS);
+
     for (int J = 0; J <= VgprUB; J++)
       StrictDom |= mergeScore(M, VgprScores[T][J], Other.VgprScores[T][J]);
 
@@ -2249,13 +2333,7 @@ bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
 
     updateEventWaitcntAfter(Inst, &ScoreBrackets);
 
-    if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore()) {
-      AMDGPU::Waitcnt Wait = WCG->getAllZeroWaitcnt(
-          Inst.mayStore() && !SIInstrInfo::isAtomicRet(Inst));
-      ScoreBrackets.simplifyWaitcnt(Wait);
-      Modified |= generateWaitcnt(Wait, std::next(Inst.getIterator()), Block,
-                                  ScoreBrackets, /*OldWaitcntInstr=*/nullptr);
-    }
+    Modified |= insertForcedWaitAfter(Inst, Block, ScoreBrackets);
 
     LLVM_DEBUG({
       Inst.print(dbgs());
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 1e025f481ffa9..e383f6f2aef48 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4239,7 +4239,9 @@ bool SIInstrInfo::isSchedulingBoundary(const MachineInstr &MI,
 }
 
 bool SIInstrInfo::isAlwaysGDS(uint16_t Opcode) const {
-  return Opcode == AMDGPU::DS_ORDERED_COUNT || isGWS(Opcode);
+  return Opcode == AMDGPU::DS_ORDERED_COUNT ||
+         Opcode == AMDGPU::DS_ADD_GS_REG_RTN ||
+         Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
 }
 
 bool SIInstrInfo::modifiesModeRegister(const MachineInstr &MI) {
diff --git a/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir b/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir
new file mode 100644
index 0000000000000..c5b8491e52a41
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir
@@ -0,0 +1,26 @@
+# RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -run-pass si-insert-waitcnts %s -o - | FileCheck -check-prefix=GCN %s
+
+---
+# GCN-LABEL: name: test_ordered_count
+# GCN:       bb.0
+# GCN:       DS_ADD_U32
+# GCN:       DS_SUB_U32
+# GCN-NEXT:  S_WAITCNT 64535
+# GCN-NEXT:  $vgpr3 = DS_ORDERED_COUNT
+# GCN-NEXT:  S_WAITCNT 64519
+# GCN-NEXT:  $vgpr4_vgpr5 = DS_ADD_GS_REG_RTN
+# GCN-NEXT:  S_WAITCNT 64519
+# GCN-NEXT:  S_NOP 0
+
+name:   test_ordered_count
+body:    |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+
+    DS_ADD_U32 $vgpr1, $vgpr2, 12, -1, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+    DS_SUB_U32 $vgpr1, $vgpr2, 12, 0, implicit $m0, implicit $exec :: (load store (s32), addrspace 2)
+    $vgpr3 = DS_ORDERED_COUNT $vgpr0, 772, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+    $vgpr4_vgpr5 = DS_ADD_GS_REG_RTN $vgpr0, 32, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+    S_ENDPGM 0
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
index 081727c3b5e14..9aedaaec94e7e 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
@@ -9,6 +9,8 @@ define amdgpu_gs void @test_add_32(i32 %arg) {
 ; CHECK-LABEL: test_add_32:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_add_gs_reg_rtn v[0:1], v0 offset:16 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i32 @llvm.amdgcn.ds.add.gs.reg.rtn.i32(i32 %arg, i32 16)
   ret void
@@ -30,6 +32,8 @@ define amdgpu_gs void @test_add_64(i32 %arg) {
 ; CHECK-LABEL: test_add_64:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_add_gs_reg_rtn v[0:1], v0 offset:32 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i64 @llvm.amdgcn.ds.add.gs.reg.rtn.i64(i32 %arg, i32 32)
   ret void
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
index 63d4bac20b354..bb1c4604dd9d2 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
@@ -9,6 +9,8 @@ define amdgpu_gs void @test_sub_32(i32 %arg) {
 ; CHECK-LABEL: test_sub_32:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_sub_gs_reg_rtn v[0:1], v0 offset:16 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i32 @llvm.amdgcn.ds.sub.gs.reg.rtn.i32(i32 %arg, i32 16)
   ret void
@@ -30,6 +32,8 @@ define amdgpu_gs void @test_sub_64(i32 %arg) {
 ; CHECK-LABEL: test_sub_64:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    ds_sub_gs_reg_rtn v[0:1], v0 offset:32 gds
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_nop 0
 ; CHECK-NEXT:    s_endpgm
   %unused = call i64 @llvm.amdgcn.ds.sub.gs.reg.rtn.i64(i32 %arg, i32 32)
   ret void

github-actions · 2025-03-14T14:09:10Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jayfoad · 2025-03-14T14:13:10Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+    // GFX11 must wait for any pending GDS instruction to complete before
+    // any "Always GDS" instruction.
+    if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(MI.getOpcode()) &&


Don't restrict this to GFX11. The rule definitely applies to GFX9 and GFX10 too, so I'm guessing it applies to any architecture that supports GWS and this code can be unconditional.

jayfoad · 2025-03-14T16:15:50Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  auto E = Block->instr_end();
+
+  while (true) {
+    if (It == E) {


I think you can test It.isEnd() here, which means you don't need Block or E in this function.

It looks like It.isEnd() does exist, It->getParent() fails if It is at the end of the iterator, so it's not possible get the next block without knowing the current block already.

But if you use isEnd then you don't need E for anything.

Yes, but it looks like I'll still need Block, which is annoying.

arsenm

Should this be treated as a type of hazard, rather than something for waitcnt insertion to deal with?

arsenm · 2025-03-18T05:38:01Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+
+  while (true) {
+    if (It == E) {
+      if (auto FallThrough = Block->getFallThrough(false)) {


Suggested change

if (auto FallThrough = Block->getFallThrough(false)) {

if (MachineBasicBlock *FallThrough = Block->getFallThrough(false)) {

On a related note, getFallThrough is far more powerful than we need here. We could just use std::next. But maybe that would be less readable.

auto EndBlock = Block->getParent()->end(); while (true) { if (It.isEnd()) { auto NextBlock = std::next(Block->getIterator()); if (NextBlock != EndBlock) { It = NextBlock->instr_begin(); Block = &*NextBlock; continue; } return false; } if (!It->isMetaInstruction()) break; It++; } assert(!It.isEnd());

This is what come up with if I want to use std::next(). I think it could get a bit more readable if I had an iterator over blocks starting from Block, advancing it as the instruction iterator encountered block ends, but is ++ on iterators the same as using std::next()?

but is ++ on iterators the same as using std::next()?

Yes.

arsenm · 2025-03-18T05:38:30Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+// blocks if necessary.
+bool SIInsertWaitcnts::isNextENDPGM(MachineBasicBlock::instr_iterator It,
+                                    MachineBasicBlock *Block) const {
+  auto E = Block->instr_end();


Is this intentionally looking through bundles?

No. I guess we're can be fairly sure S_ENDPGM won't be in a bundle?

jayfoad · 2025-03-18T10:01:28Z

Should this be treated as a type of hazard, rather than something for waitcnt insertion to deal with?

No. The waitcnt before a GWS instruction definitely belongs in this pass, since we need to track whether there is an outstanding GDS operation. For the waitcnt after a GWS instruction it is less clear, but I think it fits fairly naturally into this pass.

jayfoad

LGTM, thanks.

llvmbot added the backend:AMDGPU label Mar 14, 2025

stepthomas requested a review from jayfoad March 14, 2025 14:05

clang-format correction

68a2106

stepthomas requested a review from ScottEgerton March 14, 2025 14:11

jayfoad reviewed Mar 14, 2025

View reviewed changes

remove GFX11 specificity

a1feab0

jayfoad reviewed Mar 14, 2025

View reviewed changes

arsenm reviewed Mar 18, 2025

View reviewed changes

avoid use of getFallThrough()

f80cbaa

jayfoad approved these changes Mar 19, 2025

View reviewed changes

stepthomas merged commit 2e3fa4b into llvm:main Mar 21, 2025
11 checks passed

	if (auto FallThrough = Block->getFallThrough(false)) {
	if (MachineBasicBlock *FallThrough = Block->getFallThrough(false)) {

[AMDGPU] Insert before and after instructions that always use GDS #131338

[AMDGPU] Insert before and after instructions that always use GDS #131338

Uh oh!

Conversation

stepthomas commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 14, 2025

Uh oh!

github-actions bot commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stepthomas Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stepthomas Mar 19, 2025 • edited by jayfoad Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Mar 18, 2025

Uh oh!

jayfoad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stepthomas commented Mar 14, 2025 •

edited

Loading

github-actions bot commented Mar 14, 2025 •

edited

Loading

stepthomas Mar 18, 2025 •

edited

Loading

stepthomas Mar 19, 2025 •

edited by jayfoad

Loading