-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU] Insert before and after instructions that always use GDS #131338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Insert before and after instructions that always use GDS #131338
Conversation
On GFX11 it is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert a waits on LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).
@llvm/pr-subscribers-backend-amdgpu Author: Stephen Thomas (stepthomas) ChangesOn GFX11 it is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert a waits on LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch). Full diff: https://github.com/llvm/llvm-project/pull/131338.diff 6 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
index d0043bcc920b6..4802ed4bb53df 100644
--- a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
+++ b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
@@ -328,7 +328,9 @@ bool AMDGPUCustomBehaviour::isGWS(uint16_t Opcode) const {
// taken from SIInstrInfo::isAlwaysGDS()
bool AMDGPUCustomBehaviour::isAlwaysGDS(uint16_t Opcode) const {
- return Opcode == AMDGPU::DS_ORDERED_COUNT || isGWS(Opcode);
+ return Opcode == AMDGPU::DS_ORDERED_COUNT ||
+ Opcode == AMDGPU::DS_ADD_GS_REG_RTN ||
+ Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
}
} // namespace llvm::mca
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 239f2664f59f3..9fbb79fe3d6aa 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -349,6 +349,16 @@ class WaitcntBrackets {
LastFlat[DS_CNT] = ScoreUBs[DS_CNT];
}
+ bool hasPendingGDS() const {
+ return LastGDS > ScoreLBs[DS_CNT] && LastGDS <= ScoreUBs[DS_CNT];
+ }
+
+ unsigned getPendingGDSWait() const {
+ return std::min(getScoreUB(DS_CNT) - LastGDS, getWaitCountMax(DS_CNT) - 1);
+ }
+
+ void setPendingGDS() { LastGDS = ScoreUBs[DS_CNT]; }
+
// Return true if there might be pending writes to the vgpr-interval by VMEM
// instructions with types different from V.
bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
@@ -427,6 +437,8 @@ class WaitcntBrackets {
unsigned PendingEvents = 0;
// Remember the last flat memory operation.
unsigned LastFlat[NUM_INST_CNTS] = {0};
+ // Remember the last GDS operation.
+ unsigned LastGDS = 0;
// wait_cnt scores for every vgpr.
// Keep track of the VgprUB and SgprUB to make merge at join efficient.
int VgprUB = -1;
@@ -729,6 +741,10 @@ class SIInsertWaitcnts : public MachineFunctionPass {
MachineInstr *OldWaitcntInstr);
void updateEventWaitcntAfter(MachineInstr &Inst,
WaitcntBrackets *ScoreBrackets);
+ bool isNextENDPGM(MachineBasicBlock::instr_iterator It,
+ MachineBasicBlock *Block) const;
+ bool insertForcedWaitAfter(MachineInstr &Inst, MachineBasicBlock &Block,
+ WaitcntBrackets &ScoreBrackets);
bool insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block,
WaitcntBrackets &ScoreBrackets);
};
@@ -1678,6 +1694,11 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
}
}
+ // GFX11 must wait for any pending GDS instruction to complete before
+ // any "Always GDS" instruction.
+ if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(MI.getOpcode()) && ScoreBrackets.hasPendingGDS())
+ addWait(Wait, DS_CNT, ScoreBrackets.getPendingGDSWait());
+
if (MI.isCall() && callWaitsOnFunctionEntry(MI)) {
// The function is going to insert a wait on everything in its prolog.
// This still needs to be careful if the call target is a load (e.g. a GOT
@@ -1982,6 +2003,65 @@ static bool isCacheInvOrWBInst(MachineInstr &Inst) {
Opc == AMDGPU::GLOBAL_WBINV;
}
+// Return true if the next instruction is S_ENDPGM, following fallthrough
+// blocks if necessary.
+bool SIInsertWaitcnts::isNextENDPGM(MachineBasicBlock::instr_iterator It,
+ MachineBasicBlock *Block) const {
+ auto E = Block->instr_end();
+
+ while (true) {
+ if (It == E) {
+ if (auto FallThrough = Block->getFallThrough(false)) {
+ Block = FallThrough;
+ It = Block->instr_begin();
+ E = Block->instr_end();
+ continue;
+ }
+
+ return false;
+ }
+
+ if (!It->isMetaInstruction())
+ break;
+
+ It++;
+ }
+
+ assert(It != E);
+
+ return It->getOpcode() == AMDGPU::S_ENDPGM;
+}
+
+// Add a wait after an instruction if architecture requirements mandate one.
+bool SIInsertWaitcnts::insertForcedWaitAfter(MachineInstr &Inst,
+ MachineBasicBlock &Block,
+ WaitcntBrackets &ScoreBrackets) {
+ AMDGPU::Waitcnt Wait;
+ bool NeedsEndPGMCheck = false;
+
+ if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore())
+ Wait = WCG->getAllZeroWaitcnt(Inst.mayStore() &&
+ !SIInstrInfo::isAtomicRet(Inst));
+
+ if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(Inst.getOpcode())) {
+ Wait.DsCnt = 0;
+ NeedsEndPGMCheck = true;
+ }
+
+ ScoreBrackets.simplifyWaitcnt(Wait);
+
+ auto SuccessorIt = std::next(Inst.getIterator());
+ bool Result = generateWaitcnt(Wait, SuccessorIt, Block, ScoreBrackets,
+ /*OldWaitcntInstr=*/nullptr);
+
+ if (Result && NeedsEndPGMCheck && isNextENDPGM(SuccessorIt, &Block)) {
+ BuildMI(Block, SuccessorIt, Inst.getDebugLoc(), TII->get(AMDGPU::S_NOP))
+ .addImm(0);
+ }
+
+ return Result;
+}
+
void SIInsertWaitcnts::updateEventWaitcntAfter(MachineInstr &Inst,
WaitcntBrackets *ScoreBrackets) {
// Now look at the instruction opcode. If it is a memory access
@@ -1994,6 +2074,7 @@ void SIInsertWaitcnts::updateEventWaitcntAfter(MachineInstr &Inst,
TII->hasModifiersSet(Inst, AMDGPU::OpName::gds)) {
ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_ACCESS, Inst);
ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_GPR_LOCK, Inst);
+ ScoreBrackets->setPendingGDS();
} else {
ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);
}
@@ -2124,6 +2205,9 @@ bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {
StrictDom |= mergeScore(M, LastFlat[T], Other.LastFlat[T]);
+ if (T == DS_CNT)
+ StrictDom |= mergeScore(M, LastGDS, Other.LastGDS);
+
for (int J = 0; J <= VgprUB; J++)
StrictDom |= mergeScore(M, VgprScores[T][J], Other.VgprScores[T][J]);
@@ -2249,13 +2333,7 @@ bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
updateEventWaitcntAfter(Inst, &ScoreBrackets);
- if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore()) {
- AMDGPU::Waitcnt Wait = WCG->getAllZeroWaitcnt(
- Inst.mayStore() && !SIInstrInfo::isAtomicRet(Inst));
- ScoreBrackets.simplifyWaitcnt(Wait);
- Modified |= generateWaitcnt(Wait, std::next(Inst.getIterator()), Block,
- ScoreBrackets, /*OldWaitcntInstr=*/nullptr);
- }
+ Modified |= insertForcedWaitAfter(Inst, Block, ScoreBrackets);
LLVM_DEBUG({
Inst.print(dbgs());
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 1e025f481ffa9..e383f6f2aef48 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4239,7 +4239,9 @@ bool SIInstrInfo::isSchedulingBoundary(const MachineInstr &MI,
}
bool SIInstrInfo::isAlwaysGDS(uint16_t Opcode) const {
- return Opcode == AMDGPU::DS_ORDERED_COUNT || isGWS(Opcode);
+ return Opcode == AMDGPU::DS_ORDERED_COUNT ||
+ Opcode == AMDGPU::DS_ADD_GS_REG_RTN ||
+ Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
}
bool SIInstrInfo::modifiesModeRegister(const MachineInstr &MI) {
diff --git a/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir b/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir
new file mode 100644
index 0000000000000..c5b8491e52a41
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/force-wait-after-always-gds.mir
@@ -0,0 +1,26 @@
+# RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs -run-pass si-insert-waitcnts %s -o - | FileCheck -check-prefix=GCN %s
+
+---
+# GCN-LABEL: name: test_ordered_count
+# GCN: bb.0
+# GCN: DS_ADD_U32
+# GCN: DS_SUB_U32
+# GCN-NEXT: S_WAITCNT 64535
+# GCN-NEXT: $vgpr3 = DS_ORDERED_COUNT
+# GCN-NEXT: S_WAITCNT 64519
+# GCN-NEXT: $vgpr4_vgpr5 = DS_ADD_GS_REG_RTN
+# GCN-NEXT: S_WAITCNT 64519
+# GCN-NEXT: S_NOP 0
+
+name: test_ordered_count
+body: |
+ bb.0:
+ liveins: $vgpr0, $vgpr1, $vgpr2
+
+ DS_ADD_U32 $vgpr1, $vgpr2, 12, -1, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+ DS_SUB_U32 $vgpr1, $vgpr2, 12, 0, implicit $m0, implicit $exec :: (load store (s32), addrspace 2)
+ $vgpr3 = DS_ORDERED_COUNT $vgpr0, 772, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+ $vgpr4_vgpr5 = DS_ADD_GS_REG_RTN $vgpr0, 32, implicit $m0, implicit $exec :: (load store (s32), addrspace 3)
+ S_ENDPGM 0
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
index 081727c3b5e14..9aedaaec94e7e 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll
@@ -9,6 +9,8 @@ define amdgpu_gs void @test_add_32(i32 %arg) {
; CHECK-LABEL: test_add_32:
; CHECK: ; %bb.0:
; CHECK-NEXT: ds_add_gs_reg_rtn v[0:1], v0 offset:16 gds
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_nop 0
; CHECK-NEXT: s_endpgm
%unused = call i32 @llvm.amdgcn.ds.add.gs.reg.rtn.i32(i32 %arg, i32 16)
ret void
@@ -30,6 +32,8 @@ define amdgpu_gs void @test_add_64(i32 %arg) {
; CHECK-LABEL: test_add_64:
; CHECK: ; %bb.0:
; CHECK-NEXT: ds_add_gs_reg_rtn v[0:1], v0 offset:32 gds
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_nop 0
; CHECK-NEXT: s_endpgm
%unused = call i64 @llvm.amdgcn.ds.add.gs.reg.rtn.i64(i32 %arg, i32 32)
ret void
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
index 63d4bac20b354..bb1c4604dd9d2 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll
@@ -9,6 +9,8 @@ define amdgpu_gs void @test_sub_32(i32 %arg) {
; CHECK-LABEL: test_sub_32:
; CHECK: ; %bb.0:
; CHECK-NEXT: ds_sub_gs_reg_rtn v[0:1], v0 offset:16 gds
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_nop 0
; CHECK-NEXT: s_endpgm
%unused = call i32 @llvm.amdgcn.ds.sub.gs.reg.rtn.i32(i32 %arg, i32 16)
ret void
@@ -30,6 +32,8 @@ define amdgpu_gs void @test_sub_64(i32 %arg) {
; CHECK-LABEL: test_sub_64:
; CHECK: ; %bb.0:
; CHECK-NEXT: ds_sub_gs_reg_rtn v[0:1], v0 offset:32 gds
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_nop 0
; CHECK-NEXT: s_endpgm
%unused = call i64 @llvm.amdgcn.ds.sub.gs.reg.rtn.i64(i32 %arg, i32 32)
ret void
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
// GFX11 must wait for any pending GDS instruction to complete before | ||
// any "Always GDS" instruction. | ||
if (AMDGPU::isGFX11(*ST) && TII->isAlwaysGDS(MI.getOpcode()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't restrict this to GFX11. The rule definitely applies to GFX9 and GFX10 too, so I'm guessing it applies to any architecture that supports GWS and this code can be unconditional.
auto E = Block->instr_end(); | ||
|
||
while (true) { | ||
if (It == E) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can test It.isEnd()
here, which means you don't need Block
or E
in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like It.isEnd()
does exist, It->getParent()
fails if It
is at the end of the iterator, so it's not possible get the next block without knowing the current block already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you use isEnd
then you don't need E
for anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it looks like I'll still need Block
, which is annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be treated as a type of hazard, rather than something for waitcnt insertion to deal with?
|
||
while (true) { | ||
if (It == E) { | ||
if (auto FallThrough = Block->getFallThrough(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (auto FallThrough = Block->getFallThrough(false)) { | |
if (MachineBasicBlock *FallThrough = Block->getFallThrough(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, getFallThrough
is far more powerful than we need here. We could just use std::next
. But maybe that would be less readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto EndBlock = Block->getParent()->end();
while (true) {
if (It.isEnd()) {
auto NextBlock = std::next(Block->getIterator());
if (NextBlock != EndBlock) {
It = NextBlock->instr_begin();
Block = &*NextBlock;
continue;
}
return false;
}
if (!It->isMetaInstruction())
break;
It++;
}
assert(!It.isEnd());
This is what come up with if I want to use std::next()
. I think it could get a bit more readable if I had an iterator over blocks starting from Block
, advancing it as the instruction iterator encountered block ends, but is ++
on iterators the same as using std::next()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is
++
on iterators the same as usingstd::next()
?
Yes.
// blocks if necessary. | ||
bool SIInsertWaitcnts::isNextENDPGM(MachineBasicBlock::instr_iterator It, | ||
MachineBasicBlock *Block) const { | ||
auto E = Block->instr_end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentionally looking through bundles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I guess we're can be fairly sure S_ENDPGM won't be in a bundle?
No. The waitcnt before a GWS instruction definitely belongs in this pass, since we need to track whether there is an outstanding GDS operation. For the waitcnt after a GWS instruction it is less clear, but I think it fits fairly naturally into this pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete.
Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM.
Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).