[AMDGPU] SelectionDAG support for vector type set 0 to multiple sgpr64 #128017

JanekvO · 2025-02-20T15:27:33Z

Zeros seem to materialize into a sequence of s_mov_b32s, even when s_mov_b64 is possible. First step to solving a worse case when 0 should be materialized into vector type vgprs as it first materializes into sgprs before moving into vgprs.

llvmbot · 2025-02-20T15:28:16Z

@llvm/pr-subscribers-backend-amdgpu

Author: Janek van Oirschot (JanekvO)

Changes

Zeros seem to materialize into a sequence of s_mov_b32s, even when s_mov_b64 is possible. First step to solving a worse case when 0 should be materialized into vector type vgprs as it first materializes into sgprs before moving into vgprs.

Patch is 205.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128017.diff

40 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+55)
(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+20-5)
(modified) llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll (+8-9)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+135-139)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+2-4)
(modified) llvm/test/CodeGen/AMDGPU/bitreverse-inline-immediates.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll (+23-32)
(modified) llvm/test/CodeGen/AMDGPU/bug-cselect-b64.ll (+6-9)
(modified) llvm/test/CodeGen/AMDGPU/cluster_stores.ll (+9-8)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+14-27)
(modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.ll (+16-18)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-nondeterminism.ll (+3-5)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+153-259)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+5-9)
(modified) llvm/test/CodeGen/AMDGPU/hazard-recognizer-src-shared-base.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+173-322)
(modified) llvm/test/CodeGen/AMDGPU/imm.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+96-97)
(modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/issue98474-need-live-out-undef-subregister-def.ll (+2-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.ll (+3-1)
(modified) llvm/test/CodeGen/AMDGPU/max-hard-clause-length.ll (+8-14)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+6-2)
(modified) llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll (+8-6)
(modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+12-24)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+15-19)
(modified) llvm/test/CodeGen/AMDGPU/scalar-float-sop2.ll (+92-14)
(modified) llvm/test/CodeGen/AMDGPU/scheduler-rp-calc-one-successor-two-predecessors-bug.ll (+19-25)
(modified) llvm/test/CodeGen/AMDGPU/set_kill_i1_for_floation_point_comparison.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.ll (+4-6)
(modified) llvm/test/CodeGen/AMDGPU/si-scheduler-exports.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/smfmac_no_agprs.ll (+11-14)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+18-17)
(modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+32-40)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+9-11)
(modified) llvm/test/CodeGen/AMDGPU/vopc_dpp.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/waterfall_kills_scc.ll (+10-11)
(modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+18-18)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 8e90754103ff1..c990df622175f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -634,6 +634,61 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) {
   case ISD::BUILD_VECTOR: {
     EVT VT = N->getValueType(0);
     unsigned NumVectorElts = VT.getVectorNumElements();
+
+    auto IsSplatAllZeros = [this](SDNode *N) -> bool {
+      if (ISD::isConstantSplatVectorAllZeros(N))
+        return true;
+
+      // Types may have legalized by stripping the 16 bit multi-element vector
+      // into multiple BUILD_VECTORs. Peek through and see if it is all zeros
+      // regardless of what the legalizer did. Assumes cases along the lines of:
+      //   v8i16 build_vector 0, 0, 0, 0, 0, 0, 0, 0
+      //     -> legalizer ->
+      //   t0 = v2i16 build_vector 0, 0
+      //   t1 = bitcast t0 to i32
+      //   v4i32 build_vector t1, t1, t1, t1
+      if (CurDAG->isSplatValue(SDValue(N, 0))) {
+        SDValue Op = peekThroughBitcasts(N->getOperand(0));
+        EVT InnerVT = Op.getValueType();
+        if (InnerVT.isVector() && Op.getOpcode() == ISD::BUILD_VECTOR &&
+            InnerVT.getVectorNumElements() == 2)
+          return ISD::isConstantSplatVectorAllZeros(Op.getNode());
+      }
+      return false;
+    };
+    if (IsSplatAllZeros(N)) {
+      unsigned FixedBitSize = VT.getFixedSizeInBits();
+      SDLoc DL(N);
+      if (FixedBitSize == 64) {
+        SDValue Set0 = {
+            CurDAG->getMachineNode(AMDGPU::S_MOV_B64_IMM_PSEUDO, DL, MVT::i64,
+                                   CurDAG->getTargetConstant(0, DL, MVT::i64)),
+            0};
+        CurDAG->SelectNodeTo(N, AMDGPU::COPY, VT, Set0);
+        return;
+      } else if (NumVectorElts <= 32 && (FixedBitSize % 64 == 0)) {
+        SmallVector<SDValue, 32 * 2 + 1> Ops((FixedBitSize / 64) * 2 + 1);
+        SDValue Set0 = {
+            CurDAG->getMachineNode(AMDGPU::S_MOV_B64_IMM_PSEUDO, DL, MVT::i64,
+                                   CurDAG->getTargetConstant(0, DL, MVT::i64)),
+            0};
+        unsigned RCID =
+            SIRegisterInfo::getSGPRClassForBitWidth(FixedBitSize)->getID();
+        Ops[0] = CurDAG->getTargetConstant(RCID, DL, MVT::i32);
+
+        for (unsigned i = 0, CurrentBitSize = FixedBitSize; CurrentBitSize != 0;
+             ++i, CurrentBitSize -= 64) {
+          unsigned SubRegs =
+              SIRegisterInfo::getSubRegFromChannel(i * 2, /*NumRegs=*/2);
+          Ops[i * 2 + 1] = Set0;
+          Ops[i * 2 + 2] = CurDAG->getTargetConstant(SubRegs, DL, MVT::i32);
+        }
+
+        CurDAG->SelectNodeTo(N, AMDGPU::REG_SEQUENCE, VT, Ops);
+        return;
+      }
+    }
+
     if (VT.getScalarSizeInBits() == 16) {
       if (Opc == ISD::BUILD_VECTOR && NumVectorElts == 2) {
         if (SDNode *Packed = packConstantV2I16(N, *CurDAG)) {
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index fa15e73bc31d5..3e7129f7bcd8f 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1035,6 +1035,9 @@ void SIFoldOperandsImpl::foldOperand(
           getRegSeqInit(Defs, UseReg, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
         const DebugLoc &DL = UseMI->getDebugLoc();
         MachineBasicBlock &MBB = *UseMI->getParent();
+        unsigned SubRegSize =
+            Defs.empty() ? 4
+                         : TRI->getNumChannelsFromSubReg(Defs[0].second) * 4;
 
         UseMI->setDesc(TII->get(AMDGPU::REG_SEQUENCE));
         for (unsigned I = UseMI->getNumOperands() - 1; I > 0; --I)
@@ -1043,17 +1046,29 @@ void SIFoldOperandsImpl::foldOperand(
         MachineInstrBuilder B(*MBB.getParent(), UseMI);
         DenseMap<TargetInstrInfo::RegSubRegPair, Register> VGPRCopies;
         SmallSetVector<TargetInstrInfo::RegSubRegPair, 32> SeenAGPRs;
-        for (unsigned I = 0; I < Size / 4; ++I) {
+        for (unsigned I = 0; I < Size / SubRegSize; ++I) {
           MachineOperand *Def = Defs[I].first;
           TargetInstrInfo::RegSubRegPair CopyToVGPR;
           if (Def->isImm() &&
               TII->isInlineConstant(*Def, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
             int64_t Imm = Def->getImm();
 
-            auto Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
-            BuildMI(MBB, UseMI, DL,
-                    TII->get(AMDGPU::V_ACCVGPR_WRITE_B32_e64), Tmp).addImm(Imm);
-            B.addReg(Tmp);
+            // Adding multiple immediates in case of 64 bit MOVs to agpr.
+            unsigned SubWordSize = SubRegSize / 4;
+            for (unsigned ImmIt = 0; ImmIt < SubWordSize; ++ImmIt) {
+              auto Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
+              unsigned SubRegs =
+                  SIRegisterInfo::getSubRegFromChannel(I * SubWordSize + ImmIt);
+              BuildMI(MBB, UseMI, DL, TII->get(AMDGPU::V_ACCVGPR_WRITE_B32_e64),
+                      Tmp)
+                  .addImm(Imm);
+              B.addReg(Tmp);
+              B.addImm(SubRegs);
+            }
+            // Avoid fallthrough that adds 'Defs[I].second' to B as we're
+            // technically pulling the SubReg apart from sgpr64 to
+            // agpr32:agpr32.
+            continue;
           } else if (Def->isReg() && TRI->isAGPR(*MRI, Def->getReg())) {
             auto Src = getRegSubRegPair(*Def);
             Def->setIsKill(false);
diff --git a/llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll b/llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll
index 23a7bb6ece488..1d0a9f9585123 100644
--- a/llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll
+++ b/llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll
@@ -4,15 +4,14 @@
 define float @test() {
   ; GFX10-LABEL: name: test
   ; GFX10: bb.0.bb:
-  ; GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
-  ; GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_256 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_]], %subreg.sub1, [[S_MOV_B32_]], %subreg.sub2, [[S_MOV_B32_]], %subreg.sub3, [[S_MOV_B32_]], %subreg.sub4, [[S_MOV_B32_]], %subreg.sub5, [[S_MOV_B32_]], %subreg.sub6, [[S_MOV_B32_]], %subreg.sub7
-  ; GFX10-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
-  ; GFX10-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
-  ; GFX10-NEXT:   [[IMAGE_LOAD_V2_V2_nsa_gfx10_:%[0-9]+]]:vreg_64 = IMAGE_LOAD_V2_V2_nsa_gfx10 [[COPY]], [[COPY1]], killed [[REG_SEQUENCE]], 3, 1, -1, 0, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 8)
-  ; GFX10-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[IMAGE_LOAD_V2_V2_nsa_gfx10_]].sub1
-  ; GFX10-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[IMAGE_LOAD_V2_V2_nsa_gfx10_]].sub0
-  ; GFX10-NEXT:   [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[COPY2]], 0, killed [[COPY3]], 0, 0, implicit $mode, implicit $exec
-  ; GFX10-NEXT:   [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[V_ADD_F32_e64_]], 0, [[COPY2]], 0, 0, implicit $mode, implicit $exec
+  ; GFX10-NEXT:   [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64_IMM_PSEUDO 0
+  ; GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_256 = REG_SEQUENCE [[S_MOV_B64_]], %subreg.sub0_sub1, [[S_MOV_B64_]], %subreg.sub2_sub3, [[S_MOV_B64_]], %subreg.sub4_sub5, [[S_MOV_B64_]], %subreg.sub6_sub7
+  ; GFX10-NEXT:   [[V_MOV_B32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0
+  ; GFX10-NEXT:   [[IMAGE_LOAD_V2_V2_nsa_gfx10_:%[0-9]+]]:vreg_64 = IMAGE_LOAD_V2_V2_nsa_gfx10 [[V_MOV_B32_]], [[V_MOV_B32_]], killed [[REG_SEQUENCE]], 3, 1, -1, 0, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s96), align 16, addrspace 8)
+  ; GFX10-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY [[IMAGE_LOAD_V2_V2_nsa_gfx10_]].sub1
+  ; GFX10-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[IMAGE_LOAD_V2_V2_nsa_gfx10_]].sub0
+  ; GFX10-NEXT:   [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[COPY]], 0, killed [[COPY1]], 0, 0, implicit $mode, implicit $exec
+  ; GFX10-NEXT:   [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[V_ADD_F32_e64_]], 0, [[COPY]], 0, 0, implicit $mode, implicit $exec
   ; GFX10-NEXT:   $vgpr0 = COPY [[V_ADD_F32_e64_1]]
   ; GFX10-NEXT:   SI_RETURN implicit $vgpr0
 bb:
diff --git a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
index 4ce46bbaf45ac..02b17e6229b15 100644
--- a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
+++ b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
@@ -515,48 +515,47 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
 ; GFX908-NEXT:    global_load_ushort v16, v[0:1], off glc
 ; GFX908-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
 ; GFX908-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x10
-; GFX908-NEXT:    s_load_dword s7, s[8:9], 0x18
-; GFX908-NEXT:    s_mov_b32 s6, 0
-; GFX908-NEXT:    s_mov_b32 s9, s6
+; GFX908-NEXT:    s_mov_b32 s7, 0
+; GFX908-NEXT:    s_load_dword s8, s[8:9], 0x18
+; GFX908-NEXT:    v_mov_b32_e32 v19, 0
 ; GFX908-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX908-NEXT:    v_cvt_f32_u32_e32 v0, s3
-; GFX908-NEXT:    s_sub_i32 s8, 0, s3
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v17, s7
-; GFX908-NEXT:    v_mov_b32_e32 v19, 0
-; GFX908-NEXT:    v_rcp_iflag_f32_e32 v2, v0
+; GFX908-NEXT:    s_sub_i32 s6, 0, s3
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v17, s8
+; GFX908-NEXT:    v_rcp_iflag_f32_e32 v0, v0
+; GFX908-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
+; GFX908-NEXT:    v_cvt_u32_f32_e32 v2, v0
 ; GFX908-NEXT:    v_mov_b32_e32 v0, 0
 ; GFX908-NEXT:    v_mov_b32_e32 v1, 0
-; GFX908-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
-; GFX908-NEXT:    v_cvt_u32_f32_e32 v2, v2
-; GFX908-NEXT:    v_readfirstlane_b32 s10, v2
-; GFX908-NEXT:    s_mul_i32 s8, s8, s10
-; GFX908-NEXT:    s_mul_hi_u32 s8, s10, s8
-; GFX908-NEXT:    s_add_i32 s10, s10, s8
-; GFX908-NEXT:    s_mul_hi_u32 s8, s2, s10
-; GFX908-NEXT:    s_mul_i32 s10, s8, s3
-; GFX908-NEXT:    s_sub_i32 s2, s2, s10
-; GFX908-NEXT:    s_add_i32 s11, s8, 1
-; GFX908-NEXT:    s_sub_i32 s10, s2, s3
+; GFX908-NEXT:    v_readfirstlane_b32 s9, v2
+; GFX908-NEXT:    s_mul_i32 s6, s6, s9
+; GFX908-NEXT:    s_mul_hi_u32 s6, s9, s6
+; GFX908-NEXT:    s_add_i32 s9, s9, s6
+; GFX908-NEXT:    s_mul_hi_u32 s6, s2, s9
+; GFX908-NEXT:    s_mul_i32 s9, s6, s3
+; GFX908-NEXT:    s_sub_i32 s2, s2, s9
+; GFX908-NEXT:    s_add_i32 s10, s6, 1
+; GFX908-NEXT:    s_sub_i32 s9, s2, s3
 ; GFX908-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX908-NEXT:    s_cselect_b32 s8, s11, s8
-; GFX908-NEXT:    s_cselect_b32 s2, s10, s2
-; GFX908-NEXT:    s_add_i32 s10, s8, 1
+; GFX908-NEXT:    s_cselect_b32 s6, s10, s6
+; GFX908-NEXT:    s_cselect_b32 s2, s9, s2
+; GFX908-NEXT:    s_add_i32 s9, s6, 1
 ; GFX908-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX908-NEXT:    s_cselect_b32 s8, s10, s8
-; GFX908-NEXT:    s_lshr_b32 s7, s7, 16
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v18, s7
+; GFX908-NEXT:    s_cselect_b32 s6, s9, s6
+; GFX908-NEXT:    s_lshr_b32 s10, s8, 16
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v18, s10
+; GFX908-NEXT:    s_lshl_b64 s[10:11], s[6:7], 5
 ; GFX908-NEXT:    s_lshl_b64 s[2:3], s[0:1], 5
-; GFX908-NEXT:    s_lshl_b64 s[12:13], s[8:9], 5
-; GFX908-NEXT:    s_lshl_b64 s[10:11], s[4:5], 5
-; GFX908-NEXT:    s_or_b32 s10, s10, 28
+; GFX908-NEXT:    s_lshl_b64 s[8:9], s[4:5], 5
+; GFX908-NEXT:    s_or_b32 s8, s8, 28
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
 ; GFX908-NEXT:    v_readfirstlane_b32 s7, v16
 ; GFX908-NEXT:    s_and_b32 s7, 0xffff, s7
 ; GFX908-NEXT:    s_mul_i32 s1, s1, s7
-; GFX908-NEXT:    s_mul_hi_u32 s9, s0, s7
+; GFX908-NEXT:    s_mul_hi_u32 s12, s0, s7
 ; GFX908-NEXT:    s_mul_i32 s0, s0, s7
-; GFX908-NEXT:    s_add_i32 s1, s9, s1
-; GFX908-NEXT:    s_lshl_b64 s[14:15], s[0:1], 5
+; GFX908-NEXT:    s_add_i32 s1, s12, s1
+; GFX908-NEXT:    s_lshl_b64 s[12:13], s[0:1], 5
 ; GFX908-NEXT:    s_branch .LBB3_2
 ; GFX908-NEXT:  .LBB3_1: ; %Flow20
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
@@ -565,59 +564,58 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
 ; GFX908-NEXT:  .LBB3_2: ; %bb9
 ; GFX908-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX908-NEXT:    ; Child Loop BB3_5 Depth 2
-; GFX908-NEXT:    s_mov_b64 s[16:17], -1
+; GFX908-NEXT:    s_mov_b64 s[14:15], -1
 ; GFX908-NEXT:    s_cbranch_scc0 .LBB3_10
 ; GFX908-NEXT:  ; %bb.3: ; %bb14
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
 ; GFX908-NEXT:    global_load_dwordx2 v[2:3], v[0:1], off
 ; GFX908-NEXT:    v_cmp_gt_i64_e64 s[0:1], s[4:5], -1
-; GFX908-NEXT:    s_mov_b32 s7, s6
-; GFX908-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[0:1]
-; GFX908-NEXT:    v_mov_b32_e32 v4, s6
-; GFX908-NEXT:    v_cmp_ne_u32_e64 s[0:1], 1, v6
-; GFX908-NEXT:    v_mov_b32_e32 v6, s6
-; GFX908-NEXT:    v_mov_b32_e32 v9, s7
-; GFX908-NEXT:    v_mov_b32_e32 v5, s7
-; GFX908-NEXT:    v_mov_b32_e32 v7, s7
-; GFX908-NEXT:    v_mov_b32_e32 v8, s6
-; GFX908-NEXT:    v_cmp_lt_i64_e64 s[16:17], s[4:5], 0
-; GFX908-NEXT:    v_mov_b32_e32 v11, v5
-; GFX908-NEXT:    s_mov_b64 s[18:19], s[10:11]
-; GFX908-NEXT:    v_mov_b32_e32 v10, v4
+; GFX908-NEXT:    v_cmp_lt_i64_e64 s[14:15], s[4:5], 0
+; GFX908-NEXT:    v_mov_b32_e32 v4, 0
+; GFX908-NEXT:    v_cndmask_b32_e64 v12, 0, 1, s[0:1]
+; GFX908-NEXT:    v_mov_b32_e32 v6, 0
+; GFX908-NEXT:    v_mov_b32_e32 v8, 0
+; GFX908-NEXT:    v_mov_b32_e32 v10, 0
+; GFX908-NEXT:    v_mov_b32_e32 v5, 0
+; GFX908-NEXT:    v_mov_b32_e32 v7, 0
+; GFX908-NEXT:    v_mov_b32_e32 v9, 0
+; GFX908-NEXT:    v_mov_b32_e32 v11, 0
+; GFX908-NEXT:    v_cmp_ne_u32_e64 s[0:1], 1, v12
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
 ; GFX908-NEXT:    v_readfirstlane_b32 s7, v2
-; GFX908-NEXT:    v_readfirstlane_b32 s9, v3
+; GFX908-NEXT:    v_readfirstlane_b32 s16, v3
 ; GFX908-NEXT:    s_add_u32 s7, s7, 1
-; GFX908-NEXT:    s_addc_u32 s9, s9, 0
-; GFX908-NEXT:    s_mul_hi_u32 s20, s2, s7
-; GFX908-NEXT:    s_mul_i32 s9, s2, s9
-; GFX908-NEXT:    s_mul_i32 s21, s3, s7
-; GFX908-NEXT:    s_add_i32 s9, s20, s9
+; GFX908-NEXT:    s_addc_u32 s16, s16, 0
+; GFX908-NEXT:    s_mul_hi_u32 s17, s2, s7
+; GFX908-NEXT:    s_mul_i32 s16, s2, s16
+; GFX908-NEXT:    s_mul_i32 s18, s3, s7
+; GFX908-NEXT:    s_add_i32 s16, s17, s16
 ; GFX908-NEXT:    s_mul_i32 s7, s2, s7
-; GFX908-NEXT:    s_add_i32 s9, s9, s21
+; GFX908-NEXT:    s_add_i32 s22, s16, s18
+; GFX908-NEXT:    s_mov_b64 s[16:17], s[8:9]
 ; GFX908-NEXT:    s_branch .LBB3_5
 ; GFX908-NEXT:  .LBB3_4: ; %bb58
 ; GFX908-NEXT:    ; in Loop: Header=BB3_5 Depth=2
 ; GFX908-NEXT:    v_add_co_u32_sdwa v2, vcc, v2, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
 ; GFX908-NEXT:    v_addc_co_u32_e32 v3, vcc, 0, v3, vcc
-; GFX908-NEXT:    s_add_u32 s18, s18, s14
-; GFX908-NEXT:    v_cmp_lt_i64_e64 s[22:23], -1, v[2:3]
-; GFX908-NEXT:    s_addc_u32 s19, s19, s15
-; GFX908-NEXT:    s_mov_b64 s[20:21], 0
-; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[22:23]
+; GFX908-NEXT:    s_add_u32 s16, s16, s12
+; GFX908-NEXT:    v_cmp_lt_i64_e64 s[20:21], -1, v[2:3]
+; GFX908-NEXT:    s_addc_u32 s17, s17, s13
+; GFX908-NEXT:    s_mov_b64 s[18:19], 0
+; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[20:21]
 ; GFX908-NEXT:    s_cbranch_vccz .LBB3_9
 ; GFX908-NEXT:  .LBB3_5: ; %bb16
 ; GFX908-NEXT:    ; Parent Loop BB3_2 Depth=1
 ; GFX908-NEXT:    ; => This Inner Loop Header: Depth=2
-; GFX908-NEXT:    s_add_u32 s20, s18, s7
-; GFX908-NEXT:    s_addc_u32 s21, s19, s9
-; GFX908-NEXT:    global_load_dword v21, v19, s[20:21] offset:-12 glc
+; GFX908-NEXT:    s_add_u32 s18, s16, s7
+; GFX908-NEXT:    s_addc_u32 s19, s17, s22
+; GFX908-NEXT:    global_load_dword v21, v19, s[18:19] offset:-12 glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    global_load_dword v20, v19, s[20:21] offset:-8 glc
+; GFX908-NEXT:    global_load_dword v20, v19, s[18:19] offset:-8 glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    global_load_dword v12, v19, s[20:21] offset:-4 glc
+; GFX908-NEXT:    global_load_dword v12, v19, s[18:19] offset:-4 glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    global_load_dword v12, v19, s[20:21] glc
+; GFX908-NEXT:    global_load_dword v12, v19, s[18:19] glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
 ; GFX908-NEXT:    ds_read_b64 v[12:13], v19
 ; GFX908-NEXT:    ds_read_b64 v[14:15], v0
@@ -646,29 +644,29 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
 ; GFX908-NEXT:    v_add_f32_e32 v9, v9, v15
 ; GFX908-NEXT:    v_add_f32_e32 v10, v10, v12
 ; GFX908-NEXT:    v_add_f32_e32 v11, v11, v13
-; GFX908-NEXT:    s_mov_b64 s[20:21], -1
+; GFX908-NEXT:    s_mov_b64 s[18:19], -1
 ; GFX908-NEXT:    s_branch .LBB3_4
 ; GFX908-NEXT:  .LBB3_7: ; in Loop: Header=BB3_5 Depth=2
-; GFX908-NEXT:    s_mov_b64 s[20:21], s[16:17]
-; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[20:21]
+; GFX908-NEXT:    s_mov_b64 s[18:19], s[14:15]
+; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[18:19]
 ; GFX908-NEXT:    s_cbranch_vccz .LBB3_4
 ; GFX908-NEXT:  ; %bb.8: ; in Loop: Header=BB3_2 Depth=1
 ; GFX908-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX908-NEXT:    ; implicit-def: $sgpr18_sgpr19
+; GFX908-NEXT:    ; implicit-def: $sgpr16_sgpr17
 ; GFX908-NEXT:  .LBB3_9: ; %loop.exit.guard
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_xor_b64 s[16:17], s[20:21], -1
+; GFX908-NEXT:    s_xor_b64 s[14:15], s[18:19], -1
 ; GFX908-NEXT:  .LBB3_10: ; %Flow19
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
 ; GFX908-NEXT:    s_mov_b64 s[0:1], -1
-; GFX908-NEXT:    s_and_b64 vcc, exec, s[16:17]
+; GFX908-NEXT:    s_and_b64 vcc, exec, s[14:15]
 ; GFX908-NEXT:    s_cbranch_vccz .LBB3_1
 ; GFX908-NEXT:  ; %bb.11: ; %bb12
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_add_u32 s4, s4, s8
+; GFX908-NEXT:    s_add_u32 s4, s4, s6
 ; GFX908-NEXT:    s_addc_u32 s5, s5, 0
-; GFX908-NEXT:    s_add_u32 s10, s10, s12
-; GFX908-NEXT:    s_addc_u32 s11, s11, s13
+; GFX908-NEXT:    s_add_u32 s8, s8, s10
+; GFX908-NEXT:    s_addc_u32 s9, s9, s11
 ; GFX908-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX908-NEXT:    s_branch .LBB3_1
 ; GFX908-NEXT:  .LBB3_12: ; %DummyReturnBlock
@@ -679,47 +677,46 @@ define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg
 ; GFX90A-NEXT:    global_load_ushort v18, v[0:1], off glc
 ; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
 ; GFX90A-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x10
-; GFX90A-NEXT:    s_load_dword s7, s[8:9], 0x18
-; GFX90A-NEXT:    s_mov_b32 s6, 0
-; GFX90A-NEXT:    s_mov_b32 s9, s6
+; GFX90A-NEXT:    s_mov_b32 s7, 0
+; GFX90A-NEXT:    s_load_dword s8, s[8:9], 0x18
+; GFX90A-NEXT:    v_mov_b32_e32 v19, 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cvt_f32_u32_e32 v0, s3
-; GFX90A-NEXT:    s_sub_i32 s8, 0, s3
-; GFX90A-NEXT:    v_mov_b32_e32 v19, 0
-; GFX90A-NEXT:    v_rcp_iflag_f32_e32 v2, v0
+; GFX90A-NEXT:    s_sub_i32 s6, 0, s3
+; GFX90A-NEXT:    v_cvt_f32_f16_e32 v2, s8
+; GFX90A-NEXT:    v_rcp_iflag_f32_e32 v0, v0
+; GFX90A-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
+; GFX90A-NEXT:    v_cvt_u32_f32_e32 v3, v0
 ; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], 0, 0
-; GFX90A-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
-; GFX90A-NEXT:    v_cvt_u32_f32_e32 v3, v2
-; GFX90A-NEXT:    v_cvt_f32_f16_e32 v2, s7
-; GFX90A-NEXT:    v_readfirstlane_b32 s10, v3
-; GFX90A-NEXT:    s_mul_i32 s8, s8, s10
-; GFX90A-NEXT:    s_mul_hi_u32 s8, s10, s8
-; GFX90A-NEXT:    s_add_i32 s10, s10, s8
-; GFX90A-NEXT:    s_mul_hi_u32 s8, s2, s10
-; GFX90A-NEXT:    s_mul_i32 s10, s8, s3
-; GFX90A-NEXT:    s_sub_i32 s2, s2, s10
-; GFX90A-NEXT:    s_add_i32 s11, s8, 1
-; GFX90A-NEXT:    s_sub_i32 s10, s2, s3
+; GFX90A-NEXT:    v_readfirstlane_b32 s9, v3
+; GFX90A-NEXT:    s_mul_i32 s6, s6, s9
+; GFX90A-NEXT:    s_mul_hi_u32 s6, s9, s6
+; GFX90A-NEXT:    s_add_i32 s9, s9, s6
+; GFX90A-NEXT:    s_mul_hi_u32 s6, s2, s9
+; GFX90A-NEXT:    s_mul_i32 s9, s6, s3
+; GFX90A-NEXT:    s_sub_i32 s2, s2, s9
+; GFX90A-NEXT:    s_add_i32 s10, s6, 1
+; GFX90A-NEXT:    s_sub_i32 s9, s2, s3
 ; GFX90A-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX90A-NEXT:    s_cselect_b32 s8, s11, s8
-; GFX90A-NEXT:    s_cselect_b32 s2, s10, s2
-; GFX90A-NEXT:    s_add_i32 s10, s8, 1
+; GFX90A-NEXT:    s_cselect_b32 s6, s10, s6
+; GFX90A-NEXT:    s_cselect_b32 s2, s9, s2
+; GFX90A-NEXT:    s_add_i32 s9, s6, 1
 ; GFX90A-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX90A-NEXT:    s_cselect_b32 s8, s10, s8
-; GFX9...
[truncated]

github-actions · 2025-02-20T15:31:06Z

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:

git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 553da9634dc4bae215e6c850d2de3186d09f9da5 f7e6f7b7f3bd16943a7874a717dbec2e04b5156f llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp llvm/test/CodeGen/AMDGPU/adjust-writemask-cse.ll llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll llvm/test/CodeGen/AMDGPU/bitreverse-inline-immediates.ll llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll llvm/test/CodeGen/AMDGPU/bug-cselect-b64.ll llvm/test/CodeGen/AMDGPU/cluster_stores.ll llvm/test/CodeGen/AMDGPU/collapse-endcf.ll llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll llvm/test/CodeGen/AMDGPU/fcanonicalize.ll llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-nondeterminism.ll llvm/test/CodeGen/AMDGPU/flat-scratch.ll llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll llvm/test/CodeGen/AMDGPU/hazard-recognizer-src-shared-base.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/imm.ll llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll llvm/test/CodeGen/AMDGPU/issue98474-need-live-out-undef-subregister-def.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.ll llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll llvm/test/CodeGen/AMDGPU/max-hard-clause-length.ll llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll llvm/test/CodeGen/AMDGPU/mfma-loop.ll llvm/test/CodeGen/AMDGPU/module-lds-false-sharing.ll llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll llvm/test/CodeGen/AMDGPU/scalar-float-sop2.ll llvm/test/CodeGen/AMDGPU/scheduler-rp-calc-one-successor-two-predecessors-bug.ll llvm/test/CodeGen/AMDGPU/set_kill_i1_for_floation_point_comparison.ll llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.ll llvm/test/CodeGen/AMDGPU/si-scheduler-exports.ll llvm/test/CodeGen/AMDGPU/smfmac_no_agprs.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll llvm/test/CodeGen/AMDGPU/vopc_dpp.ll llvm/test/CodeGen/AMDGPU/waterfall_kills_scc.ll llvm/test/CodeGen/AMDGPU/wqm.ll

The following files introduce new uses of undef:

llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

arsenm

The set of peephole opt patches I'm working on improve some of these type of situations.

You also have a number of regressions in the VALU cases. Should do something about those (which may mean not doing this for VALU values). Also should consider v_mov_b64 on gfx940

arsenm · 2025-02-20T15:29:22Z

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

+      if (ISD::isConstantSplatVectorAllZeros(N))
+        return true;
+
+      // Types may have legalized by stripping the 16 bit multi-element vector


It would be better to adjust the legalizer to do this in the first place

As in, allow vector types as long as they splat 0?

Even if they don't splat 0

The splatness isn't significant, and if we're going to have 64-bit vector elements we should have made them legal throughout the DAG, not just during selection

arsenm · 2025-02-20T15:30:12Z

llvm/test/CodeGen/AMDGPU/fcanonicalize.ll

@@ -1374,29 +1374,28 @@ define amdgpu_kernel void @test_fold_canonicalize_p0_f64(ptr addrspace(1) %out)
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; GFX9-NEXT:    v_mov_b32_e32 v0, 0
-; GFX9-NEXT:    v_mov_b32_e32 v1, v0
+; GFX9-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9-NEXT:    v_mov_b32_e32 v2, 0


arsenm · 2025-02-20T15:30:29Z

llvm/test/CodeGen/AMDGPU/fcanonicalize.ll

@@ -1585,29 +1584,28 @@ define amdgpu_kernel void @test_no_denormals_fold_canonicalize_denormal0_f64(ptr
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
 ; GFX9-NEXT:    v_mov_b32_e32 v0, 0
-; GFX9-NEXT:    v_mov_b32_e32 v1, v0
+; GFX9-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9-NEXT:    v_mov_b32_e32 v2, 0


JanekvO · 2025-02-20T16:05:53Z

You also have a number of regressions in the VALU cases. Should do something about those (which may mean not doing this for VALU values).

Looking into it

Also should consider v_mov_b64 on gfx940

Was going to be a follow up patch to fold the sgpr64 set 0 cases into vgpr64 in SIFoldOperand (perhaps should be peephole opt?)

arsenm · 2025-02-21T13:59:03Z

You also have a number of regressions in the VALU cases. Should do something about those (which may mean not doing this for VALU values).

Looking into it

I am speculating that there are several things going on, some of which are going to be improved by a set of patches I'm working on which is taking significantly longer than I expected. Related to that, I was thinking about other ways we can improve the immediate folding.

I've been working on enabling MachineCSE to work for subregister extracts, and improving folds of copies through sub registers. In doing this, I see a number of places where we get better coalescing and re-assemble materialize of 64-bit inline immediates.

I've been thinking we should teach foldImmediate and/or SIFoldOperands to reconstruct 64-bit moves. e.g. foldImmediate can see the use instruction is a reg_sequence with the same register used twice, and replace it with an s_mov_b64. It also currently skips any constants with multiple uses, but the correct heuristic is probably more refined (like only skip multiple uses for non inline-immediate)

JanekvO · 2025-02-21T16:11:14Z

I've been working on enabling MachineCSE to work for subregister extracts, and improving folds of copies through sub registers. In doing this, I see a number of places where we get better coalescing and re-assemble materialize of 64-bit inline immediates.

This is exactly what would fix the aforementioned regressions. In the case prior to s_movb64 materialization, the MachineCSE eliminates the duplicate s_mov_b32 but it doesn't have the logic for subregister materialized immediate CSE'ing.

I've been thinking we should teach foldImmediate and/or SIFoldOperands to reconstruct 64-bit moves. e.g. foldImmediate can see the use instruction is a reg_sequence with the same register used twice, and replace it with an s_mov_b64. It also currently skips any constants with multiple uses, but the correct heuristic is probably more refined (like only skip multiple uses for non inline-immediate)

One of the reasons I tried to solve this in the selectiondag is because it has non-opaque and nicely described helpers for checking if there's a splatted vector. Additionally, my plan was to add s_mov_b64 -> v_mov_b64 folding (for gfx942+) in SIFoldOperands and I didn't have to fit in both s_mov_b32 x2 -> s_mov_b64 and s_mov_b64 -> v_mov_b64 in the same pass.

Anywho, I'll check some of the mentioned alternatives cause this approach also requires a separate change for globalisel to do the same

JanekvO · 2025-03-03T13:30:53Z

Rebase

JanekvO · 2025-03-03T13:48:06Z

I've been looking at resolving the regression but having a hard time finding a good MachineCSE alternative. Tried to convince the complex pattern to associate the added v_mov_b32 0 explicitly with a vector element of i64 = buildvector i32 0, i32 0, if possible. Handling in any of the constant folding mechanisms (peepholeopt, SIFoldOperand) is a bit awkward with the fold possibly being between 2 set 0 expressions that don't have a common ancestor expression. A blanket ban on any vgpr destination isn't preferred either as I have a follow-up patch that aids in folding those into v_mov_b64 0, if applicable. I'll look into resolving the specific regression shown by associating a fold (in SIFoldOperand?) with a common use in the mem op that generates the possible superfluous v_mov_b32 0 in its complex expression, but I acknowledge that this wouldn't work for non-mem op cases with multiple different sized v_mov_bXY 0 instructions.

Additionally, I bailed out of moving this selection code into foldImmediate as the resulting code just seemed less clear on its intent as the selection code equivalent. I'll leave it up to review's discretion on whether I should move it into either foldImmediate/SIFoldOperand anyway (I admit I haven't looking into moving this into SIFoldOperand).

arsenm · 2025-03-03T13:55:49Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

-      BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::V_ACCVGPR_WRITE_B32_e64), Tmp)
-          .addImm(Imm);
-      B.addReg(Tmp);
+      // Adding multiple immediates in case of 64 bit MOVs to agpr.


This part is obsolete with #129463

I'll have to rebase wrt your changes and add a similar split-b64-for-agpr

arsenm

Do any of these test changes include the example you are trying to fix? I have several patches in flight which largely address similar issues

JanekvO · 2025-03-03T17:16:49Z

Do any of these test changes include the example you are trying to fix? I have several patches in flight which largely address similar issues

I've added https://github.com/llvm/llvm-project/pull/128017/files#diff-a073dcad4b3da7b790117b5218e1e7d7bd24eb963ac0d54b601f7f6fb001e453
If you'd like I can pre-commit a version so you can compare codegen better, but basically all s_mov_b64 were s_mov_b32.

Additionally, my follow up patch will fold s_mov_b64 0 into v_mov_b64 0

arsenm

can you separately pre-commit the test

arsenm · 2025-03-03T17:29:44Z

llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll

@@ -0,0 +1,272 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 < %s | FileCheck --check-prefix=GFX942 %s


don't need -O3, probably

arsenm · 2025-03-03T17:29:59Z

llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll

+  %0 = insertelement <2 x i1> poison, i1 %mask, i64 0
+  %1 = shufflevector <2 x i1> %0, <2 x i1> poison, <2 x i32> zeroinitializer
+  %result = tail call <2 x i32> @llvm.masked.load.v2i32.p1(ptr addrspace(1) %ptr, i32 2, <2 x i1> %1, <2 x i32> zeroinitializer)
+  ret <2 x i32> %result


Use named values in tests

arsenm · 2025-03-03T17:30:33Z

llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll

+ ret <8 x bfloat> %result
+}
+
+define <16 x i8> @masked_load_v16i8(ptr addrspace(1) inreg nocapture readonly %ptr, i1 %mask) {


Can you add something to the name to indicate the value kinds? These are all uniform SGPR ptr + divergent mask

arsenm · 2025-03-03T17:30:54Z

llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll

+ ret <16 x i8> %result
+}
+
+; Declare the masked load intrinsics


Don't need the comment (don't actually need the declarations either)

JanekvO · 2025-03-04T13:24:32Z

can you separately pre-commit the test

I've put up #129703 and yesterday's PR #129464 has already helped with removing the intermediate s_mov_b32 materializes. My aim with this + follow up PR(s) was to get v_mov_b64 emitted for gfx942+ but do let me know if you have anything in-flight at the moment for folding into v_mov_b64

JanekvO · 2025-03-10T14:15:24Z

I've been looking at resolving the regression but having a hard time finding a good MachineCSE alternative. Tried to convince the complex pattern to associate the added v_mov_b32 0 explicitly with a vector element of i64 = buildvector i32 0, i32 0, if possible. Handling in any of the constant folding mechanisms (peepholeopt, SIFoldOperand) is a bit awkward with the fold possibly being between 2 set 0 expressions that don't have a common ancestor expression. A blanket ban on any vgpr destination isn't preferred either as I have a follow-up patch that aids in folding those into v_mov_b64 0, if applicable.

Any ideas for alternatives to implementing subreg awareness to MachineCSE for the regression of an additional vgpr use in cases like fcanonicalize.ll? I cannot seem to find an appropriate alternative that avoids this (I'm wondering whether it's worth continuing this PR if the regression cannot be fixed; would it be an acceptable regression?)

… multiple sgpr64

JanekvO · 2025-03-12T11:36:43Z

Rebase

JanekvO · 2025-04-23T11:30:56Z

Shouldn't be resolved in selection; move b64 immediate materialization (through v_mov_b64, if available) to amdgpu fold mechanism.

JanekvO requested review from jayfoad, arsenm and shiltian February 20, 2025 15:27

llvmbot added the backend:AMDGPU label Feb 20, 2025

arsenm reviewed Feb 20, 2025

View reviewed changes

JanekvO force-pushed the 64b_sgpr_mov branch from 34567ec to 01f5fdf Compare March 3, 2025 13:30

JanekvO requested a review from arsenm March 3, 2025 13:48

arsenm reviewed Mar 3, 2025

View reviewed changes

[AMDGPU] SelectionDAG support for vector type materialization of 0 to…

f7e6f7b

… multiple sgpr64

JanekvO force-pushed the 64b_sgpr_mov branch from 16016b7 to f7e6f7b Compare March 12, 2025 11:36

JanekvO closed this Apr 23, 2025

JanekvO mentioned this pull request May 7, 2025

[AMDGPU] Fold multiple aligned v_mov_b32 to v_mov_b64 on gfx942 #138843

Open

		@@ -0,0 +1,272 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 < %s \| FileCheck --check-prefix=GFX942 %s

[AMDGPU] SelectionDAG support for vector type set 0 to multiple sgpr64 #128017

[AMDGPU] SelectionDAG support for vector type set 0 to multiple sgpr64 #128017

Uh oh!

Conversation

JanekvO commented Feb 20, 2025

Uh oh!

llvmbot commented Feb 20, 2025

Uh oh!

github-actions bot commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JanekvO commented Feb 20, 2025

Uh oh!

arsenm commented Feb 21, 2025

Uh oh!

JanekvO commented Feb 21, 2025

Uh oh!

JanekvO commented Mar 3, 2025

Uh oh!

JanekvO commented Mar 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

JanekvO commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JanekvO commented Mar 4, 2025

Uh oh!

JanekvO commented Mar 10, 2025

Uh oh!

JanekvO commented Mar 12, 2025

Uh oh!

JanekvO commented Apr 23, 2025

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2025 •

edited

Loading

JanekvO commented Mar 3, 2025 •

edited

Loading