[InitUndef] Enable the InitUndef pass on non-AMDGPU targets #108353

nikic · 2024-09-12T09:34:27Z

The InitUndef pass works around a register allocation issue, where undef operands can be allocated to the same register as early-clobber result operands. This may lead to ISA constraint violations, where certain input and output registers are not allowed to overlap.

Originally this pass was implemented for RISCV, and then extended to ARM in #77770. I've since removed the target-specific parts of the pass in #106744 and #107885. This PR reduces the pass to use a single requiresDisjointEarlyClobberAndUndef() target hook and enables it by default. The hook is disabled for AMDGPU, because overlapping early-clobber and undef operands are known to be safe for that target, and we get significant codegen diffs otherwise.

The motivating case is the one in arm64-ldxr-stxr.ll, where we were previously incorrectly allocating a stxp input and output to the same register.

The InitUndef pass works around a register allocation issue, where undef operands can be allocated to the same register as early-clobber result operands. This may lead to ISA constraint violations, where certain input and output registers are not allowed to overlap. Originally this pass was implemented for RISCV, and then extended to ARM in llvm#77770. I've since removed the target-specific parts of the pass in llvm#106744 and llvm#107885. This PR now enables the pass for all targets. The motivating case is the one in arm64-ldxr-stxr.ll for the AArch64 target, where we were previously incorrectly allocating a stxp input and output to the same register.

llvmbot · 2024-09-12T09:35:04Z

@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-backend-arm

Author: Nikita Popov (nikic)

Changes

The InitUndef pass works around a register allocation issue, where undef operands can be allocated to the same register as early-clobber result operands. This may lead to ISA constraint violations, where certain input and output registers are not allowed to overlap.

Originally this pass was implemented for RISCV, and then extended to ARM in #77770. I've since removed the target-specific parts of the pass in #106744 and #107885. This PR now enables the pass for all targets.

The motivating case is the one in arm64-ldxr-stxr.ll for the AArch64 target, where we were previously incorrectly allocating a stxp input and output to the same register.

Patch is 58.71 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108353.diff

17 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetRegisterInfo.h (-13)
(modified) llvm/include/llvm/CodeGen/TargetSubtargetInfo.h (-6)
(modified) llvm/lib/CodeGen/InitUndef.cpp (-14)
(modified) llvm/lib/Target/ARM/ARMBaseRegisterInfo.h (-14)
(modified) llvm/lib/Target/ARM/ARMSubtarget.h (-7)
(modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.h (-5)
(modified) llvm/lib/Target/RISCV/RISCVSubtarget.h (-2)
(modified) llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll (+1-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+28-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+12-11)
(modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+163-159)
(modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (+12-10)
(modified) llvm/test/CodeGen/AMDGPU/mad_u64_u32.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/wmma_modifiers.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/regalloc-last-chance-recoloring-failure.ll (+2-2)

diff --git a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
index ebf06bc57948f2..1a2f31e199336a 100644
--- a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
@@ -1203,19 +1203,6 @@ class TargetRegisterInfo : public MCRegisterInfo {
   virtual bool isNonallocatableRegisterCalleeSave(MCRegister Reg) const {
     return false;
   }
-
-  /// Returns if the architecture being targeted has the required Pseudo
-  /// Instructions for initializing the register. By default this returns false,
-  /// but where it is overriden for an architecture, the behaviour will be
-  /// different. This can either be a check to ensure the Register Class is
-  /// present, or to return true as an indication the architecture supports the
-  /// pass. If using the method that does not check for the Register Class, it
-  /// is imperative to ensure all required Pseudo Instructions are implemented,
-  /// otherwise compilation may fail with an `Unexpected register class` error.
-  virtual bool
-  doesRegClassHavePseudoInitUndef(const TargetRegisterClass *RC) const {
-    return false;
-  }
 };
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index b4b018f080914a..707842f896b2fe 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -332,12 +332,6 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
 
   /// Get the list of MacroFusion predicates.
   virtual std::vector<MacroFusionPredTy> getMacroFusions() const { return {}; };
-
-  /// supportsInitUndef is used to determine if an architecture supports
-  /// the Init Undef Pass. By default, it is assumed that it will not support
-  /// the pass, with architecture specific overrides providing the information
-  /// where they are implemented.
-  virtual bool supportsInitUndef() const { return false; }
 };
 
 } // end namespace llvm
diff --git a/llvm/lib/CodeGen/InitUndef.cpp b/llvm/lib/CodeGen/InitUndef.cpp
index d6f7c0d7cf0f5f..4081cb17b3cfdd 100644
--- a/llvm/lib/CodeGen/InitUndef.cpp
+++ b/llvm/lib/CodeGen/InitUndef.cpp
@@ -120,8 +120,6 @@ bool InitUndef::handleReg(MachineInstr *MI) {
       continue;
     if (!UseMO.getReg().isVirtual())
       continue;
-    if (!TRI->doesRegClassHavePseudoInitUndef(MRI->getRegClass(UseMO.getReg())))
-      continue;
 
     if (UseMO.isUndef() || findImplictDefMIFromReg(UseMO.getReg(), MRI))
       Changed |= fixupIllOperand(MI, UseMO);
@@ -140,8 +138,6 @@ bool InitUndef::handleSubReg(MachineFunction &MF, MachineInstr &MI,
       continue;
     if (UseMO.isTied())
       continue;
-    if (!TRI->doesRegClassHavePseudoInitUndef(MRI->getRegClass(UseMO.getReg())))
-      continue;
 
     Register Reg = UseMO.getReg();
     if (NewRegs.count(Reg))
@@ -245,16 +241,6 @@ bool InitUndef::processBasicBlock(MachineFunction &MF, MachineBasicBlock &MBB,
 
 bool InitUndef::runOnMachineFunction(MachineFunction &MF) {
   ST = &MF.getSubtarget();
-
-  // supportsInitUndef is implemented to reflect if an architecture has support
-  // for the InitUndef pass. Support comes from having the relevant Pseudo
-  // instructions that can be used to initialize the register. The function
-  // returns false by default so requires an implementation per architecture.
-  // Support can be added by overriding the function in a way that best fits
-  // the architecture.
-  if (!ST->supportsInitUndef())
-    return false;
-
   MRI = &MF.getRegInfo();
   TII = ST->getInstrInfo();
   TRI = MRI->getTargetRegisterInfo();
diff --git a/llvm/lib/Target/ARM/ARMBaseRegisterInfo.h b/llvm/lib/Target/ARM/ARMBaseRegisterInfo.h
index 58b5e98fd30b14..926d702b4092a5 100644
--- a/llvm/lib/Target/ARM/ARMBaseRegisterInfo.h
+++ b/llvm/lib/Target/ARM/ARMBaseRegisterInfo.h
@@ -240,20 +240,6 @@ class ARMBaseRegisterInfo : public ARMGenRegisterInfo {
                             unsigned SrcSubReg) const override;
 
   int getSEHRegNum(unsigned i) const { return getEncodingValue(i); }
-
-  bool doesRegClassHavePseudoInitUndef(
-      const TargetRegisterClass *RC) const override {
-    (void)RC;
-    // For the ARM Architecture we want to always return true because all
-    // required PseudoInitUndef types have been added. If compilation fails due
-    // to `Unexpected register class`, this is likely to be because the specific
-    // register being used is not support by Init Undef and needs the Pseudo
-    // Instruction adding to ARMInstrInfo.td. If this is implemented as a
-    // conditional check, this could create a false positive where Init Undef is
-    // not running, skipping the instruction and moving to the next. This could
-    // lead to illegal instructions being generated by the register allocator.
-    return true;
-  }
 };
 
 } // end namespace llvm
diff --git a/llvm/lib/Target/ARM/ARMSubtarget.h b/llvm/lib/Target/ARM/ARMSubtarget.h
index 00239ff94b7ba5..fa20f4b590bea5 100644
--- a/llvm/lib/Target/ARM/ARMSubtarget.h
+++ b/llvm/lib/Target/ARM/ARMSubtarget.h
@@ -209,13 +209,6 @@ class ARMSubtarget : public ARMGenSubtargetInfo {
     return &InstrInfo->getRegisterInfo();
   }
 
-  /// The correct instructions have been implemented to initialize undef
-  /// registers, therefore the ARM Architecture is supported by the Init Undef
-  /// Pass. This will return true as the pass needs to be supported for all
-  /// types of instructions. The pass will then perform more checks to ensure it
-  /// should be applying the Pseudo Instructions.
-  bool supportsInitUndef() const override { return true; }
-
   const CallLowering *getCallLowering() const override;
   InstructionSelector *getInstructionSelector() const override;
   const LegalizerInfo *getLegalizerInfo() const override;
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 98a712af085399..cb0bb77d1fcbcb 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -130,11 +130,6 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
                              const MachineFunction &MF, const VirtRegMap *VRM,
                              const LiveRegMatrix *Matrix) const override;
 
-  bool doesRegClassHavePseudoInitUndef(
-      const TargetRegisterClass *RC) const override {
-    return isVRRegClass(RC);
-  }
-
   static bool isVRRegClass(const TargetRegisterClass *RC) {
     return RISCVRI::isVRegClass(RC->TSFlags) &&
            RISCVRI::getNF(RC->TSFlags) == 1;
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index ea54ff1df0b7cb..bf9ed3f3d71655 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -306,8 +306,6 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
   unsigned getTailDupAggressiveThreshold() const {
     return TuneInfo->TailDupAggressiveThreshold;
   }
-
-  bool supportsInitUndef() const override { return hasVInstructions(); }
 };
 } // End llvm namespace
 
diff --git a/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll b/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll
index 1a60f8752dd571..4fb0c2775a7a7a 100644
--- a/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll
@@ -354,11 +354,10 @@ define dso_local i32 @test_store_release_i64(i32, i64 %val, ptr %addr) {
 }
 
 ; The stxp result cannot be allocated to the same register as the inputs.
-; FIXME: This is a miscompile.
 define dso_local i32 @test_stxp_undef(ptr %p, i64 %x) nounwind {
 ; CHECK-LABEL: test_stxp_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    stxp w8, x8, x1, [x0]
+; CHECK-NEXT:    stxp w8, x9, x1, [x0]
 ; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %res = call i32 @llvm.aarch64.stxp(i64 undef, i64 %x, ptr %p)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
index 489f46d1237a36..ae1b54f2aaf9c8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
@@ -33,11 +33,9 @@ define amdgpu_kernel void @v_mul_i64_no_zext(ptr addrspace(1) %out, ptr addrspac
 ; GFX11-NEXT:    global_load_b64 v[2:3], v9, s[2:3]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    v_mad_u64_u32 v[4:5], null, v0, v2, 0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[6:7], null, v0, v3, v[5:6]
-; GFX11-NEXT:    v_mad_u64_u32 v[7:8], null, v1, v2, v[6:7]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v5, v7
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_mad_u64_u32 v[7:8], null, v0, v3, v[5:6]
+; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v1, v2, v[7:8]
 ; GFX11-NEXT:    global_store_b64 v9, v[4:5], s[2:3]
 ; GFX11-NEXT:    s_nop 0
 ; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -85,13 +83,13 @@ define amdgpu_kernel void @v_mul_i64_zext_src1(ptr addrspace(1) %out, ptr addrsp
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    global_load_b64 v[0:1], v1, s[6:7]
-; GFX11-NEXT:    global_load_b32 v5, v2, s[0:1]
+; GFX11-NEXT:    global_load_b32 v7, v2, s[0:1]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v0, v5, 0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, v3
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v1, v5, v[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v0, v7, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v1, v7, v[3:4]
+; GFX11-NEXT:    v_mov_b32_e32 v3, v5
 ; GFX11-NEXT:    global_store_b64 v0, v[2:3], s[4:5]
 ; GFX11-NEXT:    s_nop 0
 ; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -138,14 +136,14 @@ define amdgpu_kernel void @v_mul_i64_zext_src0(ptr addrspace(1) %out, ptr addrsp
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v0
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    global_load_b32 v5, v1, s[6:7]
+; GFX11-NEXT:    global_load_b32 v7, v1, s[6:7]
 ; GFX11-NEXT:    global_load_b64 v[0:1], v0, s[0:1]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v5, v0, 0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, v3
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, v1, v[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v7, v0, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v7, v1, v[3:4]
+; GFX11-NEXT:    v_mov_b32_e32 v3, v5
 ; GFX11-NEXT:    global_store_b64 v0, v[2:3], s[4:5]
 ; GFX11-NEXT:    s_nop 0
 ; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -241,14 +239,14 @@ define amdgpu_kernel void @v_mul_i64_masked_src0_hi(ptr addrspace(1) %out, ptr a
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_clause 0x1
-; GFX11-NEXT:    global_load_b32 v5, v0, s[6:7]
+; GFX11-NEXT:    global_load_b32 v7, v0, s[6:7]
 ; GFX11-NEXT:    global_load_b64 v[0:1], v0, s[0:1]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v5, v0, 0
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, v3
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, v1, v[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v7, v0, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v7, v1, v[3:4]
+; GFX11-NEXT:    v_mov_b32_e32 v3, v5
 ; GFX11-NEXT:    global_store_b64 v0, v[2:3], s[4:5]
 ; GFX11-NEXT:    s_nop 0
 ; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -436,16 +434,14 @@ define amdgpu_kernel void @v_mul_i64_partially_masked_src0(ptr addrspace(1) %out
 ; GFX11-NEXT:    global_load_b64 v[0:1], v2, s[6:7]
 ; GFX11-NEXT:    global_load_b64 v[2:3], v2, s[0:1]
 ; GFX11-NEXT:    s_waitcnt vmcnt(1)
-; GFX11-NEXT:    v_and_b32_e32 v7, 0xfff00000, v0
+; GFX11-NEXT:    v_and_b32_e32 v0, 0xfff00000, v0
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[4:5], null, v7, v2, 0
-; GFX11-NEXT:    v_mov_b32_e32 v0, v5
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v7, v3, v[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[4:5], null, v0, v2, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[7:8], null, v0, v3, v[5:6]
 ; GFX11-NEXT:    v_and_b32_e32 v3, 0xf00f, v1
-; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v3, v2, v[5:6]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v3, v2, v[7:8]
 ; GFX11-NEXT:    v_dual_mov_b32 v5, v0 :: v_dual_mov_b32 v0, 0
 ; GFX11-NEXT:    global_store_b64 v0, v[4:5], s[4:5]
 ; GFX11-NEXT:    s_nop 0
@@ -568,10 +564,12 @@ define amdgpu_kernel void @v_mul64_masked_before_and_in_branch(ptr addrspace(1)
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v2, v4, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v2, v5, v[1:2]
-; GFX11-NEXT:    ; implicit-def: $vgpr4_vgpr5
-; GFX11-NEXT:    v_mov_b32_e32 v1, v3
+; GFX11-NEXT:    v_mov_b32_e32 v3, v1
+; GFX11-NEXT:    v_mad_u64_u32 v[6:7], null, v2, v5, v[3:4]
 ; GFX11-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; GFX11-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_mov_b32_e32 v1, v6
 ; GFX11-NEXT:  .LBB10_2: ; %Flow
 ; GFX11-NEXT:    s_and_not1_saveexec_b32 s0, s0
 ; GFX11-NEXT:    s_cbranch_execz .LBB10_4
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
index 42f1bf84c04207..a7df4e6e75103a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
@@ -667,8 +667,8 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
 ; GFX11-NEXT:    v_mul_lo_u32 v0, v6, v5
 ; GFX11-NEXT:    v_mad_u64_u32 v[8:9], null, v7, v4, v[0:1]
 ; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v6, v3, 0
-; GFX11-NEXT:    v_mad_u64_u32 v[9:10], null, v2, v3, v[8:9]
-; GFX11-NEXT:    v_mov_b32_e32 v2, v9
+; GFX11-NEXT:    v_mad_u64_u32 v[10:11], null, v2, v3, v[8:9]
+; GFX11-NEXT:    v_mov_b32_e32 v2, v10
 ; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v6, v4, v[1:2]
 ; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v7, v3, v[1:2]
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
@@ -2691,9 +2691,9 @@ define amdgpu_ps void @s_mul_u64_sext_with_vregs(ptr addrspace(1) %out, ptr addr
 ; GFX11-NEXT:    global_load_b32 v4, v[2:3], off
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, 0x50, v4, 0
-; GFX11-NEXT:    v_ashrrev_i32_e32 v6, 31, v4
-; GFX11-NEXT:    v_mad_u64_u32 v[4:5], null, 0x50, v6, v[3:4]
-; GFX11-NEXT:    v_mov_b32_e32 v3, v4
+; GFX11-NEXT:    v_ashrrev_i32_e32 v7, 31, v4
+; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, 0x50, v7, v[3:4]
+; GFX11-NEXT:    v_mov_b32_e32 v3, v5
 ; GFX11-NEXT:    global_store_b64 v[0:1], v[2:3], off
 ; GFX11-NEXT:    s_nop 0
 ; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index b17dfc7c3754a1..5262247c755b3f 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -550,7 +550,6 @@ define amdgpu_kernel void @add_i32_uniform(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1164-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1164-NEXT:    s_mov_b32 s7, 0x31016000
 ; GFX1164-NEXT:    s_mov_b32 s6, -1
-; GFX1164-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX1164-NEXT:    v_mad_u64_u32 v[1:2], null, s2, v0, s[0:1]
 ; GFX1164-NEXT:    buffer_store_b32 v1, off, s[4:7], 0
 ; GFX1164-NEXT:    s_nop 0
@@ -588,7 +587,6 @@ define amdgpu_kernel void @add_i32_uniform(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1132-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1132-NEXT:    s_mov_b32 s7, 0x31016000
 ; GFX1132-NEXT:    s_mov_b32 s6, -1
-; GFX1132-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX1132-NEXT:    v_mad_u64_u32 v[1:2], null, s0, v0, s[2:3]
 ; GFX1132-NEXT:    buffer_store_b32 v1, off, s[4:7], 0
 ; GFX1132-NEXT:    s_nop 0
@@ -2219,11 +2217,12 @@ define amdgpu_kernel void @add_i64_uniform(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1164-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1164-NEXT:    s_mov_b32 s7, 0x31016000
 ; GFX1164-NEXT:    s_mov_b32 s6, -1
-; GFX1164-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX1164-NEXT:    v_mad_u64_u32 v[0:1], null, s0, v2, s[2:3]
-; GFX1164-NEXT:    v_mad_u64_u32 v[3:4], null, s1, v2, v[1:2]
+; GFX1164-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164-NEXT:    v_mov_b32_e32 v3, v1
+; GFX1164-NEXT:    v_mad_u64_u32 v[5:6], null, s1, v2, v[3:4]
 ; GFX1164-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164-NEXT:    v_mov_b32_e32 v1, v3
+; GFX1164-NEXT:    v_mov_b32_e32 v1, v5
 ; GFX1164-NEXT:    buffer_store_b64 v[0:1], off, s[4:7], 0
 ; GFX1164-NEXT:    s_nop 0
 ; GFX1164-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -2265,11 +2264,12 @@ define amdgpu_kernel void @add_i64_uniform(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1132-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX1132-NEXT:    s_mov_b32 s7, 0x31016000
 ; GFX1132-NEXT:    s_mov_b32 s6, -1
-; GFX1132-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX1132-NEXT:    v_mad_u64_u32 v[0:1], null, s0, v2, s[2:3]
-; GFX1132-NEXT:    v_mad_u64_u32 v[3:4], null, s1, v2, v[1:2]
+; GFX1132-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132-NEXT:    v_mov_b32_e32 v3, v1
+; GFX1132-NEXT:    v_mad_u64_u32 v[5:6], null, s1, v2, v[3:4]
 ; GFX1132-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132-NEXT:    v_mov_b32_e32 v1, v3
+; GFX1132-NEXT:    v_mov_b32_e32 v1, v5
 ; GFX1132-NEXT:    buffer_store_b64 v[0:1], off, s[4:7], 0
 ; GFX1132-NEXT:    s_nop 0
 ; GFX1132-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
@@ -5918,11 +5918,11 @@ define amdgpu_kernel void @sub_i64_uniform(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1164-NEXT:    s_mov_b32 s7, 0x31016000
 ; GFX1164-NEXT:    s_mov_b32 s6, -1
 ; GFX1164-NEXT:    s_waitcnt_depctr 0xfff
-; GFX1164-NEXT:    v_mad_u64_u32 v[5:6], null, s1, v2, v[4:5]
+; GFX1164-NEXT:    v_mad_u64_u32 v[6:7], null, s1, v2, v[4:5]
 ; GFX1164-NEXT:    v_readfirstlane_b32 s1, v1
 ; GFX1164-NEXT:    v_sub_co_u32 v0, vcc, s0, v3
 ; GFX1164-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164-NEXT:    v_mov_b32_e32 v1, v5
+; GFX1164-NEXT:    v_mov_b32_e32 v1, v6
 ; GFX1164-NEXT:    v_sub_co_ci_u32_e32 v1, vcc, s1, v1, vcc
 ; GFX1164-NEXT:    buffer_store_b64 v[0:1], off, s[4:7], 0
 ; GFX1164-NEXT:    s_nop 0
@@ -5966,11 +5966,11 @@ define amdgpu_kernel void @sub_i64_uniform(ptr addrspace(1) %out, ptr addrspace(
 ; GFX1132-NEXT:    s_mov_b32 s7, 0x31016000
 ; GFX1132-NEXT:    s_mov_b32 s6, -1
 ; GFX1132-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_4)
-; GFX1132-NEXT:    v_mad_u64_u32 v[5:6], null, s1, v2, v[4:5]
+; GFX1132-NEXT:    v_mad_u64_u32 v[6:7], null, s1, v2, v[4:5]
 ; GFX1132-NEXT:    v_readfirstlane_b32 s1, v1
 ; GFX1132-NEXT:    v_sub_co_u32 v0, vcc_lo, s0, v3
 ; GFX1132-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132-NEXT:    v_mov_b32_e32 v1, v5
+; GFX1132-NEXT:    v_mov_b32_e32 v1, v6
 ; GFX1132-NEX...
[truncated]

jayfoad · 2024-09-12T10:06:36Z

I don't think this has any benefit for AMDGPU. We use early-clobber to avoid some outputs overlapping with defined inputs, but it's OK for them to overlap with undef inputs. One specific example:

v_mad_u64_u32 v[9:10], null, v2, v3, v[8:9]

Here we generated a 64-bit integer multiply-add (v[9:10] = zext(v2) * zext(v3) + v[8:9]) but we're only using the low part of the result (v9 = v2 * v3 + v8). This means that the v9 (the high part of the addend input v[8:9]) is an undef input so it's OK for it to overlap with the output, even though the output is early-clobber.

jayfoad · 2024-09-12T10:20:16Z

The AMDGPU usage is a very literal interpretation of what "early clobber" means, i.e. (some part of) the output might get written before (some parts of) the inputs are read.

For the architectural constraints in ARM and RISCV, is the output forbidden from overlapping with any input, or with one specific input? If the latter, maybe a better way to model it would be with something like Constraints = "$dst != $src2" in the tablegen definition of the instruction -- but that would require support in all register allocators, and I don't know how much effort anyone wants to put into better handling for this minor codegen issue.

Stylie777 · 2024-09-12T10:23:41Z

@jayfoad I can only speak for Arm and AArch64 but the implementation we use to not reuse registers, unlike what is done in AMDGPU as you described.

I think it would be easier here to just ensure AMDGPU is disabled, by keeping the SupportsInitUndef function and adding an implementation for AArch64. There does not seem to be any other architectures affected by this change.

nikic · 2024-09-12T10:30:21Z

If this is undesirable for AMDGPU, I can restore the target hook. A caveat is that I think this is always required for inline asm (because we have no knowledge over whether it's safe or not in that case), but if needed we could explicitly check for that case separately from the rest.

BeMg · 2024-09-12T10:46:29Z

The AMDGPU usage is a very literal interpretation of what "early clobber" means, i.e. (some part of) the output might get written before (some parts of) the inputs are read.

For the architectural constraints in ARM and RISCV, is the output forbidden from overlapping with any input, or with one specific input? If the latter, maybe a better way to model it would be with something like Constraints = "$dst != $src2" in the tablegen definition of the instruction -- but that would require support in all register allocators, and I don't know how much effort anyone wants to put into better handling for this minor codegen issue.

In RISC-V, we are trying to resolve a case involving the vrgather.vv instruction by InitUndef, where the destination register cannot overlap with any source register; otherwise, it is considered an illegal instruction.

nikic · 2024-09-12T10:51:31Z

I've updated the PR to restore the hook and only enable it on AArch64 additionally. I've renamed the hook to make it clearer when you are supposed to enable it.

Though I'm wondering if it wouldn't be better to enable it by default and only disable on AMDGPU, as that is the conservatively correct behavior, and it doesn't seem like this affects register allocation/scheduling on other targets anyway.

jayfoad · 2024-09-12T10:57:58Z

The AMDGPU usage is a very literal interpretation of what "early clobber" means, i.e. (some part of) the output might get written before (some parts of) the inputs are read.
For the architectural constraints in ARM and RISCV, is the output forbidden from overlapping with any input, or with one specific input? If the latter, maybe a better way to model it would be with something like Constraints = "$dst != $src2" in the tablegen definition of the instruction -- but that would require support in all register allocators, and I don't know how much effort anyone wants to put into better handling for this minor codegen issue.

In RISC-V, we are trying to resolve a case involving the vrgather.vv instruction by InitUndef, where the destination register cannot overlap with any source register; otherwise, it is considered an illegal instruction.

Thanks. But I also see cases like this:

lib/Target/RISCV/RISCVInstrInfoV.td
1114-// Set earlyclobber for following instructions for second and mask operands.
1115:// This has the downside that the earlyclobber constraint is too coarse and
1116-// will impose unnecessary restrictions by not allowing the destination to
1117-// overlap with the first (wide) operand.

which sound like they could benefit from a != constraint.

BeMg · 2024-09-12T11:20:32Z

The AMDGPU usage is a very literal interpretation of what "early clobber" means, i.e. (some part of) the output might get written before (some parts of) the inputs are read.
For the architectural constraints in ARM and RISCV, is the output forbidden from overlapping with any input, or with one specific input? If the latter, maybe a better way to model it would be with something like Constraints = "$dst != $src2" in the tablegen definition of the instruction -- but that would require support in all register allocators, and I don't know how much effort anyone wants to put into better handling for this minor codegen issue.

In RISC-V, we are trying to resolve a case involving the vrgather.vv instruction by InitUndef, where the destination register cannot overlap with any source register; otherwise, it is considered an illegal instruction.

Thanks. But I also see cases like this:
lib/Target/RISCV/RISCVInstrInfoV.td
1114-// Set earlyclobber for following instructions for second and mask operands.
1115:// This has the downside that the earlyclobber constraint is too coarse and
1116-// will impose unnecessary restrictions by not allowing the destination to
1117-// overlap with the first (wide) operand.
which sound like they could benefit from a != constraint.

You're right. For some RISC-V instructions, the early-clobber constraint we currently use is too strong. There are many instructions where the destination register could overlap part of the source register (dst != src:sub_1). With the != constraint, we could reduce register pressure, but this would only be beneficial in very compact situations.

I think that even if we implement partial early-clobber constraints, we would still need to use InitUndef for any operand that should be constrained but is marked as undefined.

arsenm · 2024-09-12T16:47:24Z

llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll

 define dso_local i32 @test_stxp_undef(ptr %p, i64 %x) nounwind {
 ; CHECK-LABEL: test_stxp_undef:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    stxp w8, x8, x1, [x0]
+; CHECK-NEXT:    stxp w8, x9, x1, [x0]


This could use an additional MIR test for something so specific

The pass itself already has MIR tests. I don't think adding another MIR test for this case would be useful, because we care about the end-to-end behavior (from the interaction of three different passes) here.

What would be quite nice is if the old code would emit an error, as it's hard to spot the issue otherwise. Currently it only generates an error when parsed back as inline assembly. Not sure whether there is a good way to share that validation, I'm not familiar with this part of the backend...

AArch64InstrInfo::verifyInstruction is the place for target-specific verification of an instruction... but currently it only gets called with -verify-machineinstrs. (It looks like AMDGPUAsmPrinter::emitInstruction calls verifyInstruction, but other targets currently don't do that.)

I only see 3 places using run-pass=init-undef, all in RISCV. You could just dump the MIR before init-undef, extract the one instruction, and make sure the pass isn't a no-op on AArch64

Okay, if all we want to test is "the pass does something", then a MIR test makes sense. I've added one.

arsenm · 2024-09-13T05:54:48Z

Though I'm wondering if it wouldn't be better to enable it by default and only disable on AMDGPU, as that is the conservatively correct behavior, and it doesn't seem like this affects register allocation/scheduling on other targets anyway.

Yes, the default should always be conservatively correct. This type of stuff is very undiscoverable

nikic · 2024-09-13T10:12:08Z

Though I'm wondering if it wouldn't be better to enable it by default and only disable on AMDGPU, as that is the conservatively correct behavior, and it doesn't seem like this affects register allocation/scheduling on other targets anyway.

Yes, the default should always be conservatively correct. This type of stuff is very undiscoverable

Done.

arsenm · 2024-09-13T17:48:37Z

llvm/test/CodeGen/AArch64/init-undef.mir

+# RUN: llc -mtriple=aarch64-- -run-pass=init-undef -o - %s | FileCheck %s
+
+--- |
+  define dso_local i32 @test_stxp_undef(ptr %p, i64 %x) #0 {


Don't need the IR section

arsenm · 2024-09-13T17:48:46Z

llvm/test/CodeGen/AArch64/init-undef.mir

+registers:
+  - { id: 0, class: gpr64common }
+  - { id: 1, class: gpr64 }
+  - { id: 2, class: gpr32 }
+  - { id: 3, class: gpr64 }


Suggested change

registers:

- { id: 0, class: gpr64common }

- { id: 1, class: gpr64 }

- { id: 2, class: gpr32 }

- { id: 3, class: gpr64 }

arsenm · 2024-09-13T17:48:57Z

llvm/test/CodeGen/AArch64/init-undef.mir

+  - { reg: '$x0', virtual-reg: '%0' }
+  - { reg: '$x1', virtual-reg: '%1' }
+body:             |
+  bb.0 (%ir-block.0):


Suggested change

bb.0 (%ir-block.0):

bb.0:

arsenm · 2024-09-13T17:49:06Z

llvm/test/CodeGen/AArch64/init-undef.mir

+    %1:gpr64 = COPY $x1
+    %0:gpr64common = COPY $x0
+    %3:gpr64 = IMPLICIT_DEF
+    early-clobber %2:gpr32 = STXPX killed %3, %1, %0 :: (volatile store (s128) into %ir.p)


Suggested change

early-clobber %2:gpr32 = STXPX killed %3, %1, %0 :: (volatile store (s128) into %ir.p)

early-clobber %2:gpr32 = STXPX killed %3, %1, %0 :: (volatile store (s128))

nikic requested review from jayfoad, arsenm, lukel97, BeMg, davemgreen, efriedma-quic and Stylie777 September 12, 2024 09:34

llvmbot added backend:ARM backend:AArch64 backend:AMDGPU backend:RISC-V llvm:globalisel labels Sep 12, 2024

nikic mentioned this pull request Sep 12, 2024

[AArch64] Overlapping registers allocated to input and lateout constraints #106380

Closed

Restore hook, only enable on aarch64

481b8ea

nikic changed the title ~~[InitUndef] Enable the InitUndef pass on all targets~~ [InitUndef] Enable the InitUndef pass on AArch64 Sep 12, 2024

arsenm reviewed Sep 12, 2024

View reviewed changes

nikic added 2 commits September 13, 2024 11:51

Enable by default, disable on AMDGPU

4609c88

Add MIR test

812d5e7

nikic changed the title ~~[InitUndef] Enable the InitUndef pass on AArch64~~ [InitUndef] Enable the InitUndef pass on non-AMDGPU targets Sep 13, 2024

arsenm approved these changes Sep 13, 2024

View reviewed changes

Simplify MIR test

e38b66b

nikic merged commit dfa5429 into llvm:main Sep 16, 2024
8 checks passed

nikic deleted the init-undef-general branch September 16, 2024 07:48

	early-clobber %2:gpr32 = STXPX killed %3, %1, %0 :: (volatile store (s128) into %ir.p)
	early-clobber %2:gpr32 = STXPX killed %3, %1, %0 :: (volatile store (s128))

[InitUndef] Enable the InitUndef pass on non-AMDGPU targets #108353

[InitUndef] Enable the InitUndef pass on non-AMDGPU targets #108353

Uh oh!

Conversation

nikic commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayfoad commented Sep 12, 2024

Uh oh!

jayfoad commented Sep 12, 2024

Uh oh!

Stylie777 commented Sep 12, 2024

Uh oh!

nikic commented Sep 12, 2024

Uh oh!

BeMg commented Sep 12, 2024

Uh oh!

nikic commented Sep 12, 2024

Uh oh!

jayfoad commented Sep 12, 2024

Uh oh!

BeMg commented Sep 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm commented Sep 13, 2024

Uh oh!

nikic commented Sep 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nikic commented Sep 12, 2024 •

edited

Loading

llvmbot commented Sep 12, 2024 •

edited

Loading

nikic Sep 12, 2024 •

edited

Loading