Skip to content

[AMDGPU] Codegen support for constrained multi-dword sloads #96163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 23, 2024

Conversation

cdevadas
Copy link
Collaborator

For targets that support xnack replay feature (gfx8+), the
multi-dword scalar loads shouldn't clobber any register that
holds the src address. The constraint version of the scalar
loads have the early clobber flag attached to the dst operand
to restrict RA from re-allocating any of the src regs for its
dst operand.

@cdevadas
Copy link
Collaborator Author

cdevadas commented Jun 20, 2024

@llvmbot
Copy link
Member

llvmbot commented Jun 20, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Christudasan Devadasan (cdevadas)

Changes

For targets that support xnack replay feature (gfx8+), the
multi-dword scalar loads shouldn't clobber any register that
holds the src address. The constraint version of the scalar
loads have the early clobber flag attached to the dst operand
to restrict RA from re-allocating any of the src regs for its
dst operand.


Patch is 7.42 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96163.diff

265 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SMInstructions.td (+99-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/bool-legalization.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+122-122)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll (+117-117)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-fract.f64.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir (+36-36)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lds-zero-initializer.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.scale.ll (+215-190)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+75-128)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mov.dpp.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.set.inactive.ll (+90-90)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.update.dpp.ll (+64-64)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant.96.ll (+15-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+33-33)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+82-82)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+139-139)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+263-264)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/widen-i8-i16-scalar-loads.ll (+146-146)
  • (modified) llvm/test/CodeGen/AMDGPU/add.ll (+273-272)
  • (modified) llvm/test/CodeGen/AMDGPU/add.v2i16.ll (+134-134)
  • (modified) llvm/test/CodeGen/AMDGPU/amd.endpgm.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn-load-offset-from-reg.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+614-611)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/anyext.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+271-253)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+1000-1005)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+1095-1060)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+236-220)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+272-254)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics_cond_sub.ll (+28-28)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-combine.ll (+28-28)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+36-36)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_int.ll (+68-68)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+70-70)
  • (modified) llvm/test/CodeGen/AMDGPU/br_cc.f16.ll (+37-37)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/bswap.ll (+78-78)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+22-22)
  • (modified) llvm/test/CodeGen/AMDGPU/calling-conventions.ll (+162-162)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+426-424)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp-modifier.ll (+209-209)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp.ll (+667-667)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll (+127-130)
  • (modified) llvm/test/CodeGen/AMDGPU/copy_to_scc.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+43-43)
  • (modified) llvm/test/CodeGen/AMDGPU/ctpop16.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/ctpop64.ll (+38-38)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+154-151)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+41-41)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+135-135)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-combine-with-dependence.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_write2.ll (+23-23)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_dynelt.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-f16.ll (+60-62)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fadd.f16.ll (+211-211)
  • (modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll (+361-361)
  • (modified) llvm/test/CodeGen/AMDGPU/fcmp.f16.ll (+319-319)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+208-207)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+178-178)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f64.ll (+155-155)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.f16.ll (+199-199)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+55-52)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv32-to-rcp-folding.ll (+92-92)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics.ll (+168-168)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+60-60)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64.ll (+1117-1117)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+162-162)
  • (modified) llvm/test/CodeGen/AMDGPU/fma-combine.ll (+81-81)
  • (modified) llvm/test/CodeGen/AMDGPU/fmax3.ll (+224-224)
  • (modified) llvm/test/CodeGen/AMDGPU/fmax_legacy.f64.ll (+20-20)
  • (modified) llvm/test/CodeGen/AMDGPU/fmaximum.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+1395-1395)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin3.ll (+332-332)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.f64.ll (+40-40)
  • (modified) llvm/test/CodeGen/AMDGPU/fminimum.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul.f16.ll (+171-171)
  • (modified) llvm/test/CodeGen/AMDGPU/fmuladd.f16.ll (+376-376)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+45-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f64.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+137-137)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+141-141)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-buffer-atomics.ll (+50-50)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-buffer-ptr-atomics.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+131-131)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-min-max-buffer-atomics.ll (+28-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-min-max-buffer-ptr-atomics.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+63-63)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+52-52)
  • (modified) llvm/test/CodeGen/AMDGPU/fpext.f16.ll (+289-297)
  • (modified) llvm/test/CodeGen/AMDGPU/fptosi.f16.ll (+133-133)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoui.f16.ll (+133-133)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll (+486-486)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.ll (+406-406)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (+98-98)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+60-60)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/fsub.f16.ll (+126-126)
  • (modified) llvm/test/CodeGen/AMDGPU/fused-bitlogic.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp-wrong-subtarget.ll (+6-5)
  • (modified) llvm/test/CodeGen/AMDGPU/global-i16-load-store.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+325-325)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+90-90)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64.ll (+1323-1323)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+282-282)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1022-1022)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+594-594)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+594-594)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+806-806)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+257-250)
  • (modified) llvm/test/CodeGen/AMDGPU/idot2.ll (+193-174)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+312-295)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+450-415)
  • (modified) llvm/test/CodeGen/AMDGPU/idot8s.ll (+128-130)
  • (modified) llvm/test/CodeGen/AMDGPU/idot8u.ll (+130-129)
  • (modified) llvm/test/CodeGen/AMDGPU/imm.ll (+321-321)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-term.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+222-220)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+48-48)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+282-282)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-atomic-fmin-fmax.ll (+190-190)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-zero-initializer.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.pkrtz.ll (+62-62)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fcmp.w32.ll (+536-536)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fcmp.w64.ll (+930-930)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll (+19-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+14-13)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.tr-w32.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.tr-w64.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.icmp.w32.ll (+368-368)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.icmp.w64.ll (+634-634)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+36-30)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane.ll (+112-112)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane16.var.ll (+96-96)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.store.d16.ll (+50-50)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.d16.ll (+75-75)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+155-155)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+155-155)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll (+167-167)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.rtn.ll (+48-48)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.set.inactive.ll (+107-107)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+298-298)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.ceil.f16.ll (+60-60)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.cos.f16.ll (+28-28)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+107-102)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+107-102)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+21-20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.floor.f16.ll (+60-60)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll (+268-268)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.get.fpmode.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.is.fpclass.bf16.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.is.fpclass.f16.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.is.fpclass.ll (+41-49)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+62-62)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+62-62)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+78-78)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll (+262-262)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll (+260-260)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.mulo.ll (+212-212)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.r600.read.local.size.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.rint.f16.ll (+50-50)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+175-251)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.sin.f16.ll (+28-28)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.sqrt.f16.ll (+40-40)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.trunc.f16.ll (+40-40)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-f64.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+1747-1747)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+793-792)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+206-207)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i64.ll (+44-44)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+1267-1266)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+291-291)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+224-224)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+39-39)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-lds-struct-aa.ll (+5-4)
  • (modified) llvm/test/CodeGen/AMDGPU/lshl-add-u64.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+70-70)
  • (modified) llvm/test/CodeGen/AMDGPU/mad.u16.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/madak.ll (+38-38)
  • (modified) llvm/test/CodeGen/AMDGPU/max-hard-clause-length.ll (+210-210)
  • (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/memory_clause.ll (+13-12)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+66-66)
  • (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-addsubu64.ll (+17-17)
  • (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-pseudo-scalar-trans.ll (+30-30)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+757-753)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+47-47)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+73-73)
  • (modified) llvm/test/CodeGen/AMDGPU/offset-split-flat.ll (+445-445)
  • (modified) llvm/test/CodeGen/AMDGPU/offset-split-global.ll (+361-361)
  • (modified) llvm/test/CodeGen/AMDGPU/omod.ll (+44-44)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-compare.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+135-135)
  • (modified) llvm/test/CodeGen/AMDGPU/packed-op-sel.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/post-ra-soft-clause-dbg-info.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-kernargs.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-vect3-load.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/ptr-buffer-alias-scheduling.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/rcp-pattern.ll (+61-61)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/saddo.ll (+129-129)
  • (modified) llvm/test/CodeGen/AMDGPU/scalar_to_vector.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+139-139)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+185-185)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+315-315)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.ll (+193-193)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+70-70)
  • (modified) llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll (+682-682)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+92-92)
  • (modified) llvm/test/CodeGen/AMDGPU/simple-indirect-call.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/sint_to_fp.i64.ll (+177-177)
  • (modified) llvm/test/CodeGen/AMDGPU/sitofp.f16.ll (+91-91)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll (+2585-2585)
  • (modified) llvm/test/CodeGen/AMDGPU/sra.ll (+137-137)
  • (modified) llvm/test/CodeGen/AMDGPU/srl.ll (+35-35)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+118-118)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+152-152)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/trunc-combine.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/twoaddr-constrain.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/uaddo.ll (+89-89)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+74-72)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.i64.ll (+142-142)
  • (modified) llvm/test/CodeGen/AMDGPU/uitofp.f16.ll (+91-91)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+75-75)
  • (modified) llvm/test/CodeGen/AMDGPU/usubo.ll (+89-89)
  • (modified) llvm/test/CodeGen/AMDGPU/v_add_u64_pseudo_sdwa.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/v_cndmask.ll (+125-125)
  • (modified) llvm/test/CodeGen/AMDGPU/v_madak_f16.ll (+29-27)
  • (modified) llvm/test/CodeGen/AMDGPU/v_pack.ll (+41-41)
  • (modified) llvm/test/CodeGen/AMDGPU/v_sat_pk_u8_i16.ll (+45-45)
  • (modified) llvm/test/CodeGen/AMDGPU/v_sub_u64_pseudo_sdwa.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+142-140)
  • (modified) llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll (+45-45)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+87-87)
  • (modified) llvm/test/CodeGen/AMDGPU/zero_extend.ll (+1-1)
diff --git a/llvm/lib/Target/AMDGPU/SMInstructions.td b/llvm/lib/Target/AMDGPU/SMInstructions.td
index 4551a3a615b15..9fbedce554a53 100644
--- a/llvm/lib/Target/AMDGPU/SMInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SMInstructions.td
@@ -867,13 +867,104 @@ def SMRDBufferImm   : ComplexPattern<iPTR, 1, "SelectSMRDBufferImm">;
 def SMRDBufferImm32 : ComplexPattern<iPTR, 1, "SelectSMRDBufferImm32">;
 def SMRDBufferSgprImm : ComplexPattern<iPTR, 2, "SelectSMRDBufferSgprImm">;
 
+class SMRDAlignedLoadPat<PatFrag Op> : PatFrag <(ops node:$ptr), (Op node:$ptr), [{
+  // Returns true if it is a naturally aligned multi-dword load.
+  LoadSDNode *Ld = cast<LoadSDNode>(N);
+  unsigned Size = Ld->getMemoryVT().getStoreSize();
+  return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size));
+}]> {
+  let GISelPredicateCode = [{
+    auto &Ld = cast<GLoad>(MI);
+    TypeSize Size = Ld.getMMO().getSize().getValue();
+    return (Size <= 4) || (Ld.getMMO().getAlign().value() >= PowerOf2Ceil(Size));
+  }];
+}
+
+class SMRDUnalignedLoadPat<PatFrag Op> : PatFrag <(ops node:$ptr), (Op node:$ptr), [{
+  // Returns true if it is an under aligned multi-dword load.
+  LoadSDNode *Ld = cast<LoadSDNode>(N);
+  unsigned Size = Ld->getMemoryVT().getStoreSize();
+  return (Size > 4) && (Ld->getAlign().value() < PowerOf2Ceil(Size));
+}]> {
+  let GISelPredicateCode = [{
+    auto &Ld = cast<GLoad>(MI);
+    TypeSize Size = Ld.getMMO().getSize().getValue();
+    return (Size > 4) && (Ld.getMMO().getAlign().value() < PowerOf2Ceil(Size));
+  }];
+}
+
+def alignedmultidwordload : SMRDAlignedLoadPat<smrd_load>;
+def unalignedmultidwordload : SMRDUnalignedLoadPat<smrd_load>;
+
+multiclass SMRD_Align_Pattern <string Instr, ValueType vt> {
+
+  // 1. IMM offset
+  def : GCNPat <
+    (alignedmultidwordload (SMRDImm i64:$sbase, i32:$offset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_IMM") $sbase, $offset, 0))> {
+    let OtherPredicates = [isGFX8Plus];
+  }
+  def : GCNPat <
+    (unalignedmultidwordload (SMRDImm i64:$sbase, i32:$offset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_IMM_ec") $sbase, $offset, 0))> {
+    let OtherPredicates = [isGFX8Plus];
+  }
+
+  // 2. SGPR offset
+  def : GCNPat <
+    (alignedmultidwordload (SMRDSgpr i64:$sbase, i32:$soffset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_SGPR") $sbase, $soffset, 0))> {
+    let OtherPredicates = [isGFX8Only];
+  }
+  def : GCNPat <
+    (unalignedmultidwordload (SMRDSgpr i64:$sbase, i32:$soffset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_ec") $sbase, $soffset, 0))> {
+    let OtherPredicates = [isGFX8Only];
+  }
+  def : GCNPat <
+    (alignedmultidwordload (SMRDSgpr i64:$sbase, i32:$soffset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_IMM") $sbase, $soffset, 0, 0))> {
+    let OtherPredicates = [isGFX9Plus];
+  }
+  def : GCNPat <
+    (unalignedmultidwordload (SMRDSgpr i64:$sbase, i32:$soffset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_IMM_ec") $sbase, $soffset, 0, 0))> {
+    let OtherPredicates = [isGFX9Plus];
+  }
+
+  // 3. SGPR+IMM offset
+  def : GCNPat <
+    (alignedmultidwordload (SMRDSgprImm i64:$sbase, i32:$soffset, i32:$offset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_IMM") $sbase, $soffset, $offset, 0))> {
+    let OtherPredicates = [isGFX9Plus];
+  }
+  def : GCNPat <
+    (unalignedmultidwordload (SMRDSgprImm i64:$sbase, i32:$soffset, i32:$offset)),
+    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_IMM_ec") $sbase, $soffset, $offset, 0))> {
+    let OtherPredicates = [isGFX9Plus];
+  }
+
+  // 4. No offset
+  def : GCNPat <
+    (vt (alignedmultidwordload (i64 SReg_64:$sbase))),
+    (vt (!cast<SM_Pseudo>(Instr#"_IMM") i64:$sbase, 0, 0))> {
+    let OtherPredicates = [isGFX8Plus];
+  }
+  def : GCNPat <
+    (vt (unalignedmultidwordload (i64 SReg_64:$sbase))),
+    (vt (!cast<SM_Pseudo>(Instr#"_IMM_ec") i64:$sbase, 0, 0))> {
+    let OtherPredicates = [isGFX8Plus];
+  }
+}
+
 multiclass SMRD_Pattern <string Instr, ValueType vt, bit immci = true> {
 
   // 1. IMM offset
   def : GCNPat <
     (smrd_load (SMRDImm i64:$sbase, i32:$offset)),
-    (vt (!cast<SM_Pseudo>(Instr#"_IMM") $sbase, $offset, 0))
-  >;
+    (vt (!cast<SM_Pseudo>(Instr#"_IMM") $sbase, $offset, 0))> {
+    let OtherPredicates = [isGFX6GFX7];
+  }
 
   // 2. 32-bit IMM offset on CI
   if immci then def : GCNPat <
@@ -886,26 +977,17 @@ multiclass SMRD_Pattern <string Instr, ValueType vt, bit immci = true> {
   def : GCNPat <
     (smrd_load (SMRDSgpr i64:$sbase, i32:$soffset)),
     (vt (!cast<SM_Pseudo>(Instr#"_SGPR") $sbase, $soffset, 0))> {
-    let OtherPredicates = [isNotGFX9Plus];
-  }
-  def : GCNPat <
-    (smrd_load (SMRDSgpr i64:$sbase, i32:$soffset)),
-    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_IMM") $sbase, $soffset, 0, 0))> {
-    let OtherPredicates = [isGFX9Plus];
+    let OtherPredicates = [isGFX6GFX7];
   }
 
-  // 4. SGPR+IMM offset
+  // 4. No offset
   def : GCNPat <
-    (smrd_load (SMRDSgprImm i64:$sbase, i32:$soffset, i32:$offset)),
-    (vt (!cast<SM_Pseudo>(Instr#"_SGPR_IMM") $sbase, $soffset, $offset, 0))> {
-    let OtherPredicates = [isGFX9Plus];
+    (vt (smrd_load (i64 SReg_64:$sbase))),
+    (vt (!cast<SM_Pseudo>(Instr#"_IMM") i64:$sbase, 0, 0))> {
+    let OtherPredicates = [isGFX6GFX7];
   }
 
-  // 5. No offset
-  def : GCNPat <
-    (vt (smrd_load (i64 SReg_64:$sbase))),
-    (vt (!cast<SM_Pseudo>(Instr#"_IMM") i64:$sbase, 0, 0))
-  >;
+  defm : SMRD_Align_Pattern<Instr, vt>;
 }
 
 multiclass SMLoad_Pattern <string Instr, ValueType vt, bit immci = true> {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll
index a38b6e3263882..9a8672dba5357 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll
@@ -7,11 +7,11 @@ define amdgpu_kernel void @s_add_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX11:       ; %bb.0: ; %entry
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    s_load_b128 s[4:7], s[0:1], 0x24
-; GFX11-NEXT:    s_load_b64 s[0:1], s[0:1], 0x34
+; GFX11-NEXT:    s_load_b64 s[2:3], s[0:1], 0x34
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    s_add_u32 s0, s6, s0
-; GFX11-NEXT:    s_addc_u32 s1, s7, s1
+; GFX11-NEXT:    s_add_u32 s0, s6, s2
+; GFX11-NEXT:    s_addc_u32 s1, s7, s3
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
@@ -23,10 +23,10 @@ define amdgpu_kernel void @s_add_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX12:       ; %bb.0: ; %entry
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    s_load_b128 s[4:7], s[0:1], 0x24
-; GFX12-NEXT:    s_load_b64 s[0:1], s[0:1], 0x34
+; GFX12-NEXT:    s_load_b64 s[2:3], s[0:1], 0x34
 ; GFX12-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    s_add_nc_u64 s[0:1], s[6:7], s[0:1]
+; GFX12-NEXT:    s_add_nc_u64 s[0:1], s[6:7], s[2:3]
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX12-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
@@ -59,11 +59,11 @@ define amdgpu_kernel void @s_sub_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX11:       ; %bb.0: ; %entry
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    s_load_b128 s[4:7], s[0:1], 0x24
-; GFX11-NEXT:    s_load_b64 s[0:1], s[0:1], 0x34
+; GFX11-NEXT:    s_load_b64 s[2:3], s[0:1], 0x34
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    s_sub_u32 s0, s6, s0
-; GFX11-NEXT:    s_subb_u32 s1, s7, s1
+; GFX11-NEXT:    s_sub_u32 s0, s6, s2
+; GFX11-NEXT:    s_subb_u32 s1, s7, s3
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
@@ -75,10 +75,10 @@ define amdgpu_kernel void @s_sub_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX12:       ; %bb.0: ; %entry
 ; GFX12-NEXT:    s_clause 0x1
 ; GFX12-NEXT:    s_load_b128 s[4:7], s[0:1], 0x24
-; GFX12-NEXT:    s_load_b64 s[0:1], s[0:1], 0x34
+; GFX12-NEXT:    s_load_b64 s[2:3], s[0:1], 0x34
 ; GFX12-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
-; GFX12-NEXT:    s_sub_nc_u64 s[0:1], s[6:7], s[0:1]
+; GFX12-NEXT:    s_sub_nc_u64 s[0:1], s[6:7], s[2:3]
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX12-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/bool-legalization.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/bool-legalization.ll
index bb5ccc3657dc4..57a8bbbb7d185 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/bool-legalization.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/bool-legalization.ll
@@ -113,9 +113,9 @@ bb1:
 define amdgpu_kernel void @brcond_sgpr_trunc_and(i32 %cond0, i32 %cond1) {
 ; WAVE64-LABEL: brcond_sgpr_trunc_and:
 ; WAVE64:       ; %bb.0: ; %entry
-; WAVE64-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
+; WAVE64-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
 ; WAVE64-NEXT:    s_waitcnt lgkmcnt(0)
-; WAVE64-NEXT:    s_and_b32 s0, s0, s1
+; WAVE64-NEXT:    s_and_b32 s0, s2, s3
 ; WAVE64-NEXT:    s_xor_b32 s0, s0, 1
 ; WAVE64-NEXT:    s_and_b32 s0, s0, 1
 ; WAVE64-NEXT:    s_cmp_lg_u32 s0, 0
@@ -131,9 +131,9 @@ define amdgpu_kernel void @brcond_sgpr_trunc_and(i32 %cond0, i32 %cond1) {
 ;
 ; WAVE32-LABEL: brcond_sgpr_trunc_and:
 ; WAVE32:       ; %bb.0: ; %entry
-; WAVE32-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
+; WAVE32-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
 ; WAVE32-NEXT:    s_waitcnt lgkmcnt(0)
-; WAVE32-NEXT:    s_and_b32 s0, s0, s1
+; WAVE32-NEXT:    s_and_b32 s0, s2, s3
 ; WAVE32-NEXT:    s_xor_b32 s0, s0, 1
 ; WAVE32-NEXT:    s_and_b32 s0, s0, 1
 ; WAVE32-NEXT:    s_cmp_lg_u32 s0, 0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
index 3f034eaca4997..9cabe0c0ae9de 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
@@ -1400,11 +1400,11 @@ define amdgpu_kernel void @cvt_ubyte0_or_multiuse(ptr addrspace(1) %in, ptr addr
 ;
 ; VI-LABEL: cvt_ubyte0_or_multiuse:
 ; VI:       ; %bb.0: ; %bb
-; VI-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; VI-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; VI-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
-; VI-NEXT:    v_mov_b32_e32 v0, s0
-; VI-NEXT:    v_mov_b32_e32 v1, s1
+; VI-NEXT:    v_mov_b32_e32 v0, s4
+; VI-NEXT:    v_mov_b32_e32 v1, s5
 ; VI-NEXT:    v_add_u32_e32 v0, vcc, v0, v2
 ; VI-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
 ; VI-NEXT:    flat_load_dword v0, v[0:1]
@@ -1412,8 +1412,8 @@ define amdgpu_kernel void @cvt_ubyte0_or_multiuse(ptr addrspace(1) %in, ptr addr
 ; VI-NEXT:    v_or_b32_e32 v0, 0x80000001, v0
 ; VI-NEXT:    v_cvt_f32_ubyte0_sdwa v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0
 ; VI-NEXT:    v_add_f32_e32 v2, v0, v1
-; VI-NEXT:    v_mov_b32_e32 v0, s2
-; VI-NEXT:    v_mov_b32_e32 v1, s3
+; VI-NEXT:    v_mov_b32_e32 v0, s6
+; VI-NEXT:    v_mov_b32_e32 v1, s7
 ; VI-NEXT:    flat_store_dword v[0:1], v2
 ; VI-NEXT:    s_endpgm
 bb:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll
index a018ea5bf18f1..ce0d9c3c5365e 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll
@@ -27,10 +27,10 @@ define amdgpu_kernel void @flat_atomic_fadd_f32_noret(ptr %ptr, float %data) {
 define amdgpu_kernel void @flat_atomic_fadd_f32_noret_pat(ptr %ptr) {
 ; GFX940-LABEL: flat_atomic_fadd_f32_noret_pat:
 ; GFX940:       ; %bb.0:
-; GFX940-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
+; GFX940-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
 ; GFX940-NEXT:    v_mov_b32_e32 v2, 4.0
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[0:1]
+; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[2:3]
 ; GFX940-NEXT:    buffer_wbl2 sc0 sc1
 ; GFX940-NEXT:    flat_atomic_add_f32 v[0:1], v2 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -43,10 +43,10 @@ define amdgpu_kernel void @flat_atomic_fadd_f32_noret_pat(ptr %ptr) {
 define amdgpu_kernel void @flat_atomic_fadd_f32_noret_pat_ieee(ptr %ptr) #0 {
 ; GFX940-LABEL: flat_atomic_fadd_f32_noret_pat_ieee:
 ; GFX940:       ; %bb.0:
-; GFX940-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
+; GFX940-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
 ; GFX940-NEXT:    v_mov_b32_e32 v2, 4.0
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[0:1]
+; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[2:3]
 ; GFX940-NEXT:    buffer_wbl2 sc0 sc1
 ; GFX940-NEXT:    flat_atomic_add_f32 v[0:1], v2 sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
index 4e94a646f6da5..081e25708c067 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
@@ -1021,20 +1021,20 @@ main_body:
 define amdgpu_kernel void @global_atomic_fadd_f64_noret(ptr addrspace(1) %ptr, double %data) {
 ; GFX90A-LABEL: global_atomic_fadd_f64_noret:
 ; GFX90A:       ; %bb.0: ; %main_body
-; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; GFX90A-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
-; GFX90A-NEXT:    global_atomic_add_f64 v2, v[0:1], s[0:1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[6:7], s[6:7] op_sel:[0,1]
+; GFX90A-NEXT:    global_atomic_add_f64 v2, v[0:1], s[4:5]
 ; GFX90A-NEXT:    s_endpgm
 ;
 ; GFX940-LABEL: global_atomic_fadd_f64_noret:
 ; GFX940:       ; %bb.0: ; %main_body
-; GFX940-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; GFX940-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; GFX940-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[2:3]
-; GFX940-NEXT:    global_atomic_add_f64 v2, v[0:1], s[0:1]
+; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[6:7]
+; GFX940-NEXT:    global_atomic_add_f64 v2, v[0:1], s[4:5]
 ; GFX940-NEXT:    s_endpgm
 main_body:
   %ret = call double @llvm.amdgcn.global.atomic.fadd.f64.p1.f64(ptr addrspace(1) %ptr, double %data)
@@ -1044,20 +1044,20 @@ main_body:
 define amdgpu_kernel void @global_atomic_fmin_f64_noret(ptr addrspace(1) %ptr, double %data) {
 ; GFX90A-LABEL: global_atomic_fmin_f64_noret:
 ; GFX90A:       ; %bb.0: ; %main_body
-; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; GFX90A-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
-; GFX90A-NEXT:    global_atomic_min_f64 v2, v[0:1], s[0:1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[6:7], s[6:7] op_sel:[0,1]
+; GFX90A-NEXT:    global_atomic_min_f64 v2, v[0:1], s[4:5]
 ; GFX90A-NEXT:    s_endpgm
 ;
 ; GFX940-LABEL: global_atomic_fmin_f64_noret:
 ; GFX940:       ; %bb.0: ; %main_body
-; GFX940-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; GFX940-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; GFX940-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[2:3]
-; GFX940-NEXT:    global_atomic_min_f64 v2, v[0:1], s[0:1]
+; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[6:7]
+; GFX940-NEXT:    global_atomic_min_f64 v2, v[0:1], s[4:5]
 ; GFX940-NEXT:    s_endpgm
 main_body:
   %ret = call double @llvm.amdgcn.global.atomic.fmin.f64.p1.f64(ptr addrspace(1) %ptr, double %data)
@@ -1067,20 +1067,20 @@ main_body:
 define amdgpu_kernel void @global_atomic_fmax_f64_noret(ptr addrspace(1) %ptr, double %data) {
 ; GFX90A-LABEL: global_atomic_fmax_f64_noret:
 ; GFX90A:       ; %bb.0: ; %main_body
-; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; GFX90A-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
-; GFX90A-NEXT:    global_atomic_max_f64 v2, v[0:1], s[0:1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], s[6:7], s[6:7] op_sel:[0,1]
+; GFX90A-NEXT:    global_atomic_max_f64 v2, v[0:1], s[4:5]
 ; GFX90A-NEXT:    s_endpgm
 ;
 ; GFX940-LABEL: global_atomic_fmax_f64_noret:
 ; GFX940:       ; %bb.0: ; %main_body
-; GFX940-NEXT:    s_load_dwordx4 s[0:3], s[0:1], 0x24
+; GFX940-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
 ; GFX940-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[2:3]
-; GFX940-NEXT:    global_atomic_max_f64 v2, v[0:1], s[0:1]
+; GFX940-NEXT:    v_mov_b64_e32 v[0:1], s[6:7]
+; GFX940-NEXT:    global_atomic_max_f64 v2, v[0:1], s[4:5]
 ; GFX940-NEXT:    s_endpgm
 main_body:
   %ret = call double @llvm.amdgcn.global.atomic.fmax.f64.p1.f64(ptr addrspace(1) %ptr, double %data)
@@ -1090,21 +1090,21 @@ main_body:
 define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(ptr addrspace(1) %ptr) #1 {
 ; GFX90A-LABEL: global_atomic_fadd_f64_noret_pat:
 ; GFX90A:       ; %bb.0: ; %main_body
-; GFX90A-NEXT:    s_mov_b64 s[2:3], exec
-; GFX90A-NEXT:    s_mov_b32 s4, s3
-; GFX90A-NEXT:    v_mbcnt_lo_u32_b32 v0, s2, 0
-; GFX90A-NEXT:    v_mbcnt_hi_u32_b32 v0, s4, v0
+; GFX90A-NEXT:    s_mov_b64 s[4:5], exec
+; GFX90A-NEXT:    s_mov_b32 s2, s5
+; GFX90A-NEXT:    v_mbcnt_lo_u32_b32 v0, s4, 0
+; GFX90A-NEXT:    v_mbcnt_hi_u32_b32 v0, s2, v0
 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
-; GFX90A-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX90A-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX90A-NEXT:    s_cbranch_execz .LBB39_3
 ; GFX90A-NEXT:  ; %bb.1:
-; GFX90A-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX90A-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
-; GFX90A-NEXT:    v_cvt_f64_u32_e32 v[0:1], s2
+; GFX90A-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0x24
+; GFX90A-NEXT:    s_bcnt1_i32_b64 s0, s[4:5]
+; GFX90A-NEXT:    v_cvt_f64_u32_e32 v[0:1], s0
 ; GFX90A-NEXT:    v_mul_f64 v[4:5], v[0:1], 4.0
-; GFX90A-NEXT:    s_mov_b64 s[2:3], 0
+; GFX90A-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x0
+; GFX90A-NEXT:    s_load_dwordx2 s[4:5], s[2:3], 0x0
 ; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], s[4:5], s[4:5] op_sel:[0,1]
@@ -1112,14 +1112,14 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(ptr addrspace(1) %pt
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], v[4:5]
 ; GFX90A-NEXT:    buffer_wbl2
-; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[0:1], v6, v[0:3], s[0:1] glc
+; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[0:1], v6, v[0:3], s[2:3] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
-; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
+; GFX90A-NEXT:    s_or_b64 s[0:1], vcc, s[0:1]
 ; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], v[0:1], v[0:1] op_sel:[0,1]
-; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[2:3]
+; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[0:1]
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB39_2
 ; GFX90A-NEXT:  .LBB39_3:
 ; GFX90A-NEXT:    s_endpgm
@@ -1134,14 +1134,14 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(ptr addrspace(1) %pt
 ; GFX940-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX940-NEXT:    s_cbranch_execz .LBB39_2
 ; GFX940-NEXT:  ; %bb.1:
-; GFX940-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x24
-; GFX940-NEXT:    s_bcnt1_i32_b64 s2, s[2:3]
-; GFX940-NEXT:    v_cvt_f64_u32_e32 v[0:1], s2
+; GFX940-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x24
+; GFX940-NEXT:    s_bcnt1_i32_b64 s0, s[2:3]
+; GFX940-NEXT:    v_cvt_f64_u32_e32 v[0:1], s0
 ; GFX940-NEXT:    v_mul_f64 v[0:1], v[0:1], 4.0
 ; GFX940-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX940-NEXT:    buffer_wbl2 sc0 sc1
 ; GFX940-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX940-NEXT:    global_atomic_add_f64 v2, v[0:1], s[0:1] sc1
+; GFX940-NEXT:    global_atomic_add_f64 v2, v[0:1], s[4:5] sc1
 ; GFX940-NEXT:    s_waitcnt vmcnt(0)
 ; G...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a separate patch, we should add a verifier check that you used the correct tied version depending on whether xnack is enabled or not

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another strategy that might simplify the patterns is to always select the _ec versions, and then later swap to the non-ec versions if xnack is disabled

@cdevadas cdevadas force-pushed the users/cdevadas/ldstopt-constrained-sloads branch from 65eb443 to 26e0864 Compare July 1, 2024 06:00
@cdevadas cdevadas force-pushed the users/cdevadas/enable-codegen-for-constrained-sloads branch 2 times, most recently from f79b902 to 65ccf6d Compare July 1, 2024 06:27
@cdevadas cdevadas force-pushed the users/cdevadas/ldstopt-constrained-sloads branch from 786a670 to e7e6cbc Compare July 3, 2024 13:16
@cdevadas cdevadas force-pushed the users/cdevadas/enable-codegen-for-constrained-sloads branch 3 times, most recently from 77efd76 to c31f853 Compare July 8, 2024 07:56
@cdevadas
Copy link
Collaborator Author

Ping

@cdevadas
Copy link
Collaborator Author

Ping

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cdevadas
Copy link
Collaborator Author

The latest patch optimizes the PatFrag and the patterns written further by using OtherPredicates. The lit test changes in the latest patch are a missed optimization I incorrectly introduced earlier in this PR for GFX7. It is now fixed and matches the default behavior with the current compiler.

@cdevadas cdevadas force-pushed the users/cdevadas/ldstopt-constrained-sloads branch from a1b2665 to a8394e2 Compare July 22, 2024 21:06
@cdevadas cdevadas force-pushed the users/cdevadas/enable-codegen-for-constrained-sloads branch 2 times, most recently from 07b01b6 to c73011b Compare July 22, 2024 22:30
@cdevadas cdevadas force-pushed the users/cdevadas/ldstopt-constrained-sloads branch from b004bd6 to 4f5de35 Compare July 23, 2024 06:41
@cdevadas cdevadas force-pushed the users/cdevadas/enable-codegen-for-constrained-sloads branch from c73011b to 21f3849 Compare July 23, 2024 06:41
Copy link
Collaborator Author

cdevadas commented Jul 23, 2024

Merge activity

  • Jul 23, 4:02 AM EDT: @cdevadas started a stack merge that includes this pull request via Graphite.
  • Jul 23, 4:22 AM EDT: Graphite rebased this pull request as part of a merge.
  • Jul 23, 4:25 AM EDT: Graphite rebased this pull request as part of a merge.
  • Jul 23, 4:29 AM EDT: @cdevadas merged this pull request with Graphite.

@cdevadas cdevadas force-pushed the users/cdevadas/ldstopt-constrained-sloads branch 4 times, most recently from 9ab38a2 to 276fb59 Compare July 23, 2024 08:17
Base automatically changed from users/cdevadas/ldstopt-constrained-sloads to main July 23, 2024 08:20
@cdevadas cdevadas force-pushed the users/cdevadas/enable-codegen-for-constrained-sloads branch from 21f3849 to ca9ae49 Compare July 23, 2024 08:22
cdevadas added 8 commits July 23, 2024 08:25
For targets that support xnack replay feature (gfx8+), the
multi-dword scalar loads shouldn't clobber any register that
holds the src address. The constraint version of the scalar
loads have the early clobber flag attached to the dst operand
to restrict RA from re-allocating any of the src regs for its
dst operand.
@cdevadas cdevadas force-pushed the users/cdevadas/enable-codegen-for-constrained-sloads branch from ca9ae49 to b006c30 Compare July 23, 2024 08:25
@cdevadas cdevadas merged commit 229e118 into main Jul 23, 2024
4 of 7 checks passed
@cdevadas cdevadas deleted the users/cdevadas/enable-codegen-for-constrained-sloads branch July 23, 2024 08:29
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
For targets that support xnack replay feature (gfx8+), the
multi-dword scalar loads shouldn't clobber any register that
holds the src address. The constrained version of the scalar
loads have the early clobber flag attached to the dst operand
to restrict RA from re-allocating any of the src regs for its
dst operand.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251360
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants