-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU #96185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 32 performed good in microbenchmarks, values of 8, 16, or 64 would perform similarly. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.
@llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesIn the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 32 performed good in microbenchmarks, values of 8, 16, or 64 would perform similarly. See also:
Part of SWDEV-455845. Patch is 257.38 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96185.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 522b3a34161cd..63561ec3c77f3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;
+ // Enable ganging up loads and stores in the memcpy DAG lowering.
+ MaxGluedStoresPerMemcpy = 32;
+
// Lower floating point store/load to integer store/load to reduce the number
// of patterns in tablegen.
setOperationAction(ISD::LOAD, MVT::f32, Promote);
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
index a118fa388f86d..645e48f1bb1ab 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
@@ -9074,8 +9074,8 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
; GFX9-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:24 ; 4-byte Folded Spill
; GFX9-NEXT: s_mov_b64 exec, s[4:5]
-; GFX9-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:20
-; GFX9-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:16
+; GFX9-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:16
+; GFX9-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:20
; GFX9-NEXT: buffer_load_dword v31, off, s[0:3], s33
; GFX9-NEXT: v_writelane_b32 v40, s30, 0
; GFX9-NEXT: v_writelane_b32 v40, s31, 1
@@ -9113,9 +9113,9 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
; GFX9-NEXT: s_mov_b32 s4, byval_align16_f64_arg@abs32@lo
; GFX9-NEXT: v_writelane_b32 v40, s63, 31
; GFX9-NEXT: s_waitcnt vmcnt(2)
-; GFX9-NEXT: buffer_store_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT: buffer_store_dword v32, off, s[0:3], s32
; GFX9-NEXT: s_waitcnt vmcnt(2)
-; GFX9-NEXT: buffer_store_dword v33, off, s[0:3], s32
+; GFX9-NEXT: buffer_store_dword v33, off, s[0:3], s32 offset:4
; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
; GFX9-NEXT: v_readlane_b32 s63, v40, 31
; GFX9-NEXT: v_readlane_b32 s62, v40, 30
@@ -9167,17 +9167,17 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
; GFX10-NEXT: s_waitcnt_depctr 0xffe3
; GFX10-NEXT: s_mov_b32 exec_lo, s4
; GFX10-NEXT: s_clause 0x2
-; GFX10-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:20
-; GFX10-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:16
+; GFX10-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:16
+; GFX10-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:20
; GFX10-NEXT: buffer_load_dword v31, off, s[0:3], s33
; GFX10-NEXT: v_writelane_b32 v40, s30, 0
; GFX10-NEXT: s_addk_i32 s32, 0x400
; GFX10-NEXT: s_mov_b32 s5, byval_align16_f64_arg@abs32@hi
; GFX10-NEXT: s_mov_b32 s4, byval_align16_f64_arg@abs32@lo
; GFX10-NEXT: s_waitcnt vmcnt(2)
-; GFX10-NEXT: buffer_store_dword v32, off, s[0:3], s32 offset:4
+; GFX10-NEXT: buffer_store_dword v32, off, s[0:3], s32
; GFX10-NEXT: s_waitcnt vmcnt(1)
-; GFX10-NEXT: buffer_store_dword v33, off, s[0:3], s32
+; GFX10-NEXT: buffer_store_dword v33, off, s[0:3], s32 offset:4
; GFX10-NEXT: v_writelane_b32 v40, s31, 1
; GFX10-NEXT: v_writelane_b32 v40, s34, 2
; GFX10-NEXT: v_writelane_b32 v40, s35, 3
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll b/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
index a5e0ceaa6b329..343925528a520 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
@@ -8,22 +8,22 @@ define void @memcpy_fixed_align(ptr addrspace(5) %dst, ptr addrspace(1) %src) {
; MUBUF: ; %bb.0:
; MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; MUBUF-NEXT: global_load_dwordx2 v[11:12], v[1:2], off offset:32
-; MUBUF-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16
-; MUBUF-NEXT: global_load_dwordx4 v[7:10], v[1:2], off
+; MUBUF-NEXT: global_load_dwordx4 v[3:6], v[1:2], off
+; MUBUF-NEXT: global_load_dwordx4 v[7:10], v[1:2], off offset:16
; MUBUF-NEXT: v_lshrrev_b32_e64 v0, 6, s32
; MUBUF-NEXT: s_waitcnt vmcnt(2)
-; MUBUF-NEXT: buffer_store_dword v12, off, s[0:3], s32 offset:36
; MUBUF-NEXT: buffer_store_dword v11, off, s[0:3], s32 offset:32
+; MUBUF-NEXT: buffer_store_dword v12, off, s[0:3], s32 offset:36
; MUBUF-NEXT: s_waitcnt vmcnt(3)
-; MUBUF-NEXT: buffer_store_dword v6, off, s[0:3], s32 offset:28
-; MUBUF-NEXT: buffer_store_dword v5, off, s[0:3], s32 offset:24
-; MUBUF-NEXT: buffer_store_dword v4, off, s[0:3], s32 offset:20
-; MUBUF-NEXT: buffer_store_dword v3, off, s[0:3], s32 offset:16
+; MUBUF-NEXT: buffer_store_dword v6, off, s[0:3], s32 offset:12
+; MUBUF-NEXT: buffer_store_dword v5, off, s[0:3], s32 offset:8
+; MUBUF-NEXT: buffer_store_dword v4, off, s[0:3], s32 offset:4
+; MUBUF-NEXT: buffer_store_dword v3, off, s[0:3], s32
; MUBUF-NEXT: s_waitcnt vmcnt(6)
-; MUBUF-NEXT: buffer_store_dword v10, off, s[0:3], s32 offset:12
-; MUBUF-NEXT: buffer_store_dword v9, off, s[0:3], s32 offset:8
-; MUBUF-NEXT: buffer_store_dword v8, off, s[0:3], s32 offset:4
-; MUBUF-NEXT: buffer_store_dword v7, off, s[0:3], s32
+; MUBUF-NEXT: buffer_store_dword v10, off, s[0:3], s32 offset:28
+; MUBUF-NEXT: buffer_store_dword v9, off, s[0:3], s32 offset:24
+; MUBUF-NEXT: buffer_store_dword v8, off, s[0:3], s32 offset:20
+; MUBUF-NEXT: buffer_store_dword v7, off, s[0:3], s32 offset:16
; MUBUF-NEXT: ;;#ASMSTART
; MUBUF-NEXT: ; use v0
; MUBUF-NEXT: ;;#ASMEND
@@ -33,16 +33,16 @@ define void @memcpy_fixed_align(ptr addrspace(5) %dst, ptr addrspace(1) %src) {
; FLATSCR-LABEL: memcpy_fixed_align:
; FLATSCR: ; %bb.0:
; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; FLATSCR-NEXT: global_load_dwordx4 v[3:6], v[1:2], off
+; FLATSCR-NEXT: global_load_dwordx4 v[7:10], v[1:2], off offset:16
; FLATSCR-NEXT: global_load_dwordx2 v[11:12], v[1:2], off offset:32
-; FLATSCR-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16
-; FLATSCR-NEXT: global_load_dwordx4 v[7:10], v[1:2], off
; FLATSCR-NEXT: v_mov_b32_e32 v0, s32
; FLATSCR-NEXT: s_waitcnt vmcnt(2)
-; FLATSCR-NEXT: scratch_store_dwordx2 off, v[11:12], s32 offset:32
+; FLATSCR-NEXT: scratch_store_dwordx4 off, v[3:6], s32
; FLATSCR-NEXT: s_waitcnt vmcnt(2)
-; FLATSCR-NEXT: scratch_store_dwordx4 off, v[3:6], s32 offset:16
+; FLATSCR-NEXT: scratch_store_dwordx4 off, v[7:10], s32 offset:16
; FLATSCR-NEXT: s_waitcnt vmcnt(2)
-; FLATSCR-NEXT: scratch_store_dwordx4 off, v[7:10], s32
+; FLATSCR-NEXT: scratch_store_dwordx2 off, v[11:12], s32 offset:32
; FLATSCR-NEXT: ;;#ASMSTART
; FLATSCR-NEXT: ; use v0
; FLATSCR-NEXT: ;;#ASMEND
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
index 358f42dfe8dd5..166bd90d098d0 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
@@ -13,148 +13,104 @@ define amdgpu_kernel void @memcpy_p0_p0_minsize(ptr %dest, ptr readonly %src) #0
; CHECK-NEXT: v_mov_b32_e32 v0, s2
; CHECK-NEXT: v_mov_b32_e32 v1, s3
; CHECK-NEXT: flat_load_ubyte v4, v[0:1]
+; CHECK-NEXT: flat_load_ubyte v5, v[0:1] offset:1
+; CHECK-NEXT: flat_load_ubyte v6, v[0:1] offset:2
+; CHECK-NEXT: flat_load_ubyte v7, v[0:1] offset:3
+; CHECK-NEXT: flat_load_ubyte v8, v[0:1] offset:4
+; CHECK-NEXT: flat_load_ubyte v9, v[0:1] offset:5
+; CHECK-NEXT: flat_load_ubyte v10, v[0:1] offset:6
+; CHECK-NEXT: flat_load_ubyte v11, v[0:1] offset:7
+; CHECK-NEXT: flat_load_ubyte v12, v[0:1] offset:8
+; CHECK-NEXT: flat_load_ubyte v13, v[0:1] offset:9
+; CHECK-NEXT: flat_load_ubyte v14, v[0:1] offset:10
+; CHECK-NEXT: flat_load_ubyte v15, v[0:1] offset:11
+; CHECK-NEXT: flat_load_ubyte v16, v[0:1] offset:12
+; CHECK-NEXT: flat_load_ubyte v17, v[0:1] offset:13
+; CHECK-NEXT: flat_load_ubyte v18, v[0:1] offset:14
; CHECK-NEXT: v_mov_b32_e32 v3, s1
; CHECK-NEXT: v_mov_b32_e32 v2, s0
; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; CHECK-NEXT: flat_store_byte v[2:3], v4
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:1
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:1
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:2
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:2
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:3
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:3
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:4
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:4
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:5
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:5
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:6
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:6
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:7
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:7
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:8
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:8
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:9
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:9
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:10
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:10
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:11
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:11
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:12
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:12
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:13
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:13
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:14
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:14
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:15
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:15
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:16
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:16
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:17
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:17
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:18
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:18
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:19
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:19
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:20
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:20
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:21
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:21
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:22
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:22
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:23
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:23
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:24
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:24
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:25
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:25
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:26
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:26
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:27
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:27
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:28
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:28
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:29
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:29
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:30
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:30
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:31
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:31
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:32
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:32
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:33
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:33
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:34
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:34
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:35
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:35
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:36
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:36
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:37
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:37
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:38
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:38
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:39
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:39
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:40
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:40
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:41
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:41
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:42
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:42
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:43
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:43
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:44
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:44
-; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:45
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:45
-; CHECK-NEXT: flat_load_ubyte v0, v[0:1] offset:46
-; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT: flat_store_byte v[2:3], v0 offset:46
+; CHECK-NEXT: flat_store_byte v[2:3], v5 offset:1
+; CHECK-NEXT: flat_store_byte v[2:3], v6 offset:2
+; CHECK-NEXT: flat_store_byte v[2:3], v7 offset:3
+; CHECK-NEXT: flat_store_byte v[2:3], v8 offset:4
+; CHECK-NEXT: flat_store_byte v[2:3], v9 offset:5
+; CHECK-NEXT: flat_store_byte v[2:3], v10 offset:6
+; CHECK-NEXT: flat_store_byte v[2:3], v11 offset:7
+; CHECK-NEXT: flat_store_byte v[2:3], v12 offset:8
+; CHECK-NEXT: flat_store_byte v[2:3], v13 offset:9
+; CHECK-NEXT: flat_store_byte v[2:3], v14 offset:10
+; CHECK-NEXT: flat_store_byte v[2:3], v15 offset:11
+; CHECK-NEXT: flat_store_byte v[2:3], v16 offset:12
+; CHECK-NEXT: flat_store_byte v[2:3], v17 offset:13
+; CHECK-NEXT: flat_store_byte v[2:3], v18 offset:14
+; CHECK-NEXT: flat_load_ubyte v4, v[0:1] offset:46
+; CHECK-NEXT: flat_load_ubyte v5, v[0:1] offset:45
+; CHECK-NEXT: flat_load_ubyte v6, v[0:1] offset:44
+; CHECK-NEXT: flat_load_ubyte v7, v[0:1] offset:43
+; CHECK-NEXT: flat_load_ubyte v8, v[0:1] offset:42
+; CHECK-NEXT: flat_load_ubyte v9, v[0:1] offset:41
+; CHECK-NEXT: flat_load_ubyte v10, v[0:1] offset:40
+; CHECK-NEXT: flat_load_ubyte v11, v[0:1] offset:39
+; CHECK-NEXT: flat_load_ubyte v12, v[0:1] offset:38
+; CHECK-NEXT: flat_load_ubyte v13, v[0:1] offset:37
+; CHECK-NEXT: flat_load_ubyte v14, v[0:1] offset:36
+; CHECK-NEXT: flat_load_ubyte v15, v[0:1] offset:35
+; CHECK-NEXT: flat_load_ubyte v16, v[0:1] offset:34
+; CHECK-NEXT: flat_load_ubyte v17, v[0:1] offset:33
+; CHECK-NEXT: flat_load_ubyte v18, v[0:1] offset:32
+; CHECK-NEXT: flat_load_ubyte v19, v[0:1] offset:31
+; CHECK-NEXT: flat_load_ubyte v20, v[0:1] offset:30
+; CHECK-NEXT: flat_load_ubyte v21, v[0:1] offset:29
+; CHECK-NEXT: flat_load_ubyte v22, v[0:1] offset:28
+; CHECK-NEXT: flat_load_ubyte v23, v[0:1] offset:27
+; CHECK-NEXT: flat_load_ubyte v24, v[0:1] offset:26
+; CHECK-NEXT: flat_load_ubyte v25, v[0:1] offset:25
+; CHECK-NEXT: flat_load_ubyte v26, v[0:1] offset:24
+; CHECK-NEXT: flat_load_ubyte v27, v[0:1] offset:23
+; CHECK-NEXT: flat_load_ubyte v28, v[0:1] offset:22
+; CHECK-NEXT: flat_load_ubyte v29, v[0:1] offset:21
+; CHECK-NEXT: flat_load_ubyte v30, v[0:1] offset:20
+; CHECK-NEXT: flat_load_ubyte v31, v[0:1] offset:19
+; CHECK-NEXT: flat_load_ubyte v32, v[0:1] offset:18
+; CHECK-NEXT: flat_load_ubyte v33, v[0:1] offset:17
+; CHECK-NEXT: flat_load_ubyte v34, v[0:1] offset:16
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: flat_load_ubyte v0, v[0:1] offset:15
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT: flat_store_byte v[2:3], v4 offset:46
+; CHECK-NEXT: flat_store_byte v[2:3], v5 offset:45
+; CHECK-NEXT: flat_store_byte v[2:3], v6 offset:44
+; CHECK-NEXT: flat_store_byte v[2:3], v7 offset:43
+; CHECK-NEXT: flat_store_byte v[2:3], v8 offset:42
+; CHECK-NEXT: flat_store_byte v[2:3], v9 offset:41
+; CHECK-NEXT: flat_store_byte v[2:3], v10 offset:40
+; CHECK-NEXT: flat_store_byte v[2:3], v11 offset:39
+; CHECK-NEXT: flat_store_byte v[2:3], v12 offset:38
+; CHECK-NEXT: flat_store_byte v[2:3], v13 offset:37
+; CHECK-NEXT: flat_store_byte v[2:3], v14 offset:36
+; CHECK-NEXT: flat_store_byte v[2:3], v15 offset:35
+; CHECK-NEXT: flat_store_byte v[2:3], v16 offset:34
+; CHECK-NEXT: flat_store_byte v[2:3], v17 offset:33
+; CHECK-NEXT: flat_store_byte v[2:3], v18 offset:32
+; CHECK-NEXT: flat_store_byte v[2:3], v19 offset:31
+; CHECK-NEXT: flat_store_byte v[2:3], v20 offset:30
+; CHECK-NEXT: flat_store_byte v[2:3], v21 offset:29
+; CHECK-NEXT: flat_store_byte v[2:3], v22 offset:28
+; CHECK-NEXT: flat_store_byte v[2:3], v23 offset:27
+; CHECK-NEXT: flat_store_byte v[2:3], v24 offset:26
+; CHECK-NEXT: flat_store_byte v[2:3], v25 offset:25
+; CHECK-NEXT: flat_store_byte v[2:3], v26 offset:24
+; CHECK-NEXT: flat_store_byte v[2:3], v27 offset:23
+; CHECK-NEXT: flat_store_byte v[2:3], v28 offset:22
+; CHECK-NEXT: flat_store_byte v[2:3], v29 offset:21
+; CHECK-NEXT: flat_store_byte v[2:3], v30 offset:20
+; CHECK-NEXT: flat_store_byte v[2:3], v31 offset:19
+; CHECK-NEXT: flat_store_byte v[2:3], v32 offset:18
+; CHECK-NEXT: flat_store_byte v[2:3], v33 offset:17
+; CHECK-NEXT: flat_store_byte v[2:3], v34 offset:16
+; CHECK-NEXT: flat_store_byte v[2:3], v0 offset:15
; CHECK-NEXT: s_endpgm
entry:
tail call void @llvm.memcpy.p0.p0.i64(ptr %dest, ptr %src, i64 47, i1 false)
@@ -165,20 +121,20 @@ define amdgpu_kernel void @memcpy_p1_p1_minsize(ptr addrspace(1) %dest, ptr addr
; CHECK-LABEL: memcpy_p1_p1_minsize:
; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: s_load_dwordx4 s[0:3], s[4:5]...
[truncated]
|
@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM, | |||
MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U; | |||
MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U; | |||
|
|||
// Enable ganging up loads and stores in the memcpy DAG lowering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a target option? Why isn't the default just some number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current default is 0, disabling this transformation. The original author only enabled it for AArch64 and kept the previous behavior for the other targets.
In general, I don't think that it is obvious that this "optimization" (in a loose sense) improves performance for every target; it effectively limits the later stages of codegen in its choices.
We would need to benchmark the other targets to be sure that we don't decrease their performance if we want it to be enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd probably want different numbers based on the address spaces involved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API
Another argument for that: The limit as implemented is actually not concerning only stores, but loads and stores. So 32 here means that packets of 16 loads and 16 stores are ganged up. edit: That's incorrect, sorry!
I take it that you suggest that we improve the API before using it?
I think the corresponding hardware property for AMDGPU would be something like the number of memory accesses that can be started before the first one finishes.
In AArch64, it is the number of memory operations that can be merged into one (i.e. 2 for ldp and stp).
For architectures with traditional vector operations, it might be the preferred number of vector lanes.
We'd probably want different numbers based on the address spaces involved
I'll do some benchmarks for other address spaces to find out, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turning it on first is fine. We should also expand the test coverage for different sizes, alignments and address space combinations. I'm surprised we don't have more test updates from this as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm I ran some benchmarks for the other address spaces; enabling this optimization either improved or did not change the performance compared to the original memcpy implementation in all cases.
Which exact parameter value among {8, 16, 32, 64} we choose does not make much of a difference with different address spaces either; I now changed the parameter value in the PR to 16 instead of the previous 32 since that performed a bit better when copying from the generic to the global address space.
At least in my benchmarks on gfx1030, different parameter values for different address spaces seem unnecessary.
Regarding test coverage, I can easily generate llc tests for various parameter combinations; the resulting test can however grow quite large. For combinations of the following parameters, a test for memmove and memcpy with auto-generated llc check lines reaches ~60k lines (ca. 1.5x the length of the largest AMDGPU codegen test so far):
dst_address_spaces = [0, 1, 3, 5]
src_address_spaces = [0, 1, 3, 4, 5]
dst_alignments = [1, 2, 4]
src_alignments = [1, 2, 4]
sizes = [16, 31, 32]
Would that be a useful addition to our testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing 8 and 16 alignments will at least show some LDS differences. We can probably get away without the addrspace 4 cases, unless they are specially crafted to use uniform source pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a generated test (with alignments 8 and 16 instead of 4, leading to 97k lines) to the PR, so that you can see what it looks like. Running it alone costs ca. 5 seconds on my workstation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more into the code, I noticed that the DAG lowering of memcpy only takes the smaller one when different alignments are specified for source and destination. So, there is no point in testing combinations of different alignments, which leaves us with ~20k lines of test cases (the now updated state in this PR).
Do you have objections against merging the PR like this, @arsenm ?
@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM, | |||
MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U; | |||
MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U; | |||
|
|||
// Enable ganging up loads and stores in the memcpy DAG lowering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turning it on first is fine. We should also expand the test coverage for different sizes, alignments and address space combinations. I'm surprised we don't have more test updates from this as-is
[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly, trading off performance in different scenarios. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.
…for AMDGPU Add a test for combinations of memcpy parameters
…stores for AMDGPU Only include equal alignments in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never seen "this diff is too large to display before".
These test functions are all kernels, making the pointer arguments all uniform. It would be somewhat more representative to make these regular functions with VGPR pointer arguments. The cases where known uniform pointers is interesting should then use explicit inreg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test functions are now normal functions instead of kernels and I added another test with inreg
arguments.
The differences in the generated code between those are not too fundamental, so I skipped most address space combinations and sizes for the sreg test to avoid making it too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only interesting for the SMRD candidate loads from p1/p4 from the sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which really means p4, without an invariant annotation we don't handle now. You would need to use a kernel for the p1->p4 promotion, so just test the constant address space uniform source case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this, in memcpy-sreg.ll
? The previous, more verbose memcpy-param-combinations-sreg.ll
is removed again.
… loads/stores for AMDGPU Adjust the test to pass arguments in vregs, add another one that passes arguments explicitly in sregs.
… memcpy loads/stores for AMDGPU Only test scalar operands for copies/moves from the constant address space 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with nit
; Testing codegen for memcpy and memmove with scalar reads. | ||
|
||
|
||
define void @memcpy_p1_p4_sz16_align_4_4(ptr addrspace(1) align 4 %dst, ptr addrspace(4) align 4 readonly inreg %src) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space between align 4 and %st throughout
@@ -0,0 +1,138 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better name would be scalar-load instead of sreg
…g up of memcpy loads/stores for AMDGPU Clean up whitespace and rename memcpy-sreg.ll -> memcpy-scalar-load.ll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only other nit is I would have split the memcpy and memmove cases to separate files
… ganging up of memcpy loads/stores for AMDGPU Separate files for memcpy and memmove tests.
…vm#96185) In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.
…vm#96185) In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.
In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load.
The chosen value of 32 performed good in microbenchmarks, values of 8, 16, or 64 would perform similarly.
The testcase updates are autogenerated by utils/update_llc_test_checks.py.
See also:
Part of SWDEV-455845.