[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU #96185

ritter-x2a · 2024-06-20T12:52:18Z

In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load.

The chosen value of 32 performed good in microbenchmarks, values of 8, 16, or 64 would perform similarly.
The testcase updates are autogenerated by utils/update_llc_test_checks.py.

See also:

PR introducing this optimization: https://reviews.llvm.org/D46477

Part of SWDEV-455845.

Patch is 257.38 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96185.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+3)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+2062-2240)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 522b3a34161cd..63561ec3c77f3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
   MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;
 
+  // Enable ganging up loads and stores in the memcpy DAG lowering.
+  MaxGluedStoresPerMemcpy = 32;
+
   // Lower floating point store/load to integer store/load to reduce the number
   // of patterns in tablegen.
   setOperationAction(ISD::LOAD, MVT::f32, Promote);
diff --git a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
index a118fa388f86d..645e48f1bb1ab 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
@@ -9074,8 +9074,8 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
 ; GFX9-NEXT:    s_or_saveexec_b64 s[4:5], -1
 ; GFX9-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:24 ; 4-byte Folded Spill
 ; GFX9-NEXT:    s_mov_b64 exec, s[4:5]
-; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s33 offset:20
-; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s33 offset:16
+; GFX9-NEXT:    buffer_load_dword v32, off, s[0:3], s33 offset:16
+; GFX9-NEXT:    buffer_load_dword v33, off, s[0:3], s33 offset:20
 ; GFX9-NEXT:    buffer_load_dword v31, off, s[0:3], s33
 ; GFX9-NEXT:    v_writelane_b32 v40, s30, 0
 ; GFX9-NEXT:    v_writelane_b32 v40, s31, 1
@@ -9113,9 +9113,9 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
 ; GFX9-NEXT:    s_mov_b32 s4, byval_align16_f64_arg@abs32@lo
 ; GFX9-NEXT:    v_writelane_b32 v40, s63, 31
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:4
+; GFX9-NEXT:    buffer_store_dword v32, off, s[0:3], s32
 ; GFX9-NEXT:    s_waitcnt vmcnt(2)
-; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32
+; GFX9-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:4
 ; GFX9-NEXT:    s_swappc_b64 s[30:31], s[4:5]
 ; GFX9-NEXT:    v_readlane_b32 s63, v40, 31
 ; GFX9-NEXT:    v_readlane_b32 s62, v40, 30
@@ -9167,17 +9167,17 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
 ; GFX10-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX10-NEXT:    s_mov_b32 exec_lo, s4
 ; GFX10-NEXT:    s_clause 0x2
-; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s33 offset:20
-; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s33 offset:16
+; GFX10-NEXT:    buffer_load_dword v32, off, s[0:3], s33 offset:16
+; GFX10-NEXT:    buffer_load_dword v33, off, s[0:3], s33 offset:20
 ; GFX10-NEXT:    buffer_load_dword v31, off, s[0:3], s33
 ; GFX10-NEXT:    v_writelane_b32 v40, s30, 0
 ; GFX10-NEXT:    s_addk_i32 s32, 0x400
 ; GFX10-NEXT:    s_mov_b32 s5, byval_align16_f64_arg@abs32@hi
 ; GFX10-NEXT:    s_mov_b32 s4, byval_align16_f64_arg@abs32@lo
 ; GFX10-NEXT:    s_waitcnt vmcnt(2)
-; GFX10-NEXT:    buffer_store_dword v32, off, s[0:3], s32 offset:4
+; GFX10-NEXT:    buffer_store_dword v32, off, s[0:3], s32
 ; GFX10-NEXT:    s_waitcnt vmcnt(1)
-; GFX10-NEXT:    buffer_store_dword v33, off, s[0:3], s32
+; GFX10-NEXT:    buffer_store_dword v33, off, s[0:3], s32 offset:4
 ; GFX10-NEXT:    v_writelane_b32 v40, s31, 1
 ; GFX10-NEXT:    v_writelane_b32 v40, s34, 2
 ; GFX10-NEXT:    v_writelane_b32 v40, s35, 3
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll b/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
index a5e0ceaa6b329..343925528a520 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
@@ -8,22 +8,22 @@ define void @memcpy_fixed_align(ptr addrspace(5)  %dst, ptr addrspace(1) %src) {
 ; MUBUF:       ; %bb.0:
 ; MUBUF-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; MUBUF-NEXT:    global_load_dwordx2 v[11:12], v[1:2], off offset:32
-; MUBUF-NEXT:    global_load_dwordx4 v[3:6], v[1:2], off offset:16
-; MUBUF-NEXT:    global_load_dwordx4 v[7:10], v[1:2], off
+; MUBUF-NEXT:    global_load_dwordx4 v[3:6], v[1:2], off
+; MUBUF-NEXT:    global_load_dwordx4 v[7:10], v[1:2], off offset:16
 ; MUBUF-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
 ; MUBUF-NEXT:    s_waitcnt vmcnt(2)
-; MUBUF-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:36
 ; MUBUF-NEXT:    buffer_store_dword v11, off, s[0:3], s32 offset:32
+; MUBUF-NEXT:    buffer_store_dword v12, off, s[0:3], s32 offset:36
 ; MUBUF-NEXT:    s_waitcnt vmcnt(3)
-; MUBUF-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:28
-; MUBUF-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:24
-; MUBUF-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:20
-; MUBUF-NEXT:    buffer_store_dword v3, off, s[0:3], s32 offset:16
+; MUBUF-NEXT:    buffer_store_dword v6, off, s[0:3], s32 offset:12
+; MUBUF-NEXT:    buffer_store_dword v5, off, s[0:3], s32 offset:8
+; MUBUF-NEXT:    buffer_store_dword v4, off, s[0:3], s32 offset:4
+; MUBUF-NEXT:    buffer_store_dword v3, off, s[0:3], s32
 ; MUBUF-NEXT:    s_waitcnt vmcnt(6)
-; MUBUF-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:12
-; MUBUF-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:8
-; MUBUF-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:4
-; MUBUF-NEXT:    buffer_store_dword v7, off, s[0:3], s32
+; MUBUF-NEXT:    buffer_store_dword v10, off, s[0:3], s32 offset:28
+; MUBUF-NEXT:    buffer_store_dword v9, off, s[0:3], s32 offset:24
+; MUBUF-NEXT:    buffer_store_dword v8, off, s[0:3], s32 offset:20
+; MUBUF-NEXT:    buffer_store_dword v7, off, s[0:3], s32 offset:16
 ; MUBUF-NEXT:    ;;#ASMSTART
 ; MUBUF-NEXT:    ; use v0
 ; MUBUF-NEXT:    ;;#ASMEND
@@ -33,16 +33,16 @@ define void @memcpy_fixed_align(ptr addrspace(5)  %dst, ptr addrspace(1) %src) {
 ; FLATSCR-LABEL: memcpy_fixed_align:
 ; FLATSCR:       ; %bb.0:
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; FLATSCR-NEXT:    global_load_dwordx4 v[3:6], v[1:2], off
+; FLATSCR-NEXT:    global_load_dwordx4 v[7:10], v[1:2], off offset:16
 ; FLATSCR-NEXT:    global_load_dwordx2 v[11:12], v[1:2], off offset:32
-; FLATSCR-NEXT:    global_load_dwordx4 v[3:6], v[1:2], off offset:16
-; FLATSCR-NEXT:    global_load_dwordx4 v[7:10], v[1:2], off
 ; FLATSCR-NEXT:    v_mov_b32_e32 v0, s32
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(2)
-; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[11:12], s32 offset:32
+; FLATSCR-NEXT:    scratch_store_dwordx4 off, v[3:6], s32
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(2)
-; FLATSCR-NEXT:    scratch_store_dwordx4 off, v[3:6], s32 offset:16
+; FLATSCR-NEXT:    scratch_store_dwordx4 off, v[7:10], s32 offset:16
 ; FLATSCR-NEXT:    s_waitcnt vmcnt(2)
-; FLATSCR-NEXT:    scratch_store_dwordx4 off, v[7:10], s32
+; FLATSCR-NEXT:    scratch_store_dwordx2 off, v[11:12], s32 offset:32
 ; FLATSCR-NEXT:    ;;#ASMSTART
 ; FLATSCR-NEXT:    ; use v0
 ; FLATSCR-NEXT:    ;;#ASMEND
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
index 358f42dfe8dd5..166bd90d098d0 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll
@@ -13,148 +13,104 @@ define amdgpu_kernel void @memcpy_p0_p0_minsize(ptr %dest, ptr readonly %src) #0
 ; CHECK-NEXT:    v_mov_b32_e32 v0, s2
 ; CHECK-NEXT:    v_mov_b32_e32 v1, s3
 ; CHECK-NEXT:    flat_load_ubyte v4, v[0:1]
+; CHECK-NEXT:    flat_load_ubyte v5, v[0:1] offset:1
+; CHECK-NEXT:    flat_load_ubyte v6, v[0:1] offset:2
+; CHECK-NEXT:    flat_load_ubyte v7, v[0:1] offset:3
+; CHECK-NEXT:    flat_load_ubyte v8, v[0:1] offset:4
+; CHECK-NEXT:    flat_load_ubyte v9, v[0:1] offset:5
+; CHECK-NEXT:    flat_load_ubyte v10, v[0:1] offset:6
+; CHECK-NEXT:    flat_load_ubyte v11, v[0:1] offset:7
+; CHECK-NEXT:    flat_load_ubyte v12, v[0:1] offset:8
+; CHECK-NEXT:    flat_load_ubyte v13, v[0:1] offset:9
+; CHECK-NEXT:    flat_load_ubyte v14, v[0:1] offset:10
+; CHECK-NEXT:    flat_load_ubyte v15, v[0:1] offset:11
+; CHECK-NEXT:    flat_load_ubyte v16, v[0:1] offset:12
+; CHECK-NEXT:    flat_load_ubyte v17, v[0:1] offset:13
+; CHECK-NEXT:    flat_load_ubyte v18, v[0:1] offset:14
 ; CHECK-NEXT:    v_mov_b32_e32 v3, s1
 ; CHECK-NEXT:    v_mov_b32_e32 v2, s0
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    flat_store_byte v[2:3], v4
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:1
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:1
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:2
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:2
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:3
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:3
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:4
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:4
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:5
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:5
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:6
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:6
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:7
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:7
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:8
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:8
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:9
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:9
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:10
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:10
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:11
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:11
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:12
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:12
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:13
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:13
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:14
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:14
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:15
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:15
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:16
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:16
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:17
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:17
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:18
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:18
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:19
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:19
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:20
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:20
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:21
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:21
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:22
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:22
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:23
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:23
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:24
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:24
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:25
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:25
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:26
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:26
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:27
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:27
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:28
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:28
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:29
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:29
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:30
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:30
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:31
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:31
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:32
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:32
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:33
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:33
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:34
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:34
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:35
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:35
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:36
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:36
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:37
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:37
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:38
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:38
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:39
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:39
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:40
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:40
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:41
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:41
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:42
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:42
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:43
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:43
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:44
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:44
-; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:45
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:45
-; CHECK-NEXT:    flat_load_ubyte v0, v[0:1] offset:46
-; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    flat_store_byte v[2:3], v0 offset:46
+; CHECK-NEXT:    flat_store_byte v[2:3], v5 offset:1
+; CHECK-NEXT:    flat_store_byte v[2:3], v6 offset:2
+; CHECK-NEXT:    flat_store_byte v[2:3], v7 offset:3
+; CHECK-NEXT:    flat_store_byte v[2:3], v8 offset:4
+; CHECK-NEXT:    flat_store_byte v[2:3], v9 offset:5
+; CHECK-NEXT:    flat_store_byte v[2:3], v10 offset:6
+; CHECK-NEXT:    flat_store_byte v[2:3], v11 offset:7
+; CHECK-NEXT:    flat_store_byte v[2:3], v12 offset:8
+; CHECK-NEXT:    flat_store_byte v[2:3], v13 offset:9
+; CHECK-NEXT:    flat_store_byte v[2:3], v14 offset:10
+; CHECK-NEXT:    flat_store_byte v[2:3], v15 offset:11
+; CHECK-NEXT:    flat_store_byte v[2:3], v16 offset:12
+; CHECK-NEXT:    flat_store_byte v[2:3], v17 offset:13
+; CHECK-NEXT:    flat_store_byte v[2:3], v18 offset:14
+; CHECK-NEXT:    flat_load_ubyte v4, v[0:1] offset:46
+; CHECK-NEXT:    flat_load_ubyte v5, v[0:1] offset:45
+; CHECK-NEXT:    flat_load_ubyte v6, v[0:1] offset:44
+; CHECK-NEXT:    flat_load_ubyte v7, v[0:1] offset:43
+; CHECK-NEXT:    flat_load_ubyte v8, v[0:1] offset:42
+; CHECK-NEXT:    flat_load_ubyte v9, v[0:1] offset:41
+; CHECK-NEXT:    flat_load_ubyte v10, v[0:1] offset:40
+; CHECK-NEXT:    flat_load_ubyte v11, v[0:1] offset:39
+; CHECK-NEXT:    flat_load_ubyte v12, v[0:1] offset:38
+; CHECK-NEXT:    flat_load_ubyte v13, v[0:1] offset:37
+; CHECK-NEXT:    flat_load_ubyte v14, v[0:1] offset:36
+; CHECK-NEXT:    flat_load_ubyte v15, v[0:1] offset:35
+; CHECK-NEXT:    flat_load_ubyte v16, v[0:1] offset:34
+; CHECK-NEXT:    flat_load_ubyte v17, v[0:1] offset:33
+; CHECK-NEXT:    flat_load_ubyte v18, v[0:1] offset:32
+; CHECK-NEXT:    flat_load_ubyte v19, v[0:1] offset:31
+; CHECK-NEXT:    flat_load_ubyte v20, v[0:1] offset:30
+; CHECK-NEXT:    flat_load_ubyte v21, v[0:1] offset:29
+; CHECK-NEXT:    flat_load_ubyte v22, v[0:1] offset:28
+; CHECK-NEXT:    flat_load_ubyte v23, v[0:1] offset:27
+; CHECK-NEXT:    flat_load_ubyte v24, v[0:1] offset:26
+; CHECK-NEXT:    flat_load_ubyte v25, v[0:1] offset:25
+; CHECK-NEXT:    flat_load_ubyte v26, v[0:1] offset:24
+; CHECK-NEXT:    flat_load_ubyte v27, v[0:1] offset:23
+; CHECK-NEXT:    flat_load_ubyte v28, v[0:1] offset:22
+; CHECK-NEXT:    flat_load_ubyte v29, v[0:1] offset:21
+; CHECK-NEXT:    flat_load_ubyte v30, v[0:1] offset:20
+; CHECK-NEXT:    flat_load_ubyte v31, v[0:1] offset:19
+; CHECK-NEXT:    flat_load_ubyte v32, v[0:1] offset:18
+; CHECK-NEXT:    flat_load_ubyte v33, v[0:1] offset:17
+; CHECK-NEXT:    flat_load_ubyte v34, v[0:1] offset:16
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    flat_load_ubyte v0, v[0:1] offset:15
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    flat_store_byte v[2:3], v4 offset:46
+; CHECK-NEXT:    flat_store_byte v[2:3], v5 offset:45
+; CHECK-NEXT:    flat_store_byte v[2:3], v6 offset:44
+; CHECK-NEXT:    flat_store_byte v[2:3], v7 offset:43
+; CHECK-NEXT:    flat_store_byte v[2:3], v8 offset:42
+; CHECK-NEXT:    flat_store_byte v[2:3], v9 offset:41
+; CHECK-NEXT:    flat_store_byte v[2:3], v10 offset:40
+; CHECK-NEXT:    flat_store_byte v[2:3], v11 offset:39
+; CHECK-NEXT:    flat_store_byte v[2:3], v12 offset:38
+; CHECK-NEXT:    flat_store_byte v[2:3], v13 offset:37
+; CHECK-NEXT:    flat_store_byte v[2:3], v14 offset:36
+; CHECK-NEXT:    flat_store_byte v[2:3], v15 offset:35
+; CHECK-NEXT:    flat_store_byte v[2:3], v16 offset:34
+; CHECK-NEXT:    flat_store_byte v[2:3], v17 offset:33
+; CHECK-NEXT:    flat_store_byte v[2:3], v18 offset:32
+; CHECK-NEXT:    flat_store_byte v[2:3], v19 offset:31
+; CHECK-NEXT:    flat_store_byte v[2:3], v20 offset:30
+; CHECK-NEXT:    flat_store_byte v[2:3], v21 offset:29
+; CHECK-NEXT:    flat_store_byte v[2:3], v22 offset:28
+; CHECK-NEXT:    flat_store_byte v[2:3], v23 offset:27
+; CHECK-NEXT:    flat_store_byte v[2:3], v24 offset:26
+; CHECK-NEXT:    flat_store_byte v[2:3], v25 offset:25
+; CHECK-NEXT:    flat_store_byte v[2:3], v26 offset:24
+; CHECK-NEXT:    flat_store_byte v[2:3], v27 offset:23
+; CHECK-NEXT:    flat_store_byte v[2:3], v28 offset:22
+; CHECK-NEXT:    flat_store_byte v[2:3], v29 offset:21
+; CHECK-NEXT:    flat_store_byte v[2:3], v30 offset:20
+; CHECK-NEXT:    flat_store_byte v[2:3], v31 offset:19
+; CHECK-NEXT:    flat_store_byte v[2:3], v32 offset:18
+; CHECK-NEXT:    flat_store_byte v[2:3], v33 offset:17
+; CHECK-NEXT:    flat_store_byte v[2:3], v34 offset:16
+; CHECK-NEXT:    flat_store_byte v[2:3], v0 offset:15
 ; CHECK-NEXT:    s_endpgm
 entry:
   tail call void @llvm.memcpy.p0.p0.i64(ptr %dest, ptr %src, i64 47, i1 false)
@@ -165,20 +121,20 @@ define amdgpu_kernel void @memcpy_p1_p1_minsize(ptr addrspace(1) %dest, ptr addr
 ; CHECK-LABEL: memcpy_p1_p1_minsize:
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5]...
[truncated]

arsenm · 2024-06-20T14:04:56Z

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;

+  // Enable ganging up loads and stores in the memcpy DAG lowering.


Why does this need to be a target option? Why isn't the default just some number?

The current default is 0, disabling this transformation. The original author only enabled it for AArch64 and kept the previous behavior for the other targets.

In general, I don't think that it is obvious that this "optimization" (in a loose sense) improves performance for every target; it effectively limits the later stages of codegen in its choices.
We would need to benchmark the other targets to be sure that we don't decrease their performance if we want it to be enabled by default.

I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API

We'd probably want different numbers based on the address spaces involved

I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API

~~Another argument for that: The limit as implemented is actually not concerning only stores, but loads and stores. So 32 here means that packets of 16 loads and 16 stores are ganged up.~~ edit: That's incorrect, sorry!
I take it that you suggest that we improve the API before using it?

I think the corresponding hardware property for AMDGPU would be something like the number of memory accesses that can be started before the first one finishes.
In AArch64, it is the number of memory operations that can be merged into one (i.e. 2 for ldp and stp).
For architectures with traditional vector operations, it might be the preferred number of vector lanes.

We'd probably want different numbers based on the address spaces involved

I'll do some benchmarks for other address spaces to find out, then.

Turning it on first is fine. We should also expand the test coverage for different sizes, alignments and address space combinations. I'm surprised we don't have more test updates from this as-is

@arsenm I ran some benchmarks for the other address spaces; enabling this optimization either improved or did not change the performance compared to the original memcpy implementation in all cases.
Which exact parameter value among {8, 16, 32, 64} we choose does not make much of a difference with different address spaces either; I now changed the parameter value in the PR to 16 instead of the previous 32 since that performed a bit better when copying from the generic to the global address space.

At least in my benchmarks on gfx1030, different parameter values for different address spaces seem unnecessary.

Regarding test coverage, I can easily generate llc tests for various parameter combinations; the resulting test can however grow quite large. For combinations of the following parameters, a test for memmove and memcpy with auto-generated llc check lines reaches ~60k lines (ca. 1.5x the length of the largest AMDGPU codegen test so far):

dst_address_spaces = [0, 1, 3, 5] src_address_spaces = [0, 1, 3, 4, 5] dst_alignments = [1, 2, 4] src_alignments = [1, 2, 4] sizes = [16, 31, 32]

Would that be a useful addition to our testing?

Testing 8 and 16 alignments will at least show some LDS differences. We can probably get away without the addrspace 4 cases, unless they are specially crafted to use uniform source pointers

I just added a generated test (with alignments 8 and 16 instead of 4, leading to 97k lines) to the PR, so that you can see what it looks like. Running it alone costs ca. 5 seconds on my workstation.

Looking more into the code, I noticed that the DAG lowering of memcpy only takes the smaller one when different alignments are specified for source and destination. So, there is no point in testing combinations of different alignments, which leaves us with ~20k lines of test cases (the now updated state in this PR).

Do you have objections against merging the PR like this, @arsenm ?

arsenm · 2024-06-21T08:26:15Z

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;

+  // Enable ganging up loads and stores in the memcpy DAG lowering.


Turning it on first is fine. We should also expand the test coverage for different sizes, alignments and address space combinations. I'm surprised we don't have more test updates from this as-is

[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly, trading off performance in different scenarios. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.

…for AMDGPU Add a test for combinations of memcpy parameters

…stores for AMDGPU Only include equal alignments in the test.

arsenm · 2024-07-01T13:00:48Z

llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll

I've never seen "this diff is too large to display before".

These test functions are all kernels, making the pointer arguments all uniform. It would be somewhat more representative to make these regular functions with VGPR pointer arguments. The cases where known uniform pointers is interesting should then use explicit inreg

The test functions are now normal functions instead of kernels and I added another test with inreg arguments.
The differences in the generated code between those are not too fundamental, so I skipped most address space combinations and sizes for the sreg test to avoid making it too big.

It's only interesting for the SMRD candidate loads from p1/p4 from the sources

Which really means p4, without an invariant annotation we don't handle now. You would need to use a kernel for the p1->p4 promotion, so just test the constant address space uniform source case

Like this, in memcpy-sreg.ll? The previous, more verbose memcpy-param-combinations-sreg.ll is removed again.

… loads/stores for AMDGPU Adjust the test to pass arguments in vregs, add another one that passes arguments explicitly in sregs.

… memcpy loads/stores for AMDGPU Only test scalar operands for copies/moves from the constant address space 4.

arsenm

lgtm with nit

arsenm · 2024-07-02T13:18:06Z

llvm/test/CodeGen/AMDGPU/memcpy-sreg.ll

+; Testing codegen for memcpy and memmove with scalar reads.
+
+
+define void @memcpy_p1_p4_sz16_align_4_4(ptr addrspace(1) align 4  %dst, ptr addrspace(4) align 4 readonly inreg %src) {


Extra space between align 4 and %st throughout

arsenm · 2024-07-02T13:18:49Z

llvm/test/CodeGen/AMDGPU/memcpy-sreg.ll

@@ -0,0 +1,138 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+


Better name would be scalar-load instead of sreg

…g up of memcpy loads/stores for AMDGPU Clean up whitespace and rename memcpy-sreg.ll -> memcpy-scalar-load.ll.

arsenm

Only other nit is I would have split the memcpy and memmove cases to separate files

… ganging up of memcpy loads/stores for AMDGPU Separate files for memcpy and memmove tests.

…vm#96185) In the SelectionDAG lowering of the memcpy intrinsic, this optimization introduces additional chains between fixed-size groups of loads and the corresponding stores. While initially introduced to ensure that wider load/store-pair instructions are generated on AArch64, this optimization also improves code generation for AMDGPU: Ganged loads are scheduled into a clause; stores only await completion of their corresponding load. The chosen value of 16 performed good in microbenchmarks, values of 8, 32, or 64 would perform similarly. The testcase updates are autogenerated by utils/update_llc_test_checks.py. See also: - PR introducing this optimization: https://reviews.llvm.org/D46477 Part of SWDEV-455845.

ritter-x2a requested review from arsenm, kzhuravl and Pierre-vh June 20, 2024 12:52

ritter-x2a self-assigned this Jun 20, 2024

llvmbot added the backend:AMDGPU label Jun 20, 2024

arsenm reviewed Jun 20, 2024

View reviewed changes

arsenm approved these changes Jun 21, 2024

View reviewed changes

ritter-x2a added 3 commits June 25, 2024 09:40

fixup! amend! [AMDGPU][DAG] Enable ganging up of memcpy loads/stores …

3146306

…for AMDGPU Add a test for combinations of memcpy parameters

fixup! fixup! amend! [AMDGPU][DAG] Enable ganging up of memcpy loads/…

fd9bcb2

…stores for AMDGPU Only include equal alignments in the test.

arsenm reviewed Jul 1, 2024

View reviewed changes

ritter-x2a added 2 commits July 2, 2024 05:20

fixup! fixup! fixup! amend! [AMDGPU][DAG] Enable ganging up of memcpy…

909e025

… loads/stores for AMDGPU Adjust the test to pass arguments in vregs, add another one that passes arguments explicitly in sregs.

fixup! fixup! fixup! fixup! amend! [AMDGPU][DAG] Enable ganging up of…

1b64408

… memcpy loads/stores for AMDGPU Only test scalar operands for copies/moves from the constant address space 4.

arsenm approved these changes Jul 2, 2024

View reviewed changes

fixup! fixup! fixup! fixup! fixup! amend! [AMDGPU][DAG] Enable gangin…

3282532

…g up of memcpy loads/stores for AMDGPU Clean up whitespace and rename memcpy-sreg.ll -> memcpy-scalar-load.ll.

arsenm approved these changes Jul 2, 2024

View reviewed changes

fixup! fixup! fixup! fixup! fixup! fixup! amend! [AMDGPU][DAG] Enable…

c146c8f

… ganging up of memcpy loads/stores for AMDGPU Separate files for memcpy and memmove tests.

arsenm approved these changes Jul 2, 2024

View reviewed changes

ritter-x2a merged commit e1094dd into llvm:main Jul 3, 2024
7 checks passed

		; Testing codegen for memcpy and memmove with scalar reads.


		define void @memcpy_p1_p4_sz16_align_4_4(ptr addrspace(1) align 4 %dst, ptr addrspace(4) align 4 readonly inreg %src) {

		@@ -0,0 +1,138 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5

[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU #96185

[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU #96185

Uh oh!

Conversation

ritter-x2a commented Jun 20, 2024

Uh oh!

llvmbot commented Jun 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ritter-x2a Jun 20, 2024 •

edited

Loading