Skip to content

[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU #96185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 3, 2024
Merged
3 changes: 3 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;

// Enable ganging up loads and stores in the memcpy DAG lowering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a target option? Why isn't the default just some number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current default is 0, disabling this transformation. The original author only enabled it for AArch64 and kept the previous behavior for the other targets.

In general, I don't think that it is obvious that this "optimization" (in a loose sense) improves performance for every target; it effectively limits the later stages of codegen in its choices.
We would need to benchmark the other targets to be sure that we don't decrease their performance if we want it to be enabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd probably want different numbers based on the address spaces involved

Copy link
Member Author

@ritter-x2a ritter-x2a Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API

Another argument for that: The limit as implemented is actually not concerning only stores, but loads and stores. So 32 here means that packets of 16 loads and 16 stores are ganged up. edit: That's incorrect, sorry!
I take it that you suggest that we improve the API before using it?

I think the corresponding hardware property for AMDGPU would be something like the number of memory accesses that can be started before the first one finishes.
In AArch64, it is the number of memory operations that can be merged into one (i.e. 2 for ldp and stp).
For architectures with traditional vector operations, it might be the preferred number of vector lanes.

We'd probably want different numbers based on the address spaces involved

I'll do some benchmarks for other address spaces to find out, then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning it on first is fine. We should also expand the test coverage for different sizes, alignments and address space combinations. I'm surprised we don't have more test updates from this as-is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm I ran some benchmarks for the other address spaces; enabling this optimization either improved or did not change the performance compared to the original memcpy implementation in all cases.
Which exact parameter value among {8, 16, 32, 64} we choose does not make much of a difference with different address spaces either; I now changed the parameter value in the PR to 16 instead of the previous 32 since that performed a bit better when copying from the generic to the global address space.

At least in my benchmarks on gfx1030, different parameter values for different address spaces seem unnecessary.

Regarding test coverage, I can easily generate llc tests for various parameter combinations; the resulting test can however grow quite large. For combinations of the following parameters, a test for memmove and memcpy with auto-generated llc check lines reaches ~60k lines (ca. 1.5x the length of the largest AMDGPU codegen test so far):

  dst_address_spaces = [0, 1, 3, 5]
  src_address_spaces = [0, 1, 3, 4, 5]
  dst_alignments = [1, 2, 4]
  src_alignments = [1, 2, 4]
  sizes = [16, 31, 32]

Would that be a useful addition to our testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing 8 and 16 alignments will at least show some LDS differences. We can probably get away without the addrspace 4 cases, unless they are specially crafted to use uniform source pointers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a generated test (with alignments 8 and 16 instead of 4, leading to 97k lines) to the PR, so that you can see what it looks like. Running it alone costs ca. 5 seconds on my workstation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more into the code, I noticed that the DAG lowering of memcpy only takes the smaller one when different alignments are specified for source and destination. So, there is no point in testing combinations of different alignments, which leaves us with ~20k lines of test cases (the now updated state in this PR).

Do you have objections against merging the PR like this, @arsenm ?

MaxGluedStoresPerMemcpy = 16;

// Lower floating point store/load to integer store/load to reduce the number
// of patterns in tablegen.
setOperationAction(ISD::LOAD, MVT::f32, Promote);
Expand Down
16 changes: 8 additions & 8 deletions llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll
Original file line number Diff line number Diff line change
Expand Up @@ -9074,8 +9074,8 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
; GFX9-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:24 ; 4-byte Folded Spill
; GFX9-NEXT: s_mov_b64 exec, s[4:5]
; GFX9-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:20
; GFX9-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:16
; GFX9-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:16
; GFX9-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:20
; GFX9-NEXT: buffer_load_dword v31, off, s[0:3], s33
; GFX9-NEXT: v_writelane_b32 v40, s30, 0
; GFX9-NEXT: v_writelane_b32 v40, s31, 1
Expand Down Expand Up @@ -9113,9 +9113,9 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
; GFX9-NEXT: s_mov_b32 s4, byval_align16_f64_arg@abs32@lo
; GFX9-NEXT: v_writelane_b32 v40, s63, 31
; GFX9-NEXT: s_waitcnt vmcnt(2)
; GFX9-NEXT: buffer_store_dword v32, off, s[0:3], s32 offset:4
; GFX9-NEXT: buffer_store_dword v32, off, s[0:3], s32
; GFX9-NEXT: s_waitcnt vmcnt(2)
; GFX9-NEXT: buffer_store_dword v33, off, s[0:3], s32
; GFX9-NEXT: buffer_store_dword v33, off, s[0:3], s32 offset:4
; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
; GFX9-NEXT: v_readlane_b32 s63, v40, 31
; GFX9-NEXT: v_readlane_b32 s62, v40, 30
Expand Down Expand Up @@ -9167,17 +9167,17 @@ define void @tail_call_byval_align16(<32 x i32> %val, double %tmp) #0 {
; GFX10-NEXT: s_waitcnt_depctr 0xffe3
; GFX10-NEXT: s_mov_b32 exec_lo, s4
; GFX10-NEXT: s_clause 0x2
; GFX10-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:20
; GFX10-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:16
; GFX10-NEXT: buffer_load_dword v32, off, s[0:3], s33 offset:16
; GFX10-NEXT: buffer_load_dword v33, off, s[0:3], s33 offset:20
; GFX10-NEXT: buffer_load_dword v31, off, s[0:3], s33
; GFX10-NEXT: v_writelane_b32 v40, s30, 0
; GFX10-NEXT: s_addk_i32 s32, 0x400
; GFX10-NEXT: s_mov_b32 s5, byval_align16_f64_arg@abs32@hi
; GFX10-NEXT: s_mov_b32 s4, byval_align16_f64_arg@abs32@lo
; GFX10-NEXT: s_waitcnt vmcnt(2)
; GFX10-NEXT: buffer_store_dword v32, off, s[0:3], s32 offset:4
; GFX10-NEXT: buffer_store_dword v32, off, s[0:3], s32
; GFX10-NEXT: s_waitcnt vmcnt(1)
; GFX10-NEXT: buffer_store_dword v33, off, s[0:3], s32
; GFX10-NEXT: buffer_store_dword v33, off, s[0:3], s32 offset:4
; GFX10-NEXT: v_writelane_b32 v40, s31, 1
; GFX10-NEXT: v_writelane_b32 v40, s34, 2
; GFX10-NEXT: v_writelane_b32 v40, s35, 3
Expand Down
32 changes: 16 additions & 16 deletions llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,22 @@ define void @memcpy_fixed_align(ptr addrspace(5) %dst, ptr addrspace(1) %src) {
; MUBUF: ; %bb.0:
; MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; MUBUF-NEXT: global_load_dwordx2 v[11:12], v[1:2], off offset:32
; MUBUF-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16
; MUBUF-NEXT: global_load_dwordx4 v[7:10], v[1:2], off
; MUBUF-NEXT: global_load_dwordx4 v[3:6], v[1:2], off
; MUBUF-NEXT: global_load_dwordx4 v[7:10], v[1:2], off offset:16
; MUBUF-NEXT: v_lshrrev_b32_e64 v0, 6, s32
; MUBUF-NEXT: s_waitcnt vmcnt(2)
; MUBUF-NEXT: buffer_store_dword v12, off, s[0:3], s32 offset:36
; MUBUF-NEXT: buffer_store_dword v11, off, s[0:3], s32 offset:32
; MUBUF-NEXT: buffer_store_dword v12, off, s[0:3], s32 offset:36
; MUBUF-NEXT: s_waitcnt vmcnt(3)
; MUBUF-NEXT: buffer_store_dword v6, off, s[0:3], s32 offset:28
; MUBUF-NEXT: buffer_store_dword v5, off, s[0:3], s32 offset:24
; MUBUF-NEXT: buffer_store_dword v4, off, s[0:3], s32 offset:20
; MUBUF-NEXT: buffer_store_dword v3, off, s[0:3], s32 offset:16
; MUBUF-NEXT: buffer_store_dword v6, off, s[0:3], s32 offset:12
; MUBUF-NEXT: buffer_store_dword v5, off, s[0:3], s32 offset:8
; MUBUF-NEXT: buffer_store_dword v4, off, s[0:3], s32 offset:4
; MUBUF-NEXT: buffer_store_dword v3, off, s[0:3], s32
; MUBUF-NEXT: s_waitcnt vmcnt(6)
; MUBUF-NEXT: buffer_store_dword v10, off, s[0:3], s32 offset:12
; MUBUF-NEXT: buffer_store_dword v9, off, s[0:3], s32 offset:8
; MUBUF-NEXT: buffer_store_dword v8, off, s[0:3], s32 offset:4
; MUBUF-NEXT: buffer_store_dword v7, off, s[0:3], s32
; MUBUF-NEXT: buffer_store_dword v10, off, s[0:3], s32 offset:28
; MUBUF-NEXT: buffer_store_dword v9, off, s[0:3], s32 offset:24
; MUBUF-NEXT: buffer_store_dword v8, off, s[0:3], s32 offset:20
; MUBUF-NEXT: buffer_store_dword v7, off, s[0:3], s32 offset:16
; MUBUF-NEXT: ;;#ASMSTART
; MUBUF-NEXT: ; use v0
; MUBUF-NEXT: ;;#ASMEND
Expand All @@ -33,16 +33,16 @@ define void @memcpy_fixed_align(ptr addrspace(5) %dst, ptr addrspace(1) %src) {
; FLATSCR-LABEL: memcpy_fixed_align:
; FLATSCR: ; %bb.0:
; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; FLATSCR-NEXT: global_load_dwordx4 v[3:6], v[1:2], off
; FLATSCR-NEXT: global_load_dwordx4 v[7:10], v[1:2], off offset:16
; FLATSCR-NEXT: global_load_dwordx2 v[11:12], v[1:2], off offset:32
; FLATSCR-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16
; FLATSCR-NEXT: global_load_dwordx4 v[7:10], v[1:2], off
; FLATSCR-NEXT: v_mov_b32_e32 v0, s32
; FLATSCR-NEXT: s_waitcnt vmcnt(2)
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[11:12], s32 offset:32
; FLATSCR-NEXT: scratch_store_dwordx4 off, v[3:6], s32
; FLATSCR-NEXT: s_waitcnt vmcnt(2)
; FLATSCR-NEXT: scratch_store_dwordx4 off, v[3:6], s32 offset:16
; FLATSCR-NEXT: scratch_store_dwordx4 off, v[7:10], s32 offset:16
; FLATSCR-NEXT: s_waitcnt vmcnt(2)
; FLATSCR-NEXT: scratch_store_dwordx4 off, v[7:10], s32
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[11:12], s32 offset:32
; FLATSCR-NEXT: ;;#ASMSTART
; FLATSCR-NEXT: ; use v0
; FLATSCR-NEXT: ;;#ASMEND
Expand Down
Loading
Loading