[AMDGPU] Change SGPR layout to striped caller/callee saved #127353

shiltian · 2025-02-15T23:10:42Z

This PR updates the SGPR layout to a striped caller/callee-saved design, similar
to the VGPR layout.

To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout starts
from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to
avoid ending with a 2-wide stripe.

Fixes #113782.

shiltian · 2025-02-15T23:10:58Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-02-15T23:12:42Z

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

This PR updates the SGPR layout to a striped caller/callee-saved design, similar
to the VGPR layout. The stripe width is set to 8.

Fixes #113782.

Patch is 2.57 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127353.diff

60 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td (+5-1)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll (+145-145)
(modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+90-245)
(modified) llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+203-201)
(modified) llvm/test/CodeGen/AMDGPU/branch-relax-spill.ll (+73-140)
(modified) llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/call-args-inreg.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+1256-1256)
(modified) llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll (+20-14)
(modified) llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll (+788-1549)
(modified) llvm/test/CodeGen/AMDGPU/csr-sgpr-spill-live-ins.mir (+4-6)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/dwarf-multi-register-use-crash.ll (+36-36)
(modified) llvm/test/CodeGen/AMDGPU/eliminate-frame-index-s-mov-b32.mir (+26-27)
(modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/function-resource-usage.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/gfx-call-non-gfx-func.ll (+66-2)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+80-208)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1834-1834)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1554-1554)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1554-1554)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1834-1834)
(modified) llvm/test/CodeGen/AMDGPU/greedy-alloc-fail-sgpr1024-spill.mir (+64-62)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+55-91)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+492-748)
(modified) llvm/test/CodeGen/AMDGPU/issue48473.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+6-39)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+18-63)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+6-39)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+18-63)
(modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-hsa.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+160-160)
(modified) llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll (+68-774)
(modified) llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll (+416-1095)
(modified) llvm/test/CodeGen/AMDGPU/mcexpr-knownbits-assign-crash-gh-issue-110930.ll (+13-13)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr-carry-out.mir (+28-58)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr-gfx9.mir (+17-39)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr.mir (+9-21)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+223-223)
(modified) llvm/test/CodeGen/AMDGPU/ran-out-of-sgprs-allocation-failure.mir (+120-86)
(modified) llvm/test/CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+2-13)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-update-only-slot-indexes.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v8i64.ll (+672-1568)
(modified) llvm/test/CodeGen/AMDGPU/sibling-call.ll (+120-120)
(modified) llvm/test/CodeGen/AMDGPU/snippet-copy-bundle-regression.mir (+38-17)
(modified) llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir (+11-27)
(modified) llvm/test/CodeGen/AMDGPU/spill-sgpr-used-for-exec-copy.mir (+3-8)
(modified) llvm/test/CodeGen/AMDGPU/spill_more_than_wavesize_csr_sgprs.ll (+132-264)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-copy-bundle.mir (+107-93)
(modified) llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/stack-realign.ll (+7-13)
(modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+189-144)
(modified) llvm/test/CodeGen/AMDGPU/unallocatable-bundle-regression.mir (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll (+106-106)
(modified) llvm/test/CodeGen/AMDGPU/use_restore_frame_reg.mir (+25-51)
(modified) llvm/test/CodeGen/AMDGPU/vgpr-large-tuple-alloc-error.ll (+112-240)
(modified) llvm/test/CodeGen/MIR/AMDGPU/spill-phys-vgprs.mir (+1-2)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td b/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
index 80969fce3d77f..e3861a7d06c3d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
@@ -91,7 +91,11 @@ def CSR_AMDGPU_AGPRs : CalleeSavedRegs<
 >;
 
 def CSR_AMDGPU_SGPRs : CalleeSavedRegs<
-  (sequence "SGPR%u", 30, 105)
+  (add (sequence "SGPR%u", 30, 37),
+       (sequence "SGPR%u", 46, 53),
+       (sequence "SGPR%u", 62, 69),
+       (sequence "SGPR%u", 78, 85),
+       (sequence "SGPR%u", 94, 105))
 >;
 
 def CSR_AMDGPU_SI_Gfx_SGPRs : CalleeSavedRegs<
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
index ab2363860af9d..905d0deacab35 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
@@ -125,35 +125,35 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
 ; CHECK-NEXT:    v_mov_b32_e32 v42, v1
-; CHECK-NEXT:    v_writelane_b32 v43, s44, 12
+; CHECK-NEXT:    v_writelane_b32 v43, s52, 12
 ; CHECK-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v42
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT:    v_writelane_b32 v43, s45, 13
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT:    v_writelane_b32 v43, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v31
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v2
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_cvt_f64_i32_e32 v[2:3], v41
@@ -161,15 +161,15 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
@@ -179,14 +179,14 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
 ; CHECK-NEXT:    v_or_b32_e32 v1, v2, v1
-; CHECK-NEXT:    v_readlane_b32 s45, v43, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v43, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v43, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v43, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v43, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v43, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v43, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v43, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v43, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v43, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v43, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v43, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v43, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v43, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v43, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v43, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v43, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v43, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v43, 3
@@ -266,34 +266,34 @@ define double @test_powr_fast_f64(double %x, double %y) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
-; CHECK-NEXT:    v_writelane_b32 v43, s44, 12
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s52, 12
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s45, 13
+; CHECK-NEXT:    v_writelane_b32 v43, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v42, v31
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v3
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v2
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[40:41], v[0:1]
@@ -301,28 +301,28 @@ define double @test_powr_fast_f64(double %x, double %y) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v42
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
-; CHECK-NEXT:    v_readlane_b32 s45, v43, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v43, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v43, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v43, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v43, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v43, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v43, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v43, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v43, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v43, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v43, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v43, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v43, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v43, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v43, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v43, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v43, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v43, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v43, 3
@@ -409,35 +409,35 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
 ; CHECK-NEXT:    v_mov_b32_e32 v42, v1
-; CHECK-NEXT:    v_writelane_b32 v43, s44, 12
+; CHECK-NEXT:    v_writelane_b32 v43, s52, 12
 ; CHECK-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v42
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT:    v_writelane_b32 v43, s45, 13
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT:    v_writelane_b32 v43, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v31
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v2
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_cvt_f64_i32_e32 v[2:3], v41
@@ -445,15 +445,15 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
@@ -463,14 +463,14 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
 ; CHECK-NEXT:    v_or_b32_e32 v1, v2, v1
-; CHECK-NEXT:    v_readlane_b32 s45, v43, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v43, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v43, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v43, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v43, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v43, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v43, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v43, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v43, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v43, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v43, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v43, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v43, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v43, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v43, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v43, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v43, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v43, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v43, 3
@@ -552,32 +552,32 @@ define double @test_pown_fast_f64_known_even(double %x, i32 %y.arg) {
 ; CHECK-NEXT:    v_writelane_b32 v42, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v42, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v42, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v42, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v42, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v42, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v42, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x400
-; CHECK-NEXT:    v_writelane_b32 v42, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v42, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v42, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v42, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v42, s42, 10
-; CHECK-NEXT:    v_writelane_b32 v42, s43, 11
-; CHECK-NEXT:    v_writelane_b32 v42, s44, 12
+; CHECK-NEXT:    v_writelane_b32 v42, s50, 10
+; CHECK-NEXT:    v_writelane_b32 v42, s51, 11
+; CHECK-NEXT:    v_writelane_b32 v42, s52, 12
 ; CHECK-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v42, s45, 13
+; CHECK-NEXT:    v_writelane_b32 v42, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v31
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v41, 1, v2
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
@@ -586,28 +586,28 @@ define double @test_pown_fast_f64_known_even(double %x, i32 %y.arg) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
-; CHECK-NEXT:    v_readlane_b32 s45, v42, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v42, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v42, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v42, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v42, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v42, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v42, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v42, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v42, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v42, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v42, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v42, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v42, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v42, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v42, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v42, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v42, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v42, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v42, 3
@@ -694,34 +694,34 @@ define double @test_pown_fast_f64_known_odd(double %x, i32 %y.arg) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v1
-; CHECK-NEXT:   ...
[truncated]

shiltian · 2025-02-16T13:41:52Z

This has passed internal PSDB (except the one test case that I has not updated yet).

Flakebi

Sounds good to me

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td

cdevadas · 2025-02-18T15:23:46Z

This patch would improve the codegen with less number of SGPR spills for heavy workloads involving device calls. May be run the perf PSDB as well? That would give us some initial numbers. You can CP this PR to the staging compiler and then launch the perf PSDB.

jayfoad

Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.

jayfoad · 2025-02-18T15:32:01Z

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td

+       (sequence "SGPR%u", 46, 53),
+       (sequence "SGPR%u", 62, 69),
+       (sequence "SGPR%u", 78, 85),
+       (sequence "SGPR%u", 94, 105))
 >;

 def CSR_AMDGPU_SI_Gfx_SGPRs : CalleeSavedRegs<


@Flakebi should we make some similar change here for amdgpu_gfx?

I think both is fine (changing it or leaving it for now). amdgpu_gfx already has caller-saves that are not used for arguments, so it’s not hit by this bug.

The important part is that amdgpu_gfx wants the SGPR arguments to be in callee-save registers. I assume compute would likely benefit from having SGPR args in callee-saves as well, as they usually contain constant data, but it’s not there yet.
Once the C calling convention does that, we can probably ditch amdgpu_gfx and switch to the C calling conv for graphics.

I assume compute would likely benefit from having SGPR args in callee-saves as well, as they usually contain constant data

I'll include this part for the next step.

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td

shiltian · 2025-02-18T15:43:38Z

Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.

This doesn't serve as a performance improvement anyway. I'll request a performance cycle.

jayfoad · 2025-02-18T15:53:35Z

Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.

This doesn't serve as a performance improvement anyway. I'll request a performance cycle.

The only reason for doing striping is to get roughly the same ratio of callee-saves / non-callee-saves at different occupancies. Why would you want to keep that ratio constant, if not for performance?

Anyway that reason does not apply on GFX10+.

shiltian · 2025-02-18T17:53:28Z

This patch would improve the codegen with less number of SGPR spills for heavy workloads involving device calls. May be run the perf PSDB as well? That would give us some initial numbers. You can CP this PR to the staging compiler and then launch the perf PSDB.

Will do.

arsenm · 2025-02-20T03:47:48Z

Is the test from #113782 buried somewhere in this giant test diff?

shiltian · 2025-02-20T04:46:34Z

Is the test from #113782 buried somewhere in this giant test diff?

Yes. The check line of the test has been updated because it no longer crashes but another error was emitted. I'll fix the new issue in a follow up.

shiltian · 2025-02-21T18:11:06Z

A full testing cycle has been requested. Will comment here afterwards.

shiltian · 2025-03-08T05:07:05Z

Just got the results from a full cycle. There are no correctness issues and no performance regressions. I'm not sure if there's any performance improvement, though. That being said, this PR should be in good shape to go.

This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. The stripe width is set to 8. Fixes #113782.

github-actions · 2025-03-08T05:17:58Z

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:

git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' d08cf7900d2aaff9e7483ea74a58871edbdc45f2 1bde981f60a8014728012b4b19dd73072a41bd48 llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/branch-relax-spill.ll llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll llvm/test/CodeGen/AMDGPU/call-args-inreg.ll llvm/test/CodeGen/AMDGPU/call-argument-types.ll llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll llvm/test/CodeGen/AMDGPU/ds_read2.ll llvm/test/CodeGen/AMDGPU/dwarf-multi-register-use-crash.ll llvm/test/CodeGen/AMDGPU/function-args-inreg.ll llvm/test/CodeGen/AMDGPU/function-resource-usage.ll llvm/test/CodeGen/AMDGPU/gfx-call-non-gfx-func.ll llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/indirect-call.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll llvm/test/CodeGen/AMDGPU/mcexpr-knownbits-assign-crash-gh-issue-110930.ll llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll llvm/test/CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg.ll llvm/test/CodeGen/AMDGPU/select.f16.ll llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v8i64.ll llvm/test/CodeGen/AMDGPU/sibling-call.ll llvm/test/CodeGen/AMDGPU/spill_more_than_wavesize_csr_sgprs.ll llvm/test/CodeGen/AMDGPU/stack-realign.ll llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll llvm/test/CodeGen/AMDGPU/vgpr-large-tuple-alloc-error.ll

The following files introduce new uses of undef:

llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

rovka · 2025-03-10T04:30:48Z

Hi @shiltian, I think this change introduced some failures on this buildbot, which went unnoticed because it was already red. Could you please have a look? :)

mikaelholmen · 2025-03-10T12:00:17Z

So, with this patch e.g.

llc -verify-machineinstrs -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll

fails with

# After Greedy Register Allocator
********** INTERVALS **********
SGPR59_LO16 [96r,96d:0) 0@96r
SGPR59_HI16 [96r,96d:0) 0@96r
%8 [16r,32r:0) 0@16r  weight:INF
%15 [80r,80d:0) 0@80r  weight:INF
RegMasks:
********** MACHINEINSTRS **********
# Machine code for function scalar_mov_materializes_frame_index_unavailable_scc: NoPHIs, TracksLiveness, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=16384, align=64, at location [SP]
  fi#1: size=4, align=4, at location [SP]

0B	bb.0 (%ir-block.0):
16B	  %8:vgpr_32 = V_MOV_B32_e32 %stack.0.alloca0, implicit $exec
32B	  INLINEASM &"; use alloca0 $0" [sideeffect] [attdialect], $0:[reguse:VGPR_32], %8:vgpr_32
80B	  dead renamable $sgpr4_sgpr5 = S_AND_B64 0, $exec, implicit-def $scc
96B	  $sgpr59 = S_MOV_B32 %stack.1.alloca1
112B	  INLINEASM &"; use $0, $1" [sideeffect] [attdialect], $0:[reguse], $sgpr59, $1:[reguse], killed $scc
128B	  SI_RETURN

# End machine code for function scalar_mov_materializes_frame_index_unavailable_scc.

*** Bad machine code: No live segment at use ***
- function:    scalar_mov_materializes_frame_index_unavailable_scc
- basic block: %bb.0  (0x55ed7ac21b50) [0B;144B)
- instruction: 112B	INLINEASM &"; use $0, $1" [sideeffect] [attdialect], $0:[reguse], $sgpr59, $1:[reguse], killed $scc
- operand 3:   $sgpr59
- liverange:   [96r,96d:0) 0@96r
- regunit:     SGPR59_LO16
- at:          112B

*** Bad machine code: No live segment at use ***
- function:    scalar_mov_materializes_frame_index_unavailable_scc
- basic block: %bb.0  (0x55ed7ac21b50) [0B;144B)
- instruction: 112B	INLINEASM &"; use $0, $1" [sideeffect] [attdialect], $0:[reguse], $sgpr59, $1:[reguse], killed $scc
- operand 3:   $sgpr59
- liverange:   [96r,96d:0) 0@96r
- regunit:     SGPR59_HI16
- at:          112B
LLVM ERROR: Found 2 machine code errors.

(also seen if you compile with EXPENSIVE_CHECKS and run lit tests, as the failed build bot shows)

shiltian · 2025-03-10T15:02:43Z

@mikaelholmen @rovka Thanks for the information. I'll take a look right away.

shiltian · 2025-03-10T17:43:07Z

Right before this PR, CodeGen/AMDGPU/shufflevector-physreg-copy.ll already fails with expensive checks. I'll fix the three new failures in #130644.

This PR fixes test failures introduced in #127353 when expensive checkes are enabled.

This PR fixes test failures introduced in #127353 when expensive checks are enabled. For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` is no longer in live-ins because it is caller saved. Switch to `s55` in this PR.

#130644)" As suggested on 5ec884e#commitcomment-153707488 this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON: LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll > This PR fixes test failures introduced in #127353 when expensive checks > are enabled. > > For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and > `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` > is no longer in live-ins because it is caller saved. Switch to `s55` in > this PR.

llvm#130644)" As suggested on llvm@5ec884e#commitcomment-153707488 this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON: LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll > This PR fixes test failures introduced in llvm#127353 when expensive checks > are enabled. > > For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and > `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` > is no longer in live-ins because it is caller saved. Switch to `s55` in > this PR.

This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame pointer), and s34 (base pointer) remain callee-saved, the striped layout starts from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to avoid ending with a 2-wide stripe. Fixes llvm#113782. (cherry picked from commit a779af3)

…) (llvm#1103)

shiltian mentioned this pull request Feb 15, 2025

[NFC][AMDGPU] Auto generate check lines for three test cases #127352

Merged

shiltian marked this pull request as ready for review February 15, 2025 23:12

llvmbot added the backend:AMDGPU label Feb 15, 2025

shiltian requested review from jayfoad, arsenm, scchan, Flakebi, rampitec, rovka, cdevadas and b-sumner February 15, 2025 23:12

Base automatically changed from users/shiltian/autogen-tests-for-for-striped-sgrp-cc to main February 17, 2025 16:22

shiltian force-pushed the users/shiltian/striped-sgpr-cc branch from 025e58c to 178dd48 Compare February 17, 2025 16:23

Flakebi reviewed Feb 18, 2025

View reviewed changes

arsenm reviewed Feb 18, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td Outdated Show resolved Hide resolved

jayfoad reviewed Feb 18, 2025

View reviewed changes

shiltian force-pushed the users/shiltian/striped-sgpr-cc branch from 178dd48 to 5478409 Compare February 18, 2025 17:48

shiltian mentioned this pull request Feb 27, 2025

AMDGPU inreg arguments for SGPRs use whole VGPRs after SGPR arguments run out #129071

Open

[AMDGPU] Change SGPR layout to striped caller/callee saved

27d6094

This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. The stripe width is set to 8. Fixes #113782.

shiltian added 2 commits March 8, 2025 00:08

Start the partition from s40

627fe1b

rebase and fix conflicts

1bde981

shiltian force-pushed the users/shiltian/striped-sgpr-cc branch from 5478409 to 1bde981 Compare March 8, 2025 05:14

arsenm approved these changes Mar 8, 2025

View reviewed changes

shiltian merged commit a779af3 into main Mar 8, 2025
10 of 11 checks passed

shiltian deleted the users/shiltian/striped-sgpr-cc branch March 8, 2025 14:28

shiltian mentioned this pull request Mar 10, 2025

[AMDGPU] Fix test failures when expensive checks are enabled #130644

Merged

shiltian added a commit that referenced this pull request Mar 10, 2025

[AMDGPU] Fix test failures when expensive checks are enabled

af30b17

This PR fixes test failures introduced in #127353 when expensive checkes are enabled.

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 20, 2025

[AMDGPU] Change SGPR layout to striped caller/callee saved (llvm#127353…

86c7782

…) (llvm#1103)

[AMDGPU] Change SGPR layout to striped caller/callee saved #127353

[AMDGPU] Change SGPR layout to striped caller/callee saved #127353

Conversation

shiltian commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Feb 15, 2025

Uh oh!

shiltian commented Feb 16, 2025

Uh oh!

Flakebi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cdevadas commented Feb 18, 2025

Uh oh!

jayfoad left a comment

Choose a reason for hiding this comment

Uh oh!

jayfoad Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

Flakebi Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

shiltian Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shiltian commented Feb 18, 2025

Uh oh!

jayfoad commented Feb 18, 2025

Uh oh!

shiltian commented Feb 18, 2025

Uh oh!

arsenm commented Feb 20, 2025

Uh oh!

shiltian commented Feb 20, 2025

Uh oh!

shiltian commented Feb 21, 2025

Uh oh!

shiltian commented Mar 8, 2025

Uh oh!

github-actions bot commented Mar 8, 2025

Uh oh!

Uh oh!

rovka commented Mar 10, 2025

Uh oh!

mikaelholmen commented Mar 10, 2025

Uh oh!

shiltian commented Mar 10, 2025

Uh oh!

shiltian commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shiltian commented Feb 15, 2025 •

edited

Loading

shiltian commented Feb 15, 2025 •

edited

Loading

shiltian commented Mar 10, 2025 •

edited

Loading