Skip to content

[AMDGPU] Change SGPR layout to striped caller/callee saved #127353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 8, 2025

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented Feb 15, 2025

This PR updates the SGPR layout to a striped caller/callee-saved design, similar
to the VGPR layout.

To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout starts
from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to
avoid ending with a 2-wide stripe.

Fixes #113782.

Copy link
Contributor Author

shiltian commented Feb 15, 2025

@llvmbot
Copy link
Member

llvmbot commented Feb 15, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

This PR updates the SGPR layout to a striped caller/callee-saved design, similar
to the VGPR layout. The stripe width is set to 8.

Fixes #113782.


Patch is 2.57 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127353.diff

60 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll (+145-145)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+90-245)
  • (modified) llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+203-201)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relax-spill.ll (+73-140)
  • (modified) llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/call-args-inreg.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+1256-1256)
  • (modified) llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll (+20-14)
  • (modified) llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll (+788-1549)
  • (modified) llvm/test/CodeGen/AMDGPU/csr-sgpr-spill-live-ins.mir (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/dwarf-multi-register-use-crash.ll (+36-36)
  • (modified) llvm/test/CodeGen/AMDGPU/eliminate-frame-index-s-mov-b32.mir (+26-27)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/function-resource-usage.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-call-non-gfx-func.ll (+66-2)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+80-208)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1834-1834)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1554-1554)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1554-1554)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1834-1834)
  • (modified) llvm/test/CodeGen/AMDGPU/greedy-alloc-fail-sgpr1024-spill.mir (+64-62)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+55-91)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+492-748)
  • (modified) llvm/test/CodeGen/AMDGPU/issue48473.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+6-39)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+18-63)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+6-39)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+18-63)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-hsa.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+160-160)
  • (modified) llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll (+68-774)
  • (modified) llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll (+416-1095)
  • (modified) llvm/test/CodeGen/AMDGPU/mcexpr-knownbits-assign-crash-gh-issue-110930.ll (+13-13)
  • (modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr-carry-out.mir (+28-58)
  • (modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr-gfx9.mir (+17-39)
  • (modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr.mir (+9-21)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+223-223)
  • (modified) llvm/test/CodeGen/AMDGPU/ran-out-of-sgprs-allocation-failure.mir (+120-86)
  • (modified) llvm/test/CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+2-13)
  • (modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-update-only-slot-indexes.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v8i64.ll (+672-1568)
  • (modified) llvm/test/CodeGen/AMDGPU/sibling-call.ll (+120-120)
  • (modified) llvm/test/CodeGen/AMDGPU/snippet-copy-bundle-regression.mir (+38-17)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir (+11-27)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-sgpr-used-for-exec-copy.mir (+3-8)
  • (modified) llvm/test/CodeGen/AMDGPU/spill_more_than_wavesize_csr_sgprs.ll (+132-264)
  • (modified) llvm/test/CodeGen/AMDGPU/splitkit-copy-bundle.mir (+107-93)
  • (modified) llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/stack-realign.ll (+7-13)
  • (modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+189-144)
  • (modified) llvm/test/CodeGen/AMDGPU/unallocatable-bundle-regression.mir (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll (+106-106)
  • (modified) llvm/test/CodeGen/AMDGPU/use_restore_frame_reg.mir (+25-51)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-large-tuple-alloc-error.ll (+112-240)
  • (modified) llvm/test/CodeGen/MIR/AMDGPU/spill-phys-vgprs.mir (+1-2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td b/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
index 80969fce3d77f..e3861a7d06c3d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
@@ -91,7 +91,11 @@ def CSR_AMDGPU_AGPRs : CalleeSavedRegs<
 >;
 
 def CSR_AMDGPU_SGPRs : CalleeSavedRegs<
-  (sequence "SGPR%u", 30, 105)
+  (add (sequence "SGPR%u", 30, 37),
+       (sequence "SGPR%u", 46, 53),
+       (sequence "SGPR%u", 62, 69),
+       (sequence "SGPR%u", 78, 85),
+       (sequence "SGPR%u", 94, 105))
 >;
 
 def CSR_AMDGPU_SI_Gfx_SGPRs : CalleeSavedRegs<
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
index ab2363860af9d..905d0deacab35 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
@@ -125,35 +125,35 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
 ; CHECK-NEXT:    v_mov_b32_e32 v42, v1
-; CHECK-NEXT:    v_writelane_b32 v43, s44, 12
+; CHECK-NEXT:    v_writelane_b32 v43, s52, 12
 ; CHECK-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v42
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT:    v_writelane_b32 v43, s45, 13
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT:    v_writelane_b32 v43, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v31
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v2
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_cvt_f64_i32_e32 v[2:3], v41
@@ -161,15 +161,15 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
@@ -179,14 +179,14 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
 ; CHECK-NEXT:    v_or_b32_e32 v1, v2, v1
-; CHECK-NEXT:    v_readlane_b32 s45, v43, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v43, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v43, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v43, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v43, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v43, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v43, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v43, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v43, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v43, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v43, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v43, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v43, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v43, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v43, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v43, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v43, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v43, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v43, 3
@@ -266,34 +266,34 @@ define double @test_powr_fast_f64(double %x, double %y) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
-; CHECK-NEXT:    v_writelane_b32 v43, s44, 12
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s52, 12
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s45, 13
+; CHECK-NEXT:    v_writelane_b32 v43, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v42, v31
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v3
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v2
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[40:41], v[0:1]
@@ -301,28 +301,28 @@ define double @test_powr_fast_f64(double %x, double %y) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v42
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
-; CHECK-NEXT:    v_readlane_b32 s45, v43, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v43, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v43, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v43, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v43, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v43, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v43, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v43, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v43, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v43, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v43, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v43, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v43, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v43, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v43, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v43, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v43, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v43, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v43, 3
@@ -409,35 +409,35 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
 ; CHECK-NEXT:    v_mov_b32_e32 v42, v1
-; CHECK-NEXT:    v_writelane_b32 v43, s44, 12
+; CHECK-NEXT:    v_writelane_b32 v43, s52, 12
 ; CHECK-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v42
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT:    v_writelane_b32 v43, s45, 13
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT:    v_writelane_b32 v43, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v31
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v2
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_cvt_f64_i32_e32 v[2:3], v41
@@ -445,15 +445,15 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
@@ -463,14 +463,14 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
 ; CHECK-NEXT:    v_or_b32_e32 v1, v2, v1
-; CHECK-NEXT:    v_readlane_b32 s45, v43, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v43, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v43, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v43, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v43, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v43, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v43, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v43, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v43, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v43, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v43, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v43, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v43, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v43, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v43, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v43, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v43, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v43, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v43, 3
@@ -552,32 +552,32 @@ define double @test_pown_fast_f64_known_even(double %x, i32 %y.arg) {
 ; CHECK-NEXT:    v_writelane_b32 v42, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v42, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v42, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v42, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v42, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v42, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v42, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x400
-; CHECK-NEXT:    v_writelane_b32 v42, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v42, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v42, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v42, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v42, s42, 10
-; CHECK-NEXT:    v_writelane_b32 v42, s43, 11
-; CHECK-NEXT:    v_writelane_b32 v42, s44, 12
+; CHECK-NEXT:    v_writelane_b32 v42, s50, 10
+; CHECK-NEXT:    v_writelane_b32 v42, s51, 11
+; CHECK-NEXT:    v_writelane_b32 v42, s52, 12
 ; CHECK-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v42, s45, 13
+; CHECK-NEXT:    v_writelane_b32 v42, s53, 13
 ; CHECK-NEXT:    v_mov_b32_e32 v40, v31
-; CHECK-NEXT:    s_mov_b32 s42, s15
-; CHECK-NEXT:    s_mov_b32 s43, s14
-; CHECK-NEXT:    s_mov_b32 s44, s13
-; CHECK-NEXT:    s_mov_b32 s45, s12
+; CHECK-NEXT:    s_mov_b32 s50, s15
+; CHECK-NEXT:    s_mov_b32 s51, s14
+; CHECK-NEXT:    s_mov_b32 s52, s13
+; CHECK-NEXT:    s_mov_b32 s53, s12
 ; CHECK-NEXT:    s_mov_b64 s[34:35], s[10:11]
 ; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT:    s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[46:47], s[6:7]
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v41, 1, v2
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
@@ -586,28 +586,28 @@ define double @test_pown_fast_f64_known_even(double %x, i32 %y.arg) {
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[48:49]
 ; CHECK-NEXT:    v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[6:7], s[46:47]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], s[36:37]
 ; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT:    s_mov_b32 s12, s45
-; CHECK-NEXT:    s_mov_b32 s13, s44
-; CHECK-NEXT:    s_mov_b32 s14, s43
-; CHECK-NEXT:    s_mov_b32 s15, s42
+; CHECK-NEXT:    s_mov_b32 s12, s53
+; CHECK-NEXT:    s_mov_b32 s13, s52
+; CHECK-NEXT:    s_mov_b32 s14, s51
+; CHECK-NEXT:    s_mov_b32 s15, s50
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
-; CHECK-NEXT:    v_readlane_b32 s45, v42, 13
-; CHECK-NEXT:    v_readlane_b32 s44, v42, 12
-; CHECK-NEXT:    v_readlane_b32 s43, v42, 11
-; CHECK-NEXT:    v_readlane_b32 s42, v42, 10
-; CHECK-NEXT:    v_readlane_b32 s41, v42, 9
-; CHECK-NEXT:    v_readlane_b32 s40, v42, 8
-; CHECK-NEXT:    v_readlane_b32 s39, v42, 7
-; CHECK-NEXT:    v_readlane_b32 s38, v42, 6
+; CHECK-NEXT:    v_readlane_b32 s53, v42, 13
+; CHECK-NEXT:    v_readlane_b32 s52, v42, 12
+; CHECK-NEXT:    v_readlane_b32 s51, v42, 11
+; CHECK-NEXT:    v_readlane_b32 s50, v42, 10
+; CHECK-NEXT:    v_readlane_b32 s49, v42, 9
+; CHECK-NEXT:    v_readlane_b32 s48, v42, 8
+; CHECK-NEXT:    v_readlane_b32 s47, v42, 7
+; CHECK-NEXT:    v_readlane_b32 s46, v42, 6
 ; CHECK-NEXT:    v_readlane_b32 s37, v42, 5
 ; CHECK-NEXT:    v_readlane_b32 s36, v42, 4
 ; CHECK-NEXT:    v_readlane_b32 s35, v42, 3
@@ -694,34 +694,34 @@ define double @test_pown_fast_f64_known_odd(double %x, i32 %y.arg) {
 ; CHECK-NEXT:    v_writelane_b32 v43, s35, 3
 ; CHECK-NEXT:    v_writelane_b32 v43, s36, 4
 ; CHECK-NEXT:    v_writelane_b32 v43, s37, 5
-; CHECK-NEXT:    v_writelane_b32 v43, s38, 6
-; CHECK-NEXT:    v_writelane_b32 v43, s39, 7
+; CHECK-NEXT:    v_writelane_b32 v43, s46, 6
+; CHECK-NEXT:    v_writelane_b32 v43, s47, 7
 ; CHECK-NEXT:    s_addk_i32 s32, 0x800
-; CHECK-NEXT:    v_writelane_b32 v43, s40, 8
-; CHECK-NEXT:    v_writelane_b32 v43, s41, 9
-; CHECK-NEXT:    s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT:    v_writelane_b32 v43, s48, 8
+; CHECK-NEXT:    v_writelane_b32 v43, s49, 9
+; CHECK-NEXT:    s_mov_b64 s[48:49], s[4:5]
 ; CHECK-NEXT:    s_getpc_b64 s[4:5]
 ; CHECK-NEXT:    s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
 ; CHECK-NEXT:    s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT:    v_writelane_b32 v43, s42, 10
+; CHECK-NEXT:    v_writelane_b32 v43, s50, 10
 ; CHECK-NEXT:    buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
 ; CHECK-NEXT:    buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT:    v_writelane_b32 v43, s43, 11
+; CHECK-NEXT:    v_writelane_b32 v43, s51, 11
 ; CHECK-NEXT:    v_mov_b32_e32 v41, v1
-; CHECK-NEXT:   ...
[truncated]

@shiltian
Copy link
Contributor Author

This has passed internal PSDB (except the one test case that I has not updated yet).

Base automatically changed from users/shiltian/autogen-tests-for-for-striped-sgrp-cc to main February 17, 2025 16:22
@shiltian shiltian force-pushed the users/shiltian/striped-sgpr-cc branch from 025e58c to 178dd48 Compare February 17, 2025 16:23
Copy link
Member

@Flakebi Flakebi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@cdevadas
Copy link
Collaborator

This patch would improve the codegen with less number of SGPR spills for heavy workloads involving device calls. May be run the perf PSDB as well? That would give us some initial numbers. You can CP this PR to the staging compiler and then launch the perf PSDB.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.

(sequence "SGPR%u", 46, 53),
(sequence "SGPR%u", 62, 69),
(sequence "SGPR%u", 78, 85),
(sequence "SGPR%u", 94, 105))
>;

def CSR_AMDGPU_SI_Gfx_SGPRs : CalleeSavedRegs<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Flakebi should we make some similar change here for amdgpu_gfx?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both is fine (changing it or leaving it for now). amdgpu_gfx already has caller-saves that are not used for arguments, so it’s not hit by this bug.

The important part is that amdgpu_gfx wants the SGPR arguments to be in callee-save registers. I assume compute would likely benefit from having SGPR args in callee-saves as well, as they usually contain constant data, but it’s not there yet.
Once the C calling convention does that, we can probably ditch amdgpu_gfx and switch to the C calling conv for graphics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume compute would likely benefit from having SGPR args in callee-saves as well, as they usually contain constant data

I'll include this part for the next step.

@shiltian
Copy link
Contributor Author

Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.

This doesn't serve as a performance improvement anyway. I'll request a performance cycle.

@jayfoad
Copy link
Contributor

jayfoad commented Feb 18, 2025

Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.

This doesn't serve as a performance improvement anyway. I'll request a performance cycle.

The only reason for doing striping is to get roughly the same ratio of callee-saves / non-callee-saves at different occupancies. Why would you want to keep that ratio constant, if not for performance?

Anyway that reason does not apply on GFX10+.

@shiltian shiltian force-pushed the users/shiltian/striped-sgpr-cc branch from 178dd48 to 5478409 Compare February 18, 2025 17:48
@shiltian
Copy link
Contributor Author

This patch would improve the codegen with less number of SGPR spills for heavy workloads involving device calls. May be run the perf PSDB as well? That would give us some initial numbers. You can CP this PR to the staging compiler and then launch the perf PSDB.

Will do.

@arsenm
Copy link
Contributor

arsenm commented Feb 20, 2025

Is the test from #113782 buried somewhere in this giant test diff?

@shiltian
Copy link
Contributor Author

Is the test from #113782 buried somewhere in this giant test diff?

Yes. The check line of the test has been updated because it no longer crashes but another error was emitted. I'll fix the new issue in a follow up.

@shiltian
Copy link
Contributor Author

A full testing cycle has been requested. Will comment here afterwards.

@shiltian
Copy link
Contributor Author

shiltian commented Mar 8, 2025

Just got the results from a full cycle. There are no correctness issues and no performance regressions. I'm not sure if there's any performance improvement, though. That being said, this PR should be in good shape to go.

This PR updates the SGPR layout to a striped caller/callee-saved design, similar
to the VGPR layout. The stripe width is set to 8.

Fixes #113782.
@shiltian shiltian force-pushed the users/shiltian/striped-sgpr-cc branch from 5478409 to 1bde981 Compare March 8, 2025 05:14
Copy link

github-actions bot commented Mar 8, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' d08cf7900d2aaff9e7483ea74a58871edbdc45f2 1bde981f60a8014728012b4b19dd73072a41bd48 llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/branch-relax-spill.ll llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll llvm/test/CodeGen/AMDGPU/call-args-inreg.ll llvm/test/CodeGen/AMDGPU/call-argument-types.ll llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll llvm/test/CodeGen/AMDGPU/ds_read2.ll llvm/test/CodeGen/AMDGPU/dwarf-multi-register-use-crash.ll llvm/test/CodeGen/AMDGPU/function-args-inreg.ll llvm/test/CodeGen/AMDGPU/function-resource-usage.ll llvm/test/CodeGen/AMDGPU/gfx-call-non-gfx-func.ll llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/indirect-call.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll llvm/test/CodeGen/AMDGPU/mcexpr-knownbits-assign-crash-gh-issue-110930.ll llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll llvm/test/CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg.ll llvm/test/CodeGen/AMDGPU/select.f16.ll llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v8i64.ll llvm/test/CodeGen/AMDGPU/sibling-call.ll llvm/test/CodeGen/AMDGPU/spill_more_than_wavesize_csr_sgprs.ll llvm/test/CodeGen/AMDGPU/stack-realign.ll llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll llvm/test/CodeGen/AMDGPU/vgpr-large-tuple-alloc-error.ll

The following files introduce new uses of undef:

  • llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@shiltian shiltian merged commit a779af3 into main Mar 8, 2025
10 of 11 checks passed
@shiltian shiltian deleted the users/shiltian/striped-sgpr-cc branch March 8, 2025 14:28
@rovka
Copy link
Collaborator

rovka commented Mar 10, 2025

Hi @shiltian, I think this change introduced some failures on this buildbot, which went unnoticed because it was already red. Could you please have a look? :)

@mikaelholmen
Copy link
Collaborator

So, with this patch e.g.

llc -verify-machineinstrs -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll

fails with

# After Greedy Register Allocator
********** INTERVALS **********
SGPR59_LO16 [96r,96d:0) 0@96r
SGPR59_HI16 [96r,96d:0) 0@96r
%8 [16r,32r:0) 0@16r  weight:INF
%15 [80r,80d:0) 0@80r  weight:INF
RegMasks:
********** MACHINEINSTRS **********
# Machine code for function scalar_mov_materializes_frame_index_unavailable_scc: NoPHIs, TracksLiveness, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=16384, align=64, at location [SP]
  fi#1: size=4, align=4, at location [SP]

0B	bb.0 (%ir-block.0):
16B	  %8:vgpr_32 = V_MOV_B32_e32 %stack.0.alloca0, implicit $exec
32B	  INLINEASM &"; use alloca0 $0" [sideeffect] [attdialect], $0:[reguse:VGPR_32], %8:vgpr_32
80B	  dead renamable $sgpr4_sgpr5 = S_AND_B64 0, $exec, implicit-def $scc
96B	  $sgpr59 = S_MOV_B32 %stack.1.alloca1
112B	  INLINEASM &"; use $0, $1" [sideeffect] [attdialect], $0:[reguse], $sgpr59, $1:[reguse], killed $scc
128B	  SI_RETURN

# End machine code for function scalar_mov_materializes_frame_index_unavailable_scc.

*** Bad machine code: No live segment at use ***
- function:    scalar_mov_materializes_frame_index_unavailable_scc
- basic block: %bb.0  (0x55ed7ac21b50) [0B;144B)
- instruction: 112B	INLINEASM &"; use $0, $1" [sideeffect] [attdialect], $0:[reguse], $sgpr59, $1:[reguse], killed $scc
- operand 3:   $sgpr59
- liverange:   [96r,96d:0) 0@96r
- regunit:     SGPR59_LO16
- at:          112B

*** Bad machine code: No live segment at use ***
- function:    scalar_mov_materializes_frame_index_unavailable_scc
- basic block: %bb.0  (0x55ed7ac21b50) [0B;144B)
- instruction: 112B	INLINEASM &"; use $0, $1" [sideeffect] [attdialect], $0:[reguse], $sgpr59, $1:[reguse], killed $scc
- operand 3:   $sgpr59
- liverange:   [96r,96d:0) 0@96r
- regunit:     SGPR59_HI16
- at:          112B
LLVM ERROR: Found 2 machine code errors.

(also seen if you compile with EXPENSIVE_CHECKS and run lit tests, as the failed build bot shows)

@shiltian
Copy link
Contributor Author

@mikaelholmen @rovka Thanks for the information. I'll take a look right away.

@shiltian
Copy link
Contributor Author

shiltian commented Mar 10, 2025

Right before this PR, CodeGen/AMDGPU/shufflevector-physreg-copy.ll already fails with expensive checks. I'll fix the three new failures in #130644.

shiltian added a commit that referenced this pull request Mar 10, 2025
This PR fixes test failures introduced in #127353 when expensive checkes are
enabled.
shiltian added a commit that referenced this pull request Mar 11, 2025
This PR fixes test failures introduced in #127353 when expensive checks
are enabled.

For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and
`llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59`
is no longer in live-ins because it is caller saved. Switch to `s55` in
this PR.
zmodem pushed a commit that referenced this pull request Mar 14, 2025
#130644)"

As suggested on
5ec884e#commitcomment-153707488
this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON:

  LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll
  LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll
  LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll

> This PR fixes test failures introduced in #127353 when expensive checks
> are enabled.
>
> For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and
> `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59`
> is no longer in live-ins because it is caller saved. Switch to `s55` in
> this PR.
frederik-h pushed a commit to frederik-h/llvm-project that referenced this pull request Mar 18, 2025
llvm#130644)"

As suggested on
llvm@5ec884e#commitcomment-153707488
this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON:

  LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll
  LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll
  LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll

> This PR fixes test failures introduced in llvm#127353 when expensive checks
> are enabled.
>
> For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and
> `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59`
> is no longer in live-ins because it is caller saved. Switch to `s55` in
> this PR.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 20, 2025
This PR updates the SGPR layout to a striped caller/callee-saved design,
similar to the VGPR layout.

To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout
starts from s40, with a stripe width of 8. The last stripe is 10 wide instead
of 8 to avoid ending with a 2-wide stripe.

Fixes llvm#113782.

(cherry picked from commit a779af3)
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AMDGPU] No available SGPR for CSR spill stores
8 participants