-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[AMDGPU] Change SGPR layout to striped caller/callee saved #127353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
@llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) ChangesThis PR updates the SGPR layout to a striped caller/callee-saved design, similar Fixes #113782. Patch is 2.57 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127353.diff 60 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td b/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
index 80969fce3d77f..e3861a7d06c3d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
@@ -91,7 +91,11 @@ def CSR_AMDGPU_AGPRs : CalleeSavedRegs<
>;
def CSR_AMDGPU_SGPRs : CalleeSavedRegs<
- (sequence "SGPR%u", 30, 105)
+ (add (sequence "SGPR%u", 30, 37),
+ (sequence "SGPR%u", 46, 53),
+ (sequence "SGPR%u", 62, 69),
+ (sequence "SGPR%u", 78, 85),
+ (sequence "SGPR%u", 94, 105))
>;
def CSR_AMDGPU_SI_Gfx_SGPRs : CalleeSavedRegs<
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
index ab2363860af9d..905d0deacab35 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll
@@ -125,35 +125,35 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
; CHECK-NEXT: v_writelane_b32 v43, s35, 3
; CHECK-NEXT: v_writelane_b32 v43, s36, 4
; CHECK-NEXT: v_writelane_b32 v43, s37, 5
-; CHECK-NEXT: v_writelane_b32 v43, s38, 6
-; CHECK-NEXT: v_writelane_b32 v43, s39, 7
+; CHECK-NEXT: v_writelane_b32 v43, s46, 6
+; CHECK-NEXT: v_writelane_b32 v43, s47, 7
; CHECK-NEXT: s_addk_i32 s32, 0x800
-; CHECK-NEXT: v_writelane_b32 v43, s40, 8
-; CHECK-NEXT: v_writelane_b32 v43, s41, 9
-; CHECK-NEXT: s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT: v_writelane_b32 v43, s48, 8
+; CHECK-NEXT: v_writelane_b32 v43, s49, 9
+; CHECK-NEXT: s_mov_b64 s[48:49], s[4:5]
; CHECK-NEXT: s_getpc_b64 s[4:5]
; CHECK-NEXT: s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: v_writelane_b32 v43, s42, 10
+; CHECK-NEXT: v_writelane_b32 v43, s50, 10
; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT: v_writelane_b32 v43, s43, 11
+; CHECK-NEXT: v_writelane_b32 v43, s51, 11
; CHECK-NEXT: v_mov_b32_e32 v42, v1
-; CHECK-NEXT: v_writelane_b32 v43, s44, 12
+; CHECK-NEXT: v_writelane_b32 v43, s52, 12
; CHECK-NEXT: v_and_b32_e32 v1, 0x7fffffff, v42
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT: v_writelane_b32 v43, s45, 13
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT: v_writelane_b32 v43, s53, 13
; CHECK-NEXT: v_mov_b32_e32 v40, v31
; CHECK-NEXT: v_mov_b32_e32 v41, v2
-; CHECK-NEXT: s_mov_b32 s42, s15
-; CHECK-NEXT: s_mov_b32 s43, s14
-; CHECK-NEXT: s_mov_b32 s44, s13
-; CHECK-NEXT: s_mov_b32 s45, s12
+; CHECK-NEXT: s_mov_b32 s50, s15
+; CHECK-NEXT: s_mov_b32 s51, s14
+; CHECK-NEXT: s_mov_b32 s52, s13
+; CHECK-NEXT: s_mov_b32 s53, s12
; CHECK-NEXT: s_mov_b64 s[34:35], s[10:11]
; CHECK-NEXT: s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT: s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT: s_mov_b64 s[46:47], s[6:7]
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: v_cvt_f64_i32_e32 v[2:3], v41
@@ -161,15 +161,15 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
; CHECK-NEXT: s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
; CHECK-NEXT: v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT: s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT: s_mov_b64 s[6:7], s[46:47]
; CHECK-NEXT: s_mov_b64 s[8:9], s[36:37]
; CHECK-NEXT: s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT: s_mov_b32 s12, s45
-; CHECK-NEXT: s_mov_b32 s13, s44
-; CHECK-NEXT: s_mov_b32 s14, s43
-; CHECK-NEXT: s_mov_b32 s15, s42
+; CHECK-NEXT: s_mov_b32 s12, s53
+; CHECK-NEXT: s_mov_b32 s13, s52
+; CHECK-NEXT: s_mov_b32 s14, s51
+; CHECK-NEXT: s_mov_b32 s15, s50
; CHECK-NEXT: v_mov_b32_e32 v31, v40
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
@@ -179,14 +179,14 @@ define double @test_pow_fast_f64__integral_y(double %x, i32 %y.i) {
; CHECK-NEXT: buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
; CHECK-NEXT: v_or_b32_e32 v1, v2, v1
-; CHECK-NEXT: v_readlane_b32 s45, v43, 13
-; CHECK-NEXT: v_readlane_b32 s44, v43, 12
-; CHECK-NEXT: v_readlane_b32 s43, v43, 11
-; CHECK-NEXT: v_readlane_b32 s42, v43, 10
-; CHECK-NEXT: v_readlane_b32 s41, v43, 9
-; CHECK-NEXT: v_readlane_b32 s40, v43, 8
-; CHECK-NEXT: v_readlane_b32 s39, v43, 7
-; CHECK-NEXT: v_readlane_b32 s38, v43, 6
+; CHECK-NEXT: v_readlane_b32 s53, v43, 13
+; CHECK-NEXT: v_readlane_b32 s52, v43, 12
+; CHECK-NEXT: v_readlane_b32 s51, v43, 11
+; CHECK-NEXT: v_readlane_b32 s50, v43, 10
+; CHECK-NEXT: v_readlane_b32 s49, v43, 9
+; CHECK-NEXT: v_readlane_b32 s48, v43, 8
+; CHECK-NEXT: v_readlane_b32 s47, v43, 7
+; CHECK-NEXT: v_readlane_b32 s46, v43, 6
; CHECK-NEXT: v_readlane_b32 s37, v43, 5
; CHECK-NEXT: v_readlane_b32 s36, v43, 4
; CHECK-NEXT: v_readlane_b32 s35, v43, 3
@@ -266,34 +266,34 @@ define double @test_powr_fast_f64(double %x, double %y) {
; CHECK-NEXT: v_writelane_b32 v43, s35, 3
; CHECK-NEXT: v_writelane_b32 v43, s36, 4
; CHECK-NEXT: v_writelane_b32 v43, s37, 5
-; CHECK-NEXT: v_writelane_b32 v43, s38, 6
-; CHECK-NEXT: v_writelane_b32 v43, s39, 7
+; CHECK-NEXT: v_writelane_b32 v43, s46, 6
+; CHECK-NEXT: v_writelane_b32 v43, s47, 7
; CHECK-NEXT: s_addk_i32 s32, 0x800
-; CHECK-NEXT: v_writelane_b32 v43, s40, 8
-; CHECK-NEXT: v_writelane_b32 v43, s41, 9
-; CHECK-NEXT: s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT: v_writelane_b32 v43, s48, 8
+; CHECK-NEXT: v_writelane_b32 v43, s49, 9
+; CHECK-NEXT: s_mov_b64 s[48:49], s[4:5]
; CHECK-NEXT: s_getpc_b64 s[4:5]
; CHECK-NEXT: s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: v_writelane_b32 v43, s42, 10
-; CHECK-NEXT: v_writelane_b32 v43, s43, 11
-; CHECK-NEXT: v_writelane_b32 v43, s44, 12
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT: v_writelane_b32 v43, s50, 10
+; CHECK-NEXT: v_writelane_b32 v43, s51, 11
+; CHECK-NEXT: v_writelane_b32 v43, s52, 12
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT: v_writelane_b32 v43, s45, 13
+; CHECK-NEXT: v_writelane_b32 v43, s53, 13
; CHECK-NEXT: v_mov_b32_e32 v42, v31
; CHECK-NEXT: v_mov_b32_e32 v41, v3
; CHECK-NEXT: v_mov_b32_e32 v40, v2
-; CHECK-NEXT: s_mov_b32 s42, s15
-; CHECK-NEXT: s_mov_b32 s43, s14
-; CHECK-NEXT: s_mov_b32 s44, s13
-; CHECK-NEXT: s_mov_b32 s45, s12
+; CHECK-NEXT: s_mov_b32 s50, s15
+; CHECK-NEXT: s_mov_b32 s51, s14
+; CHECK-NEXT: s_mov_b32 s52, s13
+; CHECK-NEXT: s_mov_b32 s53, s12
; CHECK-NEXT: s_mov_b64 s[34:35], s[10:11]
; CHECK-NEXT: s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT: s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT: s_mov_b64 s[46:47], s[6:7]
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: v_mul_f64 v[0:1], v[40:41], v[0:1]
@@ -301,28 +301,28 @@ define double @test_powr_fast_f64(double %x, double %y) {
; CHECK-NEXT: s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT: s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT: s_mov_b64 s[6:7], s[46:47]
; CHECK-NEXT: s_mov_b64 s[8:9], s[36:37]
; CHECK-NEXT: s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT: s_mov_b32 s12, s45
-; CHECK-NEXT: s_mov_b32 s13, s44
-; CHECK-NEXT: s_mov_b32 s14, s43
-; CHECK-NEXT: s_mov_b32 s15, s42
+; CHECK-NEXT: s_mov_b32 s12, s53
+; CHECK-NEXT: s_mov_b32 s13, s52
+; CHECK-NEXT: s_mov_b32 s14, s51
+; CHECK-NEXT: s_mov_b32 s15, s50
; CHECK-NEXT: v_mov_b32_e32 v31, v42
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: buffer_load_dword v42, off, s[0:3], s33 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
-; CHECK-NEXT: v_readlane_b32 s45, v43, 13
-; CHECK-NEXT: v_readlane_b32 s44, v43, 12
-; CHECK-NEXT: v_readlane_b32 s43, v43, 11
-; CHECK-NEXT: v_readlane_b32 s42, v43, 10
-; CHECK-NEXT: v_readlane_b32 s41, v43, 9
-; CHECK-NEXT: v_readlane_b32 s40, v43, 8
-; CHECK-NEXT: v_readlane_b32 s39, v43, 7
-; CHECK-NEXT: v_readlane_b32 s38, v43, 6
+; CHECK-NEXT: v_readlane_b32 s53, v43, 13
+; CHECK-NEXT: v_readlane_b32 s52, v43, 12
+; CHECK-NEXT: v_readlane_b32 s51, v43, 11
+; CHECK-NEXT: v_readlane_b32 s50, v43, 10
+; CHECK-NEXT: v_readlane_b32 s49, v43, 9
+; CHECK-NEXT: v_readlane_b32 s48, v43, 8
+; CHECK-NEXT: v_readlane_b32 s47, v43, 7
+; CHECK-NEXT: v_readlane_b32 s46, v43, 6
; CHECK-NEXT: v_readlane_b32 s37, v43, 5
; CHECK-NEXT: v_readlane_b32 s36, v43, 4
; CHECK-NEXT: v_readlane_b32 s35, v43, 3
@@ -409,35 +409,35 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
; CHECK-NEXT: v_writelane_b32 v43, s35, 3
; CHECK-NEXT: v_writelane_b32 v43, s36, 4
; CHECK-NEXT: v_writelane_b32 v43, s37, 5
-; CHECK-NEXT: v_writelane_b32 v43, s38, 6
-; CHECK-NEXT: v_writelane_b32 v43, s39, 7
+; CHECK-NEXT: v_writelane_b32 v43, s46, 6
+; CHECK-NEXT: v_writelane_b32 v43, s47, 7
; CHECK-NEXT: s_addk_i32 s32, 0x800
-; CHECK-NEXT: v_writelane_b32 v43, s40, 8
-; CHECK-NEXT: v_writelane_b32 v43, s41, 9
-; CHECK-NEXT: s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT: v_writelane_b32 v43, s48, 8
+; CHECK-NEXT: v_writelane_b32 v43, s49, 9
+; CHECK-NEXT: s_mov_b64 s[48:49], s[4:5]
; CHECK-NEXT: s_getpc_b64 s[4:5]
; CHECK-NEXT: s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: v_writelane_b32 v43, s42, 10
+; CHECK-NEXT: v_writelane_b32 v43, s50, 10
; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT: v_writelane_b32 v43, s43, 11
+; CHECK-NEXT: v_writelane_b32 v43, s51, 11
; CHECK-NEXT: v_mov_b32_e32 v42, v1
-; CHECK-NEXT: v_writelane_b32 v43, s44, 12
+; CHECK-NEXT: v_writelane_b32 v43, s52, 12
; CHECK-NEXT: v_and_b32_e32 v1, 0x7fffffff, v42
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
-; CHECK-NEXT: v_writelane_b32 v43, s45, 13
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
+; CHECK-NEXT: v_writelane_b32 v43, s53, 13
; CHECK-NEXT: v_mov_b32_e32 v40, v31
; CHECK-NEXT: v_mov_b32_e32 v41, v2
-; CHECK-NEXT: s_mov_b32 s42, s15
-; CHECK-NEXT: s_mov_b32 s43, s14
-; CHECK-NEXT: s_mov_b32 s44, s13
-; CHECK-NEXT: s_mov_b32 s45, s12
+; CHECK-NEXT: s_mov_b32 s50, s15
+; CHECK-NEXT: s_mov_b32 s51, s14
+; CHECK-NEXT: s_mov_b32 s52, s13
+; CHECK-NEXT: s_mov_b32 s53, s12
; CHECK-NEXT: s_mov_b64 s[34:35], s[10:11]
; CHECK-NEXT: s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT: s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT: s_mov_b64 s[46:47], s[6:7]
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: v_cvt_f64_i32_e32 v[2:3], v41
@@ -445,15 +445,15 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
; CHECK-NEXT: s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
; CHECK-NEXT: v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT: s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT: s_mov_b64 s[6:7], s[46:47]
; CHECK-NEXT: s_mov_b64 s[8:9], s[36:37]
; CHECK-NEXT: s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT: s_mov_b32 s12, s45
-; CHECK-NEXT: s_mov_b32 s13, s44
-; CHECK-NEXT: s_mov_b32 s14, s43
-; CHECK-NEXT: s_mov_b32 s15, s42
+; CHECK-NEXT: s_mov_b32 s12, s53
+; CHECK-NEXT: s_mov_b32 s13, s52
+; CHECK-NEXT: s_mov_b32 s14, s51
+; CHECK-NEXT: s_mov_b32 s15, s50
; CHECK-NEXT: v_mov_b32_e32 v31, v40
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
@@ -463,14 +463,14 @@ define double @test_pown_fast_f64(double %x, i32 %y) {
; CHECK-NEXT: buffer_load_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload
; CHECK-NEXT: v_or_b32_e32 v1, v2, v1
-; CHECK-NEXT: v_readlane_b32 s45, v43, 13
-; CHECK-NEXT: v_readlane_b32 s44, v43, 12
-; CHECK-NEXT: v_readlane_b32 s43, v43, 11
-; CHECK-NEXT: v_readlane_b32 s42, v43, 10
-; CHECK-NEXT: v_readlane_b32 s41, v43, 9
-; CHECK-NEXT: v_readlane_b32 s40, v43, 8
-; CHECK-NEXT: v_readlane_b32 s39, v43, 7
-; CHECK-NEXT: v_readlane_b32 s38, v43, 6
+; CHECK-NEXT: v_readlane_b32 s53, v43, 13
+; CHECK-NEXT: v_readlane_b32 s52, v43, 12
+; CHECK-NEXT: v_readlane_b32 s51, v43, 11
+; CHECK-NEXT: v_readlane_b32 s50, v43, 10
+; CHECK-NEXT: v_readlane_b32 s49, v43, 9
+; CHECK-NEXT: v_readlane_b32 s48, v43, 8
+; CHECK-NEXT: v_readlane_b32 s47, v43, 7
+; CHECK-NEXT: v_readlane_b32 s46, v43, 6
; CHECK-NEXT: v_readlane_b32 s37, v43, 5
; CHECK-NEXT: v_readlane_b32 s36, v43, 4
; CHECK-NEXT: v_readlane_b32 s35, v43, 3
@@ -552,32 +552,32 @@ define double @test_pown_fast_f64_known_even(double %x, i32 %y.arg) {
; CHECK-NEXT: v_writelane_b32 v42, s35, 3
; CHECK-NEXT: v_writelane_b32 v42, s36, 4
; CHECK-NEXT: v_writelane_b32 v42, s37, 5
-; CHECK-NEXT: v_writelane_b32 v42, s38, 6
-; CHECK-NEXT: v_writelane_b32 v42, s39, 7
+; CHECK-NEXT: v_writelane_b32 v42, s46, 6
+; CHECK-NEXT: v_writelane_b32 v42, s47, 7
; CHECK-NEXT: s_addk_i32 s32, 0x400
-; CHECK-NEXT: v_writelane_b32 v42, s40, 8
-; CHECK-NEXT: v_writelane_b32 v42, s41, 9
-; CHECK-NEXT: s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT: v_writelane_b32 v42, s48, 8
+; CHECK-NEXT: v_writelane_b32 v42, s49, 9
+; CHECK-NEXT: s_mov_b64 s[48:49], s[4:5]
; CHECK-NEXT: s_getpc_b64 s[4:5]
; CHECK-NEXT: s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: v_writelane_b32 v42, s42, 10
-; CHECK-NEXT: v_writelane_b32 v42, s43, 11
-; CHECK-NEXT: v_writelane_b32 v42, s44, 12
+; CHECK-NEXT: v_writelane_b32 v42, s50, 10
+; CHECK-NEXT: v_writelane_b32 v42, s51, 11
+; CHECK-NEXT: v_writelane_b32 v42, s52, 12
; CHECK-NEXT: v_and_b32_e32 v1, 0x7fffffff, v1
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT: v_writelane_b32 v42, s45, 13
+; CHECK-NEXT: v_writelane_b32 v42, s53, 13
; CHECK-NEXT: v_mov_b32_e32 v40, v31
-; CHECK-NEXT: s_mov_b32 s42, s15
-; CHECK-NEXT: s_mov_b32 s43, s14
-; CHECK-NEXT: s_mov_b32 s44, s13
-; CHECK-NEXT: s_mov_b32 s45, s12
+; CHECK-NEXT: s_mov_b32 s50, s15
+; CHECK-NEXT: s_mov_b32 s51, s14
+; CHECK-NEXT: s_mov_b32 s52, s13
+; CHECK-NEXT: s_mov_b32 s53, s12
; CHECK-NEXT: s_mov_b64 s[34:35], s[10:11]
; CHECK-NEXT: s_mov_b64 s[36:37], s[8:9]
-; CHECK-NEXT: s_mov_b64 s[38:39], s[6:7]
+; CHECK-NEXT: s_mov_b64 s[46:47], s[6:7]
; CHECK-NEXT: v_lshlrev_b32_e32 v41, 1, v2
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
@@ -586,28 +586,28 @@ define double @test_pown_fast_f64_known_even(double %x, i32 %y.arg) {
; CHECK-NEXT: s_add_u32 s4, s4, _Z4exp2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4exp2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: s_mov_b64 s[4:5], s[40:41]
+; CHECK-NEXT: s_mov_b64 s[4:5], s[48:49]
; CHECK-NEXT: v_mul_f64 v[0:1], v[0:1], v[2:3]
-; CHECK-NEXT: s_mov_b64 s[6:7], s[38:39]
+; CHECK-NEXT: s_mov_b64 s[6:7], s[46:47]
; CHECK-NEXT: s_mov_b64 s[8:9], s[36:37]
; CHECK-NEXT: s_mov_b64 s[10:11], s[34:35]
-; CHECK-NEXT: s_mov_b32 s12, s45
-; CHECK-NEXT: s_mov_b32 s13, s44
-; CHECK-NEXT: s_mov_b32 s14, s43
-; CHECK-NEXT: s_mov_b32 s15, s42
+; CHECK-NEXT: s_mov_b32 s12, s53
+; CHECK-NEXT: s_mov_b32 s13, s52
+; CHECK-NEXT: s_mov_b32 s14, s51
+; CHECK-NEXT: s_mov_b32 s15, s50
; CHECK-NEXT: v_mov_b32_e32 v31, v40
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
-; CHECK-NEXT: v_readlane_b32 s45, v42, 13
-; CHECK-NEXT: v_readlane_b32 s44, v42, 12
-; CHECK-NEXT: v_readlane_b32 s43, v42, 11
-; CHECK-NEXT: v_readlane_b32 s42, v42, 10
-; CHECK-NEXT: v_readlane_b32 s41, v42, 9
-; CHECK-NEXT: v_readlane_b32 s40, v42, 8
-; CHECK-NEXT: v_readlane_b32 s39, v42, 7
-; CHECK-NEXT: v_readlane_b32 s38, v42, 6
+; CHECK-NEXT: v_readlane_b32 s53, v42, 13
+; CHECK-NEXT: v_readlane_b32 s52, v42, 12
+; CHECK-NEXT: v_readlane_b32 s51, v42, 11
+; CHECK-NEXT: v_readlane_b32 s50, v42, 10
+; CHECK-NEXT: v_readlane_b32 s49, v42, 9
+; CHECK-NEXT: v_readlane_b32 s48, v42, 8
+; CHECK-NEXT: v_readlane_b32 s47, v42, 7
+; CHECK-NEXT: v_readlane_b32 s46, v42, 6
; CHECK-NEXT: v_readlane_b32 s37, v42, 5
; CHECK-NEXT: v_readlane_b32 s36, v42, 4
; CHECK-NEXT: v_readlane_b32 s35, v42, 3
@@ -694,34 +694,34 @@ define double @test_pown_fast_f64_known_odd(double %x, i32 %y.arg) {
; CHECK-NEXT: v_writelane_b32 v43, s35, 3
; CHECK-NEXT: v_writelane_b32 v43, s36, 4
; CHECK-NEXT: v_writelane_b32 v43, s37, 5
-; CHECK-NEXT: v_writelane_b32 v43, s38, 6
-; CHECK-NEXT: v_writelane_b32 v43, s39, 7
+; CHECK-NEXT: v_writelane_b32 v43, s46, 6
+; CHECK-NEXT: v_writelane_b32 v43, s47, 7
; CHECK-NEXT: s_addk_i32 s32, 0x800
-; CHECK-NEXT: v_writelane_b32 v43, s40, 8
-; CHECK-NEXT: v_writelane_b32 v43, s41, 9
-; CHECK-NEXT: s_mov_b64 s[40:41], s[4:5]
+; CHECK-NEXT: v_writelane_b32 v43, s48, 8
+; CHECK-NEXT: v_writelane_b32 v43, s49, 9
+; CHECK-NEXT: s_mov_b64 s[48:49], s[4:5]
; CHECK-NEXT: s_getpc_b64 s[4:5]
; CHECK-NEXT: s_add_u32 s4, s4, _Z4log2d@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s5, s5, _Z4log2d@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
-; CHECK-NEXT: v_writelane_b32 v43, s42, 10
+; CHECK-NEXT: v_writelane_b32 v43, s50, 10
; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s33 offset:8 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v41, off, s[0:3], s33 offset:4 ; 4-byte Folded Spill
; CHECK-NEXT: buffer_store_dword v42, off, s[0:3], s33 ; 4-byte Folded Spill
-; CHECK-NEXT: v_writelane_b32 v43, s43, 11
+; CHECK-NEXT: v_writelane_b32 v43, s51, 11
; CHECK-NEXT: v_mov_b32_e32 v41, v1
-; CHECK-NEXT: ...
[truncated]
|
This has passed internal PSDB (except the one test case that I has not updated yet). |
025e58c
to
178dd48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
This patch would improve the codegen with less number of SGPR spills for heavy workloads involving device calls. May be run the perf PSDB as well? That would give us some initial numbers. You can CP this PR to the staging compiler and then launch the perf PSDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Striping SGPRs serves no purpose on GFX10+ where all waves get the full allocation of SGPRs. But hopefully it doesn't do any harm either.
(sequence "SGPR%u", 46, 53), | ||
(sequence "SGPR%u", 62, 69), | ||
(sequence "SGPR%u", 78, 85), | ||
(sequence "SGPR%u", 94, 105)) | ||
>; | ||
|
||
def CSR_AMDGPU_SI_Gfx_SGPRs : CalleeSavedRegs< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Flakebi should we make some similar change here for amdgpu_gfx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both is fine (changing it or leaving it for now). amdgpu_gfx already has caller-saves that are not used for arguments, so it’s not hit by this bug.
The important part is that amdgpu_gfx wants the SGPR arguments to be in callee-save registers. I assume compute would likely benefit from having SGPR args in callee-saves as well, as they usually contain constant data, but it’s not there yet.
Once the C calling convention does that, we can probably ditch amdgpu_gfx and switch to the C calling conv for graphics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume compute would likely benefit from having SGPR args in callee-saves as well, as they usually contain constant data
I'll include this part for the next step.
This doesn't serve as a performance improvement anyway. I'll request a performance cycle. |
The only reason for doing striping is to get roughly the same ratio of callee-saves / non-callee-saves at different occupancies. Why would you want to keep that ratio constant, if not for performance? Anyway that reason does not apply on GFX10+. |
178dd48
to
5478409
Compare
Will do. |
Is the test from #113782 buried somewhere in this giant test diff? |
Yes. The check line of the test has been updated because it no longer crashes but another error was emitted. I'll fix the new issue in a follow up. |
A full testing cycle has been requested. Will comment here afterwards. |
Just got the results from a full cycle. There are no correctness issues and no performance regressions. I'm not sure if there's any performance improvement, though. That being said, this PR should be in good shape to go. |
This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. The stripe width is set to 8. Fixes #113782.
5478409
to
1bde981
Compare
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' d08cf7900d2aaff9e7483ea74a58871edbdc45f2 1bde981f60a8014728012b4b19dd73072a41bd48 llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/blender-no-live-segment-at-def-implicit-def.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/branch-relax-spill.ll llvm/test/CodeGen/AMDGPU/call-args-inreg-no-sgpr-for-csrspill-xfail.ll llvm/test/CodeGen/AMDGPU/call-args-inreg.ll llvm/test/CodeGen/AMDGPU/call-argument-types.ll llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll llvm/test/CodeGen/AMDGPU/ds_read2.ll llvm/test/CodeGen/AMDGPU/dwarf-multi-register-use-crash.ll llvm/test/CodeGen/AMDGPU/function-args-inreg.ll llvm/test/CodeGen/AMDGPU/function-resource-usage.ll llvm/test/CodeGen/AMDGPU/gfx-call-non-gfx-func.ll llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/indirect-call.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll llvm/test/CodeGen/AMDGPU/mcexpr-knownbits-assign-crash-gh-issue-110930.ll llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll llvm/test/CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg.ll llvm/test/CodeGen/AMDGPU/select.f16.ll llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v8i64.ll llvm/test/CodeGen/AMDGPU/sibling-call.ll llvm/test/CodeGen/AMDGPU/spill_more_than_wavesize_csr_sgprs.ll llvm/test/CodeGen/AMDGPU/stack-realign.ll llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll llvm/test/CodeGen/AMDGPU/vgpr-large-tuple-alloc-error.ll The following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
} Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
} Please refer to the Undefined Behavior Manual for more information. |
Hi @shiltian, I think this change introduced some failures on this buildbot, which went unnoticed because it was already red. Could you please have a look? :) |
So, with this patch e.g.
fails with
(also seen if you compile with EXPENSIVE_CHECKS and run lit tests, as the failed build bot shows) |
@mikaelholmen @rovka Thanks for the information. I'll take a look right away. |
Right before this PR, |
This PR fixes test failures introduced in #127353 when expensive checkes are enabled.
This PR fixes test failures introduced in #127353 when expensive checks are enabled. For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` is no longer in live-ins because it is caller saved. Switch to `s55` in this PR.
#130644)" As suggested on 5ec884e#commitcomment-153707488 this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON: LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll > This PR fixes test failures introduced in #127353 when expensive checks > are enabled. > > For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and > `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` > is no longer in live-ins because it is caller saved. Switch to `s55` in > this PR.
llvm#130644)" As suggested on llvm@5ec884e#commitcomment-153707488 this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON: LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll > This PR fixes test failures introduced in llvm#127353 when expensive checks > are enabled. > > For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and > `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` > is no longer in live-ins because it is caller saved. Switch to `s55` in > this PR.
This PR updates the SGPR layout to a striped caller/callee-saved design, similar to the VGPR layout. To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame pointer), and s34 (base pointer) remain callee-saved, the striped layout starts from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to avoid ending with a 2-wide stripe. Fixes llvm#113782. (cherry picked from commit a779af3)
This PR updates the SGPR layout to a striped caller/callee-saved design, similar
to the VGPR layout.
To ensure that s30-s31 (return address), s32 (stack pointer), s33 (frame
pointer), and s34 (base pointer) remain callee-saved, the striped layout starts
from s40, with a stripe width of 8. The last stripe is 10 wide instead of 8 to
avoid ending with a 2-wide stripe.
Fixes #113782.