Closed
Description
Test case:
define amdgpu_ps i32 @_amdgpu_ps_main(i32 inreg %arg) {
bb:
%i = icmp eq i32 %arg, 0
%i1 = zext i1 %i to i64
%i2 = getelementptr i8, ptr addrspace(4) null, i64 %i1
%i3 = load i32, ptr addrspace(4) %i2, align 8
ret i32 %i3
}
If I compile with llc -march=amdgcn -mcpu=gfx900
I get:
_amdgpu_ps_main: ; @_amdgpu_ps_main
; %bb.0: ; %bb
s_cmp_eq_u32 s0, 0
s_cselect_b64 s[2:3], -1, 0
v_cndmask_b32_e64 v0, 0, 1, s[2:3]
s_mov_b32 s1, 0
v_readfirstlane_b32 s0, v0
s_load_dword s0, s[0:1], 0x0
s_waitcnt lgkmcnt(0)
; return to shader part epilog
All computations are uniform, so the use of v_cndmask_b32_e64
and v_readfirstlane_b32
is wasteful and inefficient.