Description
Consider the following IR (input.ll). In particular I'm focusing on what zeroinitializer
gets lowered to by llc.
define amdgpu_kernel void @main(ptr addrspace(1) %out_ptr) {
entry:
br label %loop
loop: ; preds = %loop, %entry
%vec = phi <32 x float> [ zeroinitializer, %entry ], [ %vec.next, %loop ]
%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
%element = extractelement <32 x float> %vec, i32 %i
%add = fadd float %element, 1.0
%vec.next = insertelement <32 x float> %vec, float %add, i32 %i
%i.next = add nuw nsw i32 %i, 1
%exitcond = icmp eq i32 %i.next, 16
br i1 %exitcond, label %store_result, label %loop
store_result: ; preds = %loop
%ptr = getelementptr float, ptr addrspace(1) %out_ptr, i64 0
store <32 x float> %vec.next, ptr addrspace(1) %ptr, align 64
ret void
}
Running
llc -mtriple=amdgcn -mcpu=gfx942 input.ll
the generated assembly is:
.text
.globl main ; -- Begin function main
.p2align 8
.type main,@function
main: ; @main
; %bb.0: ; %entry
v_mov_b32_e32 v0, 0
s_mov_b32 s0, 0
v_mov_b32_e32 v1, v0
v_mov_b32_e32 v2, v0
v_mov_b32_e32 v3, v0
v_mov_b32_e32 v4, v0
v_mov_b32_e32 v5, v0
[...]
v_mov_b32_e32 v26, v0
v_mov_b32_e32 v27, v0
v_mov_b32_e32 v28, v0
v_mov_b32_e32 v29, v0
v_mov_b32_e32 v30, v0
v_mov_b32_e32 v31, v0
.LBB0_1: ; %loop
; =>This Inner Loop Header: Depth=1
s_set_gpr_idx_on s0, gpr_idx(SRC0)
v_mov_b32_e32 v32, v0
s_set_gpr_idx_off
v_add_f32_e32 v32, 1.0, v32
s_set_gpr_idx_on s0, gpr_idx(DST)
The optimization I have in mind is to combine consecutive v_mov_b32_e32 instructions, to arrive at something like
[...]
v_mov_b64_e32 v[2:3], v[0:1]
v_mov_b64_e32 v[4:5], v[0:1]
v_mov_b64_e32 v[6:7], v[0:1]
[...]
v_mov_b64_e32 v[30:31], v[0:1]
making use of the 2-register move instruction for mi300 ( search for "V_MOV_B64" in https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf )
I wonder where such an optimization might live. Would it be standalone pass like GCNDPPCombine.cpp
or should it be a pattern in AMDGPUPostLegalizerCombiner.cpp
? If I can have some guidance on this, I'd be happy to give it a try.
[Please let me know if this task doesn't make sense, I'm quite new here and would like to learn]]