Skip to content

AMDGPU missed opportunity: 2 x v_mov_b32 -> v_mov_b64 #139198

Open
@newling

Description

@newling

Consider the following IR (input.ll). In particular I'm focusing on what zeroinitializer gets lowered to by llc.

define amdgpu_kernel void @main(ptr addrspace(1) %out_ptr) {
entry:
  br label %loop

loop:                                              ; preds = %loop, %entry
  %vec = phi <32 x float> [ zeroinitializer, %entry ], [ %vec.next, %loop ]
  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
  %element = extractelement <32 x float> %vec, i32 %i
  %add = fadd float %element, 1.0
  %vec.next = insertelement <32 x float> %vec, float %add, i32 %i
  %i.next = add nuw nsw i32 %i, 1
  %exitcond = icmp eq i32 %i.next, 16
  br i1 %exitcond, label %store_result, label %loop

store_result:                                     ; preds = %loop
  %ptr = getelementptr float, ptr addrspace(1) %out_ptr, i64 0
  store <32 x float> %vec.next, ptr addrspace(1) %ptr, align 64
  ret void
}

Running

llc  -mtriple=amdgcn -mcpu=gfx942  input.ll 

the generated assembly is:

	.text
	.globl	main                            ; -- Begin function main
	.p2align	8
	.type	main,@function
main:                                   ; @main
; %bb.0:                                ; %entry
	v_mov_b32_e32 v0, 0
	s_mov_b32 s0, 0
	v_mov_b32_e32 v1, v0
	v_mov_b32_e32 v2, v0
	v_mov_b32_e32 v3, v0
	v_mov_b32_e32 v4, v0
	v_mov_b32_e32 v5, v0
[...]
	v_mov_b32_e32 v26, v0
	v_mov_b32_e32 v27, v0
	v_mov_b32_e32 v28, v0
	v_mov_b32_e32 v29, v0
	v_mov_b32_e32 v30, v0
	v_mov_b32_e32 v31, v0
.LBB0_1:                                ; %loop
                                        ; =>This Inner Loop Header: Depth=1
	s_set_gpr_idx_on s0, gpr_idx(SRC0)
	v_mov_b32_e32 v32, v0
	s_set_gpr_idx_off
	v_add_f32_e32 v32, 1.0, v32
	s_set_gpr_idx_on s0, gpr_idx(DST)

The optimization I have in mind is to combine consecutive v_mov_b32_e32 instructions, to arrive at something like

[...]
	v_mov_b64_e32 v[2:3], v[0:1]
	v_mov_b64_e32 v[4:5], v[0:1]
	v_mov_b64_e32 v[6:7], v[0:1]
[...]
	v_mov_b64_e32 v[30:31], v[0:1]

making use of the 2-register move instruction for mi300 ( search for "V_MOV_B64" in https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf )

I wonder where such an optimization might live. Would it be standalone pass like GCNDPPCombine.cpp or should it be a pattern in AMDGPUPostLegalizerCombiner.cpp? If I can have some guidance on this, I'd be happy to give it a try.

[Please let me know if this task doesn't make sense, I'm quite new here and would like to learn]]

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions