Description
llvm-project git hash: c9eebc7af440dc012c94d25351eaba92e6a57910
command: llc --mtriple=aarch64 aarch64_reprod.ll
Reproducible provided for aarch64, but am seeing it for amdgpu as well (I assume it can happen for any target using greedy).
From the AArch64 reproducible provided (aarch64_reprod.s
):
...
fmov d16, d19
fmul d7, d6, d7
ldr d6, [sp, #40] // 8-byte Folded Reload
fnmul d13, d1, d18
fmul d24, d3, d24
fmul d11, d30, d22
mul w11, w1, w11
fmadd d26, d9, d22, d26
fmul d17, d8, d17
cmp w8, #3
fmadd d31, d6, d18, d15
fmul d27, d27, d28
ldr d6, [sp, #40] // 8-byte Folded Reload
fmadd d9, d0, d18, d14
fmul d8, d4, d23
...
Where offset #40
is reloaded multiple times from the stack into d6
. Between the 2 reloads, a use exists as fmadd d31, d6, d18, d15
but no defs for d6
that would overwrite the initial reload. Looking into it, it seems to be an unfortunate split+spill combination (albeit, the split in the aarch64 example is particularly confusing. The amdgpu case I've been looking at shows a more direct split->spill->superfluous stack reload
combination.) where spillAroundUses
will indiscriminately reload for all uses within an interval regardless of whether a previous reload is (or could be) still live. Within the aarch64_reprod_regalloc.txt
dumps the virtregs in question are %170
and %171
.
amdgpu reproducible:
llvm-project git hash: 3d7e56fd28cd2195e7f330f933d491530e274401
command: llc -mcpu=gfx942 amdgpu_reprod.ll -O3
Will produce a similar sequence (amdgpu_reprod.s
):
...
global_store_dwordx2 v12, v[2:3], s[22:23]
scratch_load_dwordx2 v[2:3], off, off ; 8-byte Folded Reload
v_accvgpr_write_b32 a20, v28
v_mul_f64 v[58:59], s[10:11], v[24:25]
v_accvgpr_write_b32 a21, v29
v_add_f64 v[28:29], v[58:59], 0
v_accvgpr_write_b32 a14, v22
v_accvgpr_write_b32 a43, v15
v_accvgpr_write_b32 a15, v23
v_accvgpr_write_b32 a42, v14
v_mul_f64 v[14:15], s[36:37], 0
v_mov_b64_e32 v[44:45], s[12:13]
v_accvgpr_write_b32 a0, v20
v_accvgpr_write_b32 a1, v21
v_mov_b32_e32 v13, 0x3ff00000
v_add_f64 v[16:17], v[54:55], v[16:17]
s_lshl_b64 s[2:3], s[42:43], 3
s_add_u32 s2, s22, s2
s_addc_u32 s3, s23, s3
s_waitcnt vmcnt(0)
v_mul_f64 v[32:33], v[2:3], 0
scratch_load_dwordx2 v[2:3], off, off ; 8-byte Folded Reload
v_mul_f64 v[30:31], v[32:33], 0
...
Where v[2:3]
is loaded from the stack, used (but not overwritten/redefined), and immediately loaded again for a subsequent use.
I'm personally not too familiar with greedy regalloc details so I'm not sure how easy a fix will be but it seems that regalloc changes are (understandably) under a bit more scrutiny so I'm not convinced that it'll be a trivial fix but I'd like to hear from the people more familiar with the greedy regalloc, if possible. Perhaps there is some low hanging fruit I'm overlooking.
aarch64_reprod_regalloc.txt
aarch64_reprod.ll.txt
aarch64_reprod.s.txt
amdgpu_reprod_regalloc.txt
amdgpu_reprod.ll.txt
amdgpu_reprod.s.txt