Skip to content

[GreedyRegAlloc] Multiple spill reloads into same register without intermediate def/overwrite #135639

Open
@JanekvO

Description

@JanekvO

llvm-project git hash: c9eebc7af440dc012c94d25351eaba92e6a57910
command: llc --mtriple=aarch64 aarch64_reprod.ll

Reproducible provided for aarch64, but am seeing it for amdgpu as well (I assume it can happen for any target using greedy).

From the AArch64 reproducible provided (aarch64_reprod.s):

...
fmov    d16, d19
fmul    d7, d6, d7
ldr     d6, [sp, #40]                   // 8-byte Folded Reload
fnmul   d13, d1, d18
fmul    d24, d3, d24
fmul    d11, d30, d22
mul     w11, w1, w11
fmadd   d26, d9, d22, d26
fmul    d17, d8, d17
cmp     w8, #3
fmadd   d31, d6, d18, d15
fmul    d27, d27, d28
ldr     d6, [sp, #40]                   // 8-byte Folded Reload
fmadd   d9, d0, d18, d14
fmul    d8, d4, d23
...

Where offset #40 is reloaded multiple times from the stack into d6. Between the 2 reloads, a use exists as fmadd d31, d6, d18, d15 but no defs for d6 that would overwrite the initial reload. Looking into it, it seems to be an unfortunate split+spill combination (albeit, the split in the aarch64 example is particularly confusing. The amdgpu case I've been looking at shows a more direct split->spill->superfluous stack reload combination.) where spillAroundUses will indiscriminately reload for all uses within an interval regardless of whether a previous reload is (or could be) still live. Within the aarch64_reprod_regalloc.txt dumps the virtregs in question are %170 and %171.

amdgpu reproducible:

llvm-project git hash: 3d7e56fd28cd2195e7f330f933d491530e274401
command: llc -mcpu=gfx942 amdgpu_reprod.ll -O3

Will produce a similar sequence (amdgpu_reprod.s):

...
global_store_dwordx2 v12, v[2:3], s[22:23]
scratch_load_dwordx2 v[2:3], off, off   ; 8-byte Folded Reload
v_accvgpr_write_b32 a20, v28
v_mul_f64 v[58:59], s[10:11], v[24:25]
v_accvgpr_write_b32 a21, v29
v_add_f64 v[28:29], v[58:59], 0
v_accvgpr_write_b32 a14, v22
v_accvgpr_write_b32 a43, v15
v_accvgpr_write_b32 a15, v23
v_accvgpr_write_b32 a42, v14
v_mul_f64 v[14:15], s[36:37], 0
v_mov_b64_e32 v[44:45], s[12:13]
v_accvgpr_write_b32 a0, v20
v_accvgpr_write_b32 a1, v21
v_mov_b32_e32 v13, 0x3ff00000
v_add_f64 v[16:17], v[54:55], v[16:17]
s_lshl_b64 s[2:3], s[42:43], 3
s_add_u32 s2, s22, s2
s_addc_u32 s3, s23, s3
s_waitcnt vmcnt(0)
v_mul_f64 v[32:33], v[2:3], 0
scratch_load_dwordx2 v[2:3], off, off   ; 8-byte Folded Reload
v_mul_f64 v[30:31], v[32:33], 0
...

Where v[2:3] is loaded from the stack, used (but not overwritten/redefined), and immediately loaded again for a subsequent use.

I'm personally not too familiar with greedy regalloc details so I'm not sure how easy a fix will be but it seems that regalloc changes are (understandably) under a bit more scrutiny so I'm not convinced that it'll be a trivial fix but I'd like to hear from the people more familiar with the greedy regalloc, if possible. Perhaps there is some low hanging fruit I'm overlooking.

aarch64_reprod_regalloc.txt
aarch64_reprod.ll.txt
aarch64_reprod.s.txt

amdgpu_reprod_regalloc.txt
amdgpu_reprod.ll.txt
amdgpu_reprod.s.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions