Closed
Description
Currently clang sets max_flat_workgroup_size
always to 1024 and causes register spill
.max_flat_workgroup_size: 1024
.name: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
.private_segment_fixed_size: 264
.sgpr_count: 60
.sgpr_spill_count: 0
.symbol: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
.vgpr_count: 128
.vgpr_spill_count: 66
.wavefront_size: 64
I tested overriding the default using ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))
and got 2x kernel speed-up.
.max_flat_workgroup_size: 256
.name: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
.private_segment_fixed_size: 0
.sgpr_count: 58
.sgpr_spill_count: 0
.symbol: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
.vgpr_count: 166
.vgpr_spill_count: 0
.wavefront_size: 64
The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.
- Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?
- When I specify
thread_limit(192)
clause, can the compiler take advantage of it?