Skip to content

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

Closed
@ye-luo

Description

@ye-luo

Currently clang sets max_flat_workgroup_size always to 1024 and causes register spill

    .max_flat_workgroup_size: 1024
    .name:           __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
    .private_segment_fixed_size: 264
    .sgpr_count:     60
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
    .vgpr_count:     128
    .vgpr_spill_count: 66
    .wavefront_size: 64

I tested overriding the default using ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))
and got 2x kernel speed-up.

    .max_flat_workgroup_size: 256
    .name:           __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
    .private_segment_fixed_size: 0
    .sgpr_count:     58
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
    .vgpr_count:     166
    .vgpr_spill_count: 0
    .wavefront_size: 64

The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.

  1. Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?
  2. When I specify thread_limit(192) clause, can the compiler take advantage of it?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions