Skip to content

[AMDGPU] int64 modulo-constant x % 3 and divide-by-constant x / 3 compile to 80 instructions. #100383

Closed
@bjacob

Description

@bjacob

This is observed with -xhip targeting AMD MI300 (gfx942).

Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are -O3 --cuda-device-only -x hip -nogpuinc -nogpulib --offload-arch=gfx942.

Testcase:

__attribute__((device))
int64_t a(int64_t i) {
    return i % 3;
}

This compiles to 80 instructions.

By contrast, the same testcase with int64_t replaced by int32_t compiles to just 8 instructions.

I was expecting the int64 variant to generate slightly over 2x more instructions than the int32 variant (since the target requires rewriting int64 ops into pairs of int32 ops). Not 10x.

The above Compiler Explorer link shows the same happening with i / 3 instead of i % 3.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions