Closed
Description
This is observed with -xhip
targeting AMD MI300 (gfx942
).
Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are -O3 --cuda-device-only -x hip -nogpuinc -nogpulib --offload-arch=gfx942
.
Testcase:
__attribute__((device))
int64_t a(int64_t i) {
return i % 3;
}
This compiles to 80 instructions.
By contrast, the same testcase with int64_t
replaced by int32_t
compiles to just 8 instructions.
I was expecting the int64
variant to generate slightly over 2x more instructions than the int32
variant (since the target requires rewriting int64
ops into pairs of int32
ops). Not 10x.
The above Compiler Explorer link shows the same happening with i / 3
instead of i % 3
.