Open
Description
These functions:
void shl_u8(uint8_t* dst, uint64_t c) {
*dst = 1 << (c&7);
}
void shr_u8(uint8_t* dst, uint64_t c) {
*dst = 0xaa >> (c&7);
}
compiled with -O3 -march=haswell
produce:
shl_u8:
mov rcx, rsi
and cl, 7
mov al, 1
shl al, cl
mov byte ptr [rdi], al
ret
shr_u8:
mov rcx, rsi
and cl, 7
mov al, -86
shr al, cl
mov byte ptr [rdi], al
ret
but they could use shlx
& shrx
as gcc does, e.g.:
shl_u8:
and esi, 7
mov eax, 1
shlx esi, eax, esi
mov BYTE PTR [rdi], sil
ret
Extra important in a loop, where clang's version ends up reloading the constant every iteration, whereas shlx
/shrx
can reuse one from outside the loop, ending up with clang taking 4 uops on Haswell, vs gcc - 1 uop per iteration.