Description
When we know that a shift is greater than 32, we can translate a 64 bit shift into a 32 bit shift and set the lower register to 0.
Clang fails to do this optimization in most cases. This happens at least on ARM, RISC-V and x86, others probably also.
Here are some examples (godbolt):
#include <cstdint>
uint64_t shift_assume_ge32(uint64_t x, int s) {
__builtin_assume(s >= 32);
return x << s;
}
uint64_t shift_assume_bit5(uint64_t x, int s) {
__builtin_assume((s | 32) == s);
return x << s;
}
uint64_t shift_set_bit5(uint64_t x, int s) {
s |= 32;
return x << s;
}
uint64_t shift_set_bit5_assume_set(uint64_t x, int s) {
__builtin_assume((s | 32) == s);
s |= 32;
return x << s;
}
uint64_t shift_32_plus(uint64_t x, int s) {
__builtin_assume(s >= 0);
s += 32;
return x << s;
}
uint64_t shift_lt32(uint64_t x, int s) {
__builtin_assume(s < 32);
return x << s;
}
uint64_t shift_mask31(uint64_t x, int s) {
s &= 31;
return x << s;
}
uint64_t shift_assume_masked(uint64_t x, int s) {
__builtin_assume((s & 31) == s);
return x << s;
}
uint64_t shift_mask31_assume_masked(uint64_t x, int s) {
__builtin_assume((s & 31) == s);
s &= 31;
return x << s;
}
Assuming that the shift amount is greater than 32 or the 5th bit is set has no effect (shift_assume_ge32()
and shift_assume_bit5
).
Similarly assuming it is less than 32 or the bit is unset also has no effect (shift_lt32()
and shift_assume_masked()
).
Setting the 5th bit or masking by 31 (shift_set_bit5()
and shift_mask31()
) result in correctly simplified output.
Unless we assume the value is in range (shift_set_bit5_assume_set()
and shift_mask31_assume_masked()
).
Then the bit masking is correctly removed, but clang forgets the range info and emits a general shift again.
Sometimes we have a conditional shift by 32, which on a 32 bit target is just a conditional move, but clang replaces it with a general 64 bit shift, which sometimes gets only partially simplified. In particular on RISC-V it fails to realize that the left shifts are nops.
uint64_t conditional_shift32(uint64_t x, unsigned b) {
__builtin_assume(b <= 1);
return b ? x << 32 : x;
}
uint64_t shift_shifted_condition(uint64_t x, unsigned b) {
__builtin_assume(b <= 1);
return x << (32 * b);
}
uint64_t conditional_shift_bit5(uint64_t x, unsigned b) {
return (b & 32) ? x << 32 : x;
}
uint64_t conditional_shift_bit5_assume(uint64_t x, unsigned b) {
__builtin_assume((b & 32) == b);
return (b & 32) ? x << 32 : x;
}
uint64_t conditional_shift_assume_bit5(uint64_t x, unsigned b) {
__builtin_assume((b & 32) == b);
return b ? x << 32 : x;
}
For conditional_shift_bit5()
and conditional_shift_bit5_assume()
clang sees that it is just a shift by b
, but in conditional_shift_assume_bit5()
it fails, even though it is only missing the & 32
, which gets removed anyways.
The above also made me realize, that the 64 shift for 32 bit RISC-V is super weird:
shift(unsigned long long, int):
addi a4, a2, -32
sll a3, a0, a2
bltz a4, .LBB0_2
mv a1, a3
srai a0, a4, 31
and a0, a0, a3
ret
.LBB0_2:
sll a1, a1, a2
not a2, a2
srli a0, a0, 1
srl a0, a0, a2
or a1, a1, a0
srai a0, a4, 31
and a0, a0, a3
ret
The srai + and
before the ret
s are totally useless. The first is always 0 and the other is always a0.
With the Zicond extension it is slightly better:
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srli a4, a0, 1
sll a0, a0, a2
addi a2, a2, -32
srl a3, a4, a3
slti a2, a2, 0
or a1, a1, a3
czero.nez a3, a0, a2
czero.eqz a1, a1, a2
or a1, a1, a3
czero.eqz a0, a0, a2
ret
The addi
is still there, even though it could have been combined with the slti
. Also a minor nitpick, if the srli
was placed after the srl
, we could avoid using a4 and use a compressible srli
.
This is how I would write it:
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srl a3, a0, a3
sll a0, a0, a2
srli a3, a3, 1
slti a2, a2, 32
or a1, a1, a3
czero.nez a3, a0, a2
czero.eqz a1, a1, a2
or a1, a1, a3
czero.eqz a0, a0, a2
(The slti
could also be replaced with srli a2, a2, 5
, which has a compressed version, whereas if I'm not mistaken slti
does not)
This can be rewritten to use only base instructions with only two more instructions:
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srl a3, a0, a3
sll a0, a0, a2
srli a3, a3, 1
addi a2, a2, -32
srai a2, a2, 31
or a1, a1, a3
and a1, a1, a2
not a3, a2
and a3, a3, a0
or a1, a1, a3
and a0, a0, a2
(The addi
could be replaced with slli a2, a2, 26
, but I'm not sure which is better)
This to me is preferable over a branchy version from a side channel perspective alone, since it is highly unintuitive that a shift introduces a branch.