64 bit shifts on 32 bit targets fail to utilize shift range information

When we know that a shift is greater than 32, we can translate a 64 bit shift into a 32 bit shift and set the lower register to 0.
Clang fails to do this optimization in most cases. This happens at least on ARM, RISC-V and x86, others probably also.
Here are some examples ([godbolt](https://godbolt.org/z/9Tdf1xKxd)):
```C++
#include <cstdint>

uint64_t shift_assume_ge32(uint64_t x, int s) {
    __builtin_assume(s >= 32);
    return x << s;
}

uint64_t shift_assume_bit5(uint64_t x, int s) {
    __builtin_assume((s | 32) == s);
    return x << s;
}

uint64_t shift_set_bit5(uint64_t x, int s) {
    s |= 32;
    return x << s;
}

uint64_t shift_set_bit5_assume_set(uint64_t x, int s) {
    __builtin_assume((s | 32) == s);
    s |= 32;
    return x << s;
}

uint64_t shift_32_plus(uint64_t x, int s) {
    __builtin_assume(s >= 0);
    s += 32;
    return x << s;
}

uint64_t shift_lt32(uint64_t x, int s) {
    __builtin_assume(s < 32);
    return x << s;
}

uint64_t shift_mask31(uint64_t x, int s) {
    s &= 31;
    return x << s;
}

uint64_t shift_assume_masked(uint64_t x, int s) {
    __builtin_assume((s & 31) == s);
    return x << s;
}

uint64_t shift_mask31_assume_masked(uint64_t x, int s) {
    __builtin_assume((s & 31) == s);
    s &= 31;
    return x << s;
}

```

Assuming that the shift amount is greater than 32 or the 5th bit is set has no effect (`shift_assume_ge32()` and `shift_assume_bit5 `).
Similarly assuming it is less than 32 or the bit is unset also has no effect (`shift_lt32()` and `shift_assume_masked()`).
Setting the 5th bit or masking by 31 (`shift_set_bit5()` and `shift_mask31()`) result in correctly simplified output.
Unless we assume the value is in range (`shift_set_bit5_assume_set()` and `shift_mask31_assume_masked()`).
Then the bit masking is correctly removed, but clang forgets the range info and emits a general shift again.

Sometimes we have a conditional shift by 32, which on a 32 bit target is just a conditional move, but clang replaces it with a general 64 bit shift, which sometimes gets only partially simplified. In particular on RISC-V it fails to realize that the left shifts are nops.
```C++
uint64_t conditional_shift32(uint64_t x, unsigned b) {
    __builtin_assume(b <= 1);
    return b ? x << 32 : x;
}

uint64_t shift_shifted_condition(uint64_t x, unsigned b) {
    __builtin_assume(b <= 1);
    return x << (32 * b);
}

uint64_t conditional_shift_bit5(uint64_t x, unsigned b) {
    return (b & 32) ? x << 32 : x;
}

uint64_t conditional_shift_bit5_assume(uint64_t x, unsigned b) {
    __builtin_assume((b & 32) == b);
    return (b & 32) ? x << 32 : x;
}

uint64_t conditional_shift_assume_bit5(uint64_t x, unsigned b) {
    __builtin_assume((b & 32) == b);
    return b ? x << 32 : x;
}
```
For `conditional_shift_bit5()` and `conditional_shift_bit5_assume()` clang sees that it is just a shift by `b`, but in `conditional_shift_assume_bit5()` it fails, even though it is only missing the `& 32`, which gets removed anyways. 

The above also made me realize, that the 64 shift for 32 bit RISC-V is super weird:
```
shift(unsigned long long, int):
        addi    a4, a2, -32
        sll     a3, a0, a2
        bltz    a4, .LBB0_2
        mv      a1, a3
        srai    a0, a4, 31
        and     a0, a0, a3
        ret
.LBB0_2:
        sll     a1, a1, a2
        not     a2, a2
        srli    a0, a0, 1
        srl     a0, a0, a2
        or      a1, a1, a0
        srai    a0, a4, 31
        and     a0, a0, a3
        ret
```
The `srai + and` before the `ret`s are totally useless. The first is always 0 and the other is always a0.
With the Zicond extension it is slightly better:
```
shift(unsigned long long, int):
        sll     a1, a1, a2
        not     a3, a2
        srli    a4, a0, 1
        sll     a0, a0, a2
        addi    a2, a2, -32
        srl     a3, a4, a3
        slti    a2, a2, 0
        or      a1, a1, a3
        czero.nez       a3, a0, a2
        czero.eqz       a1, a1, a2
        or      a1, a1, a3
        czero.eqz       a0, a0, a2
        ret
```
The `addi` is still there, even though it could have been combined with the `slti`. Also a minor nitpick, if the `srli` was placed after the `srl`, we could avoid using a4 and use a compressible `srli`.
This is how I would write it:
```
shift(unsigned long long, int):
        sll a1, a1, a2
        not a3, a2
        srl a3, a0, a3
        sll a0, a0, a2
        srli a3, a3, 1
        slti a2, a2, 32
        or a1, a1, a3
        czero.nez a3, a0, a2
        czero.eqz a1, a1, a2
        or a1, a1, a3
        czero.eqz a0, a0, a2
```
(The `slti` could also be replaced with `srli a2, a2, 5`, which has a compressed version, whereas if I'm not mistaken  `slti` does not)
This can be rewritten to use only base instructions with only two more instructions:
```
shift(unsigned long long, int):
        sll a1, a1, a2
        not a3, a2
        srl a3, a0, a3
        sll a0, a0, a2
        srli a3, a3, 1
        addi a2, a2, -32
        srai a2, a2, 31
        or a1, a1, a3
        and a1, a1, a2
        not a3, a2
        and a3, a3, a0
        or a1, a1, a3
        and a0, a0, a2
```
(The `addi` could be replaced with `slli a2, a2, 26`, but I'm not sure which is better)
This to me is preferable over a branchy version from a side channel perspective alone, since it is highly unintuitive that a shift introduces a branch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

64 bit shifts on 32 bit targets fail to utilize shift range information #142046

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

64 bit shifts on 32 bit targets fail to utilize shift range information #142046

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions