Skip to content

64 bit shifts on 32 bit targets fail to utilize shift range information #142046

Open
@tom-rein

Description

@tom-rein

When we know that a shift is greater than 32, we can translate a 64 bit shift into a 32 bit shift and set the lower register to 0.
Clang fails to do this optimization in most cases. This happens at least on ARM, RISC-V and x86, others probably also.
Here are some examples (godbolt):

#include <cstdint>

uint64_t shift_assume_ge32(uint64_t x, int s) {
    __builtin_assume(s >= 32);
    return x << s;
}

uint64_t shift_assume_bit5(uint64_t x, int s) {
    __builtin_assume((s | 32) == s);
    return x << s;
}

uint64_t shift_set_bit5(uint64_t x, int s) {
    s |= 32;
    return x << s;
}

uint64_t shift_set_bit5_assume_set(uint64_t x, int s) {
    __builtin_assume((s | 32) == s);
    s |= 32;
    return x << s;
}

uint64_t shift_32_plus(uint64_t x, int s) {
    __builtin_assume(s >= 0);
    s += 32;
    return x << s;
}

uint64_t shift_lt32(uint64_t x, int s) {
    __builtin_assume(s < 32);
    return x << s;
}

uint64_t shift_mask31(uint64_t x, int s) {
    s &= 31;
    return x << s;
}

uint64_t shift_assume_masked(uint64_t x, int s) {
    __builtin_assume((s & 31) == s);
    return x << s;
}

uint64_t shift_mask31_assume_masked(uint64_t x, int s) {
    __builtin_assume((s & 31) == s);
    s &= 31;
    return x << s;
}

Assuming that the shift amount is greater than 32 or the 5th bit is set has no effect (shift_assume_ge32() and shift_assume_bit5 ).
Similarly assuming it is less than 32 or the bit is unset also has no effect (shift_lt32() and shift_assume_masked()).
Setting the 5th bit or masking by 31 (shift_set_bit5() and shift_mask31()) result in correctly simplified output.
Unless we assume the value is in range (shift_set_bit5_assume_set() and shift_mask31_assume_masked()).
Then the bit masking is correctly removed, but clang forgets the range info and emits a general shift again.

Sometimes we have a conditional shift by 32, which on a 32 bit target is just a conditional move, but clang replaces it with a general 64 bit shift, which sometimes gets only partially simplified. In particular on RISC-V it fails to realize that the left shifts are nops.

uint64_t conditional_shift32(uint64_t x, unsigned b) {
    __builtin_assume(b <= 1);
    return b ? x << 32 : x;
}

uint64_t shift_shifted_condition(uint64_t x, unsigned b) {
    __builtin_assume(b <= 1);
    return x << (32 * b);
}

uint64_t conditional_shift_bit5(uint64_t x, unsigned b) {
    return (b & 32) ? x << 32 : x;
}

uint64_t conditional_shift_bit5_assume(uint64_t x, unsigned b) {
    __builtin_assume((b & 32) == b);
    return (b & 32) ? x << 32 : x;
}

uint64_t conditional_shift_assume_bit5(uint64_t x, unsigned b) {
    __builtin_assume((b & 32) == b);
    return b ? x << 32 : x;
}

For conditional_shift_bit5() and conditional_shift_bit5_assume() clang sees that it is just a shift by b, but in conditional_shift_assume_bit5() it fails, even though it is only missing the & 32, which gets removed anyways.

The above also made me realize, that the 64 shift for 32 bit RISC-V is super weird:

shift(unsigned long long, int):
        addi    a4, a2, -32
        sll     a3, a0, a2
        bltz    a4, .LBB0_2
        mv      a1, a3
        srai    a0, a4, 31
        and     a0, a0, a3
        ret
.LBB0_2:
        sll     a1, a1, a2
        not     a2, a2
        srli    a0, a0, 1
        srl     a0, a0, a2
        or      a1, a1, a0
        srai    a0, a4, 31
        and     a0, a0, a3
        ret

The srai + and before the rets are totally useless. The first is always 0 and the other is always a0.
With the Zicond extension it is slightly better:

shift(unsigned long long, int):
        sll     a1, a1, a2
        not     a3, a2
        srli    a4, a0, 1
        sll     a0, a0, a2
        addi    a2, a2, -32
        srl     a3, a4, a3
        slti    a2, a2, 0
        or      a1, a1, a3
        czero.nez       a3, a0, a2
        czero.eqz       a1, a1, a2
        or      a1, a1, a3
        czero.eqz       a0, a0, a2
        ret

The addi is still there, even though it could have been combined with the slti. Also a minor nitpick, if the srli was placed after the srl, we could avoid using a4 and use a compressible srli.
This is how I would write it:

shift(unsigned long long, int):
        sll a1, a1, a2
        not a3, a2
        srl a3, a0, a3
        sll a0, a0, a2
        srli a3, a3, 1
        slti a2, a2, 32
        or a1, a1, a3
        czero.nez a3, a0, a2
        czero.eqz a1, a1, a2
        or a1, a1, a3
        czero.eqz a0, a0, a2

(The slti could also be replaced with srli a2, a2, 5, which has a compressed version, whereas if I'm not mistaken slti does not)
This can be rewritten to use only base instructions with only two more instructions:

shift(unsigned long long, int):
        sll a1, a1, a2
        not a3, a2
        srl a3, a0, a3
        sll a0, a0, a2
        srli a3, a3, 1
        addi a2, a2, -32
        srai a2, a2, 31
        or a1, a1, a3
        and a1, a1, a2
        not a3, a2
        and a3, a3, a0
        or a1, a1, a3
        and a0, a0, a2

(The addi could be replaced with slli a2, a2, 26, but I'm not sure which is better)
This to me is preferable over a branchy version from a side channel perspective alone, since it is highly unintuitive that a shift introduces a branch.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions