portable_simd bitmask + select generates poor code on ARM64

Consider the following piece of code:

```rust
#![feature(portable_simd)]

use core::simd::*;
type T = u32;

fn if_then_else64(mask: u64, if_true: &[T; 64], if_false: &[T; 64]) -> [T; 64] {
    let tv = Simd::<T, 64>::from_slice(if_true);
    let fv = Simd::<T, 64>::from_slice(if_false);
    let mv = Mask::<<T as SimdElement>::Mask, 64>::from_bitmask(mask);
    mv.select(tv, fv).to_array()
}
```

On Intel this generates decent code, using masked moves if AVX-512 is available, and if only AVX2 is available it spreads the mask along a vector register, and uses `vpand` and `vpcmpeqd` with vectors like `[1, 2, 4, 8, 16, 32, 64, 128]` to generate masks to blend `if_true` and `if_false` using `vblendvps`.

On ARM it's [a different story](https://rust.godbolt.org/z/nn1x3sYrP). It moves the mask __*one bit at a time*__ into registers before finishing off with a bunch of non-trivial shuffle/comparison instructions and `bsl.16b` to finally do the blends. It is possible to apply essentially the exact same strategy as the code generated on Intel, here coded manually for `u32`:

```rust
fn if_then_else64_manual_u32(mut mask: u64, if_true: &[u32; 64], if_false: &[u32; 64]) -> [u32; 64] {
    let mut out = [0; 64];
    let mut offset = 0;
    for _ in 0..2 {
        let mut bit = Simd::<u32, 4>::from_array([1, 2, 4, 8]);
        let mut mv = Simd::<u32, 4>::splat(mask as u32);
        for _ in 0..8 {
            let tv = Simd::<u32, 4>::from_slice(&if_true[offset..offset+4]);
            let fv = Simd::<u32, 4>::from_slice(&if_false[offset..offset+4]);
            let mv_full = (mv & bit).simd_eq(bit);
            let ret = mv_full.select(tv, fv);
            out[offset..offset+4].copy_from_slice(&ret[..]);
            bit = bit << 4;
            offset += 4;
        }
        mask >>= 32;
    }
    out
}
```


Note that the above isn't some novel trick or anything, it's almost entirely a 1:1 translation of what the compiler generates on AVX2, just using 4-wide instead of 8-wide registers. You can [see for yourself](https://rust.godbolt.org/z/vq1W87dff) how similar the assembly of Intel using `if_then_else64` and ARM using `if_then_else64_manual_u32` is.

The above is ~3.2x faster on my Apple M1 machine than `if_then_else64`, assuming all data is in cache. I would really like it if the compiler could generate this code automatically, just like it does on Intel.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

portable_simd bitmask + select generates poor code on ARM64 #122376

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

portable_simd bitmask + select generates poor code on ARM64 #122376

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions