Description
Consider the following piece of code:
#![feature(portable_simd)]
use core::simd::*;
type T = u32;
fn if_then_else64(mask: u64, if_true: &[T; 64], if_false: &[T; 64]) -> [T; 64] {
let tv = Simd::<T, 64>::from_slice(if_true);
let fv = Simd::<T, 64>::from_slice(if_false);
let mv = Mask::<<T as SimdElement>::Mask, 64>::from_bitmask(mask);
mv.select(tv, fv).to_array()
}
On Intel this generates decent code, using masked moves if AVX-512 is available, and if only AVX2 is available it spreads the mask along a vector register, and uses vpand
and vpcmpeqd
with vectors like [1, 2, 4, 8, 16, 32, 64, 128]
to generate masks to blend if_true
and if_false
using vblendvps
.
On ARM it's a different story. It moves the mask one bit at a time into registers before finishing off with a bunch of non-trivial shuffle/comparison instructions and bsl.16b
to finally do the blends. It is possible to apply essentially the exact same strategy as the code generated on Intel, here coded manually for u32
:
fn if_then_else64_manual_u32(mut mask: u64, if_true: &[u32; 64], if_false: &[u32; 64]) -> [u32; 64] {
let mut out = [0; 64];
let mut offset = 0;
for _ in 0..2 {
let mut bit = Simd::<u32, 4>::from_array([1, 2, 4, 8]);
let mut mv = Simd::<u32, 4>::splat(mask as u32);
for _ in 0..8 {
let tv = Simd::<u32, 4>::from_slice(&if_true[offset..offset+4]);
let fv = Simd::<u32, 4>::from_slice(&if_false[offset..offset+4]);
let mv_full = (mv & bit).simd_eq(bit);
let ret = mv_full.select(tv, fv);
out[offset..offset+4].copy_from_slice(&ret[..]);
bit = bit << 4;
offset += 4;
}
mask >>= 32;
}
out
}
Note that the above isn't some novel trick or anything, it's almost entirely a 1:1 translation of what the compiler generates on AVX2, just using 4-wide instead of 8-wide registers. You can see for yourself how similar the assembly of Intel using if_then_else64
and ARM using if_then_else64_manual_u32
is.
The above is ~3.2x faster on my Apple M1 machine than if_then_else64
, assuming all data is in cache. I would really like it if the compiler could generate this code automatically, just like it does on Intel.