Skip to content

portable_simd bitmask + select generates poor code on ARM64 #122376

Open
@orlp

Description

@orlp

Consider the following piece of code:

#![feature(portable_simd)]

use core::simd::*;
type T = u32;

fn if_then_else64(mask: u64, if_true: &[T; 64], if_false: &[T; 64]) -> [T; 64] {
    let tv = Simd::<T, 64>::from_slice(if_true);
    let fv = Simd::<T, 64>::from_slice(if_false);
    let mv = Mask::<<T as SimdElement>::Mask, 64>::from_bitmask(mask);
    mv.select(tv, fv).to_array()
}

On Intel this generates decent code, using masked moves if AVX-512 is available, and if only AVX2 is available it spreads the mask along a vector register, and uses vpand and vpcmpeqd with vectors like [1, 2, 4, 8, 16, 32, 64, 128] to generate masks to blend if_true and if_false using vblendvps.

On ARM it's a different story. It moves the mask one bit at a time into registers before finishing off with a bunch of non-trivial shuffle/comparison instructions and bsl.16b to finally do the blends. It is possible to apply essentially the exact same strategy as the code generated on Intel, here coded manually for u32:

fn if_then_else64_manual_u32(mut mask: u64, if_true: &[u32; 64], if_false: &[u32; 64]) -> [u32; 64] {
    let mut out = [0; 64];
    let mut offset = 0;
    for _ in 0..2 {
        let mut bit = Simd::<u32, 4>::from_array([1, 2, 4, 8]);
        let mut mv = Simd::<u32, 4>::splat(mask as u32);
        for _ in 0..8 {
            let tv = Simd::<u32, 4>::from_slice(&if_true[offset..offset+4]);
            let fv = Simd::<u32, 4>::from_slice(&if_false[offset..offset+4]);
            let mv_full = (mv & bit).simd_eq(bit);
            let ret = mv_full.select(tv, fv);
            out[offset..offset+4].copy_from_slice(&ret[..]);
            bit = bit << 4;
            offset += 4;
        }
        mask >>= 32;
    }
    out
}

Note that the above isn't some novel trick or anything, it's almost entirely a 1:1 translation of what the compiler generates on AVX2, just using 4-wide instead of 8-wide registers. You can see for yourself how similar the assembly of Intel using if_then_else64 and ARM using if_then_else64_manual_u32 is.

The above is ~3.2x faster on my Apple M1 machine than if_then_else64, assuming all data is in cache. I would really like it if the compiler could generate this code automatically, just like it does on Intel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-SIMDArea: SIMD (Single Instruction Multiple Data)A-codegenArea: Code generationC-bugCategory: This is a bug.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-heavyIssue: Problems and improvements with respect to binary size of generated code.I-slowIssue: Problems and improvements with respect to performance of generated code.O-AArch64Armv8-A or later processors in AArch64 modePG-portable-simdProject group: Portable SIMD (https://github.com/rust-lang/project-portable-simd)T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.T-libsRelevant to the library team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions