Closed
Description
Clang example: https://godbolt.org/z/ec4P4j78b, flags: -O3 -march=x86-64-v2
. Not clang specific, same behaviour on rust nightly.
#include <immintrin.h>
extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) {
return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112)));
}
The por
of xmm1 with 112 (0b0111_0000
) is a no-op and should be optimized out, as pshufb ignores bits 5-7 of the mask argument:
.LCPI0_0:
.zero 16,112
shuffle_or:
por xmm1, xmmword ptr [rip + .LCPI0_0]
pshufb xmm0, xmm1
ret
Writing _mm_shuffle_epi8(bytes, _mm_set1_epi8(127))
in the source emits a pshufb with 15
in the assembly, so it seems like LLVM is aware of this optimization on some level, but fails to apply it here.