Closed
Description
Same thing as #106256, but also happens for the (avx2/avs512) permute[x]var intrinsics, while the PR #106377 seems to only fix it for (v)pshufb specifically.
Godbolt examples: https://godbolt.org/z/MsTcx7qYc
The vector permute intrinsics ignore all bits except the ones that match the required index size, e.g.:
- vpermb only uses 4, 5, 6 bits out of each mask byte element for 128, 256, 512 bit sized vectors respectively
- vpermw only uses 3, 4, 5 bits out of each 16-bit element in the mask
- etc.
The OR operations with unrelated bits should be optimzied out.
Probably applies to vpermt2 (e.g. _mm512_permutex2var_epi16) also, with 1 more bit used since they selected from two concatenated vectors.
cc @RKSimon