Open
Description
Given the following code
define <16 x i16> @mulbyconst(<16 x i16> %"a") #0 {
top:
%0 = mul <16 x i16> %"a", <i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4, i16 8, i16 4>
ret <16 x i16> %0
}
LLVM compiles this to a single vpsllvw
instruction with AVX512, but in the absence of AVX512, it instead compiles to two vpsllw
and a vpblendw
(as shown in https://godbolt.org/z/PMehWerEd).
The issue is that although avx2 CPUs are missing the vpsllvw
instruction (because avx2 is a bit of a mess), it includes the vpmullw
instruction, so this could have compiled to a single vpmullw
instruction by an alternating vector of 256
and 16
. This missed optimization is especially annoying because LLVM went through a bunch of work to canonicalize the variable multiplication by powers of 2 into a variable shift left, even though just leaving it as a multiply would have been more efficient.