[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets 

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:
```c
#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}
```
This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:
```ll
  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
  %shuffle.i.i = shufflevector <2 x i64> %3, <2 x i64> %4, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %7 = bitcast <4 x i64> %shuffle.i.i to <8 x i32>
```
We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :
```ll
  %2 = icmp sgt <8 x i32> %0, %1
  %cmp.i = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %cmp.i20 = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %sext.i21 = sext <4 x i1> %cmp.i20 to <4 x i32>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
```
We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

- [x] Extend VectorCombine::foldBitcastShuf to handle length changing shuffles
- [x] Extend VectorCombine::foldBitcastShuf to handle binary shuffles
- [x] Add a VectorCombine::foldShuffleOfCasts similar to VectorCombine::foldShuffleOfBinops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions