Skip to content

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets  #67803

Closed
@RKSimon

Description

@RKSimon

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
  %shuffle.i.i = shufflevector <2 x i64> %3, <2 x i64> %4, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %7 = bitcast <4 x i64> %shuffle.i.i to <8 x i32>

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt <8 x i32> %0, %1
  %cmp.i = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %cmp.i20 = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %sext.i21 = sext <4 x i1> %cmp.i20 to <4 x i32>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

  • Extend VectorCombine::foldBitcastShuf to handle length changing shuffles
  • Extend VectorCombine::foldBitcastShuf to handle binary shuffles
  • Add a VectorCombine::foldShuffleOfCasts similar to VectorCombine::foldShuffleOfBinops

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions