Skip to content

Commit 0f32303

Browse files
committed
[SLP] Better estimate cost of no-op extracts on target vectors.
The motivation for this patch is to better estimate the cost of extracelement instructions in cases were they are going to be free, because the source vector can be used directly. A simple example is %v1.lane.0 = extractelement <2 x double> %v.1, i32 0 %v1.lane.1 = extractelement <2 x double> %v.1, i32 1 %a.lane.0 = fmul double %v1.lane.0, %x %a.lane.1 = fmul double %v1.lane.1, %y Currently we only consider the extracts free, if there are no other users. In this particular case, on AArch64 which can fit <2 x double> in a vector register, the extracts should be free, independently of other users, because the source vector of the extracts will be in a vector register directly, so it should be free to use the vector directly. The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on certain AArch64 CPUs. It looks like this does not impact any code in SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto. This originally regressed after D80773, so if there's a better alternative to explore, I'd be more than happy to do that. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D99719
1 parent 3b48d84 commit 0f32303

File tree

4 files changed

+243
-79
lines changed

4 files changed

+243
-79
lines changed

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3450,6 +3450,58 @@ getVectorCallCosts(CallInst *CI, FixedVectorType *VecTy,
34503450
return {IntrinsicCost, LibCost};
34513451
}
34523452

3453+
/// Compute the cost of creating a vector of type \p VecTy containing the
3454+
/// extracted values from \p VL.
3455+
static InstructionCost
3456+
computeExtractCost(ArrayRef<Value *> VL, FixedVectorType *VecTy,
3457+
TargetTransformInfo::ShuffleKind ShuffleKind,
3458+
ArrayRef<int> Mask, TargetTransformInfo &TTI) {
3459+
unsigned NumOfParts = TTI.getNumberOfParts(VecTy);
3460+
3461+
if (ShuffleKind != TargetTransformInfo::SK_PermuteSingleSrc || !NumOfParts)
3462+
return TTI.getShuffleCost(ShuffleKind, VecTy, Mask);
3463+
3464+
bool AllConsecutive = true;
3465+
unsigned EltsPerVector = VecTy->getNumElements() / NumOfParts;
3466+
unsigned Idx = -1;
3467+
InstructionCost Cost = 0;
3468+
3469+
// Process extracts in blocks of EltsPerVector to check if the source vector
3470+
// operand can be re-used directly. If not, add the cost of creating a shuffle
3471+
// to extract the values into a vector register.
3472+
for (auto *V : VL) {
3473+
++Idx;
3474+
3475+
// Reached the start of a new vector registers.
3476+
if (Idx % EltsPerVector == 0) {
3477+
AllConsecutive = true;
3478+
continue;
3479+
}
3480+
3481+
// Check all extracts for a vector register on the target directly
3482+
// extract values in order.
3483+
unsigned CurrentIdx = *getExtractIndex(cast<Instruction>(V));
3484+
unsigned PrevIdx = *getExtractIndex(cast<Instruction>(VL[Idx - 1]));
3485+
AllConsecutive &= PrevIdx + 1 == CurrentIdx &&
3486+
CurrentIdx % EltsPerVector == Idx % EltsPerVector;
3487+
3488+
if (AllConsecutive)
3489+
continue;
3490+
3491+
// Skip all indices, except for the last index per vector block.
3492+
if ((Idx + 1) % EltsPerVector != 0 && Idx + 1 != VL.size())
3493+
continue;
3494+
3495+
// If we have a series of extracts which are not consecutive and hence
3496+
// cannot re-use the source vector register directly, compute the shuffle
3497+
// cost to extract the a vector with EltsPerVector elements.
3498+
Cost += TTI.getShuffleCost(
3499+
TargetTransformInfo::SK_PermuteSingleSrc,
3500+
FixedVectorType::get(VecTy->getElementType(), EltsPerVector));
3501+
}
3502+
return Cost;
3503+
}
3504+
34533505
InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
34543506
ArrayRef<Value*> VL = E->Scalars;
34553507

@@ -3490,7 +3542,7 @@ InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
34903542
isShuffle(VL, Mask);
34913543
if (ShuffleKind.hasValue()) {
34923544
InstructionCost Cost =
3493-
TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask);
3545+
computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI);
34943546
for (auto *V : VL) {
34953547
// If all users of instruction are going to be vectorized and this
34963548
// instruction itself is not going to be vectorized, consider this

0 commit comments

Comments
 (0)