Skip to content

Commit 9e36df9

Browse files
committed
[SLP] Better estimate cost of no-op extracts on target vectors.
The motivation for this patch is to better estimate the cost of extracelement instructions in cases were they are going to be free, because the source vector can be used directly. A simple example is %v1.lane.0 = extractelement <2 x double> %v.1, i32 0 %v1.lane.1 = extractelement <2 x double> %v.1, i32 1 %a.lane.0 = fmul double %v1.lane.0, %x %a.lane.1 = fmul double %v1.lane.1, %y Currently we only consider the extracts free, if there are no other users. In this particular case, on AArch64 which can fit <2 x double> in a vector register, the extracts should be free, independently of other users, because the source vector of the extracts will be in a vector register directly, so it should be free to use the vector directly. The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on certain AArch64 CPUs. It looks like this does not impact any code in SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto. This originally regressed after D80773, so if there's a better alternative to explore, I'd be more than happy to do that. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D99719 (cherry-picked from 0f32303)
1 parent e7ab2c4 commit 9e36df9

File tree

4 files changed

+243
-79
lines changed

4 files changed

+243
-79
lines changed

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+53-1
Original file line numberDiff line numberDiff line change
@@ -3418,6 +3418,58 @@ getVectorCallCosts(CallInst *CI, FixedVectorType *VecTy,
34183418
return {IntrinsicCost, LibCost};
34193419
}
34203420

3421+
/// Compute the cost of creating a vector of type \p VecTy containing the
3422+
/// extracted values from \p VL.
3423+
static InstructionCost
3424+
computeExtractCost(ArrayRef<Value *> VL, FixedVectorType *VecTy,
3425+
TargetTransformInfo::ShuffleKind ShuffleKind,
3426+
TargetTransformInfo &TTI) {
3427+
unsigned NumOfParts = TTI.getNumberOfParts(VecTy);
3428+
3429+
if (ShuffleKind != TargetTransformInfo::SK_PermuteSingleSrc || !NumOfParts)
3430+
return TTI.getShuffleCost(ShuffleKind, VecTy);
3431+
3432+
bool AllConsecutive = true;
3433+
unsigned EltsPerVector = VecTy->getNumElements() / NumOfParts;
3434+
unsigned Idx = -1;
3435+
InstructionCost Cost = 0;
3436+
3437+
// Process extracts in blocks of EltsPerVector to check if the source vector
3438+
// operand can be re-used directly. If not, add the cost of creating a shuffle
3439+
// to extract the values into a vector register.
3440+
for (auto *V : VL) {
3441+
++Idx;
3442+
3443+
// Reached the start of a new vector registers.
3444+
if (Idx % EltsPerVector == 0) {
3445+
AllConsecutive = true;
3446+
continue;
3447+
}
3448+
3449+
// Check all extracts for a vector register on the target directly
3450+
// extract values in order.
3451+
unsigned CurrentIdx = *getExtractIndex(cast<Instruction>(V));
3452+
unsigned PrevIdx = *getExtractIndex(cast<Instruction>(VL[Idx - 1]));
3453+
AllConsecutive &= PrevIdx + 1 == CurrentIdx &&
3454+
CurrentIdx % EltsPerVector == Idx % EltsPerVector;
3455+
3456+
if (AllConsecutive)
3457+
continue;
3458+
3459+
// Skip all indices, except for the last index per vector block.
3460+
if ((Idx + 1) % EltsPerVector != 0 && Idx + 1 != VL.size())
3461+
continue;
3462+
3463+
// If we have a series of extracts which are not consecutive and hence
3464+
// cannot re-use the source vector register directly, compute the shuffle
3465+
// cost to extract the a vector with EltsPerVector elements.
3466+
Cost += TTI.getShuffleCost(
3467+
TargetTransformInfo::SK_PermuteSingleSrc,
3468+
FixedVectorType::get(VecTy->getElementType(), EltsPerVector));
3469+
}
3470+
return Cost;
3471+
}
3472+
34213473
InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
34223474
ArrayRef<Value*> VL = E->Scalars;
34233475

@@ -3454,7 +3506,7 @@ InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
34543506
Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL);
34553507
if (ShuffleKind.hasValue()) {
34563508
InstructionCost Cost =
3457-
TTI->getShuffleCost(ShuffleKind.getValue(), VecTy);
3509+
computeExtractCost(VL, VecTy, *ShuffleKind, *TTI);
34583510
for (auto *V : VL) {
34593511
// If all users of instruction are going to be vectorized and this
34603512
// instruction itself is not going to be vectorized, consider this

0 commit comments

Comments
 (0)