Skip to content

Commit 7f6bbb3

Browse files
authored
[RISCV][TTI] Reduce cost of a build_vector pattern (#108419)
This change is actually two related changes, but they're very hard to meaningfully separate as the second balances the first, and yet doesn't do much good on it's own. First, we can reduce the cost of a build_vector pattern. Our current costing for this defers to generic insertelement costing which isn't unreasonable, but also isn't correct. While inserting N elements requires N-1 slides and N vmv.s.x, doing the full build_vector only requires N vslide1down. (Note there are other cases that our build vector lowering can do more cheaply, this is simply the easiest upper bound which appears to be "good enough" for SLP costing purposes.) Second, we need to tell SLP that calls don't preserve vector registers. Without this, SLP will vectorize scalar code which performs e.g. 4 x float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the costing works out that this is in fact the optimal choice - except that we don't actually have a <2 x float> @exp, and unroll during DAG. This would be fine (or at least cost neutral) except that the libcall for the scalar @exp blows all vector registers. So the net effect is we added a bunch of spills that SLP had no idea about. Thankfully, AArch64 has a similiar problem, and has taught SLP how to reason about spill cost once the right TTI hook is implemented. Now, for some implications... The SLP solution for spill costing has some inaccuracies. In particular, it basically just guesses whether a intrinsic will be lowered to a call or not, and can be wrong in both directions. It also has no mechanism to differentiate on calling convention. This has the effect of making partial vectorization (i.e. starting in scalar) more profitable. In practice, the major effect of this is to make it more like SLP will vectorize part of a tree in an intersecting forrest, and then vectorize the remaining tree once those uses have been removed. This has the effect of biasing us slightly away from strided, or indexed loads during vectorization - because the scalar cost is more accurately modeled, and these instructions look relevatively less profitable.
1 parent 02071a8 commit 7f6bbb3

14 files changed

+568
-434
lines changed

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -616,6 +616,40 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
616616
return BaseT::getShuffleCost(Kind, Tp, Mask, CostKind, Index, SubTp);
617617
}
618618

619+
static unsigned isM1OrSmaller(MVT VT) {
620+
RISCVII::VLMUL LMUL = RISCVTargetLowering::getLMUL(VT);
621+
return (LMUL == RISCVII::VLMUL::LMUL_F8 || LMUL == RISCVII::VLMUL::LMUL_F4 ||
622+
LMUL == RISCVII::VLMUL::LMUL_F2 || LMUL == RISCVII::VLMUL::LMUL_1);
623+
}
624+
625+
InstructionCost RISCVTTIImpl::getScalarizationOverhead(
626+
VectorType *Ty, const APInt &DemandedElts, bool Insert, bool Extract,
627+
TTI::TargetCostKind CostKind) {
628+
if (isa<ScalableVectorType>(Ty))
629+
return InstructionCost::getInvalid();
630+
631+
// A build_vector (which is m1 sized or smaller) can be done in no
632+
// worse than one vslide1down.vx per element in the type. We could
633+
// in theory do an explode_vector in the inverse manner, but our
634+
// lowering today does not have a first class node for this pattern.
635+
InstructionCost Cost = BaseT::getScalarizationOverhead(
636+
Ty, DemandedElts, Insert, Extract, CostKind);
637+
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
638+
if (Insert && !Extract && LT.first.isValid() && LT.second.isVector() &&
639+
Ty->getScalarSizeInBits() != 1) {
640+
assert(LT.second.isFixedLengthVector());
641+
MVT ContainerVT = TLI->getContainerForFixedLengthVector(LT.second);
642+
if (isM1OrSmaller(ContainerVT)) {
643+
InstructionCost BV =
644+
cast<FixedVectorType>(Ty)->getNumElements() *
645+
getRISCVInstructionCost(RISCV::VSLIDE1DOWN_VX, LT.second, CostKind);
646+
if (BV < Cost)
647+
Cost = BV;
648+
}
649+
}
650+
return Cost;
651+
}
652+
619653
InstructionCost
620654
RISCVTTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src, Align Alignment,
621655
unsigned AddressSpace,
@@ -767,6 +801,23 @@ InstructionCost RISCVTTIImpl::getStridedMemoryOpCost(
767801
return NumLoads * MemOpCost;
768802
}
769803

804+
InstructionCost
805+
RISCVTTIImpl::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) {
806+
// FIXME: This is a property of the default vector convention, not
807+
// all possible calling conventions. Fixing that will require
808+
// some TTI API and SLP rework.
809+
InstructionCost Cost = 0;
810+
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
811+
for (auto *Ty : Tys) {
812+
if (!Ty->isVectorTy())
813+
continue;
814+
Align A = DL.getPrefTypeAlign(Ty);
815+
Cost += getMemoryOpCost(Instruction::Store, Ty, A, 0, CostKind) +
816+
getMemoryOpCost(Instruction::Load, Ty, A, 0, CostKind);
817+
}
818+
return Cost;
819+
}
820+
770821
// Currently, these represent both throughput and codesize costs
771822
// for the respective intrinsics. The costs in this table are simply
772823
// instruction counts with the following adjustments made:

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,11 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
149149
ArrayRef<const Value *> Args = {},
150150
const Instruction *CxtI = nullptr);
151151

152+
InstructionCost getScalarizationOverhead(VectorType *Ty,
153+
const APInt &DemandedElts,
154+
bool Insert, bool Extract,
155+
TTI::TargetCostKind CostKind);
156+
152157
InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
153158
TTI::TargetCostKind CostKind);
154159

@@ -169,6 +174,8 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
169174
TTI::TargetCostKind CostKind,
170175
const Instruction *I);
171176

177+
InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);
178+
172179
InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
173180
TTI::CastContextHint CCH,
174181
TTI::TargetCostKind CostKind,

llvm/test/Analysis/CostModel/RISCV/arith-fp.ll

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -361,8 +361,8 @@ define void @frem() {
361361
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F32 = frem float undef, undef
362362
; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F64 = frem double undef, undef
363363
; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V1F32 = frem <1 x float> undef, undef
364-
; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V2F32 = frem <2 x float> undef, undef
365-
; CHECK-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %V4F32 = frem <4 x float> undef, undef
364+
; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V2F32 = frem <2 x float> undef, undef
365+
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V4F32 = frem <4 x float> undef, undef
366366
; CHECK-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V8F32 = frem <8 x float> undef, undef
367367
; CHECK-NEXT: Cost Model: Found an estimated cost of 63 for instruction: %V16F32 = frem <16 x float> undef, undef
368368
; CHECK-NEXT: Cost Model: Invalid cost for instruction: %NXV1F32 = frem <vscale x 1 x float> undef, undef
@@ -371,7 +371,7 @@ define void @frem() {
371371
; CHECK-NEXT: Cost Model: Invalid cost for instruction: %NXV8F32 = frem <vscale x 8 x float> undef, undef
372372
; CHECK-NEXT: Cost Model: Invalid cost for instruction: %NXV16F32 = frem <vscale x 16 x float> undef, undef
373373
; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V1F64 = frem <1 x double> undef, undef
374-
; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V2F64 = frem <2 x double> undef, undef
374+
; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V2F64 = frem <2 x double> undef, undef
375375
; CHECK-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %V4F64 = frem <4 x double> undef, undef
376376
; CHECK-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V8F64 = frem <8 x double> undef, undef
377377
; CHECK-NEXT: Cost Model: Invalid cost for instruction: %NXV1F64 = frem <vscale x 1 x double> undef, undef
@@ -412,9 +412,9 @@ define void @frem_f16() {
412412
; ZVFH-LABEL: 'frem_f16'
413413
; ZVFH-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F16 = frem half undef, undef
414414
; ZVFH-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V1F16 = frem <1 x half> undef, undef
415-
; ZVFH-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V2F16 = frem <2 x half> undef, undef
416-
; ZVFH-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %V4F16 = frem <4 x half> undef, undef
417-
; ZVFH-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V8F16 = frem <8 x half> undef, undef
415+
; ZVFH-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V2F16 = frem <2 x half> undef, undef
416+
; ZVFH-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V4F16 = frem <4 x half> undef, undef
417+
; ZVFH-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V8F16 = frem <8 x half> undef, undef
418418
; ZVFH-NEXT: Cost Model: Found an estimated cost of 63 for instruction: %V16F16 = frem <16 x half> undef, undef
419419
; ZVFH-NEXT: Cost Model: Found an estimated cost of 127 for instruction: %V32F16 = frem <32 x half> undef, undef
420420
; ZVFH-NEXT: Cost Model: Invalid cost for instruction: %NXV1F16 = frem <vscale x 1 x half> undef, undef
@@ -428,9 +428,9 @@ define void @frem_f16() {
428428
; ZVFHMIN-LABEL: 'frem_f16'
429429
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %F16 = frem half undef, undef
430430
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V1F16 = frem <1 x half> undef, undef
431-
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V2F16 = frem <2 x half> undef, undef
432-
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %V4F16 = frem <4 x half> undef, undef
433-
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %V8F16 = frem <8 x half> undef, undef
431+
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V2F16 = frem <2 x half> undef, undef
432+
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V4F16 = frem <4 x half> undef, undef
433+
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V8F16 = frem <8 x half> undef, undef
434434
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 63 for instruction: %V16F16 = frem <16 x half> undef, undef
435435
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 127 for instruction: %V32F16 = frem <32 x half> undef, undef
436436
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %NXV1F16 = frem <vscale x 1 x half> undef, undef
@@ -620,9 +620,9 @@ define void @fcopysign_f16() {
620620
; ZVFHMIN-LABEL: 'fcopysign_f16'
621621
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %F16 = call half @llvm.copysign.f16(half undef, half undef)
622622
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V1F16 = call <1 x half> @llvm.copysign.v1f16(<1 x half> undef, <1 x half> undef)
623-
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2F16 = call <2 x half> @llvm.copysign.v2f16(<2 x half> undef, <2 x half> undef)
624-
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %V4F16 = call <4 x half> @llvm.copysign.v4f16(<4 x half> undef, <4 x half> undef)
625-
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 23 for instruction: %V8F16 = call <8 x half> @llvm.copysign.v8f16(<8 x half> undef, <8 x half> undef)
623+
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V2F16 = call <2 x half> @llvm.copysign.v2f16(<2 x half> undef, <2 x half> undef)
624+
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V4F16 = call <4 x half> @llvm.copysign.v4f16(<4 x half> undef, <4 x half> undef)
625+
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V8F16 = call <8 x half> @llvm.copysign.v8f16(<8 x half> undef, <8 x half> undef)
626626
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 47 for instruction: %V16F16 = call <16 x half> @llvm.copysign.v16f16(<16 x half> undef, <16 x half> undef)
627627
; ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 95 for instruction: %V32F16 = call <32 x half> @llvm.copysign.v32f16(<32 x half> undef, <32 x half> undef)
628628
; ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %NXV1F16 = call <vscale x 1 x half> @llvm.copysign.nxv1f16(<vscale x 1 x half> undef, <vscale x 1 x half> undef)

0 commit comments

Comments
 (0)