-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000
Conversation
In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions. However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results. Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction: ; v9 = [0.1563, 5.97e-8, 0.00006104] ; zvfh vsetivli x0, 3, e16, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v9, v8 vfmv.f.s fa0, v8 ; fa0 = 0.1563 ; zvfhmin vsetivli x0, 3, e16, m1, ta, ma vfwcvt.f.f.v v10, v9 vsetivli x0, 3, e32, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v10, v8 vfmv.f.s fa0, v8 fcvt.h.s fa0, fa0 ; fa0 = 0.1564 This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16. I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet. I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum. I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?) I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesIn https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions. However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results. Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction:
This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16. I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet. I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum. I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?) I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics. Patch is 47.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111000.diff 6 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index 04eb891f719d28..42d031310d5e02 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -5851,13 +5851,10 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
DAG.getIntPtrConstant(0, dl, /*isTarget=*/true)));
break;
}
- case ISD::VP_REDUCE_FADD:
- case ISD::VP_REDUCE_FMUL:
case ISD::VP_REDUCE_FMAX:
case ISD::VP_REDUCE_FMIN:
case ISD::VP_REDUCE_FMAXIMUM:
case ISD::VP_REDUCE_FMINIMUM:
- case ISD::VP_REDUCE_SEQ_FADD:
Results.push_back(PromoteReduction(Node));
break;
}
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 4f77bba2968988..9febf3e82ba154 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -957,8 +957,6 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
ISD::VP_FMUL,
ISD::VP_FDIV,
ISD::VP_FMA,
- ISD::VP_REDUCE_FADD,
- ISD::VP_REDUCE_SEQ_FADD,
ISD::VP_REDUCE_FMIN,
ISD::VP_REDUCE_FMAX,
ISD::VP_SQRT,
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index c042782389f188..a61461681f79ed 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1531,6 +1531,11 @@ RISCVTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
Opcodes = {RISCV::VMV_S_X, RISCV::VREDAND_VS, RISCV::VMV_X_S};
break;
case ISD::FADD:
+ // We can't promote f16/bf16 fadd reductions.
+ if ((LT.second.getVectorElementType() == MVT::f16 &&
+ !ST->hasVInstructionsF16()) ||
+ LT.second.getVectorElementType() == MVT::bf16)
+ return InstructionCost::getInvalid();
SplitOp = RISCV::VFADD_VV;
Opcodes = {RISCV::VFMV_S_F, RISCV::VFREDUSUM_VS, RISCV::VFMV_F_S};
break;
diff --git a/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll b/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll
index 71f614d7003fad..afb2b644645218 100644
--- a/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll
@@ -1,18 +1,30 @@
; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
-; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefix=FP-REDUCE
+; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=FP-REDUCE,FP-REDUCE-ZVFH
+; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfhmin -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=FP-REDUCE,FP-REDUCE-ZVFHMIN
; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=code-size 2>&1 -disable-output | FileCheck %s --check-prefix=SIZE
define void @reduce_fadd_half() {
-; FP-REDUCE-LABEL: 'reduce_fadd_half'
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
-; FP-REDUCE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+; FP-REDUCE-ZVFH-LABEL: 'reduce_fadd_half'
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
+; FP-REDUCE-ZVFH-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; FP-REDUCE-ZVFHMIN-LABEL: 'reduce_fadd_half'
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Invalid cost for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
; SIZE-LABEL: 'reduce_fadd_half'
; SIZE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll
index 793e8eb5aee6a7..6d5be7f14bf70b 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll
@@ -1,63 +1,33 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+v -target-abi=ilp32d \
-; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFH
+; RUN: -verify-machineinstrs < %s | FileCheck %s
; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+v -target-abi=lp64d \
-; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFH
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfhmin,+v -target-abi=ilp32d \
-; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFHMIN
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfhmin,+v -target-abi=lp64d \
-; RUN: -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFHMIN
+; RUN: -verify-machineinstrs < %s | FileCheck %s
declare half @llvm.vp.reduce.fadd.v2f16(half, <2 x half>, <2 x i1>, i32)
define half @vpreduce_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_fadd_v2f16:
-; ZVFH: # %bb.0:
-; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT: vfmv.s.f v9, fa0
-; ZVFH-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
-; ZVFH-NEXT: vfredusum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT: vfmv.f.s fa0, v9
-; ZVFH-NEXT: ret
-;
-; ZVFHMIN-LABEL: vpreduce_fadd_v2f16:
-; ZVFHMIN: # %bb.0:
-; ZVFHMIN-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
-; ZVFHMIN-NEXT: vfredusum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT: ret
+; CHECK-LABEL: vpreduce_fadd_v2f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT: vfmv.s.f v9, fa0
+; CHECK-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-NEXT: vfredusum.vs v9, v8, v9, v0.t
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: ret
%r = call reassoc half @llvm.vp.reduce.fadd.v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 %evl)
ret half %r
}
define half @vpreduce_ord_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_ord_fadd_v2f16:
-; ZVFH: # %bb.0:
-; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT: vfmv.s.f v9, fa0
-; ZVFH-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
-; ZVFH-NEXT: vfredosum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT: vfmv.f.s fa0, v9
-; ZVFH-NEXT: ret
-;
-; ZVFHMIN-LABEL: vpreduce_ord_fadd_v2f16:
-; ZVFHMIN: # %bb.0:
-; ZVFHMIN-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
-; ZVFHMIN-NEXT: vfredosum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT: ret
+; CHECK-LABEL: vpreduce_ord_fadd_v2f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT: vfmv.s.f v9, fa0
+; CHECK-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-NEXT: vfredosum.vs v9, v8, v9, v0.t
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: ret
%r = call half @llvm.vp.reduce.fadd.v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 %evl)
ret half %r
}
@@ -65,53 +35,27 @@ define half @vpreduce_ord_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 ze
declare half @llvm.vp.reduce.fadd.v4f16(half, <4 x half>, <4 x i1>, i32)
define half @vpreduce_fadd_v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_fadd_v4f16:
-; ZVFH: # %bb.0:
-; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT: vfmv.s.f v9, fa0
-; ZVFH-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
-; ZVFH-NEXT: vfredusum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT: vfmv.f.s fa0, v9
-; ZVFH-NEXT: ret
-;
-; ZVFHMIN-LABEL: vpreduce_fadd_v4f16:
-; ZVFHMIN: # %bb.0:
-; ZVFHMIN-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT: vsetvli zero, zero, e32, m1, ta, ma
-; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; ZVFHMIN-NEXT: vfredusum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT: ret
+; CHECK-LABEL: vpreduce_fadd_v4f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT: vfmv.s.f v9, fa0
+; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
+; CHECK-NEXT: vfredusum.vs v9, v8, v9, v0.t
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: ret
%r = call reassoc half @llvm.vp.reduce.fadd.v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 %evl)
ret half %r
}
define half @vpreduce_ord_fadd_v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_ord_fadd_v4f16:
-; ZVFH: # %bb.0:
-; ZVFH-NEXT: vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT: vfmv.s.f v9, fa0
-; ZVFH-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
-; ZVFH-NEXT: vfredosum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT: vfmv.f.s fa0, v9
-; ZVFH-NEXT: ret
-;
-; ZVFHMIN-LABEL: vpreduce_ord_fadd_v4f16:
-; ZVFHMIN: # %bb.0:
-; ZVFHMIN-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
-; ZVFHMIN-NEXT: vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT: fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT: vsetvli zero, zero, e32, m1, ta, ma
-; ZVFHMIN-NEXT: vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; ZVFHMIN-NEXT: vfredosum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT: vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT: fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT: ret
+; CHECK-LABEL: vpreduce_ord_fadd_v4f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT: vfmv.s.f v9, fa0
+; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
+; CHECK-NEXT: vfredosum.vs v9, v8, v9, v0.t
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: ret
%r = call half @llvm.vp.reduce.fadd.v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 %evl)
ret half %r
}
diff --git a/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll b/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll
index 0f8e74942d58d0..f3ccf74019bb58 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll
@@ -1,339 +1,33 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+zfbfmin,+zvfbfmin,+v \
-; RUN: -target-abi=ilp32d -verify-machineinstrs < %s | FileCheck %s \
-; RUN: --check-prefixes=CHECK,ZVFH
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+zfbfmin,+zvfbfmin,+v \
-; RUN: -target-abi=lp64d -verify-machineinstrs < %s | FileCheck %s \
-; RUN: --check-prefixes=CHECK,ZVFH
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfhmin,+zvfhmin,+zfbfmin,+zvfbfmin,+v \
-; RUN: -target-abi=ilp32d -verify-machineinstrs < %s | FileCheck %s \
-; RUN: --check-prefixes=CHECK,ZVFHMIN
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfhmin,+zvfhmin,+zfbfmin,+zvfbfmin,+v \
-; RUN: -target-abi=lp64d -verify-machineinstrs < %s | FileCheck %s \
-; RUN: --check-prefixes=CHECK,ZVFHMIN
-
-declare bfloat @llvm.vp.reduce.fadd.nxv1bf16(bfloat, <vscale x 1 x bfloat>, <vscale x 1 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_fadd_nxv1bf16:
-; CHECK: # %bb.0:
-; CHECK-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
-; CHECK-NEXT: vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT: fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT: vsetivli zero, 1, e32, mf2, ta, ma
-; CHECK-NEXT: vfmv.s.f v8, fa5
-; CHECK-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
-; CHECK-NEXT: vfredusum.vs v8, v9, v8, v0.t
-; CHECK-NEXT: vfmv.f.s fa5, v8
-; CHECK-NEXT: fcvt.bf16.s fa0, fa5
-; CHECK-NEXT: ret
- %r = call reassoc bfloat @llvm.vp.reduce.fadd.nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 %evl)
- ret bfloat %r
-}
-
-define bfloat @vpreduce_ord_fadd_nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_ord_fadd_nxv1bf16:
-; CHECK: # %bb.0:
-; CHECK-NEXT: vsetvli zero, a0, e16, mf4, ta, ma
-; CHECK-NEXT: vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT: fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT: vsetivli zero, 1, e32, mf2, ta, ma
-; CHECK-NEXT: vfmv.s.f v8, fa5
-; CHECK-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
-; CHECK-NEXT: vfredosum.vs v8, v9, v8, v0.t
-; CHECK-NEXT: vfmv.f.s fa5, v8
-; CHECK-NEXT: fcvt.bf16.s fa0, fa5
-; CHECK-NEXT: ret
- %r = call bfloat @llvm.vp.reduce.fadd.nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 %evl)
- ret bfloat %r
-}
-
-declare bfloat @llvm.vp.reduce.fadd.nxv2bf16(bfloat, <vscale x 2 x bfloat>, <vscale x 2 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_fadd_nxv2bf16:
-; CHECK: # %bb.0:
-; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
-; CHECK-NEXT: vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT: fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT: vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT: vfmv.s.f v8, fa5
-; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vfredusum.vs v8, v9, v8, v0.t
-; CHECK-NEXT: vfmv.f.s fa5, v8
-; CHECK-NEXT: fcvt.bf16.s fa0, fa5
-; CHECK-NEXT: ret
- %r = call reassoc bfloat @llvm.vp.reduce.fadd.nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 %evl)
- ret bfloat %r
-}
-
-define bfloat @vpreduce_ord_fadd_nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_ord_fadd_nxv2bf16:
-; CHECK: # %bb.0:
-; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
-; CHECK-NEXT: vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT: fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT: vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT: vfmv.s.f v8, fa5
-; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vfredosum.vs v8, v9, v8, v0.t
-; CHECK-NEXT: vfmv.f.s fa5, v8
-; CHECK-NEXT: fcvt.bf16.s fa0, fa5
-; CHECK-NEXT: ret
- %r = call bfloat @llvm.vp.reduce.fadd.nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 %evl)
- ret bfloat %r
-}
-
-declare bfloat @llvm.vp.reduce.fadd.nxv4bf16(bfloat, <vscale x 4 x bfloat>, <vscale x 4 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_fadd_nxv4bf16:
-; CHECK: # %bb.0:
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vfwcvtbf16.f.f.v v10, v8
-; CHECK-NEXT: fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT: vsetivli zero, 1, e32, m2, ta, ma
-; CHECK-NEXT: vfmv.s.f v8, fa5
-; CHECK-NEXT: vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT: vfredusum.vs v8, v10, v8, v0.t
-; CHECK-NEXT: vfmv.f.s fa5, v8
-; CHECK-NEXT: fcvt.bf16.s fa0, fa5
-; CHECK-NEXT: ret
- %r = call reassoc bfloat @llvm.vp.reduce.fadd.nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 %evl)
- ret bfloat %r
-}
-
-define bfloat @vpreduce_ord_fadd_nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_ord_fadd_nxv4bf16:
-; CHECK: # %bb.0:
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vfwcvtbf16.f.f.v v10, v8
-; CHECK-NEXT: fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT: vsetivli zero, 1, e32, m2, ta, ma
-; CHECK-NEXT: vfmv.s.f v8, fa5
-; CHECK-NEXT: vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT: vfredosum.vs v8, v10, v8, v0.t
-; CHECK-NEXT: vfmv.f.s fa5, v8
-; CHECK-NEXT: fcvt.bf16.s fa0, fa5
-; CHECK-NEXT: ret
- %r = call bfloat @llvm.vp.reduce.fadd.nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 %evl)
- ret bfloat %r
-}
-
-declare bfloat @llvm.vp.reduce.fadd.nxv64bf16(bfloat, <vscale x 64 x bfloat>, <vscale x 64 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv64bf16(bflo...
[truncated]
|
You can test this locally with the following command:git-clang-format --diff 496187b3b81ea76a8b67d796609d7f09992cf96d 394a66ea6e90710443c96909d7781fd3ad0c4af5 --extensions cpp -- llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 9febf3e82b..40e9c148df 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -952,27 +952,13 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
// TODO: support more vp ops.
static const unsigned ZvfhminZvfbfminPromoteVPOps[] = {
- ISD::VP_FADD,
- ISD::VP_FSUB,
- ISD::VP_FMUL,
- ISD::VP_FDIV,
- ISD::VP_FMA,
- ISD::VP_REDUCE_FMIN,
- ISD::VP_REDUCE_FMAX,
- ISD::VP_SQRT,
- ISD::VP_FMINNUM,
- ISD::VP_FMAXNUM,
- ISD::VP_FCEIL,
- ISD::VP_FFLOOR,
- ISD::VP_FROUND,
- ISD::VP_FROUNDEVEN,
- ISD::VP_FROUNDTOZERO,
- ISD::VP_FRINT,
- ISD::VP_FNEARBYINT,
- ISD::VP_SETCC,
- ISD::VP_FMINIMUM,
- ISD::VP_FMAXIMUM,
- ISD::VP_REDUCE_FMINIMUM,
+ ISD::VP_FADD, ISD::VP_FSUB, ISD::VP_FMUL,
+ ISD::VP_FDIV, ISD::VP_FMA, ISD::VP_REDUCE_FMIN,
+ ISD::VP_REDUCE_FMAX, ISD::VP_SQRT, ISD::VP_FMINNUM,
+ ISD::VP_FMAXNUM, ISD::VP_FCEIL, ISD::VP_FFLOOR,
+ ISD::VP_FROUND, ISD::VP_FROUNDEVEN, ISD::VP_FROUNDTOZERO,
+ ISD::VP_FRINT, ISD::VP_FNEARBYINT, ISD::VP_SETCC,
+ ISD::VP_FMINIMUM, ISD::VP_FMAXIMUM, ISD::VP_REDUCE_FMINIMUM,
ISD::VP_REDUCE_FMAXIMUM};
// Sets common operation actions on RVV floating-point vector types.
|
Luke and I discussed this a fair bit and this seems like the right solution to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When trying to add RISC-V fadd reduction cost model tests for bf16, I noticed a crash when the vector was of <1 x bfloat>. It turns out that this was being scalarized because unlike f16/f32/f64, there's no v1bf16 value type, and the existing cost model code assumed that the legalized type would always be a vector. This adds v1bf16 to bring bf16 in line with the other fp types. It also adds some more RISC-V bf16 reduction tests which previously crashed, including tests to ensure that SLP won't emit fadd/fmul reductions for bf16 or f16 w/ zvfhmin after llvm#111000.
When trying to add RISC-V fadd reduction cost model tests for bf16, I noticed a crash when the vector was of <1 x bfloat>. It turns out that this was being scalarized because unlike f16/f32/f64, there's no v1bf16 value type, and the existing cost model code assumed that the legalized type would always be a vector. This adds v1bf16 to bring bf16 in line with the other fp types. It also adds some more RISC-V bf16 reduction tests which previously crashed, including tests to ensure that SLP won't emit fadd/fmul reductions for bf16 or f16 w/ zvfhmin after #111000.
In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16 without zvfh, and marked the cost as invalid to prevent the vectorizers from emitting them. However it inadvertently didn't change the cost for ordered reductions, so this moves the check earlier to fix this. This also uses BasicTTIImpl instead which now assigns a valid but expensive cost for fixed-length vectors, which reflects how codegen will actually scalarize them.
…114250) In llvm#111000 we removed promotion of fadd/fmul reductions for bf16 and f16 without zvfh, and marked the cost as invalid to prevent the vectorizers from emitting them. However it inadvertently didn't change the cost for ordered reductions, so this moves the check earlier to fix this. This also uses BasicTTIImpl instead which now assigns a valid but expensive cost for fixed-length vectors, which reflects how codegen will actually scalarize them.
…114250) In llvm#111000 we removed promotion of fadd/fmul reductions for bf16 and f16 without zvfh, and marked the cost as invalid to prevent the vectorizers from emitting them. However it inadvertently didn't change the cost for ordered reductions, so this moves the check earlier to fix this. This also uses BasicTTIImpl instead which now assigns a valid but expensive cost for fixed-length vectors, which reflects how codegen will actually scalarize them.
In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions.
However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results.
Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction:
This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16.
I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet.
I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum.
I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?)
I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.