[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000

lukel97 · 2024-10-03T14:41:02Z

In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions.

However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results.

Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction:

; v9 = [0.1563, 5.97e-8, 0.00006104]

; zvfh
vsetivli x0, 3, e16, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v9, v8
vfmv.f.s fa0, v8
; fa0 = 0.1563

; zvfhmin
vsetivli x0, 3, e16, m1, ta, ma
vfwcvt.f.f.v v10, v9
vsetivli x0, 3, e32, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v10, v8
vfmv.f.s fa0, v8
fcvt.h.s fa0, fa0
; fa0 = 0.1564

This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16.

I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet.

I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum.

I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?)

I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.

In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions. However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results. Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction: ; v9 = [0.1563, 5.97e-8, 0.00006104] ; zvfh vsetivli x0, 3, e16, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v9, v8 vfmv.f.s fa0, v8 ; fa0 = 0.1563 ; zvfhmin vsetivli x0, 3, e16, m1, ta, ma vfwcvt.f.f.v v10, v9 vsetivli x0, 3, e32, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v10, v8 vfmv.f.s fa0, v8 fcvt.h.s fa0, fa0 ; fa0 = 0.1564 This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16. I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet. I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum. I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?) I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.

llvmbot · 2024-10-03T14:41:37Z

@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions.

However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results.

Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction:

; v9 = [0.1563, 5.97e-8, 0.00006104]

; zvfh
vsetivli x0, 3, e16, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v9, v8
vfmv.f.s fa0, v8
; fa0 = 0.1563

; zvfhmin
vsetivli x0, 3, e16, m1, ta, ma
vfwcvt.f.f.v v10, v9
vsetivli x0, 3, e32, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v10, v8
vfmv.f.s fa0, v8
fcvt.h.s fa0, fa0
; fa0 = 0.1564

This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16.

I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet.

I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum.

I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?)

I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.

Patch is 47.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111000.diff

6 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (-3)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (-2)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+5)
(modified) llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll (+23-11)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll (+34-90)
(modified) llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll (+108-620)

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index 04eb891f719d28..42d031310d5e02 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -5851,13 +5851,10 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
                     DAG.getIntPtrConstant(0, dl, /*isTarget=*/true)));
     break;
   }
-  case ISD::VP_REDUCE_FADD:
-  case ISD::VP_REDUCE_FMUL:
   case ISD::VP_REDUCE_FMAX:
   case ISD::VP_REDUCE_FMIN:
   case ISD::VP_REDUCE_FMAXIMUM:
   case ISD::VP_REDUCE_FMINIMUM:
-  case ISD::VP_REDUCE_SEQ_FADD:
     Results.push_back(PromoteReduction(Node));
     break;
   }
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 4f77bba2968988..9febf3e82ba154 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -957,8 +957,6 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
         ISD::VP_FMUL,
         ISD::VP_FDIV,
         ISD::VP_FMA,
-        ISD::VP_REDUCE_FADD,
-        ISD::VP_REDUCE_SEQ_FADD,
         ISD::VP_REDUCE_FMIN,
         ISD::VP_REDUCE_FMAX,
         ISD::VP_SQRT,
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index c042782389f188..a61461681f79ed 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1531,6 +1531,11 @@ RISCVTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
     Opcodes = {RISCV::VMV_S_X, RISCV::VREDAND_VS, RISCV::VMV_X_S};
     break;
   case ISD::FADD:
+    // We can't promote f16/bf16 fadd reductions.
+    if ((LT.second.getVectorElementType() == MVT::f16 &&
+         !ST->hasVInstructionsF16()) ||
+        LT.second.getVectorElementType() == MVT::bf16)
+      return InstructionCost::getInvalid();
     SplitOp = RISCV::VFADD_VV;
     Opcodes = {RISCV::VFMV_S_F, RISCV::VFREDUSUM_VS, RISCV::VFMV_F_S};
     break;
diff --git a/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll b/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll
index 71f614d7003fad..afb2b644645218 100644
--- a/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/reduce-fadd.ll
@@ -1,18 +1,30 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
-; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefix=FP-REDUCE
+; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=FP-REDUCE,FP-REDUCE-ZVFH
+; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfhmin -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=FP-REDUCE,FP-REDUCE-ZVFHMIN
 ; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+zfh,+zvfh -passes="print<cost-model>" -cost-kind=code-size 2>&1 -disable-output | FileCheck %s  --check-prefix=SIZE
 
 define void @reduce_fadd_half() {
-; FP-REDUCE-LABEL: 'reduce_fadd_half'
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
-; FP-REDUCE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+; FP-REDUCE-ZVFH-LABEL: 'reduce_fadd_half'
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
+; FP-REDUCE-ZVFH-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; FP-REDUCE-ZVFHMIN-LABEL: 'reduce_fadd_half'
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V2 = call fast half @llvm.vector.reduce.fadd.v2f16(half 0xH0000, <2 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V4 = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V8 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V16 = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH0000, <16 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %v32 = call fast half @llvm.vector.reduce.fadd.v32f16(half 0xH0000, <32 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V64 = call fast half @llvm.vector.reduce.fadd.v64f16(half 0xH0000, <64 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %V128 = call fast half @llvm.vector.reduce.fadd.v128f16(half 0xH0000, <128 x half> undef)
+; FP-REDUCE-ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; SIZE-LABEL: 'reduce_fadd_half'
 ; SIZE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1 = call fast half @llvm.vector.reduce.fadd.v1f16(half 0xH0000, <1 x half> undef)
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll
index 793e8eb5aee6a7..6d5be7f14bf70b 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp-vp.ll
@@ -1,63 +1,33 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+v -target-abi=ilp32d \
-; RUN:   -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFH
+; RUN:   -verify-machineinstrs < %s | FileCheck %s
 ; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+v -target-abi=lp64d \
-; RUN:   -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFH
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfhmin,+v -target-abi=ilp32d \
-; RUN:   -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFHMIN
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfhmin,+v -target-abi=lp64d \
-; RUN:   -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,ZVFHMIN
+; RUN:   -verify-machineinstrs < %s | FileCheck %s
 
 declare half @llvm.vp.reduce.fadd.v2f16(half, <2 x half>, <2 x i1>, i32)
 
 define half @vpreduce_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_fadd_v2f16:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT:    vfmv.s.f v9, fa0
-; ZVFH-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
-; ZVFH-NEXT:    vfredusum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT:    vfmv.f.s fa0, v9
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vpreduce_fadd_v2f16:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfredusum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT:    vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT:    fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT:    ret
+; CHECK-LABEL: vpreduce_fadd_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT:    vfmv.s.f v9, fa0
+; CHECK-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9, v0.t
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    ret
   %r = call reassoc half @llvm.vp.reduce.fadd.v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 %evl)
   ret half %r
 }
 
 define half @vpreduce_ord_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_ord_fadd_v2f16:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT:    vfmv.s.f v9, fa0
-; ZVFH-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
-; ZVFH-NEXT:    vfredosum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT:    vfmv.f.s fa0, v9
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vpreduce_ord_fadd_v2f16:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfredosum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT:    vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT:    fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT:    ret
+; CHECK-LABEL: vpreduce_ord_fadd_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT:    vfmv.s.f v9, fa0
+; CHECK-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
+; CHECK-NEXT:    vfredosum.vs v9, v8, v9, v0.t
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    ret
   %r = call half @llvm.vp.reduce.fadd.v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 %evl)
   ret half %r
 }
@@ -65,53 +35,27 @@ define half @vpreduce_ord_fadd_v2f16(half %s, <2 x half> %v, <2 x i1> %m, i32 ze
 declare half @llvm.vp.reduce.fadd.v4f16(half, <4 x half>, <4 x i1>, i32)
 
 define half @vpreduce_fadd_v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_fadd_v4f16:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT:    vfmv.s.f v9, fa0
-; ZVFH-NEXT:    vsetvli zero, a0, e16, mf2, ta, ma
-; ZVFH-NEXT:    vfredusum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT:    vfmv.f.s fa0, v9
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vpreduce_fadd_v4f16:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
-; ZVFHMIN-NEXT:    vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; ZVFHMIN-NEXT:    vfredusum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT:    vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT:    fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT:    ret
+; CHECK-LABEL: vpreduce_fadd_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT:    vfmv.s.f v9, fa0
+; CHECK-NEXT:    vsetvli zero, a0, e16, mf2, ta, ma
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9, v0.t
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    ret
   %r = call reassoc half @llvm.vp.reduce.fadd.v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 %evl)
   ret half %r
 }
 
 define half @vpreduce_ord_fadd_v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 zeroext %evl) {
-; ZVFH-LABEL: vpreduce_ord_fadd_v4f16:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
-; ZVFH-NEXT:    vfmv.s.f v9, fa0
-; ZVFH-NEXT:    vsetvli zero, a0, e16, mf2, ta, ma
-; ZVFH-NEXT:    vfredosum.vs v9, v8, v9, v0.t
-; ZVFH-NEXT:    vfmv.f.s fa0, v9
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vpreduce_ord_fadd_v4f16:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
-; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
-; ZVFHMIN-NEXT:    vfmv.s.f v8, fa5
-; ZVFHMIN-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; ZVFHMIN-NEXT:    vfredosum.vs v8, v9, v8, v0.t
-; ZVFHMIN-NEXT:    vfmv.f.s fa5, v8
-; ZVFHMIN-NEXT:    fcvt.h.s fa0, fa5
-; ZVFHMIN-NEXT:    ret
+; CHECK-LABEL: vpreduce_ord_fadd_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 1, e16, m1, ta, ma
+; CHECK-NEXT:    vfmv.s.f v9, fa0
+; CHECK-NEXT:    vsetvli zero, a0, e16, mf2, ta, ma
+; CHECK-NEXT:    vfredosum.vs v9, v8, v9, v0.t
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    ret
   %r = call half @llvm.vp.reduce.fadd.v4f16(half %s, <4 x half> %v, <4 x i1> %m, i32 %evl)
   ret half %r
 }
diff --git a/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll b/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll
index 0f8e74942d58d0..f3ccf74019bb58 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll
@@ -1,339 +1,33 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfh,+zvfh,+zfbfmin,+zvfbfmin,+v \
-; RUN:     -target-abi=ilp32d -verify-machineinstrs < %s | FileCheck %s \
-; RUN:     --check-prefixes=CHECK,ZVFH
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfh,+zvfh,+zfbfmin,+zvfbfmin,+v \
-; RUN:     -target-abi=lp64d -verify-machineinstrs < %s | FileCheck %s \
-; RUN:     --check-prefixes=CHECK,ZVFH
-; RUN: llc -mtriple=riscv32 -mattr=+d,+zfhmin,+zvfhmin,+zfbfmin,+zvfbfmin,+v \
-; RUN:     -target-abi=ilp32d -verify-machineinstrs < %s | FileCheck %s \
-; RUN:     --check-prefixes=CHECK,ZVFHMIN
-; RUN: llc -mtriple=riscv64 -mattr=+d,+zfhmin,+zvfhmin,+zfbfmin,+zvfbfmin,+v \
-; RUN:     -target-abi=lp64d -verify-machineinstrs < %s | FileCheck %s \
-; RUN:     --check-prefixes=CHECK,ZVFHMIN
-
-declare bfloat @llvm.vp.reduce.fadd.nxv1bf16(bfloat, <vscale x 1 x bfloat>, <vscale x 1 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_fadd_nxv1bf16:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
-; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT:    vsetivli zero, 1, e32, mf2, ta, ma
-; CHECK-NEXT:    vfmv.s.f v8, fa5
-; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
-; CHECK-NEXT:    vfredusum.vs v8, v9, v8, v0.t
-; CHECK-NEXT:    vfmv.f.s fa5, v8
-; CHECK-NEXT:    fcvt.bf16.s fa0, fa5
-; CHECK-NEXT:    ret
-  %r = call reassoc bfloat @llvm.vp.reduce.fadd.nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 %evl)
-  ret bfloat %r
-}
-
-define bfloat @vpreduce_ord_fadd_nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_ord_fadd_nxv1bf16:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a0, e16, mf4, ta, ma
-; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT:    vsetivli zero, 1, e32, mf2, ta, ma
-; CHECK-NEXT:    vfmv.s.f v8, fa5
-; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
-; CHECK-NEXT:    vfredosum.vs v8, v9, v8, v0.t
-; CHECK-NEXT:    vfmv.f.s fa5, v8
-; CHECK-NEXT:    fcvt.bf16.s fa0, fa5
-; CHECK-NEXT:    ret
-  %r = call bfloat @llvm.vp.reduce.fadd.nxv1bf16(bfloat %s, <vscale x 1 x bfloat> %v, <vscale x 1 x i1> %m, i32 %evl)
-  ret bfloat %r
-}
-
-declare bfloat @llvm.vp.reduce.fadd.nxv2bf16(bfloat, <vscale x 2 x bfloat>, <vscale x 2 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_fadd_nxv2bf16:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a0, e16, mf2, ta, ma
-; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT:    vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT:    vfmv.s.f v8, fa5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vfredusum.vs v8, v9, v8, v0.t
-; CHECK-NEXT:    vfmv.f.s fa5, v8
-; CHECK-NEXT:    fcvt.bf16.s fa0, fa5
-; CHECK-NEXT:    ret
-  %r = call reassoc bfloat @llvm.vp.reduce.fadd.nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 %evl)
-  ret bfloat %r
-}
-
-define bfloat @vpreduce_ord_fadd_nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_ord_fadd_nxv2bf16:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a0, e16, mf2, ta, ma
-; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
-; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT:    vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT:    vfmv.s.f v8, fa5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vfredosum.vs v8, v9, v8, v0.t
-; CHECK-NEXT:    vfmv.f.s fa5, v8
-; CHECK-NEXT:    fcvt.bf16.s fa0, fa5
-; CHECK-NEXT:    ret
-  %r = call bfloat @llvm.vp.reduce.fadd.nxv2bf16(bfloat %s, <vscale x 2 x bfloat> %v, <vscale x 2 x i1> %m, i32 %evl)
-  ret bfloat %r
-}
-
-declare bfloat @llvm.vp.reduce.fadd.nxv4bf16(bfloat, <vscale x 4 x bfloat>, <vscale x 4 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_fadd_nxv4bf16:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vfwcvtbf16.f.f.v v10, v8
-; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT:    vsetivli zero, 1, e32, m2, ta, ma
-; CHECK-NEXT:    vfmv.s.f v8, fa5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT:    vfredusum.vs v8, v10, v8, v0.t
-; CHECK-NEXT:    vfmv.f.s fa5, v8
-; CHECK-NEXT:    fcvt.bf16.s fa0, fa5
-; CHECK-NEXT:    ret
-  %r = call reassoc bfloat @llvm.vp.reduce.fadd.nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 %evl)
-  ret bfloat %r
-}
-
-define bfloat @vpreduce_ord_fadd_nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 zeroext %evl) {
-; CHECK-LABEL: vpreduce_ord_fadd_nxv4bf16:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vfwcvtbf16.f.f.v v10, v8
-; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
-; CHECK-NEXT:    vsetivli zero, 1, e32, m2, ta, ma
-; CHECK-NEXT:    vfmv.s.f v8, fa5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m2, ta, ma
-; CHECK-NEXT:    vfredosum.vs v8, v10, v8, v0.t
-; CHECK-NEXT:    vfmv.f.s fa5, v8
-; CHECK-NEXT:    fcvt.bf16.s fa0, fa5
-; CHECK-NEXT:    ret
-  %r = call bfloat @llvm.vp.reduce.fadd.nxv4bf16(bfloat %s, <vscale x 4 x bfloat> %v, <vscale x 4 x i1> %m, i32 %evl)
-  ret bfloat %r
-}
-
-declare bfloat @llvm.vp.reduce.fadd.nxv64bf16(bfloat, <vscale x 64 x bfloat>, <vscale x 64 x i1>, i32)
-
-define bfloat @vpreduce_fadd_nxv64bf16(bflo...
[truncated]

github-actions · 2024-10-03T14:44:27Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 496187b3b81ea76a8b67d796609d7f09992cf96d 394a66ea6e90710443c96909d7781fd3ad0c4af5 --extensions cpp -- llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 9febf3e82b..40e9c148df 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -952,27 +952,13 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
 
     // TODO: support more vp ops.
     static const unsigned ZvfhminZvfbfminPromoteVPOps[] = {
-        ISD::VP_FADD,
-        ISD::VP_FSUB,
-        ISD::VP_FMUL,
-        ISD::VP_FDIV,
-        ISD::VP_FMA,
-        ISD::VP_REDUCE_FMIN,
-        ISD::VP_REDUCE_FMAX,
-        ISD::VP_SQRT,
-        ISD::VP_FMINNUM,
-        ISD::VP_FMAXNUM,
-        ISD::VP_FCEIL,
-        ISD::VP_FFLOOR,
-        ISD::VP_FROUND,
-        ISD::VP_FROUNDEVEN,
-        ISD::VP_FROUNDTOZERO,
-        ISD::VP_FRINT,
-        ISD::VP_FNEARBYINT,
-        ISD::VP_SETCC,
-        ISD::VP_FMINIMUM,
-        ISD::VP_FMAXIMUM,
-        ISD::VP_REDUCE_FMINIMUM,
+        ISD::VP_FADD,           ISD::VP_FSUB,       ISD::VP_FMUL,
+        ISD::VP_FDIV,           ISD::VP_FMA,        ISD::VP_REDUCE_FMIN,
+        ISD::VP_REDUCE_FMAX,    ISD::VP_SQRT,       ISD::VP_FMINNUM,
+        ISD::VP_FMAXNUM,        ISD::VP_FCEIL,      ISD::VP_FFLOOR,
+        ISD::VP_FROUND,         ISD::VP_FROUNDEVEN, ISD::VP_FROUNDTOZERO,
+        ISD::VP_FRINT,          ISD::VP_FNEARBYINT, ISD::VP_SETCC,
+        ISD::VP_FMINIMUM,       ISD::VP_FMAXIMUM,   ISD::VP_REDUCE_FMINIMUM,
         ISD::VP_REDUCE_FMAXIMUM};
 
     // Sets common operation actions on RVV floating-point vector types.

asb · 2024-10-03T15:12:14Z

Luke and I discussed this a fair bit and this seems like the right solution to me.

topperc

LGTM

When trying to add RISC-V fadd reduction cost model tests for bf16, I noticed a crash when the vector was of <1 x bfloat>. It turns out that this was being scalarized because unlike f16/f32/f64, there's no v1bf16 value type, and the existing cost model code assumed that the legalized type would always be a vector. This adds v1bf16 to bring bf16 in line with the other fp types. It also adds some more RISC-V bf16 reduction tests which previously crashed, including tests to ensure that SLP won't emit fadd/fmul reductions for bf16 or f16 w/ zvfhmin after llvm#111000.

When trying to add RISC-V fadd reduction cost model tests for bf16, I noticed a crash when the vector was of <1 x bfloat>. It turns out that this was being scalarized because unlike f16/f32/f64, there's no v1bf16 value type, and the existing cost model code assumed that the legalized type would always be a vector. This adds v1bf16 to bring bf16 in line with the other fp types. It also adds some more RISC-V bf16 reduction tests which previously crashed, including tests to ensure that SLP won't emit fadd/fmul reductions for bf16 or f16 w/ zvfhmin after #111000.

In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16 without zvfh, and marked the cost as invalid to prevent the vectorizers from emitting them. However it inadvertently didn't change the cost for ordered reductions, so this moves the check earlier to fix this. This also uses BasicTTIImpl instead which now assigns a valid but expensive cost for fixed-length vectors, which reflects how codegen will actually scalarize them.

…114250) In llvm#111000 we removed promotion of fadd/fmul reductions for bf16 and f16 without zvfh, and marked the cost as invalid to prevent the vectorizers from emitting them. However it inadvertently didn't change the cost for ordered reductions, so this moves the check earlier to fix this. This also uses BasicTTIImpl instead which now assigns a valid but expensive cost for fixed-length vectors, which reflects how codegen will actually scalarize them.

lukel97 requested review from asb, jacquesguan, michaelmaitland, preames, topperc and wangpc-pp October 3, 2024 14:41

llvmbot added backend:RISC-V llvm:SelectionDAG SelectionDAGISel as well llvm:analysis labels Oct 3, 2024

topperc approved these changes Oct 3, 2024

View reviewed changes

lukel97 merged commit 487686b into llvm:main Oct 3, 2024
9 of 12 checks passed

lukel97 mentioned this pull request Oct 4, 2024

[ValueTypes][RISCV] Add v1bf16 type #111112

Merged

lukel97 mentioned this pull request Oct 30, 2024

[RISCV] Cost ordered bf16/f16 w/ zvfhmin reductions as invalid #114250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000

[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000

lukel97 commented Oct 3, 2024

llvmbot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024

asb commented Oct 3, 2024

topperc left a comment

[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000

[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} #111000

Conversation

lukel97 commented Oct 3, 2024

llvmbot commented Oct 3, 2024 • edited Loading

github-actions bot commented Oct 3, 2024

asb commented Oct 3, 2024

topperc left a comment

Choose a reason for hiding this comment

llvmbot commented Oct 3, 2024 •

edited

Loading