Skip to content

DAG: Improve fminimum/fmaximum vector expansion logic #93579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 6, 2024

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented May 28, 2024

First, expandFMINIMUM_FMAXIMUM should be a never-fail API. The client
wanted it expanded, and it can always be expanded. This logic was tied
up with what the VectorLegalizer wanted.

Prefer using the min/max opcodes, and unrolling if we don't have a vselect.
This seems to produce better code in all the changed tests.

@llvmbot
Copy link
Member

llvmbot commented May 28, 2024

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-selectiondag

Author: Matt Arsenault (arsenm)

Changes

First, expandFMINIMUM_FMAXIMUM should be a never-fail API. The client
wanted it expanded, and it can always be expanded. This logic was tied
up with what the VectorLegalizer wanted.

Prefer using the min/max opcodes, and unrolling if we don't have a vselect.
This seems to produce better code in all the changed tests.


Patch is 1.19 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93579.diff

12 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+2-2)
  • (modified) llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp (+2-5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+120-480)
  • (modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+120-480)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+17-71)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+670-2779)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+1559-4772)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+17-71)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+670-2779)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+1559-4772)
  • (modified) llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll (+21-58)
diff --git a/llvm/include/llvm/CodeGen/SelectionDAG.h b/llvm/include/llvm/CodeGen/SelectionDAG.h
index 96a6270690468..0dc237301abb4 100644
--- a/llvm/include/llvm/CodeGen/SelectionDAG.h
+++ b/llvm/include/llvm/CodeGen/SelectionDAG.h
@@ -1241,11 +1241,11 @@ class SelectionDAG {
   /// Helper function to make it easier to build Select's if you just have
   /// operands and don't want to check for vector.
   SDValue getSelect(const SDLoc &DL, EVT VT, SDValue Cond, SDValue LHS,
-                    SDValue RHS) {
+                    SDValue RHS, SDNodeFlags Flags = SDNodeFlags()) {
     assert(LHS.getValueType() == VT && RHS.getValueType() == VT &&
            "Cannot use select on differing types");
     auto Opcode = Cond.getValueType().isVector() ? ISD::VSELECT : ISD::SELECT;
-    return getNode(Opcode, DL, VT, Cond, LHS, RHS);
+    return getNode(Opcode, DL, VT, Cond, LHS, RHS, Flags);
   }
 
   /// Helper function to make it easier to build SelectCC's if you just have an
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
index 6acbc044d6731..503e09dd2077f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
@@ -990,11 +990,8 @@ void VectorLegalizer::Expand(SDNode *Node, SmallVectorImpl<SDValue> &Results) {
     break;
   case ISD::FMINIMUM:
   case ISD::FMAXIMUM:
-    if (SDValue Expanded = TLI.expandFMINIMUM_FMAXIMUM(Node, DAG)) {
-      Results.push_back(Expanded);
-      return;
-    }
-    break;
+    Results.push_back(TLI.expandFMINIMUM_FMAXIMUM(Node, DAG));
+    return;
   case ISD::SMIN:
   case ISD::SMAX:
   case ISD::UMIN:
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 4e47f50ee4289..18ecd86c4739d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8428,10 +8428,7 @@ SDValue TargetLowering::expandFMINIMUM_FMAXIMUM(SDNode *N,
   EVT VT = N->getValueType(0);
   EVT CCVT = getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
   bool IsMax = Opc == ISD::FMAXIMUM;
-
-  if (VT.isVector() &&
-      isOperationLegalOrCustomOrPromote(Opc, VT.getScalarType()))
-    return SDValue();
+  SDNodeFlags Flags = N->getFlags();
 
   // First, implement comparison not propagating NaN. If no native fmin or fmax
   // available, use plain select with setcc instead.
@@ -8444,15 +8441,18 @@ SDValue TargetLowering::expandFMINIMUM_FMAXIMUM(SDNode *N,
   bool MinMaxMustRespectOrderedZero = false;
 
   if (isOperationLegalOrCustom(CompOpcIeee, VT)) {
-    MinMax = DAG.getNode(CompOpcIeee, DL, VT, LHS, RHS);
+    MinMax = DAG.getNode(CompOpcIeee, DL, VT, LHS, RHS, Flags);
     MinMaxMustRespectOrderedZero = true;
   } else if (isOperationLegalOrCustom(CompOpc, VT)) {
-    MinMax = DAG.getNode(CompOpc, DL, VT, LHS, RHS);
+    MinMax = DAG.getNode(CompOpc, DL, VT, LHS, RHS, Flags);
   } else {
+    if (VT.isVector() && !isOperationLegalOrCustom(ISD::VSELECT, VT))
+      return DAG.UnrollVectorOp(N);
+
     // NaN (if exists) will be propagated later, so orderness doesn't matter.
     SDValue Compare =
         DAG.getSetCC(DL, CCVT, LHS, RHS, IsMax ? ISD::SETGT : ISD::SETLT);
-    MinMax = DAG.getSelect(DL, VT, Compare, LHS, RHS);
+    MinMax = DAG.getSelect(DL, VT, Compare, LHS, RHS, Flags);
   }
 
   // Propagate any NaN of both operands
@@ -8461,7 +8461,7 @@ SDValue TargetLowering::expandFMINIMUM_FMAXIMUM(SDNode *N,
     ConstantFP *FPNaN = ConstantFP::get(
         *DAG.getContext(), APFloat::getNaN(DAG.EVTToAPFloatSemantics(VT)));
     MinMax = DAG.getSelect(DL, VT, DAG.getSetCC(DL, CCVT, LHS, RHS, ISD::SETUO),
-                           DAG.getConstantFP(*FPNaN, DL, VT), MinMax);
+                           DAG.getConstantFP(*FPNaN, DL, VT), MinMax, Flags);
   }
 
   // fminimum/fmaximum requires -0.0 less than +0.0
@@ -8473,11 +8473,11 @@ SDValue TargetLowering::expandFMINIMUM_FMAXIMUM(SDNode *N,
         DAG.getTargetConstant(IsMax ? fcPosZero : fcNegZero, DL, MVT::i32);
     SDValue LCmp = DAG.getSelect(
         DL, VT, DAG.getNode(ISD::IS_FPCLASS, DL, CCVT, LHS, TestZero), LHS,
-        MinMax);
+        MinMax, Flags);
     SDValue RCmp = DAG.getSelect(
         DL, VT, DAG.getNode(ISD::IS_FPCLASS, DL, CCVT, RHS, TestZero), RHS,
-        LCmp);
-    MinMax = DAG.getSelect(DL, VT, IsZero, RCmp, MinMax);
+        LCmp, Flags);
+    MinMax = DAG.getSelect(DL, VT, IsZero, RCmp, MinMax, Flags);
   }
 
   return MinMax;
diff --git a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
index 3ec36f03a48aa..9ce1ba3316dd5 100644
--- a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
@@ -497,47 +497,19 @@ define <2 x float> @v_fmaximum3_v2f32(<2 x float> %a, <2 x float> %b, <2 x float
 ; GFX9-LABEL: v_fmaximum3_v2f32:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v6, v3, v1, vcc
+; GFX9-NEXT:    v_max_f32_e32 v6, v1, v3
 ; GFX9-NEXT:    v_mov_b32_e32 v7, 0x7fc00000
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v6, v7, v6, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v6, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v3, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v6
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v6, v1, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v0, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX9-NEXT:    v_max_f32_e32 v3, v0, v2
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v7, v3, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v2, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v4, v0
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v0, v4, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v3, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, v4, v0
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v4, v0
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v4, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v4, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v5, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v1, v5, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v2, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, v5, v1
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v5, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v5, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v5, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v3, v1, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
   %max0 = call <2 x float> @llvm.maximum.v2f32(<2 x float> %a, <2 x float> %b)
   %max1 = call <2 x float> @llvm.maximum.v2f32(<2 x float> %c, <2 x float> %max0)
@@ -559,47 +531,19 @@ define <2 x float> @v_fmaximum3_v2f32_commute(<2 x float> %a, <2 x float> %b, <2
 ; GFX9-LABEL: v_fmaximum3_v2f32_commute:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v6, v3, v1, vcc
+; GFX9-NEXT:    v_max_f32_e32 v6, v1, v3
 ; GFX9-NEXT:    v_mov_b32_e32 v7, 0x7fc00000
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v6, v7, v6, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v6, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v3, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v6
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v6, v1, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v0, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX9-NEXT:    v_max_f32_e32 v3, v0, v2
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v7, v3, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v2, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v4
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v4, v0, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v3, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, v0, v4
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v4
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v4, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v4, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v1, v5
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v5, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v2, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, v1, v5
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v5
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v5, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v1, v5, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
   %max0 = call <2 x float> @llvm.maximum.v2f32(<2 x float> %a, <2 x float> %b)
   %max1 = call <2 x float> @llvm.maximum.v2f32(<2 x float> %max0, <2 x float> %c)
@@ -621,47 +565,19 @@ define <2 x float> @v_fmaximum3_v2f32__fabs_all(<2 x float> %a, <2 x float> %b,
 ; GFX9-LABEL: v_fmaximum3_v2f32__fabs_all:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, |v3|
-; GFX9-NEXT:    v_cndmask_b32_e32 v6, v3, v1, vcc
+; GFX9-NEXT:    v_max_f32_e64 v6, |v1|, |v3|
 ; GFX9-NEXT:    v_mov_b32_e32 v7, 0x7fc00000
-; GFX9-NEXT:    v_cmp_o_f32_e64 s[4:5], |v1|, |v3|
-; GFX9-NEXT:    v_cndmask_b32_e64 v6, v7, |v6|, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], |v1|, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v1, v6, |v1|, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], |v3|, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v1, v1, |v3|, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v6
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v6, v1, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e64 vcc, |v0|, |v2|
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_o_f32_e64 s[4:5], |v0|, |v2|
-; GFX9-NEXT:    v_cndmask_b32_e64 v3, v7, |v3|, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], |v0|, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v0, v3, |v0|, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], |v2|, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v0, v0, |v2|, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e64 s[4:5], v0, |v4|
-; GFX9-NEXT:    v_cndmask_b32_e64 v2, |v4|, v0, s[4:5]
+; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, |v1|, |v3|
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX9-NEXT:    v_max_f32_e64 v3, |v0|, |v2|
+; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, |v0|, |v2|
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v3, vcc
+; GFX9-NEXT:    v_max_f32_e64 v2, v0, |v4|
 ; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, v0, |v4|
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], |v4|, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v0, v0, |v4|, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cmp_gt_f32_e64 s[4:5], v1, |v5|
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cndmask_b32_e64 v2, |v5|, v1, s[4:5]
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v2, vcc
+; GFX9-NEXT:    v_max_f32_e64 v2, v1, |v5|
 ; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, v1, |v5|
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], |v5|, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v1, v1, |v5|, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
   %a.fabs = call <2 x float> @llvm.fabs.v2f32(<2 x float> %a)
   %b.fabs = call <2 x float> @llvm.fabs.v2f32(<2 x float> %b)
@@ -686,47 +602,19 @@ define <2 x float> @v_fmaximum3_v2f32__fneg_all(<2 x float> %a, <2 x float> %b,
 ; GFX9-LABEL: v_fmaximum3_v2f32__fneg_all:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_cmp_gt_f32_e64 vcc, -v1, -v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v6, v3, v1, vcc
+; GFX9-NEXT:    v_max_f32_e64 v6, -v1, -v3
 ; GFX9-NEXT:    v_mov_b32_e32 v7, 0x7fc00000
-; GFX9-NEXT:    v_cmp_o_f32_e64 s[4:5], -v1, -v3
-; GFX9-NEXT:    v_cndmask_b32_e64 v6, v7, -v6, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], -v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v1, v6, -v1, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], -v3, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v1, v1, -v3, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v6
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v6, v1, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e64 vcc, -v0, -v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_o_f32_e64 s[4:5], -v0, -v2
-; GFX9-NEXT:    v_cndmask_b32_e64 v3, v7, -v3, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], -v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v0, v3, -v0, s[4:5]
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], -v2, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v0, v0, -v2, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e64 s[4:5], v0, -v4
-; GFX9-NEXT:    v_cndmask_b32_e64 v2, -v4, v0, s[4:5]
+; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, -v1, -v3
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX9-NEXT:    v_max_f32_e64 v3, -v0, -v2
+; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, -v0, -v2
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v3, vcc
+; GFX9-NEXT:    v_max_f32_e64 v2, v0, -v4
 ; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, v0, -v4
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], -v4, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v0, v0, -v4, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cmp_gt_f32_e64 s[4:5], v1, -v5
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
-; GFX9-NEXT:    v_cndmask_b32_e64 v2, -v5, v1, s[4:5]
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v7, v2, vcc
+; GFX9-NEXT:    v_max_f32_e64 v2, v1, -v5
 ; GFX9-NEXT:    v_cmp_o_f32_e64 vcc, v1, -v5
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v7, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 s[4:5], -v5, 64
-; GFX9-NEXT:    v_cndmask_b32_e64 v1, v1, -v5, s[4:5]
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
   %a.fneg = fneg <2 x float> %a
   %b.fneg = fneg <2 x float> %b
@@ -751,35 +639,19 @@ define <2 x float> @v_fmaximum3_v2f32__inlineimm1(<2 x float> %a, <2 x float> %c
 ; GFX9-LABEL: v_fmaximum3_v2f32__inlineimm1:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_cmp_lt_f32_e32 vcc, 2.0, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v4, 2.0, v1, vcc
+; GFX9-NEXT:    v_max_f32_e32 v4, 2.0, v1
 ; GFX9-NEXT:    v_mov_b32_e32 v5, 0x7fc00000
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v1
 ; GFX9-NEXT:    v_cndmask_b32_e32 v1, v5, v4, vcc
-; GFX9-NEXT:    v_cmp_lt_f32_e32 vcc, 2.0, v0
-; GFX9-NEXT:    v_cndmask_b32_e32 v4, 2.0, v0, vcc
+; GFX9-NEXT:    v_max_f32_e32 v4, 2.0, v0
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v5, v4, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v4, v2, v0, vcc
+; GFX9-NEXT:    v_max_f32_e32 v4, v0, v2
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v4, v5, v4, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v2, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v4
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v3, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v5, v4, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, v1, v3
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, v5, v2, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v3, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v2, v1, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
   %max0 = call <2 x float> @llvm.maximum.v2f32(<2 x float> %a, <2 x float> <float 2.0, float 2.0>)
   %max1 = call <2 x float> @llvm.maximum.v2f32(<2 x float> %max0, <2 x float> %c)
@@ -801,33 +673,17 @@ define <2 x float> @v_fmaximum3_v2f32__inlineimm2(<2 x float> %a, <2 x float> %b
 ; GFX9-LABEL: v_fmaximum3_v2f32__inlineimm2:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v4, v3, v1, vcc
+; GFX9-NEXT:    v_max_f32_e32 v4, v1, v3
 ; GFX9-NEXT:    v_mov_b32_e32 v5, 0x7fc00000
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v4, v5, v4, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v1, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v4, v1, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v3, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v1, v3, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v4
-; GFX9-NEXT:    v_cndmask_b32_e32 v1, v4, v1, vcc
-; GFX9-NEXT:    v_cmp_gt_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v2, v0, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v1, v5, v4, vcc
+; GFX9-NEXT:    v_max_f32_e32 v3, v0, v2
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v2
-; GFX9-NEXT:    v_cndmask_b32_e32 v3, v5, v3, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v0, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_class_f32_e64 vcc, v2, 64
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
-; GFX9-NEXT:    v_cmp_eq_f32_e32 vcc, 0, v3
-; GFX9-NEXT:    v_cndmask_b32_e32 v0, v3, v0, vcc
-; GFX9-NEXT:    v_cmp_lt_f32_e32 vcc, 4.0, v0
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, 4.0, v0, vcc
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v5, v3, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, 4.0, v0
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v0, v0
 ; GFX9-NEXT:    v_cndmask_b32_e32 v0, v5, v2, vcc
-; GFX9-NEXT:    v_cmp_lt_f32_e32 vcc, 4.0, v1
-; GFX9-NEXT:    v_cndmask_b32_e32 v2, 4.0, v1, vcc
+; GFX9-NEXT:    v_max_f32_e32 v2, 4.0, v1
 ; GFX9-NEXT:    v_cmp_o_f32_e32 vcc, v1, v1
 ; GFX9-NEXT:    v_cndmask_b32_e32 v1, v5, v2, vcc
 ; GFX9-NEXT:    s_setpc_b64 s[30:31]
@@ -852,67 +708,25 @@ define <3 x float> @v_fmaximum3_v3f32(<3 x float> %a, <3 x float> %b, <3 x float
 ; GFX9-L...
[truncated]

@arsenm
Copy link
Contributor Author

arsenm commented May 28, 2024

First, expandFMINIMUM_FMAXIMUM should be a never-fail API. The client wanted it expanded, and it can always be expanded. This logic was tied up with what the VectorLegalizer wanted.

Prefer using the min/max opcodes, and unrolling if we don't have a vselect. This seems to produce better code in all the changed tests.

On top of #93550

@arsenm arsenm force-pushed the dag-improve-fminimum-fmaximum-expand-logic branch from 7499e03 to 0ec84b9 Compare May 28, 2024 20:59
@@ -28,35 +28,17 @@ define half @test_fminimum(half %x, half %y) {
define <8 x half> @test_fminimum_scalarize(<8 x half> %x, <8 x half> %y) "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" {
Copy link
Contributor

@phoebewang phoebewang May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _scalarize should be removed now.

@e-kud What's these tests testing for? If in a NNan, Nsz situation, we can generate VMIN/MAXPH directly. Or you actually wanted "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"? But they are the default value, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've said we test that only VMIN/VMAX is enough. Otherwise we need to deal with +/-0 using fpclass and NaNs with vcmpunord. It's called scalarize because already scalarized version of FMINIMUM comes to lowering. Probably this is the reason why it has no much support. Let me see what's going wrong and why we stop vectorizing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean should be removed? Based on the test name, the attributes should not be here. Also, should migrate to using the instruction flags. In any case this patch shouldn't do anything with this function, other than update it

Copy link
Contributor

@phoebewang phoebewang May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the _scalarize in the function name. The codegen was scalarized for some reason, and it's not ture with this patch. So keeping this name is confusing. It's not important since @e-kud kindly help investigating it.

Copy link
Contributor

@e-kud e-kud May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the situation is the following: FMINIMUM and FMAXIMUM are set to Expand through setF16Action lambda. And I didn't set v8f16 separately to Custom because I saw SELECT inside setF16Action. And since FMINIMUM lowering requires support of SELECT I decided that we need to support SELECT for v8f16 first. Now I see, that despite of SELECT inside setF16Action, we overwrite it later with Custom for v8f16.

When I've set Custom for FMINIMUM I've seen the desired result

test_fminimum_scalarize:                # @test_fminimum_scalarize 
        .cfi_startproc                                             
# %bb.0:                                                           
        vminph  %xmm1, %xmm0, %xmm0                                
        retq                                                       

Let me support it with fixing test naming to illustrate the attributes

@ecnelises
Copy link
Member

Changes to cases are from SDFlags, not related to vector unroll?

@arsenm
Copy link
Contributor Author

arsenm commented May 29, 2024

Changes to cases are from SDFlags, not related to vector unroll?

Only look at the top commit

First, expandFMINIMUM_FMAXIMUM should be a never-fail API. The client
wanted it expanded, and it can always be expanded. This logic was tied
up with what the VectorLegalizer wanted.

Prefer using the min/max opcodes, and unrolling if we don't have a vselect.
This seems to produce better code in all the changed tests.
@arsenm arsenm force-pushed the dag-improve-fminimum-fmaximum-expand-logic branch from 0ec84b9 to f1296cd Compare May 29, 2024 10:30
@arsenm
Copy link
Contributor Author

arsenm commented Jun 6, 2024

ping

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

return;
}
break;
Results.push_back(TLI.expandFMINIMUM_FMAXIMUM(Node, DAG));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an assert in case expandFMINIMUM_FMAXIMUM may change to return SDValue() someday?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will hit an assert anyway as it is later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@arsenm arsenm merged commit 212b78a into llvm:main Jun 6, 2024
5 of 6 checks passed
@arsenm arsenm deleted the dag-improve-fminimum-fmaximum-expand-logic branch June 6, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants