Skip to content

[NVPTX] Support copysign PTX instruction #107800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AlexMaclean
Copy link
Member

Lower fcopysign SDNodes into copysign PTX instructions where possible. See PTX ISA: 9.7.3.2. Floating Point Instructions: copysign.

@AlexMaclean AlexMaclean requested a review from Artem-B September 9, 2024 02:33
@AlexMaclean AlexMaclean self-assigned this Sep 9, 2024
@llvmbot
Copy link
Member

llvmbot commented Sep 9, 2024

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Lower fcopysign SDNodes into copysign PTX instructions where possible. See PTX ISA: 9.7.3.2. Floating Point Instructions: copysign.


Full diff: https://github.com/llvm/llvm-project/pull/107800.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+2-2)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+14)
  • (added) llvm/test/CodeGen/NVPTX/copysign.ll (+39)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index bb76ffdfd99d7b..6a046f5cd7c4d7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -838,8 +838,8 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   setOperationAction(ISD::FCOPYSIGN, MVT::v2f16, Expand);
   setOperationAction(ISD::FCOPYSIGN, MVT::bf16, Expand);
   setOperationAction(ISD::FCOPYSIGN, MVT::v2bf16, Expand);
-  setOperationAction(ISD::FCOPYSIGN, MVT::f32, Expand);
-  setOperationAction(ISD::FCOPYSIGN, MVT::f64, Expand);
+  setOperationAction(ISD::FCOPYSIGN, MVT::f32, Legal);
+  setOperationAction(ISD::FCOPYSIGN, MVT::f64, Legal);
 
   // These map to corresponding instructions for f32/f64. f16 must be
   // promoted to f32. v2f16 is expanded to f16, which is then promoted
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 0c883093dd0a54..1b817ec07ce3cf 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -977,6 +977,20 @@ def INT_NVVM_FABS_F : F_MATH_1<"abs.f32 \t$dst, $src0;", Float32Regs,
 def INT_NVVM_FABS_D : F_MATH_1<"abs.f64 \t$dst, $src0;", Float64Regs,
   Float64Regs, int_nvvm_fabs_d>;
 
+//
+// copysign
+//
+
+def COPYSIGN_F :
+    NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$src0, Float32Regs:$src1),
+              "copysign.f32 \t$dst, $src0, $src1;",
+              [(set Float32Regs:$dst, (fcopysign Float32Regs:$src0, Float32Regs:$src1))]>;
+
+def COPYSIGN_D :
+    NVPTXInst<(outs Float64Regs:$dst), (ins Float64Regs:$src0, Float64Regs:$src1),
+              "copysign.f64 \t$dst, $src0, $src1;",
+              [(set Float64Regs:$dst, (fcopysign Float64Regs:$src0, Float64Regs:$src1))]>;
+
 //
 // Abs, Neg bf16, bf16x2
 //
diff --git a/llvm/test/CodeGen/NVPTX/copysign.ll b/llvm/test/CodeGen/NVPTX/copysign.ll
new file mode 100644
index 00000000000000..1b2061060a3dd9
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/copysign.ll
@@ -0,0 +1,39 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
+
+target triple = "nvptx64-nvidia-cuda"
+target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
+
+define float @fcopysign_f(float %a, float %b) {
+; CHECK-LABEL: fcopysign_f(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f32 %f<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f32 %f1, [fcopysign_f_param_0];
+; CHECK-NEXT:    ld.param.f32 %f2, [fcopysign_f_param_1];
+; CHECK-NEXT:    copysign.f32 %f3, %f1, %f2;
+; CHECK-NEXT:    st.param.f32 [func_retval0+0], %f3;
+; CHECK-NEXT:    ret;
+  %val = call float @llvm.copysign.f32(float %a, float %b)
+  ret float %val
+}
+
+define double @fcopysign_d(double %a, double %b) {
+; CHECK-LABEL: fcopysign_d(
+; CHECK:       {
+; CHECK-NEXT:    .reg .f64 %fd<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.f64 %fd1, [fcopysign_d_param_0];
+; CHECK-NEXT:    ld.param.f64 %fd2, [fcopysign_d_param_1];
+; CHECK-NEXT:    copysign.f64 %fd3, %fd1, %fd2;
+; CHECK-NEXT:    st.param.f64 [func_retval0+0], %fd3;
+; CHECK-NEXT:    ret;
+  %val = call double @llvm.copysign.f64(double %a, double %b)
+  ret double %val
+}
+
+declare float @llvm.copysign.f32(float, float)
+declare double @llvm.copysign.f64(double, double)

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream-copysign-nvptx branch from ce7afff to bd78827 Compare September 9, 2024 02:59
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For my own education -- is there hardware support for this instruction in SASS? Or would the copysign get expanded to the bit twiddling LLVM does?

@AlexMaclean
Copy link
Member Author

LGTM.

For my own education -- is there hardware support for this instruction in SASS? Or would the copysign get expanded to the bit twiddling LLVM does?

Here is a quick comparison of the SASS before/after this change: https://godbolt.org/z/snWTGEG81. copysign is still getting expanded but it looks like ptxas is doing a better job then the SelectionDAG (2 vs. 3 instructions).

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream-copysign-nvptx branch from bd78827 to d5e0ff5 Compare September 9, 2024 23:33
@Artem-B
Copy link
Member

Artem-B commented Sep 9, 2024

it looks like ptxas is doing a better job then the SelectionDAG (2 vs. 3 instructions).

2x LOP3.LUT look like a likely win over SHF, ISETP, FSEL.

We should probably start using lop3 more often, as it could potentially handle other common operations a bit better than the discrete logical ops we use now.

@AlexMaclean AlexMaclean merged commit b0d2411 into llvm:main Sep 10, 2024
6 of 7 checks passed
@pranavk
Copy link
Contributor

pranavk commented Sep 10, 2024

This is making some of our internal XLA tests fail. I talked to @Artem-B offline and it seems we missed case of copysign from a constant. I will have to revert this to unblock us.

@pranavk
Copy link
Contributor

pranavk commented Sep 10, 2024

I am working on some reproducers for you but that may take few hours.

@AlexMaclean
Copy link
Member Author

Sounds good, Apologies! Some reproducers would be great!

pranavk added a commit that referenced this pull request Sep 10, 2024
This reverts commit b0d2411.

Reverting because the original commit misses case of copysign from a
constant.
@pranavk
Copy link
Contributor

pranavk commented Sep 10, 2024

@AlexMaclean I have this I can share:

LLVM ERROR: Cannot select: t31: f32 = fcopysign ConstantFP:f32<5.000000e-01>, t17
  t19: f32 = ConstantFP<5.000000e-01>
  t17: f16,ch = load<(invariant load (s16) from %ir.7, addrspace 1)> t0, t16, undef:i64
    t16: i64 = add t10, t34
      t10: i64 = addrspacecast[0 -> 1] t35
        t35: i64,ch = load<(dereferenceable invariant load (s64) from `ptr addrspace(101) null`, addrspace 101)> t0, TargetExternalSymbol:i64'foo', undef:i64
          t1: i64 = TargetExternalSymbol'foo'
          t3: i64 = undef
      t34: i64 = NVPTXISD::MUL_WIDE_UNSIGNED t12, Constant:i32<2>
        t12: i32 = llvm.nvvm.read.ptx.sreg.tid.x TargetConstant:i64<8487>
          t11: i64 = TargetConstant<8487>
        t33: i32 = Constant<2>
    t3: i64 = undef

I will be working on getting the LLVM IR for you unless above is sufficient. Please let me know.

@AlexMaclean
Copy link
Member Author

Thanks, this is enough, it seems the problem is not the constant but that the second arg is of f16 type

@Artem-B
Copy link
Member

Artem-B commented Sep 10, 2024

Huh. I'm surprised to see copysign used with mixed FP types.

AlexMaclean added a commit to AlexMaclean/llvm-project that referenced this pull request Sep 11, 2024
Lower `fcopysign` SDNodes into `copysign` PTX instructions where
possible. See [PTX ISA: 9.7.3.2. Floating Point Instructions: copysign]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#floating-point-instructions-copysign).
@AlexMaclean
Copy link
Member Author

Yea it seems like many targets have had to work around this property of FCOPYSIGN nodes. One option would be to update ISel to handle these cases but I went ahead and added a TLI hook to prevent this sort of folding all together. #108125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants