[Clang][AArch64] Add fp8 variants for untyped NEON intrinsics #128019

Lukacma · 2025-02-20T15:41:51Z

This patch adds fp8 variants to existing intrinsics, whose operation
doesn't depend on arguments being a specific type.

It also changes mfloat8 type representation in memory from i8 to <1xi8>

llvmbot · 2025-02-20T15:42:28Z

@llvm/pr-subscribers-clang-codegen

Author: None (Lukacma)

Changes

This patch adds fp8 variants to existing intrinsics, whose operation
doesn't depend on arguments being a specific type.

Patch is 7.06 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128019.diff

53 Files Affected:

(modified) clang/include/clang/Basic/TargetBuiltins.h (+4)
(modified) clang/include/clang/Basic/arm_neon.td (+104-38)
(modified) clang/lib/AST/ExprConstant.cpp (+5)
(modified) clang/lib/AST/Type.cpp (+5)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+87-37)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+4-4)
(modified) clang/lib/Sema/SemaInit.cpp (+3-1)
(modified) clang/test/CodeGen/AArch64/bf16-dotprod-intrinsics.c (+236-148)
(modified) clang/test/CodeGen/AArch64/bf16-getset-intrinsics.c (+17-13)
(modified) clang/test/CodeGen/AArch64/bf16-reinterpret-intrinsics.c (+266-186)
(added) clang/test/CodeGen/AArch64/fp8-init-list.c (+59)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_cvt.c (+30-14)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_fdot.c (+50-34)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_fmla.c (+50-34)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_reinterpret.c (+96-62)
(added) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_untyped.c (+1158)
(modified) clang/test/CodeGen/AArch64/neon-2velem.c (+1232-594)
(modified) clang/test/CodeGen/AArch64/neon-extract.c (+228-145)
(modified) clang/test/CodeGen/AArch64/neon-fma.c (+87-59)
(modified) clang/test/CodeGen/AArch64/neon-fp16fml.c (+593-833)
(modified) clang/test/CodeGen/AArch64/neon-intrinsics-constrained.c (+1409-453)
(modified) clang/test/CodeGen/AArch64/neon-intrinsics.c (+16202-10053)
(modified) clang/test/CodeGen/AArch64/neon-ldst-one-rcpc3.c (+23-17)
(modified) clang/test/CodeGen/AArch64/neon-ldst-one.c (+3870-4665)
(modified) clang/test/CodeGen/AArch64/neon-misc-constrained.c (+78-33)
(modified) clang/test/CodeGen/AArch64/neon-misc.c (+2734-1396)
(modified) clang/test/CodeGen/AArch64/neon-perm.c (+1670-1207)
(modified) clang/test/CodeGen/AArch64/neon-scalar-x-indexed-elem-constrained.c (+219-89)
(modified) clang/test/CodeGen/AArch64/neon-scalar-x-indexed-elem.c (+401-252)
(modified) clang/test/CodeGen/AArch64/neon-vcmla.c (+889-425)
(modified) clang/test/CodeGen/AArch64/poly-add.c (+1-1)
(modified) clang/test/CodeGen/AArch64/poly128.c (+28-28)
(modified) clang/test/CodeGen/AArch64/poly64.c (+443-338)
(modified) clang/test/CodeGen/AArch64/v8.1a-neon-intrinsics.c (+81-17)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics-constrained.c (+669-233)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics-generic.c (+154-134)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics.c (+773-411)
(modified) clang/test/CodeGen/AArch64/v8.5a-neon-frint3264-intrinsic.c (+202-49)
(modified) clang/test/CodeGen/AArch64/v8.6a-neon-intrinsics.c (+145-87)
(modified) clang/test/CodeGen/arm-bf16-dotprod-intrinsics.c (+237-149)
(modified) clang/test/CodeGen/arm-bf16-getset-intrinsics.c (+18-14)
(modified) clang/test/CodeGen/arm-neon-directed-rounding.c (+285-62)
(modified) clang/test/CodeGen/arm-neon-fma.c (+45-21)
(modified) clang/test/CodeGen/arm-neon-numeric-maxmin.c (+43-19)
(modified) clang/test/CodeGen/arm-neon-vcvtX.c (+73-41)
(modified) clang/test/CodeGen/arm-neon-vst.c (+2443-1695)
(modified) clang/test/CodeGen/arm64-vrnd-constrained.c (+193-26)
(modified) clang/test/CodeGen/arm64-vrnd.c (+115-6)
(modified) clang/test/CodeGen/arm64_vcreate.c (+18-3)
(modified) clang/test/CodeGen/arm64_vdupq_n_f64.c (+58-38)
(modified) clang/test/CodeGen/arm_neon_intrinsics.c (+19524-12225)
(modified) clang/utils/TableGen/NeonEmitter.cpp (+17-11)
(added) llvm/test/CodeGen/AArch64/v8.2a-neon-intrinsics-constrained.ll (+276)

diff --git a/clang/include/clang/Basic/TargetBuiltins.h b/clang/include/clang/Basic/TargetBuiltins.h
index 4781054240b5b..c1ba65064f159 100644
--- a/clang/include/clang/Basic/TargetBuiltins.h
+++ b/clang/include/clang/Basic/TargetBuiltins.h
@@ -263,6 +263,10 @@ namespace clang {
       EltType ET = getEltType();
       return ET == Poly8 || ET == Poly16 || ET == Poly64;
     }
+    bool isFloatingPoint() const {
+      EltType ET = getEltType();
+      return ET == Float16 || ET == Float32 || ET == Float64 || ET == BFloat16;
+    }
     bool isUnsigned() const { return (Flags & UnsignedFlag) != 0; }
     bool isQuad() const { return (Flags & QuadFlag) != 0; }
     unsigned getEltSizeInBits() const {
diff --git a/clang/include/clang/Basic/arm_neon.td b/clang/include/clang/Basic/arm_neon.td
index 3e73dd054933f..90f0e90e4a7f8 100644
--- a/clang/include/clang/Basic/arm_neon.td
+++ b/clang/include/clang/Basic/arm_neon.td
@@ -31,8 +31,8 @@ def OP_MLAL     : Op<(op "+", $p0, (call "vmull", $p1, $p2))>;
 def OP_MULLHi   : Op<(call "vmull", (call "vget_high", $p0),
                                     (call "vget_high", $p1))>;
 def OP_MULLHi_P64 : Op<(call "vmull",
-                         (cast "poly64_t", (call "vget_high", $p0)),
-                         (cast "poly64_t", (call "vget_high", $p1)))>;
+                         (bitcast "poly64_t", (call "vget_high", $p0)),
+                         (bitcast "poly64_t", (call "vget_high", $p1)))>;
 def OP_MULLHi_N : Op<(call "vmull_n", (call "vget_high", $p0), $p1)>;
 def OP_MLALHi   : Op<(call "vmlal", $p0, (call "vget_high", $p1),
                                          (call "vget_high", $p2))>;
@@ -95,11 +95,11 @@ def OP_TRN2     : Op<(shuffle $p0, $p1, (interleave
 def OP_ZIP2     : Op<(shuffle $p0, $p1, (highhalf (interleave mask0, mask1)))>;
 def OP_UZP2     : Op<(shuffle $p0, $p1, (add (decimate (rotl mask0, 1), 2),
                                              (decimate (rotl mask1, 1), 2)))>;
-def OP_EQ       : Op<(cast "R", (op "==", $p0, $p1))>;
-def OP_GE       : Op<(cast "R", (op ">=", $p0, $p1))>;
-def OP_LE       : Op<(cast "R", (op "<=", $p0, $p1))>;
-def OP_GT       : Op<(cast "R", (op ">", $p0, $p1))>;
-def OP_LT       : Op<(cast "R", (op "<", $p0, $p1))>;
+def OP_EQ       : Op<(bitcast "R", (op "==", $p0, $p1))>;
+def OP_GE       : Op<(bitcast "R", (op ">=", $p0, $p1))>;
+def OP_LE       : Op<(bitcast "R", (op "<=", $p0, $p1))>;
+def OP_GT       : Op<(bitcast "R", (op ">", $p0, $p1))>;
+def OP_LT       : Op<(bitcast "R", (op "<", $p0, $p1))>;
 def OP_NEG      : Op<(op "-", $p0)>;
 def OP_NOT      : Op<(op "~", $p0)>;
 def OP_AND      : Op<(op "&", $p0, $p1)>;
@@ -108,20 +108,20 @@ def OP_XOR      : Op<(op "^", $p0, $p1)>;
 def OP_ANDN     : Op<(op "&", $p0, (op "~", $p1))>;
 def OP_ORN      : Op<(op "|", $p0, (op "~", $p1))>;
 def OP_CAST     : LOp<[(save_temp $promote, $p0),
-                       (cast "R", $promote)]>;
+                       (bitcast "R", $promote)]>;
 def OP_HI       : Op<(shuffle $p0, $p0, (highhalf mask0))>;
 def OP_LO       : Op<(shuffle $p0, $p0, (lowhalf mask0))>;
 def OP_CONC     : Op<(shuffle $p0, $p1, (add mask0, mask1))>;
 def OP_DUP      : Op<(dup $p0)>;
 def OP_DUP_LN   : Op<(call_mangled "splat_lane", $p0, $p1)>;
-def OP_SEL      : Op<(cast "R", (op "|",
-                                    (op "&", $p0, (cast $p0, $p1)),
-                                    (op "&", (op "~", $p0), (cast $p0, $p2))))>;
+def OP_SEL      : Op<(bitcast "R", (op "|",
+                                    (op "&", $p0, (bitcast $p0, $p1)),
+                                    (op "&", (op "~", $p0), (bitcast $p0, $p2))))>;
 def OP_REV16    : Op<(shuffle $p0, $p0, (rev 16, mask0))>;
 def OP_REV32    : Op<(shuffle $p0, $p0, (rev 32, mask0))>;
 def OP_REV64    : Op<(shuffle $p0, $p0, (rev 64, mask0))>;
 def OP_XTN      : Op<(call "vcombine", $p0, (call "vmovn", $p1))>;
-def OP_SQXTUN   : Op<(call "vcombine", (cast $p0, "U", $p0),
+def OP_SQXTUN   : Op<(call "vcombine", (bitcast $p0, "U", $p0),
                                        (call "vqmovun", $p1))>;
 def OP_QXTN     : Op<(call "vcombine", $p0, (call "vqmovn", $p1))>;
 def OP_VCVT_NA_HI_F16 : Op<(call "vcombine", $p0, (call "vcvt_f16_f32", $p1))>;
@@ -129,12 +129,12 @@ def OP_VCVT_NA_HI_F32 : Op<(call "vcombine", $p0, (call "vcvt_f32_f64", $p1))>;
 def OP_VCVT_EX_HI_F32 : Op<(call "vcvt_f32_f16", (call "vget_high", $p0))>;
 def OP_VCVT_EX_HI_F64 : Op<(call "vcvt_f64_f32", (call "vget_high", $p0))>;
 def OP_VCVTX_HI : Op<(call "vcombine", $p0, (call "vcvtx_f32", $p1))>;
-def OP_REINT    : Op<(cast "R", $p0)>;
+def OP_REINT    : Op<(bitcast "R", $p0)>;
 def OP_ADDHNHi  : Op<(call "vcombine", $p0, (call "vaddhn", $p1, $p2))>;
 def OP_RADDHNHi : Op<(call "vcombine", $p0, (call "vraddhn", $p1, $p2))>;
 def OP_SUBHNHi  : Op<(call "vcombine", $p0, (call "vsubhn", $p1, $p2))>;
 def OP_RSUBHNHi : Op<(call "vcombine", $p0, (call "vrsubhn", $p1, $p2))>;
-def OP_ABDL     : Op<(cast "R", (call "vmovl", (cast $p0, "U",
+def OP_ABDL     : Op<(bitcast "R", (call "vmovl", (bitcast $p0, "U",
                                                      (call "vabd", $p0, $p1))))>;
 def OP_ABDLHi   : Op<(call "vabdl", (call "vget_high", $p0),
                                     (call "vget_high", $p1))>;
@@ -152,15 +152,15 @@ def OP_QDMLSLHi : Op<(call "vqdmlsl", $p0, (call "vget_high", $p1),
                                            (call "vget_high", $p2))>;
 def OP_QDMLSLHi_N : Op<(call "vqdmlsl_n", $p0, (call "vget_high", $p1), $p2)>;
 def OP_DIV  : Op<(op "/", $p0, $p1)>;
-def OP_LONG_HI : Op<(cast "R", (call (name_replace "_high_", "_"),
+def OP_LONG_HI : Op<(bitcast "R", (call (name_replace "_high_", "_"),
                                                 (call "vget_high", $p0), $p1))>;
-def OP_NARROW_HI : Op<(cast "R", (call "vcombine",
-                                       (cast "R", "H", $p0),
-                                       (cast "R", "H",
+def OP_NARROW_HI : Op<(bitcast "R", (call "vcombine",
+                                       (bitcast "R", "H", $p0),
+                                       (bitcast "R", "H",
                                            (call (name_replace "_high_", "_"),
                                                  $p1, $p2))))>;
 def OP_MOVL_HI  : LOp<[(save_temp $a1, (call "vget_high", $p0)),
-                       (cast "R",
+                       (bitcast "R",
                             (call "vshll_n", $a1, (literal "int32_t", "0")))]>;
 def OP_COPY_LN : Op<(call "vset_lane", (call "vget_lane", $p2, $p3), $p0, $p1)>;
 def OP_SCALAR_MUL_LN : Op<(op "*", $p0, (call "vget_lane", $p1, $p2))>;
@@ -221,18 +221,18 @@ def OP_FMLSL_LN_Hi  : Op<(call "vfmlsl_high", $p0, $p1,
 
 def OP_USDOT_LN
     : Op<(call "vusdot", $p0, $p1,
-          (cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
+          (bitcast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
 def OP_USDOT_LNQ
     : Op<(call "vusdot", $p0, $p1,
-          (cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;
+          (bitcast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;
 
 // sudot splats the second vector and then calls vusdot
 def OP_SUDOT_LN
     : Op<(call "vusdot", $p0,
-          (cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
+          (bitcast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
 def OP_SUDOT_LNQ
     : Op<(call "vusdot", $p0,
-          (cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;
+          (bitcast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;
 
 def OP_BFDOT_LN
     : Op<(call "vbfdot", $p0, $p1,
@@ -263,7 +263,7 @@ def OP_VCVT_BF16_F32_A32
     : Op<(call "__a32_vcvt_bf16", $p0)>;
 
 def OP_VCVT_BF16_F32_LO_A32
-    : Op<(call "vcombine", (cast "bfloat16x4_t", (literal "uint64_t", "0ULL")),
+    : Op<(call "vcombine", (bitcast "bfloat16x4_t", (literal "uint64_t", "0ULL")),
                            (call "__a32_vcvt_bf16", $p0))>;
 def OP_VCVT_BF16_F32_HI_A32
     : Op<(call "vcombine", (call "__a32_vcvt_bf16", $p1),
@@ -924,12 +924,12 @@ def CFMLE  : SOpInst<"vcle", "U..", "lUldQdQlQUl", OP_LE>;
 def CFMGT  : SOpInst<"vcgt", "U..", "lUldQdQlQUl", OP_GT>;
 def CFMLT  : SOpInst<"vclt", "U..", "lUldQdQlQUl", OP_LT>;
 
-def CMEQ  : SInst<"vceqz", "U.",
+def CMEQ  : SInst<"vceqz", "U(.!)",
                   "csilfUcUsUiUlPcPlQcQsQiQlQfQUcQUsQUiQUlQPcdQdQPl">;
-def CMGE  : SInst<"vcgez", "U.", "csilfdQcQsQiQlQfQd">;
-def CMLE  : SInst<"vclez", "U.", "csilfdQcQsQiQlQfQd">;
-def CMGT  : SInst<"vcgtz", "U.", "csilfdQcQsQiQlQfQd">;
-def CMLT  : SInst<"vcltz", "U.", "csilfdQcQsQiQlQfQd">;
+def CMGE  : SInst<"vcgez", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMLE  : SInst<"vclez", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMGT  : SInst<"vcgtz", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMLT  : SInst<"vcltz", "U(.!)", "csilfdQcQsQiQlQfQd">;
 
 ////////////////////////////////////////////////////////////////////////////////
 // Max/Min Integer
@@ -1667,11 +1667,11 @@ let TargetGuard = "fullfp16,neon" in {
   // ARMv8.2-A FP16 one-operand vector intrinsics.
 
   // Comparison
-  def CMEQH    : SInst<"vceqz", "U.", "hQh">;
-  def CMGEH    : SInst<"vcgez", "U.", "hQh">;
-  def CMGTH    : SInst<"vcgtz", "U.", "hQh">;
-  def CMLEH    : SInst<"vclez", "U.", "hQh">;
-  def CMLTH    : SInst<"vcltz", "U.", "hQh">;
+  def CMEQH    : SInst<"vceqz", "U(.!)", "hQh">;
+  def CMGEH    : SInst<"vcgez", "U(.!)", "hQh">;
+  def CMGTH    : SInst<"vcgtz", "U(.!)", "hQh">;
+  def CMLEH    : SInst<"vclez", "U(.!)", "hQh">;
+  def CMLTH    : SInst<"vcltz", "U(.!)", "hQh">;
 
   // Vector conversion
   def VCVT_F16     : SInst<"vcvt_f16", "F(.!)",  "sUsQsQUs">;
@@ -2090,17 +2090,17 @@ let ArchGuard = "defined(__aarch64__) || defined(__arm64ec__)", TargetGuard = "r
 
 // Lookup table read with 2-bit/4-bit indices
 let ArchGuard = "defined(__aarch64__)", TargetGuard = "lut" in {
-  def VLUTI2_B    : SInst<"vluti2_lane", "Q.(qU)I", "cUcPcQcQUcQPc",
+  def VLUTI2_B    : SInst<"vluti2_lane", "Q.(qU)I", "cUcPcmQcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_1>]>;
-  def VLUTI2_B_Q  : SInst<"vluti2_laneq", "Q.(QU)I", "cUcPcQcQUcQPc",
+  def VLUTI2_B_Q  : SInst<"vluti2_laneq", "Q.(QU)I", "cUcPcmQcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_3>]>;
   def VLUTI2_H    : SInst<"vluti2_lane", "Q.(<qU)I", "sUsPshQsQUsQPsQh",
                          [ImmCheck<2, ImmCheck0_3>]>;
   def VLUTI2_H_Q  : SInst<"vluti2_laneq", "Q.(<QU)I", "sUsPshQsQUsQPsQh",
                          [ImmCheck<2, ImmCheck0_7>]>;
-  def VLUTI4_B    : SInst<"vluti4_lane", "..(qU)I", "QcQUcQPc",
+  def VLUTI4_B    : SInst<"vluti4_lane", "..(qU)I", "QcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_0>]>;
-  def VLUTI4_B_Q  : SInst<"vluti4_laneq", "..UI", "QcQUcQPc",
+  def VLUTI4_B_Q  : SInst<"vluti4_laneq", "..UI", "QcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_1>]>;
   def VLUTI4_H_X2 : SInst<"vluti4_lane_x2", ".2(<qU)I", "QsQUsQPsQh",
                           [ImmCheck<3, ImmCheck0_1>]>;
@@ -2194,4 +2194,70 @@ let ArchGuard = "defined(__aarch64__)", TargetGuard = "fp8,neon" in {
   // fscale
   def FSCALE_V128 : WInst<"vscale", "..(.S)", "QdQfQh">;
   def FSCALE_V64 : WInst<"vscale", "(.q)(.q)(.qS)", "fh">;
+}
+
+//FP8 versions of untyped intrinsics
+let ArchGuard = "defined(__aarch64__)" in {
+  def VGET_LANE_MF8 : IInst<"vget_lane", "1.I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SPLAT_MF8 : WInst<"splat_lane", ".(!q)I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SPLATQ_MF8 : WInst<"splat_laneq", ".(!Q)I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def VSET_LANE_MF8 : IInst<"vset_lane", ".1.I", "mQm", [ImmCheck<2, ImmCheckLaneIndex, 1>]>;
+  def VCREATE_MF8 : NoTestOpInst<"vcreate", ".(IU>)", "m", OP_CAST> { let BigEndianSafe = 1; }
+  let InstName = "vmov" in {
+    def VDUP_N_MF8 : WOpInst<"vdup_n", ".1", "mQm", OP_DUP>;
+    def VMOV_N_MF8 : WOpInst<"vmov_n", ".1", "mQm", OP_DUP>;
+  }
+  let InstName = "" in
+    def VDUP_LANE_MF8: WOpInst<"vdup_lane", ".qI", "mQm", OP_DUP_LN>;
+  def VCOMBINE_MF8 : NoTestOpInst<"vcombine", "Q..", "m", OP_CONC>;
+  let InstName = "vmov" in {
+    def VGET_HIGH_MF8 : NoTestOpInst<"vget_high", ".Q", "m", OP_HI>;
+    def VGET_LOW_MF8 : NoTestOpInst<"vget_low", ".Q", "m", OP_LO>;
+  }
+  let InstName = "vtbl" in {
+    def VTBL1_MF8 : WInst<"vtbl1", "..p", "m">;
+    def VTBL2_MF8 : WInst<"vtbl2", ".2p", "m">;
+    def VTBL3_MF8 : WInst<"vtbl3", ".3p", "m">;
+    def VTBL4_MF8 : WInst<"vtbl4", ".4p", "m">;
+  }
+  let InstName = "vtbx" in {
+    def VTBX1_MF8 : WInst<"vtbx1", "...p", "m">;
+    def VTBX2_MF8 : WInst<"vtbx2", "..2p", "m">;
+    def VTBX3_MF8 : WInst<"vtbx3", "..3p", "m">;
+    def VTBX4_MF8 : WInst<"vtbx4", "..4p", "m">;
+  }
+  def VEXT_MF8 : WInst<"vext", "...I", "mQm", [ImmCheck<2, ImmCheckLaneIndex, 0>]>;
+  def VREV64_MF8 : WOpInst<"vrev64", "..", "mQm", OP_REV64>;
+  def VREV32_MF8 : WOpInst<"vrev32", "..", "mQm", OP_REV32>;
+  def VREV16_MF8 : WOpInst<"vrev16", "..", "mQm", OP_REV16>;
+  let isHiddenLInst = 1 in 
+  def VBSL_MF8 : SInst<"vbsl", ".U..", "mQm">;
+  def VTRN_MF8 : WInst<"vtrn", "2..", "mQm">;
+  def VZIP_MF8 : WInst<"vzip", "2..", "mQm">;
+  def VUZP_MF8 : WInst<"vuzp", "2..", "mQm">;
+  def COPY_LANE_MF8 : IOpInst<"vcopy_lane", "..I.I", "m", OP_COPY_LN>;
+  def COPYQ_LANE_MF8 : IOpInst<"vcopy_lane", "..IqI", "Qm", OP_COPY_LN>;
+  def COPY_LANEQ_MF8 : IOpInst<"vcopy_laneq", "..IQI", "m", OP_COPY_LN>;
+  def COPYQ_LANEQ_MF8 : IOpInst<"vcopy_laneq", "..I.I", "Qm", OP_COPY_LN>;
+  def VDUP_LANE2_MF8 : WOpInst<"vdup_laneq", ".QI", "mQm", OP_DUP_LN>;
+  def VTRN1_MF8 : SOpInst<"vtrn1", "...", "mQm", OP_TRN1>;
+  def VZIP1_MF8 : SOpInst<"vzip1", "...", "mQm", OP_ZIP1>;
+  def VUZP1_MF8 : SOpInst<"vuzp1", "...", "mQm", OP_UZP1>;
+  def VTRN2_MF8 : SOpInst<"vtrn2", "...", "mQm", OP_TRN2>;
+  def VZIP2_MF8 : SOpInst<"vzip2", "...", "mQm", OP_ZIP2>;
+  def VUZP2_MF8 : SOpInst<"vuzp2", "...", "mQm", OP_UZP2>;
+  let InstName = "vtbl" in {
+    def VQTBL1_A64_MF8 : WInst<"vqtbl1", ".QU", "mQm">;
+    def VQTBL2_A64_MF8 : WInst<"vqtbl2", ".(2Q)U", "mQm">;
+    def VQTBL3_A64_MF8 : WInst<"vqtbl3", ".(3Q)U", "mQm">;
+    def VQTBL4_A64_MF8 : WInst<"vqtbl4", ".(4Q)U", "mQm">;
+  }
+  let InstName = "vtbx" in {
+    def VQTBX1_A64_MF8 : WInst<"vqtbx1", "..QU", "mQm">;
+    def VQTBX2_A64_MF8 : WInst<"vqtbx2", "..(2Q)U", "mQm">;
+    def VQTBX3_A64_MF8 : WInst<"vqtbx3", "..(3Q)U", "mQm">;
+    def VQTBX4_A64_MF8 : WInst<"vqtbx4", "..(4Q)U", "mQm">;
+  }
+  def SCALAR_VDUP_LANE_MF8 : IInst<"vdup_lane", "1.I", "Sm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SCALAR_VDUP_LANEQ_MF8 : IInst<"vdup_laneq", "1QI", "Sm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
 }
\ No newline at end of file
diff --git a/clang/lib/AST/ExprConstant.cpp b/clang/lib/AST/ExprConstant.cpp
index 5c6ca4c9ee4de..655abd0ecdb50 100644
--- a/clang/lib/AST/ExprConstant.cpp
+++ b/clang/lib/AST/ExprConstant.cpp
@@ -11172,6 +11172,11 @@ VectorExprEvaluator::VisitInitListExpr(const InitListExpr *E) {
   QualType EltTy = VT->getElementType();
   SmallVector<APValue, 4> Elements;
 
+  // MFloat8 type doesn't have constants and thus constant folding 
+  // is impossible.
+  if (EltTy->isMFloat8Type())
+    return false;
+
   // The number of initializers can be less than the number of
   // vector elements. For OpenCL, this can be due to nested vector
   // initialization. For GCC compatibility, missing trailing elements
diff --git a/clang/lib/AST/Type.cpp b/clang/lib/AST/Type.cpp
index 8c11ec2e1fe24..2f7a3a5688973 100644
--- a/clang/lib/AST/Type.cpp
+++ b/clang/lib/AST/Type.cpp
@@ -2777,6 +2777,11 @@ static bool isTriviallyCopyableTypeImpl(const QualType &type,
   if (CanonicalType->isScalarType() || CanonicalType->isVectorType())
     return true;
 
+  // Mfloat8 type is a special case as it not scalar, but is still trivially
+  // copyable.
+  if (CanonicalType->isMFloat8Type())
+    return true;
+
   if (const auto *RT = CanonicalType->getAs<RecordType>()) {
     if (const auto *ClassDecl = dyn_cast<CXXRecordDecl>(RT->getDecl())) {
       if (IsCopyConstructible) {
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 361e4c4bf2e2e..03062f01907d1 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -8189,8 +8189,9 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
 
   // Determine the type of this overloaded NEON intrinsic.
   NeonTypeFlags Type(NeonTypeConst->getZExtValue());
-  bool Usgn = Type.isUnsigned();
-  bool Quad = Type.isQuad();
+  const bool Usgn = Type.isUnsigned();
+  const bool Quad = Type.isQuad();
+  const bool Floating = Type.isFloatingPoint();
   const bool HasLegalHalfType = getTarget().hasLegalHalfType();
   const bool AllowBFloatArgsAndRet =
       getTargetHooks().getABIInfo().allowBFloatArgsAndRet();
@@ -8291,24 +8292,28 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
   }
   case NEON::BI__builtin_neon_vceqz_v:
   case NEON::BI__builtin_neon_vceqzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OEQ,
-                                         ICmpInst::ICMP_EQ, "vceqz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OEQ : ICmpInst::ICMP_EQ, "vceqz");
   case NEON::BI__builtin_neon_vcgez_v:
   case NEON::BI__builtin_neon_vcgezq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OGE,
-                                         ICmpInst::ICMP_SGE, "vcgez");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OGE : ICmpInst::ICMP_SGE,
+        "vcgez");
   case NEON::BI__builtin_neon_vclez_v:
   case NEON::BI__builtin_neon_vclezq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OLE,
-                                         ICmpInst::ICMP_SLE, "vclez");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OLE : ICmpInst::ICMP_SLE,
+        "vclez");
   case NEON::BI__builtin_neon_vcgtz_v:
   case NEON::BI__builtin_neon_vcgtzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OGT,
-                                         ICmpInst::ICMP_SGT, "vcgtz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OGT : ICmpInst::ICMP_SGT,
+        "vcgtz");
   case NEON::BI__builtin_neon_vcltz_v:
   case NEON::BI__builtin_neon_vcltzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OLT,
-                                         ICmpInst::ICMP_SLT, "vcltz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OLT : ICmpInst::ICMP_SLT,
+        "vcltz");
   case NEON::BI__builtin_neon_vclz_v:
   case NEON::BI__builtin_neon_vclzq_v:
     // We generate target-independent intrinsic, which needs a second argument
@@ -8871,28 +8876,32 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
   return Builder.CreateBitCast(Result, ResultType, NameHint);
 }
 
-Value *CodeGenFunction::EmitAArch64CompareBuiltinExpr(
-    Value *Op, llvm::Type *Ty, const CmpInst::Predicate Fp,
-    const CmpInst::Predicate Ip, const Twine &Name) {
-  llvm::Type *OTy = Op->getType();
-
-  // FIXME: this is utterly horrific. We should not be looking at previous
-  // codegen context to find out what needs doing. Unfortunately TableGen
-  // currently gives us exactly the same calls for vceqz_f32 and vceqz_s32
-  // (etc).
-  if (BitCastInst *BI = dyn_cast<BitCastInst>(Op))
-    OTy = BI->getOperand(0)->getType();
-
-  Op = Builder.CreateBitCast(Op, OTy);
-  if (OTy->getScalarType()->isFloatingPointTy()) {
-    if (Fp == CmpInst::FCMP_OEQ)
-      Op = Builder.CreateFCmp(Fp, Op, Constant::getNullValue(OTy));
+Value *
+CodeGenFunction::EmitAArch64CompareBuiltinExpr(Value *Op, llvm::Type *Ty,
+                                               const CmpInst::Predicate Pred,
+                                               const Twine &Name) {
+
+  if (isa<FixedVectorType>(Ty)) {
+    // Vector types are cast to i8 vectors. Recover original type.
+    Op = Builder.CreateBitCast(Op, Ty);
+  }
+
+  if (CmpInst::isFPPredicate(Pred)) {
+...
[truncated]

llvmbot · 2025-02-20T15:42:28Z

@llvm/pr-subscribers-backend-aarch64

Author: None (Lukacma)

Changes

This patch adds fp8 variants to existing intrinsics, whose operation
doesn't depend on arguments being a specific type.

Patch is 7.06 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128019.diff

53 Files Affected:

(modified) clang/include/clang/Basic/TargetBuiltins.h (+4)
(modified) clang/include/clang/Basic/arm_neon.td (+104-38)
(modified) clang/lib/AST/ExprConstant.cpp (+5)
(modified) clang/lib/AST/Type.cpp (+5)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+87-37)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+4-4)
(modified) clang/lib/Sema/SemaInit.cpp (+3-1)
(modified) clang/test/CodeGen/AArch64/bf16-dotprod-intrinsics.c (+236-148)
(modified) clang/test/CodeGen/AArch64/bf16-getset-intrinsics.c (+17-13)
(modified) clang/test/CodeGen/AArch64/bf16-reinterpret-intrinsics.c (+266-186)
(added) clang/test/CodeGen/AArch64/fp8-init-list.c (+59)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_cvt.c (+30-14)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_fdot.c (+50-34)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_fmla.c (+50-34)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_reinterpret.c (+96-62)
(added) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_untyped.c (+1158)
(modified) clang/test/CodeGen/AArch64/neon-2velem.c (+1232-594)
(modified) clang/test/CodeGen/AArch64/neon-extract.c (+228-145)
(modified) clang/test/CodeGen/AArch64/neon-fma.c (+87-59)
(modified) clang/test/CodeGen/AArch64/neon-fp16fml.c (+593-833)
(modified) clang/test/CodeGen/AArch64/neon-intrinsics-constrained.c (+1409-453)
(modified) clang/test/CodeGen/AArch64/neon-intrinsics.c (+16202-10053)
(modified) clang/test/CodeGen/AArch64/neon-ldst-one-rcpc3.c (+23-17)
(modified) clang/test/CodeGen/AArch64/neon-ldst-one.c (+3870-4665)
(modified) clang/test/CodeGen/AArch64/neon-misc-constrained.c (+78-33)
(modified) clang/test/CodeGen/AArch64/neon-misc.c (+2734-1396)
(modified) clang/test/CodeGen/AArch64/neon-perm.c (+1670-1207)
(modified) clang/test/CodeGen/AArch64/neon-scalar-x-indexed-elem-constrained.c (+219-89)
(modified) clang/test/CodeGen/AArch64/neon-scalar-x-indexed-elem.c (+401-252)
(modified) clang/test/CodeGen/AArch64/neon-vcmla.c (+889-425)
(modified) clang/test/CodeGen/AArch64/poly-add.c (+1-1)
(modified) clang/test/CodeGen/AArch64/poly128.c (+28-28)
(modified) clang/test/CodeGen/AArch64/poly64.c (+443-338)
(modified) clang/test/CodeGen/AArch64/v8.1a-neon-intrinsics.c (+81-17)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics-constrained.c (+669-233)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics-generic.c (+154-134)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics.c (+773-411)
(modified) clang/test/CodeGen/AArch64/v8.5a-neon-frint3264-intrinsic.c (+202-49)
(modified) clang/test/CodeGen/AArch64/v8.6a-neon-intrinsics.c (+145-87)
(modified) clang/test/CodeGen/arm-bf16-dotprod-intrinsics.c (+237-149)
(modified) clang/test/CodeGen/arm-bf16-getset-intrinsics.c (+18-14)
(modified) clang/test/CodeGen/arm-neon-directed-rounding.c (+285-62)
(modified) clang/test/CodeGen/arm-neon-fma.c (+45-21)
(modified) clang/test/CodeGen/arm-neon-numeric-maxmin.c (+43-19)
(modified) clang/test/CodeGen/arm-neon-vcvtX.c (+73-41)
(modified) clang/test/CodeGen/arm-neon-vst.c (+2443-1695)
(modified) clang/test/CodeGen/arm64-vrnd-constrained.c (+193-26)
(modified) clang/test/CodeGen/arm64-vrnd.c (+115-6)
(modified) clang/test/CodeGen/arm64_vcreate.c (+18-3)
(modified) clang/test/CodeGen/arm64_vdupq_n_f64.c (+58-38)
(modified) clang/test/CodeGen/arm_neon_intrinsics.c (+19524-12225)
(modified) clang/utils/TableGen/NeonEmitter.cpp (+17-11)
(added) llvm/test/CodeGen/AArch64/v8.2a-neon-intrinsics-constrained.ll (+276)

diff --git a/clang/include/clang/Basic/TargetBuiltins.h b/clang/include/clang/Basic/TargetBuiltins.h
index 4781054240b5b..c1ba65064f159 100644
--- a/clang/include/clang/Basic/TargetBuiltins.h
+++ b/clang/include/clang/Basic/TargetBuiltins.h
@@ -263,6 +263,10 @@ namespace clang {
       EltType ET = getEltType();
       return ET == Poly8 || ET == Poly16 || ET == Poly64;
     }
+    bool isFloatingPoint() const {
+      EltType ET = getEltType();
+      return ET == Float16 || ET == Float32 || ET == Float64 || ET == BFloat16;
+    }
     bool isUnsigned() const { return (Flags & UnsignedFlag) != 0; }
     bool isQuad() const { return (Flags & QuadFlag) != 0; }
     unsigned getEltSizeInBits() const {
diff --git a/clang/include/clang/Basic/arm_neon.td b/clang/include/clang/Basic/arm_neon.td
index 3e73dd054933f..90f0e90e4a7f8 100644
--- a/clang/include/clang/Basic/arm_neon.td
+++ b/clang/include/clang/Basic/arm_neon.td
@@ -31,8 +31,8 @@ def OP_MLAL     : Op<(op "+", $p0, (call "vmull", $p1, $p2))>;
 def OP_MULLHi   : Op<(call "vmull", (call "vget_high", $p0),
                                     (call "vget_high", $p1))>;
 def OP_MULLHi_P64 : Op<(call "vmull",
-                         (cast "poly64_t", (call "vget_high", $p0)),
-                         (cast "poly64_t", (call "vget_high", $p1)))>;
+                         (bitcast "poly64_t", (call "vget_high", $p0)),
+                         (bitcast "poly64_t", (call "vget_high", $p1)))>;
 def OP_MULLHi_N : Op<(call "vmull_n", (call "vget_high", $p0), $p1)>;
 def OP_MLALHi   : Op<(call "vmlal", $p0, (call "vget_high", $p1),
                                          (call "vget_high", $p2))>;
@@ -95,11 +95,11 @@ def OP_TRN2     : Op<(shuffle $p0, $p1, (interleave
 def OP_ZIP2     : Op<(shuffle $p0, $p1, (highhalf (interleave mask0, mask1)))>;
 def OP_UZP2     : Op<(shuffle $p0, $p1, (add (decimate (rotl mask0, 1), 2),
                                              (decimate (rotl mask1, 1), 2)))>;
-def OP_EQ       : Op<(cast "R", (op "==", $p0, $p1))>;
-def OP_GE       : Op<(cast "R", (op ">=", $p0, $p1))>;
-def OP_LE       : Op<(cast "R", (op "<=", $p0, $p1))>;
-def OP_GT       : Op<(cast "R", (op ">", $p0, $p1))>;
-def OP_LT       : Op<(cast "R", (op "<", $p0, $p1))>;
+def OP_EQ       : Op<(bitcast "R", (op "==", $p0, $p1))>;
+def OP_GE       : Op<(bitcast "R", (op ">=", $p0, $p1))>;
+def OP_LE       : Op<(bitcast "R", (op "<=", $p0, $p1))>;
+def OP_GT       : Op<(bitcast "R", (op ">", $p0, $p1))>;
+def OP_LT       : Op<(bitcast "R", (op "<", $p0, $p1))>;
 def OP_NEG      : Op<(op "-", $p0)>;
 def OP_NOT      : Op<(op "~", $p0)>;
 def OP_AND      : Op<(op "&", $p0, $p1)>;
@@ -108,20 +108,20 @@ def OP_XOR      : Op<(op "^", $p0, $p1)>;
 def OP_ANDN     : Op<(op "&", $p0, (op "~", $p1))>;
 def OP_ORN      : Op<(op "|", $p0, (op "~", $p1))>;
 def OP_CAST     : LOp<[(save_temp $promote, $p0),
-                       (cast "R", $promote)]>;
+                       (bitcast "R", $promote)]>;
 def OP_HI       : Op<(shuffle $p0, $p0, (highhalf mask0))>;
 def OP_LO       : Op<(shuffle $p0, $p0, (lowhalf mask0))>;
 def OP_CONC     : Op<(shuffle $p0, $p1, (add mask0, mask1))>;
 def OP_DUP      : Op<(dup $p0)>;
 def OP_DUP_LN   : Op<(call_mangled "splat_lane", $p0, $p1)>;
-def OP_SEL      : Op<(cast "R", (op "|",
-                                    (op "&", $p0, (cast $p0, $p1)),
-                                    (op "&", (op "~", $p0), (cast $p0, $p2))))>;
+def OP_SEL      : Op<(bitcast "R", (op "|",
+                                    (op "&", $p0, (bitcast $p0, $p1)),
+                                    (op "&", (op "~", $p0), (bitcast $p0, $p2))))>;
 def OP_REV16    : Op<(shuffle $p0, $p0, (rev 16, mask0))>;
 def OP_REV32    : Op<(shuffle $p0, $p0, (rev 32, mask0))>;
 def OP_REV64    : Op<(shuffle $p0, $p0, (rev 64, mask0))>;
 def OP_XTN      : Op<(call "vcombine", $p0, (call "vmovn", $p1))>;
-def OP_SQXTUN   : Op<(call "vcombine", (cast $p0, "U", $p0),
+def OP_SQXTUN   : Op<(call "vcombine", (bitcast $p0, "U", $p0),
                                        (call "vqmovun", $p1))>;
 def OP_QXTN     : Op<(call "vcombine", $p0, (call "vqmovn", $p1))>;
 def OP_VCVT_NA_HI_F16 : Op<(call "vcombine", $p0, (call "vcvt_f16_f32", $p1))>;
@@ -129,12 +129,12 @@ def OP_VCVT_NA_HI_F32 : Op<(call "vcombine", $p0, (call "vcvt_f32_f64", $p1))>;
 def OP_VCVT_EX_HI_F32 : Op<(call "vcvt_f32_f16", (call "vget_high", $p0))>;
 def OP_VCVT_EX_HI_F64 : Op<(call "vcvt_f64_f32", (call "vget_high", $p0))>;
 def OP_VCVTX_HI : Op<(call "vcombine", $p0, (call "vcvtx_f32", $p1))>;
-def OP_REINT    : Op<(cast "R", $p0)>;
+def OP_REINT    : Op<(bitcast "R", $p0)>;
 def OP_ADDHNHi  : Op<(call "vcombine", $p0, (call "vaddhn", $p1, $p2))>;
 def OP_RADDHNHi : Op<(call "vcombine", $p0, (call "vraddhn", $p1, $p2))>;
 def OP_SUBHNHi  : Op<(call "vcombine", $p0, (call "vsubhn", $p1, $p2))>;
 def OP_RSUBHNHi : Op<(call "vcombine", $p0, (call "vrsubhn", $p1, $p2))>;
-def OP_ABDL     : Op<(cast "R", (call "vmovl", (cast $p0, "U",
+def OP_ABDL     : Op<(bitcast "R", (call "vmovl", (bitcast $p0, "U",
                                                      (call "vabd", $p0, $p1))))>;
 def OP_ABDLHi   : Op<(call "vabdl", (call "vget_high", $p0),
                                     (call "vget_high", $p1))>;
@@ -152,15 +152,15 @@ def OP_QDMLSLHi : Op<(call "vqdmlsl", $p0, (call "vget_high", $p1),
                                            (call "vget_high", $p2))>;
 def OP_QDMLSLHi_N : Op<(call "vqdmlsl_n", $p0, (call "vget_high", $p1), $p2)>;
 def OP_DIV  : Op<(op "/", $p0, $p1)>;
-def OP_LONG_HI : Op<(cast "R", (call (name_replace "_high_", "_"),
+def OP_LONG_HI : Op<(bitcast "R", (call (name_replace "_high_", "_"),
                                                 (call "vget_high", $p0), $p1))>;
-def OP_NARROW_HI : Op<(cast "R", (call "vcombine",
-                                       (cast "R", "H", $p0),
-                                       (cast "R", "H",
+def OP_NARROW_HI : Op<(bitcast "R", (call "vcombine",
+                                       (bitcast "R", "H", $p0),
+                                       (bitcast "R", "H",
                                            (call (name_replace "_high_", "_"),
                                                  $p1, $p2))))>;
 def OP_MOVL_HI  : LOp<[(save_temp $a1, (call "vget_high", $p0)),
-                       (cast "R",
+                       (bitcast "R",
                             (call "vshll_n", $a1, (literal "int32_t", "0")))]>;
 def OP_COPY_LN : Op<(call "vset_lane", (call "vget_lane", $p2, $p3), $p0, $p1)>;
 def OP_SCALAR_MUL_LN : Op<(op "*", $p0, (call "vget_lane", $p1, $p2))>;
@@ -221,18 +221,18 @@ def OP_FMLSL_LN_Hi  : Op<(call "vfmlsl_high", $p0, $p1,
 
 def OP_USDOT_LN
     : Op<(call "vusdot", $p0, $p1,
-          (cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
+          (bitcast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
 def OP_USDOT_LNQ
     : Op<(call "vusdot", $p0, $p1,
-          (cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;
+          (bitcast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;
 
 // sudot splats the second vector and then calls vusdot
 def OP_SUDOT_LN
     : Op<(call "vusdot", $p0,
-          (cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
+          (bitcast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
 def OP_SUDOT_LNQ
     : Op<(call "vusdot", $p0,
-          (cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;
+          (bitcast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;
 
 def OP_BFDOT_LN
     : Op<(call "vbfdot", $p0, $p1,
@@ -263,7 +263,7 @@ def OP_VCVT_BF16_F32_A32
     : Op<(call "__a32_vcvt_bf16", $p0)>;
 
 def OP_VCVT_BF16_F32_LO_A32
-    : Op<(call "vcombine", (cast "bfloat16x4_t", (literal "uint64_t", "0ULL")),
+    : Op<(call "vcombine", (bitcast "bfloat16x4_t", (literal "uint64_t", "0ULL")),
                            (call "__a32_vcvt_bf16", $p0))>;
 def OP_VCVT_BF16_F32_HI_A32
     : Op<(call "vcombine", (call "__a32_vcvt_bf16", $p1),
@@ -924,12 +924,12 @@ def CFMLE  : SOpInst<"vcle", "U..", "lUldQdQlQUl", OP_LE>;
 def CFMGT  : SOpInst<"vcgt", "U..", "lUldQdQlQUl", OP_GT>;
 def CFMLT  : SOpInst<"vclt", "U..", "lUldQdQlQUl", OP_LT>;
 
-def CMEQ  : SInst<"vceqz", "U.",
+def CMEQ  : SInst<"vceqz", "U(.!)",
                   "csilfUcUsUiUlPcPlQcQsQiQlQfQUcQUsQUiQUlQPcdQdQPl">;
-def CMGE  : SInst<"vcgez", "U.", "csilfdQcQsQiQlQfQd">;
-def CMLE  : SInst<"vclez", "U.", "csilfdQcQsQiQlQfQd">;
-def CMGT  : SInst<"vcgtz", "U.", "csilfdQcQsQiQlQfQd">;
-def CMLT  : SInst<"vcltz", "U.", "csilfdQcQsQiQlQfQd">;
+def CMGE  : SInst<"vcgez", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMLE  : SInst<"vclez", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMGT  : SInst<"vcgtz", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMLT  : SInst<"vcltz", "U(.!)", "csilfdQcQsQiQlQfQd">;
 
 ////////////////////////////////////////////////////////////////////////////////
 // Max/Min Integer
@@ -1667,11 +1667,11 @@ let TargetGuard = "fullfp16,neon" in {
   // ARMv8.2-A FP16 one-operand vector intrinsics.
 
   // Comparison
-  def CMEQH    : SInst<"vceqz", "U.", "hQh">;
-  def CMGEH    : SInst<"vcgez", "U.", "hQh">;
-  def CMGTH    : SInst<"vcgtz", "U.", "hQh">;
-  def CMLEH    : SInst<"vclez", "U.", "hQh">;
-  def CMLTH    : SInst<"vcltz", "U.", "hQh">;
+  def CMEQH    : SInst<"vceqz", "U(.!)", "hQh">;
+  def CMGEH    : SInst<"vcgez", "U(.!)", "hQh">;
+  def CMGTH    : SInst<"vcgtz", "U(.!)", "hQh">;
+  def CMLEH    : SInst<"vclez", "U(.!)", "hQh">;
+  def CMLTH    : SInst<"vcltz", "U(.!)", "hQh">;
 
   // Vector conversion
   def VCVT_F16     : SInst<"vcvt_f16", "F(.!)",  "sUsQsQUs">;
@@ -2090,17 +2090,17 @@ let ArchGuard = "defined(__aarch64__) || defined(__arm64ec__)", TargetGuard = "r
 
 // Lookup table read with 2-bit/4-bit indices
 let ArchGuard = "defined(__aarch64__)", TargetGuard = "lut" in {
-  def VLUTI2_B    : SInst<"vluti2_lane", "Q.(qU)I", "cUcPcQcQUcQPc",
+  def VLUTI2_B    : SInst<"vluti2_lane", "Q.(qU)I", "cUcPcmQcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_1>]>;
-  def VLUTI2_B_Q  : SInst<"vluti2_laneq", "Q.(QU)I", "cUcPcQcQUcQPc",
+  def VLUTI2_B_Q  : SInst<"vluti2_laneq", "Q.(QU)I", "cUcPcmQcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_3>]>;
   def VLUTI2_H    : SInst<"vluti2_lane", "Q.(<qU)I", "sUsPshQsQUsQPsQh",
                          [ImmCheck<2, ImmCheck0_3>]>;
   def VLUTI2_H_Q  : SInst<"vluti2_laneq", "Q.(<QU)I", "sUsPshQsQUsQPsQh",
                          [ImmCheck<2, ImmCheck0_7>]>;
-  def VLUTI4_B    : SInst<"vluti4_lane", "..(qU)I", "QcQUcQPc",
+  def VLUTI4_B    : SInst<"vluti4_lane", "..(qU)I", "QcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_0>]>;
-  def VLUTI4_B_Q  : SInst<"vluti4_laneq", "..UI", "QcQUcQPc",
+  def VLUTI4_B_Q  : SInst<"vluti4_laneq", "..UI", "QcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_1>]>;
   def VLUTI4_H_X2 : SInst<"vluti4_lane_x2", ".2(<qU)I", "QsQUsQPsQh",
                           [ImmCheck<3, ImmCheck0_1>]>;
@@ -2194,4 +2194,70 @@ let ArchGuard = "defined(__aarch64__)", TargetGuard = "fp8,neon" in {
   // fscale
   def FSCALE_V128 : WInst<"vscale", "..(.S)", "QdQfQh">;
   def FSCALE_V64 : WInst<"vscale", "(.q)(.q)(.qS)", "fh">;
+}
+
+//FP8 versions of untyped intrinsics
+let ArchGuard = "defined(__aarch64__)" in {
+  def VGET_LANE_MF8 : IInst<"vget_lane", "1.I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SPLAT_MF8 : WInst<"splat_lane", ".(!q)I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SPLATQ_MF8 : WInst<"splat_laneq", ".(!Q)I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def VSET_LANE_MF8 : IInst<"vset_lane", ".1.I", "mQm", [ImmCheck<2, ImmCheckLaneIndex, 1>]>;
+  def VCREATE_MF8 : NoTestOpInst<"vcreate", ".(IU>)", "m", OP_CAST> { let BigEndianSafe = 1; }
+  let InstName = "vmov" in {
+    def VDUP_N_MF8 : WOpInst<"vdup_n", ".1", "mQm", OP_DUP>;
+    def VMOV_N_MF8 : WOpInst<"vmov_n", ".1", "mQm", OP_DUP>;
+  }
+  let InstName = "" in
+    def VDUP_LANE_MF8: WOpInst<"vdup_lane", ".qI", "mQm", OP_DUP_LN>;
+  def VCOMBINE_MF8 : NoTestOpInst<"vcombine", "Q..", "m", OP_CONC>;
+  let InstName = "vmov" in {
+    def VGET_HIGH_MF8 : NoTestOpInst<"vget_high", ".Q", "m", OP_HI>;
+    def VGET_LOW_MF8 : NoTestOpInst<"vget_low", ".Q", "m", OP_LO>;
+  }
+  let InstName = "vtbl" in {
+    def VTBL1_MF8 : WInst<"vtbl1", "..p", "m">;
+    def VTBL2_MF8 : WInst<"vtbl2", ".2p", "m">;
+    def VTBL3_MF8 : WInst<"vtbl3", ".3p", "m">;
+    def VTBL4_MF8 : WInst<"vtbl4", ".4p", "m">;
+  }
+  let InstName = "vtbx" in {
+    def VTBX1_MF8 : WInst<"vtbx1", "...p", "m">;
+    def VTBX2_MF8 : WInst<"vtbx2", "..2p", "m">;
+    def VTBX3_MF8 : WInst<"vtbx3", "..3p", "m">;
+    def VTBX4_MF8 : WInst<"vtbx4", "..4p", "m">;
+  }
+  def VEXT_MF8 : WInst<"vext", "...I", "mQm", [ImmCheck<2, ImmCheckLaneIndex, 0>]>;
+  def VREV64_MF8 : WOpInst<"vrev64", "..", "mQm", OP_REV64>;
+  def VREV32_MF8 : WOpInst<"vrev32", "..", "mQm", OP_REV32>;
+  def VREV16_MF8 : WOpInst<"vrev16", "..", "mQm", OP_REV16>;
+  let isHiddenLInst = 1 in 
+  def VBSL_MF8 : SInst<"vbsl", ".U..", "mQm">;
+  def VTRN_MF8 : WInst<"vtrn", "2..", "mQm">;
+  def VZIP_MF8 : WInst<"vzip", "2..", "mQm">;
+  def VUZP_MF8 : WInst<"vuzp", "2..", "mQm">;
+  def COPY_LANE_MF8 : IOpInst<"vcopy_lane", "..I.I", "m", OP_COPY_LN>;
+  def COPYQ_LANE_MF8 : IOpInst<"vcopy_lane", "..IqI", "Qm", OP_COPY_LN>;
+  def COPY_LANEQ_MF8 : IOpInst<"vcopy_laneq", "..IQI", "m", OP_COPY_LN>;
+  def COPYQ_LANEQ_MF8 : IOpInst<"vcopy_laneq", "..I.I", "Qm", OP_COPY_LN>;
+  def VDUP_LANE2_MF8 : WOpInst<"vdup_laneq", ".QI", "mQm", OP_DUP_LN>;
+  def VTRN1_MF8 : SOpInst<"vtrn1", "...", "mQm", OP_TRN1>;
+  def VZIP1_MF8 : SOpInst<"vzip1", "...", "mQm", OP_ZIP1>;
+  def VUZP1_MF8 : SOpInst<"vuzp1", "...", "mQm", OP_UZP1>;
+  def VTRN2_MF8 : SOpInst<"vtrn2", "...", "mQm", OP_TRN2>;
+  def VZIP2_MF8 : SOpInst<"vzip2", "...", "mQm", OP_ZIP2>;
+  def VUZP2_MF8 : SOpInst<"vuzp2", "...", "mQm", OP_UZP2>;
+  let InstName = "vtbl" in {
+    def VQTBL1_A64_MF8 : WInst<"vqtbl1", ".QU", "mQm">;
+    def VQTBL2_A64_MF8 : WInst<"vqtbl2", ".(2Q)U", "mQm">;
+    def VQTBL3_A64_MF8 : WInst<"vqtbl3", ".(3Q)U", "mQm">;
+    def VQTBL4_A64_MF8 : WInst<"vqtbl4", ".(4Q)U", "mQm">;
+  }
+  let InstName = "vtbx" in {
+    def VQTBX1_A64_MF8 : WInst<"vqtbx1", "..QU", "mQm">;
+    def VQTBX2_A64_MF8 : WInst<"vqtbx2", "..(2Q)U", "mQm">;
+    def VQTBX3_A64_MF8 : WInst<"vqtbx3", "..(3Q)U", "mQm">;
+    def VQTBX4_A64_MF8 : WInst<"vqtbx4", "..(4Q)U", "mQm">;
+  }
+  def SCALAR_VDUP_LANE_MF8 : IInst<"vdup_lane", "1.I", "Sm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SCALAR_VDUP_LANEQ_MF8 : IInst<"vdup_laneq", "1QI", "Sm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
 }
\ No newline at end of file
diff --git a/clang/lib/AST/ExprConstant.cpp b/clang/lib/AST/ExprConstant.cpp
index 5c6ca4c9ee4de..655abd0ecdb50 100644
--- a/clang/lib/AST/ExprConstant.cpp
+++ b/clang/lib/AST/ExprConstant.cpp
@@ -11172,6 +11172,11 @@ VectorExprEvaluator::VisitInitListExpr(const InitListExpr *E) {
   QualType EltTy = VT->getElementType();
   SmallVector<APValue, 4> Elements;
 
+  // MFloat8 type doesn't have constants and thus constant folding 
+  // is impossible.
+  if (EltTy->isMFloat8Type())
+    return false;
+
   // The number of initializers can be less than the number of
   // vector elements. For OpenCL, this can be due to nested vector
   // initialization. For GCC compatibility, missing trailing elements
diff --git a/clang/lib/AST/Type.cpp b/clang/lib/AST/Type.cpp
index 8c11ec2e1fe24..2f7a3a5688973 100644
--- a/clang/lib/AST/Type.cpp
+++ b/clang/lib/AST/Type.cpp
@@ -2777,6 +2777,11 @@ static bool isTriviallyCopyableTypeImpl(const QualType &type,
   if (CanonicalType->isScalarType() || CanonicalType->isVectorType())
     return true;
 
+  // Mfloat8 type is a special case as it not scalar, but is still trivially
+  // copyable.
+  if (CanonicalType->isMFloat8Type())
+    return true;
+
   if (const auto *RT = CanonicalType->getAs<RecordType>()) {
     if (const auto *ClassDecl = dyn_cast<CXXRecordDecl>(RT->getDecl())) {
       if (IsCopyConstructible) {
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 361e4c4bf2e2e..03062f01907d1 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -8189,8 +8189,9 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
 
   // Determine the type of this overloaded NEON intrinsic.
   NeonTypeFlags Type(NeonTypeConst->getZExtValue());
-  bool Usgn = Type.isUnsigned();
-  bool Quad = Type.isQuad();
+  const bool Usgn = Type.isUnsigned();
+  const bool Quad = Type.isQuad();
+  const bool Floating = Type.isFloatingPoint();
   const bool HasLegalHalfType = getTarget().hasLegalHalfType();
   const bool AllowBFloatArgsAndRet =
       getTargetHooks().getABIInfo().allowBFloatArgsAndRet();
@@ -8291,24 +8292,28 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
   }
   case NEON::BI__builtin_neon_vceqz_v:
   case NEON::BI__builtin_neon_vceqzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OEQ,
-                                         ICmpInst::ICMP_EQ, "vceqz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OEQ : ICmpInst::ICMP_EQ, "vceqz");
   case NEON::BI__builtin_neon_vcgez_v:
   case NEON::BI__builtin_neon_vcgezq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OGE,
-                                         ICmpInst::ICMP_SGE, "vcgez");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OGE : ICmpInst::ICMP_SGE,
+        "vcgez");
   case NEON::BI__builtin_neon_vclez_v:
   case NEON::BI__builtin_neon_vclezq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OLE,
-                                         ICmpInst::ICMP_SLE, "vclez");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OLE : ICmpInst::ICMP_SLE,
+        "vclez");
   case NEON::BI__builtin_neon_vcgtz_v:
   case NEON::BI__builtin_neon_vcgtzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OGT,
-                                         ICmpInst::ICMP_SGT, "vcgtz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OGT : ICmpInst::ICMP_SGT,
+        "vcgtz");
   case NEON::BI__builtin_neon_vcltz_v:
   case NEON::BI__builtin_neon_vcltzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OLT,
-                                         ICmpInst::ICMP_SLT, "vcltz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OLT : ICmpInst::ICMP_SLT,
+        "vcltz");
   case NEON::BI__builtin_neon_vclz_v:
   case NEON::BI__builtin_neon_vclzq_v:
     // We generate target-independent intrinsic, which needs a second argument
@@ -8871,28 +8876,32 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
   return Builder.CreateBitCast(Result, ResultType, NameHint);
 }
 
-Value *CodeGenFunction::EmitAArch64CompareBuiltinExpr(
-    Value *Op, llvm::Type *Ty, const CmpInst::Predicate Fp,
-    const CmpInst::Predicate Ip, const Twine &Name) {
-  llvm::Type *OTy = Op->getType();
-
-  // FIXME: this is utterly horrific. We should not be looking at previous
-  // codegen context to find out what needs doing. Unfortunately TableGen
-  // currently gives us exactly the same calls for vceqz_f32 and vceqz_s32
-  // (etc).
-  if (BitCastInst *BI = dyn_cast<BitCastInst>(Op))
-    OTy = BI->getOperand(0)->getType();
-
-  Op = Builder.CreateBitCast(Op, OTy);
-  if (OTy->getScalarType()->isFloatingPointTy()) {
-    if (Fp == CmpInst::FCMP_OEQ)
-      Op = Builder.CreateFCmp(Fp, Op, Constant::getNullValue(OTy));
+Value *
+CodeGenFunction::EmitAArch64CompareBuiltinExpr(Value *Op, llvm::Type *Ty,
+                                               const CmpInst::Predicate Pred,
+                                               const Twine &Name) {
+
+  if (isa<FixedVectorType>(Ty)) {
+    // Vector types are cast to i8 vectors. Recover original type.
+    Op = Builder.CreateBitCast(Op, Ty);
+  }
+
+  if (CmpInst::isFPPredicate(Pred)) {
+...
[truncated]

llvmbot · 2025-02-20T15:42:29Z

@llvm/pr-subscribers-clang

Author: None (Lukacma)

Changes

This patch adds fp8 variants to existing intrinsics, whose operation
doesn't depend on arguments being a specific type.

Patch is 7.06 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128019.diff

53 Files Affected:

(modified) clang/include/clang/Basic/TargetBuiltins.h (+4)
(modified) clang/include/clang/Basic/arm_neon.td (+104-38)
(modified) clang/lib/AST/ExprConstant.cpp (+5)
(modified) clang/lib/AST/Type.cpp (+5)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+87-37)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+4-4)
(modified) clang/lib/Sema/SemaInit.cpp (+3-1)
(modified) clang/test/CodeGen/AArch64/bf16-dotprod-intrinsics.c (+236-148)
(modified) clang/test/CodeGen/AArch64/bf16-getset-intrinsics.c (+17-13)
(modified) clang/test/CodeGen/AArch64/bf16-reinterpret-intrinsics.c (+266-186)
(added) clang/test/CodeGen/AArch64/fp8-init-list.c (+59)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_cvt.c (+30-14)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_fdot.c (+50-34)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_fmla.c (+50-34)
(modified) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_reinterpret.c (+96-62)
(added) clang/test/CodeGen/AArch64/fp8-intrinsics/acle_neon_fp8_untyped.c (+1158)
(modified) clang/test/CodeGen/AArch64/neon-2velem.c (+1232-594)
(modified) clang/test/CodeGen/AArch64/neon-extract.c (+228-145)
(modified) clang/test/CodeGen/AArch64/neon-fma.c (+87-59)
(modified) clang/test/CodeGen/AArch64/neon-fp16fml.c (+593-833)
(modified) clang/test/CodeGen/AArch64/neon-intrinsics-constrained.c (+1409-453)
(modified) clang/test/CodeGen/AArch64/neon-intrinsics.c (+16202-10053)
(modified) clang/test/CodeGen/AArch64/neon-ldst-one-rcpc3.c (+23-17)
(modified) clang/test/CodeGen/AArch64/neon-ldst-one.c (+3870-4665)
(modified) clang/test/CodeGen/AArch64/neon-misc-constrained.c (+78-33)
(modified) clang/test/CodeGen/AArch64/neon-misc.c (+2734-1396)
(modified) clang/test/CodeGen/AArch64/neon-perm.c (+1670-1207)
(modified) clang/test/CodeGen/AArch64/neon-scalar-x-indexed-elem-constrained.c (+219-89)
(modified) clang/test/CodeGen/AArch64/neon-scalar-x-indexed-elem.c (+401-252)
(modified) clang/test/CodeGen/AArch64/neon-vcmla.c (+889-425)
(modified) clang/test/CodeGen/AArch64/poly-add.c (+1-1)
(modified) clang/test/CodeGen/AArch64/poly128.c (+28-28)
(modified) clang/test/CodeGen/AArch64/poly64.c (+443-338)
(modified) clang/test/CodeGen/AArch64/v8.1a-neon-intrinsics.c (+81-17)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics-constrained.c (+669-233)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics-generic.c (+154-134)
(modified) clang/test/CodeGen/AArch64/v8.2a-neon-intrinsics.c (+773-411)
(modified) clang/test/CodeGen/AArch64/v8.5a-neon-frint3264-intrinsic.c (+202-49)
(modified) clang/test/CodeGen/AArch64/v8.6a-neon-intrinsics.c (+145-87)
(modified) clang/test/CodeGen/arm-bf16-dotprod-intrinsics.c (+237-149)
(modified) clang/test/CodeGen/arm-bf16-getset-intrinsics.c (+18-14)
(modified) clang/test/CodeGen/arm-neon-directed-rounding.c (+285-62)
(modified) clang/test/CodeGen/arm-neon-fma.c (+45-21)
(modified) clang/test/CodeGen/arm-neon-numeric-maxmin.c (+43-19)
(modified) clang/test/CodeGen/arm-neon-vcvtX.c (+73-41)
(modified) clang/test/CodeGen/arm-neon-vst.c (+2443-1695)
(modified) clang/test/CodeGen/arm64-vrnd-constrained.c (+193-26)
(modified) clang/test/CodeGen/arm64-vrnd.c (+115-6)
(modified) clang/test/CodeGen/arm64_vcreate.c (+18-3)
(modified) clang/test/CodeGen/arm64_vdupq_n_f64.c (+58-38)
(modified) clang/test/CodeGen/arm_neon_intrinsics.c (+19524-12225)
(modified) clang/utils/TableGen/NeonEmitter.cpp (+17-11)
(added) llvm/test/CodeGen/AArch64/v8.2a-neon-intrinsics-constrained.ll (+276)

diff --git a/clang/include/clang/Basic/TargetBuiltins.h b/clang/include/clang/Basic/TargetBuiltins.h
index 4781054240b5b..c1ba65064f159 100644
--- a/clang/include/clang/Basic/TargetBuiltins.h
+++ b/clang/include/clang/Basic/TargetBuiltins.h
@@ -263,6 +263,10 @@ namespace clang {
       EltType ET = getEltType();
       return ET == Poly8 || ET == Poly16 || ET == Poly64;
     }
+    bool isFloatingPoint() const {
+      EltType ET = getEltType();
+      return ET == Float16 || ET == Float32 || ET == Float64 || ET == BFloat16;
+    }
     bool isUnsigned() const { return (Flags & UnsignedFlag) != 0; }
     bool isQuad() const { return (Flags & QuadFlag) != 0; }
     unsigned getEltSizeInBits() const {
diff --git a/clang/include/clang/Basic/arm_neon.td b/clang/include/clang/Basic/arm_neon.td
index 3e73dd054933f..90f0e90e4a7f8 100644
--- a/clang/include/clang/Basic/arm_neon.td
+++ b/clang/include/clang/Basic/arm_neon.td
@@ -31,8 +31,8 @@ def OP_MLAL     : Op<(op "+", $p0, (call "vmull", $p1, $p2))>;
 def OP_MULLHi   : Op<(call "vmull", (call "vget_high", $p0),
                                     (call "vget_high", $p1))>;
 def OP_MULLHi_P64 : Op<(call "vmull",
-                         (cast "poly64_t", (call "vget_high", $p0)),
-                         (cast "poly64_t", (call "vget_high", $p1)))>;
+                         (bitcast "poly64_t", (call "vget_high", $p0)),
+                         (bitcast "poly64_t", (call "vget_high", $p1)))>;
 def OP_MULLHi_N : Op<(call "vmull_n", (call "vget_high", $p0), $p1)>;
 def OP_MLALHi   : Op<(call "vmlal", $p0, (call "vget_high", $p1),
                                          (call "vget_high", $p2))>;
@@ -95,11 +95,11 @@ def OP_TRN2     : Op<(shuffle $p0, $p1, (interleave
 def OP_ZIP2     : Op<(shuffle $p0, $p1, (highhalf (interleave mask0, mask1)))>;
 def OP_UZP2     : Op<(shuffle $p0, $p1, (add (decimate (rotl mask0, 1), 2),
                                              (decimate (rotl mask1, 1), 2)))>;
-def OP_EQ       : Op<(cast "R", (op "==", $p0, $p1))>;
-def OP_GE       : Op<(cast "R", (op ">=", $p0, $p1))>;
-def OP_LE       : Op<(cast "R", (op "<=", $p0, $p1))>;
-def OP_GT       : Op<(cast "R", (op ">", $p0, $p1))>;
-def OP_LT       : Op<(cast "R", (op "<", $p0, $p1))>;
+def OP_EQ       : Op<(bitcast "R", (op "==", $p0, $p1))>;
+def OP_GE       : Op<(bitcast "R", (op ">=", $p0, $p1))>;
+def OP_LE       : Op<(bitcast "R", (op "<=", $p0, $p1))>;
+def OP_GT       : Op<(bitcast "R", (op ">", $p0, $p1))>;
+def OP_LT       : Op<(bitcast "R", (op "<", $p0, $p1))>;
 def OP_NEG      : Op<(op "-", $p0)>;
 def OP_NOT      : Op<(op "~", $p0)>;
 def OP_AND      : Op<(op "&", $p0, $p1)>;
@@ -108,20 +108,20 @@ def OP_XOR      : Op<(op "^", $p0, $p1)>;
 def OP_ANDN     : Op<(op "&", $p0, (op "~", $p1))>;
 def OP_ORN      : Op<(op "|", $p0, (op "~", $p1))>;
 def OP_CAST     : LOp<[(save_temp $promote, $p0),
-                       (cast "R", $promote)]>;
+                       (bitcast "R", $promote)]>;
 def OP_HI       : Op<(shuffle $p0, $p0, (highhalf mask0))>;
 def OP_LO       : Op<(shuffle $p0, $p0, (lowhalf mask0))>;
 def OP_CONC     : Op<(shuffle $p0, $p1, (add mask0, mask1))>;
 def OP_DUP      : Op<(dup $p0)>;
 def OP_DUP_LN   : Op<(call_mangled "splat_lane", $p0, $p1)>;
-def OP_SEL      : Op<(cast "R", (op "|",
-                                    (op "&", $p0, (cast $p0, $p1)),
-                                    (op "&", (op "~", $p0), (cast $p0, $p2))))>;
+def OP_SEL      : Op<(bitcast "R", (op "|",
+                                    (op "&", $p0, (bitcast $p0, $p1)),
+                                    (op "&", (op "~", $p0), (bitcast $p0, $p2))))>;
 def OP_REV16    : Op<(shuffle $p0, $p0, (rev 16, mask0))>;
 def OP_REV32    : Op<(shuffle $p0, $p0, (rev 32, mask0))>;
 def OP_REV64    : Op<(shuffle $p0, $p0, (rev 64, mask0))>;
 def OP_XTN      : Op<(call "vcombine", $p0, (call "vmovn", $p1))>;
-def OP_SQXTUN   : Op<(call "vcombine", (cast $p0, "U", $p0),
+def OP_SQXTUN   : Op<(call "vcombine", (bitcast $p0, "U", $p0),
                                        (call "vqmovun", $p1))>;
 def OP_QXTN     : Op<(call "vcombine", $p0, (call "vqmovn", $p1))>;
 def OP_VCVT_NA_HI_F16 : Op<(call "vcombine", $p0, (call "vcvt_f16_f32", $p1))>;
@@ -129,12 +129,12 @@ def OP_VCVT_NA_HI_F32 : Op<(call "vcombine", $p0, (call "vcvt_f32_f64", $p1))>;
 def OP_VCVT_EX_HI_F32 : Op<(call "vcvt_f32_f16", (call "vget_high", $p0))>;
 def OP_VCVT_EX_HI_F64 : Op<(call "vcvt_f64_f32", (call "vget_high", $p0))>;
 def OP_VCVTX_HI : Op<(call "vcombine", $p0, (call "vcvtx_f32", $p1))>;
-def OP_REINT    : Op<(cast "R", $p0)>;
+def OP_REINT    : Op<(bitcast "R", $p0)>;
 def OP_ADDHNHi  : Op<(call "vcombine", $p0, (call "vaddhn", $p1, $p2))>;
 def OP_RADDHNHi : Op<(call "vcombine", $p0, (call "vraddhn", $p1, $p2))>;
 def OP_SUBHNHi  : Op<(call "vcombine", $p0, (call "vsubhn", $p1, $p2))>;
 def OP_RSUBHNHi : Op<(call "vcombine", $p0, (call "vrsubhn", $p1, $p2))>;
-def OP_ABDL     : Op<(cast "R", (call "vmovl", (cast $p0, "U",
+def OP_ABDL     : Op<(bitcast "R", (call "vmovl", (bitcast $p0, "U",
                                                      (call "vabd", $p0, $p1))))>;
 def OP_ABDLHi   : Op<(call "vabdl", (call "vget_high", $p0),
                                     (call "vget_high", $p1))>;
@@ -152,15 +152,15 @@ def OP_QDMLSLHi : Op<(call "vqdmlsl", $p0, (call "vget_high", $p1),
                                            (call "vget_high", $p2))>;
 def OP_QDMLSLHi_N : Op<(call "vqdmlsl_n", $p0, (call "vget_high", $p1), $p2)>;
 def OP_DIV  : Op<(op "/", $p0, $p1)>;
-def OP_LONG_HI : Op<(cast "R", (call (name_replace "_high_", "_"),
+def OP_LONG_HI : Op<(bitcast "R", (call (name_replace "_high_", "_"),
                                                 (call "vget_high", $p0), $p1))>;
-def OP_NARROW_HI : Op<(cast "R", (call "vcombine",
-                                       (cast "R", "H", $p0),
-                                       (cast "R", "H",
+def OP_NARROW_HI : Op<(bitcast "R", (call "vcombine",
+                                       (bitcast "R", "H", $p0),
+                                       (bitcast "R", "H",
                                            (call (name_replace "_high_", "_"),
                                                  $p1, $p2))))>;
 def OP_MOVL_HI  : LOp<[(save_temp $a1, (call "vget_high", $p0)),
-                       (cast "R",
+                       (bitcast "R",
                             (call "vshll_n", $a1, (literal "int32_t", "0")))]>;
 def OP_COPY_LN : Op<(call "vset_lane", (call "vget_lane", $p2, $p3), $p0, $p1)>;
 def OP_SCALAR_MUL_LN : Op<(op "*", $p0, (call "vget_lane", $p1, $p2))>;
@@ -221,18 +221,18 @@ def OP_FMLSL_LN_Hi  : Op<(call "vfmlsl_high", $p0, $p1,
 
 def OP_USDOT_LN
     : Op<(call "vusdot", $p0, $p1,
-          (cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
+          (bitcast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
 def OP_USDOT_LNQ
     : Op<(call "vusdot", $p0, $p1,
-          (cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;
+          (bitcast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;
 
 // sudot splats the second vector and then calls vusdot
 def OP_SUDOT_LN
     : Op<(call "vusdot", $p0,
-          (cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
+          (bitcast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
 def OP_SUDOT_LNQ
     : Op<(call "vusdot", $p0,
-          (cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;
+          (bitcast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;
 
 def OP_BFDOT_LN
     : Op<(call "vbfdot", $p0, $p1,
@@ -263,7 +263,7 @@ def OP_VCVT_BF16_F32_A32
     : Op<(call "__a32_vcvt_bf16", $p0)>;
 
 def OP_VCVT_BF16_F32_LO_A32
-    : Op<(call "vcombine", (cast "bfloat16x4_t", (literal "uint64_t", "0ULL")),
+    : Op<(call "vcombine", (bitcast "bfloat16x4_t", (literal "uint64_t", "0ULL")),
                            (call "__a32_vcvt_bf16", $p0))>;
 def OP_VCVT_BF16_F32_HI_A32
     : Op<(call "vcombine", (call "__a32_vcvt_bf16", $p1),
@@ -924,12 +924,12 @@ def CFMLE  : SOpInst<"vcle", "U..", "lUldQdQlQUl", OP_LE>;
 def CFMGT  : SOpInst<"vcgt", "U..", "lUldQdQlQUl", OP_GT>;
 def CFMLT  : SOpInst<"vclt", "U..", "lUldQdQlQUl", OP_LT>;
 
-def CMEQ  : SInst<"vceqz", "U.",
+def CMEQ  : SInst<"vceqz", "U(.!)",
                   "csilfUcUsUiUlPcPlQcQsQiQlQfQUcQUsQUiQUlQPcdQdQPl">;
-def CMGE  : SInst<"vcgez", "U.", "csilfdQcQsQiQlQfQd">;
-def CMLE  : SInst<"vclez", "U.", "csilfdQcQsQiQlQfQd">;
-def CMGT  : SInst<"vcgtz", "U.", "csilfdQcQsQiQlQfQd">;
-def CMLT  : SInst<"vcltz", "U.", "csilfdQcQsQiQlQfQd">;
+def CMGE  : SInst<"vcgez", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMLE  : SInst<"vclez", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMGT  : SInst<"vcgtz", "U(.!)", "csilfdQcQsQiQlQfQd">;
+def CMLT  : SInst<"vcltz", "U(.!)", "csilfdQcQsQiQlQfQd">;
 
 ////////////////////////////////////////////////////////////////////////////////
 // Max/Min Integer
@@ -1667,11 +1667,11 @@ let TargetGuard = "fullfp16,neon" in {
   // ARMv8.2-A FP16 one-operand vector intrinsics.
 
   // Comparison
-  def CMEQH    : SInst<"vceqz", "U.", "hQh">;
-  def CMGEH    : SInst<"vcgez", "U.", "hQh">;
-  def CMGTH    : SInst<"vcgtz", "U.", "hQh">;
-  def CMLEH    : SInst<"vclez", "U.", "hQh">;
-  def CMLTH    : SInst<"vcltz", "U.", "hQh">;
+  def CMEQH    : SInst<"vceqz", "U(.!)", "hQh">;
+  def CMGEH    : SInst<"vcgez", "U(.!)", "hQh">;
+  def CMGTH    : SInst<"vcgtz", "U(.!)", "hQh">;
+  def CMLEH    : SInst<"vclez", "U(.!)", "hQh">;
+  def CMLTH    : SInst<"vcltz", "U(.!)", "hQh">;
 
   // Vector conversion
   def VCVT_F16     : SInst<"vcvt_f16", "F(.!)",  "sUsQsQUs">;
@@ -2090,17 +2090,17 @@ let ArchGuard = "defined(__aarch64__) || defined(__arm64ec__)", TargetGuard = "r
 
 // Lookup table read with 2-bit/4-bit indices
 let ArchGuard = "defined(__aarch64__)", TargetGuard = "lut" in {
-  def VLUTI2_B    : SInst<"vluti2_lane", "Q.(qU)I", "cUcPcQcQUcQPc",
+  def VLUTI2_B    : SInst<"vluti2_lane", "Q.(qU)I", "cUcPcmQcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_1>]>;
-  def VLUTI2_B_Q  : SInst<"vluti2_laneq", "Q.(QU)I", "cUcPcQcQUcQPc",
+  def VLUTI2_B_Q  : SInst<"vluti2_laneq", "Q.(QU)I", "cUcPcmQcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_3>]>;
   def VLUTI2_H    : SInst<"vluti2_lane", "Q.(<qU)I", "sUsPshQsQUsQPsQh",
                          [ImmCheck<2, ImmCheck0_3>]>;
   def VLUTI2_H_Q  : SInst<"vluti2_laneq", "Q.(<QU)I", "sUsPshQsQUsQPsQh",
                          [ImmCheck<2, ImmCheck0_7>]>;
-  def VLUTI4_B    : SInst<"vluti4_lane", "..(qU)I", "QcQUcQPc",
+  def VLUTI4_B    : SInst<"vluti4_lane", "..(qU)I", "QcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_0>]>;
-  def VLUTI4_B_Q  : SInst<"vluti4_laneq", "..UI", "QcQUcQPc",
+  def VLUTI4_B_Q  : SInst<"vluti4_laneq", "..UI", "QcQUcQPcQm",
                          [ImmCheck<2, ImmCheck0_1>]>;
   def VLUTI4_H_X2 : SInst<"vluti4_lane_x2", ".2(<qU)I", "QsQUsQPsQh",
                           [ImmCheck<3, ImmCheck0_1>]>;
@@ -2194,4 +2194,70 @@ let ArchGuard = "defined(__aarch64__)", TargetGuard = "fp8,neon" in {
   // fscale
   def FSCALE_V128 : WInst<"vscale", "..(.S)", "QdQfQh">;
   def FSCALE_V64 : WInst<"vscale", "(.q)(.q)(.qS)", "fh">;
+}
+
+//FP8 versions of untyped intrinsics
+let ArchGuard = "defined(__aarch64__)" in {
+  def VGET_LANE_MF8 : IInst<"vget_lane", "1.I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SPLAT_MF8 : WInst<"splat_lane", ".(!q)I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SPLATQ_MF8 : WInst<"splat_laneq", ".(!Q)I", "mQm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def VSET_LANE_MF8 : IInst<"vset_lane", ".1.I", "mQm", [ImmCheck<2, ImmCheckLaneIndex, 1>]>;
+  def VCREATE_MF8 : NoTestOpInst<"vcreate", ".(IU>)", "m", OP_CAST> { let BigEndianSafe = 1; }
+  let InstName = "vmov" in {
+    def VDUP_N_MF8 : WOpInst<"vdup_n", ".1", "mQm", OP_DUP>;
+    def VMOV_N_MF8 : WOpInst<"vmov_n", ".1", "mQm", OP_DUP>;
+  }
+  let InstName = "" in
+    def VDUP_LANE_MF8: WOpInst<"vdup_lane", ".qI", "mQm", OP_DUP_LN>;
+  def VCOMBINE_MF8 : NoTestOpInst<"vcombine", "Q..", "m", OP_CONC>;
+  let InstName = "vmov" in {
+    def VGET_HIGH_MF8 : NoTestOpInst<"vget_high", ".Q", "m", OP_HI>;
+    def VGET_LOW_MF8 : NoTestOpInst<"vget_low", ".Q", "m", OP_LO>;
+  }
+  let InstName = "vtbl" in {
+    def VTBL1_MF8 : WInst<"vtbl1", "..p", "m">;
+    def VTBL2_MF8 : WInst<"vtbl2", ".2p", "m">;
+    def VTBL3_MF8 : WInst<"vtbl3", ".3p", "m">;
+    def VTBL4_MF8 : WInst<"vtbl4", ".4p", "m">;
+  }
+  let InstName = "vtbx" in {
+    def VTBX1_MF8 : WInst<"vtbx1", "...p", "m">;
+    def VTBX2_MF8 : WInst<"vtbx2", "..2p", "m">;
+    def VTBX3_MF8 : WInst<"vtbx3", "..3p", "m">;
+    def VTBX4_MF8 : WInst<"vtbx4", "..4p", "m">;
+  }
+  def VEXT_MF8 : WInst<"vext", "...I", "mQm", [ImmCheck<2, ImmCheckLaneIndex, 0>]>;
+  def VREV64_MF8 : WOpInst<"vrev64", "..", "mQm", OP_REV64>;
+  def VREV32_MF8 : WOpInst<"vrev32", "..", "mQm", OP_REV32>;
+  def VREV16_MF8 : WOpInst<"vrev16", "..", "mQm", OP_REV16>;
+  let isHiddenLInst = 1 in 
+  def VBSL_MF8 : SInst<"vbsl", ".U..", "mQm">;
+  def VTRN_MF8 : WInst<"vtrn", "2..", "mQm">;
+  def VZIP_MF8 : WInst<"vzip", "2..", "mQm">;
+  def VUZP_MF8 : WInst<"vuzp", "2..", "mQm">;
+  def COPY_LANE_MF8 : IOpInst<"vcopy_lane", "..I.I", "m", OP_COPY_LN>;
+  def COPYQ_LANE_MF8 : IOpInst<"vcopy_lane", "..IqI", "Qm", OP_COPY_LN>;
+  def COPY_LANEQ_MF8 : IOpInst<"vcopy_laneq", "..IQI", "m", OP_COPY_LN>;
+  def COPYQ_LANEQ_MF8 : IOpInst<"vcopy_laneq", "..I.I", "Qm", OP_COPY_LN>;
+  def VDUP_LANE2_MF8 : WOpInst<"vdup_laneq", ".QI", "mQm", OP_DUP_LN>;
+  def VTRN1_MF8 : SOpInst<"vtrn1", "...", "mQm", OP_TRN1>;
+  def VZIP1_MF8 : SOpInst<"vzip1", "...", "mQm", OP_ZIP1>;
+  def VUZP1_MF8 : SOpInst<"vuzp1", "...", "mQm", OP_UZP1>;
+  def VTRN2_MF8 : SOpInst<"vtrn2", "...", "mQm", OP_TRN2>;
+  def VZIP2_MF8 : SOpInst<"vzip2", "...", "mQm", OP_ZIP2>;
+  def VUZP2_MF8 : SOpInst<"vuzp2", "...", "mQm", OP_UZP2>;
+  let InstName = "vtbl" in {
+    def VQTBL1_A64_MF8 : WInst<"vqtbl1", ".QU", "mQm">;
+    def VQTBL2_A64_MF8 : WInst<"vqtbl2", ".(2Q)U", "mQm">;
+    def VQTBL3_A64_MF8 : WInst<"vqtbl3", ".(3Q)U", "mQm">;
+    def VQTBL4_A64_MF8 : WInst<"vqtbl4", ".(4Q)U", "mQm">;
+  }
+  let InstName = "vtbx" in {
+    def VQTBX1_A64_MF8 : WInst<"vqtbx1", "..QU", "mQm">;
+    def VQTBX2_A64_MF8 : WInst<"vqtbx2", "..(2Q)U", "mQm">;
+    def VQTBX3_A64_MF8 : WInst<"vqtbx3", "..(3Q)U", "mQm">;
+    def VQTBX4_A64_MF8 : WInst<"vqtbx4", "..(4Q)U", "mQm">;
+  }
+  def SCALAR_VDUP_LANE_MF8 : IInst<"vdup_lane", "1.I", "Sm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
+  def SCALAR_VDUP_LANEQ_MF8 : IInst<"vdup_laneq", "1QI", "Sm", [ImmCheck<1, ImmCheckLaneIndex, 0>]>;
 }
\ No newline at end of file
diff --git a/clang/lib/AST/ExprConstant.cpp b/clang/lib/AST/ExprConstant.cpp
index 5c6ca4c9ee4de..655abd0ecdb50 100644
--- a/clang/lib/AST/ExprConstant.cpp
+++ b/clang/lib/AST/ExprConstant.cpp
@@ -11172,6 +11172,11 @@ VectorExprEvaluator::VisitInitListExpr(const InitListExpr *E) {
   QualType EltTy = VT->getElementType();
   SmallVector<APValue, 4> Elements;
 
+  // MFloat8 type doesn't have constants and thus constant folding 
+  // is impossible.
+  if (EltTy->isMFloat8Type())
+    return false;
+
   // The number of initializers can be less than the number of
   // vector elements. For OpenCL, this can be due to nested vector
   // initialization. For GCC compatibility, missing trailing elements
diff --git a/clang/lib/AST/Type.cpp b/clang/lib/AST/Type.cpp
index 8c11ec2e1fe24..2f7a3a5688973 100644
--- a/clang/lib/AST/Type.cpp
+++ b/clang/lib/AST/Type.cpp
@@ -2777,6 +2777,11 @@ static bool isTriviallyCopyableTypeImpl(const QualType &type,
   if (CanonicalType->isScalarType() || CanonicalType->isVectorType())
     return true;
 
+  // Mfloat8 type is a special case as it not scalar, but is still trivially
+  // copyable.
+  if (CanonicalType->isMFloat8Type())
+    return true;
+
   if (const auto *RT = CanonicalType->getAs<RecordType>()) {
     if (const auto *ClassDecl = dyn_cast<CXXRecordDecl>(RT->getDecl())) {
       if (IsCopyConstructible) {
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 361e4c4bf2e2e..03062f01907d1 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -8189,8 +8189,9 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
 
   // Determine the type of this overloaded NEON intrinsic.
   NeonTypeFlags Type(NeonTypeConst->getZExtValue());
-  bool Usgn = Type.isUnsigned();
-  bool Quad = Type.isQuad();
+  const bool Usgn = Type.isUnsigned();
+  const bool Quad = Type.isQuad();
+  const bool Floating = Type.isFloatingPoint();
   const bool HasLegalHalfType = getTarget().hasLegalHalfType();
   const bool AllowBFloatArgsAndRet =
       getTargetHooks().getABIInfo().allowBFloatArgsAndRet();
@@ -8291,24 +8292,28 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
   }
   case NEON::BI__builtin_neon_vceqz_v:
   case NEON::BI__builtin_neon_vceqzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OEQ,
-                                         ICmpInst::ICMP_EQ, "vceqz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OEQ : ICmpInst::ICMP_EQ, "vceqz");
   case NEON::BI__builtin_neon_vcgez_v:
   case NEON::BI__builtin_neon_vcgezq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OGE,
-                                         ICmpInst::ICMP_SGE, "vcgez");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OGE : ICmpInst::ICMP_SGE,
+        "vcgez");
   case NEON::BI__builtin_neon_vclez_v:
   case NEON::BI__builtin_neon_vclezq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OLE,
-                                         ICmpInst::ICMP_SLE, "vclez");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OLE : ICmpInst::ICMP_SLE,
+        "vclez");
   case NEON::BI__builtin_neon_vcgtz_v:
   case NEON::BI__builtin_neon_vcgtzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OGT,
-                                         ICmpInst::ICMP_SGT, "vcgtz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OGT : ICmpInst::ICMP_SGT,
+        "vcgtz");
   case NEON::BI__builtin_neon_vcltz_v:
   case NEON::BI__builtin_neon_vcltzq_v:
-    return EmitAArch64CompareBuiltinExpr(Ops[0], Ty, ICmpInst::FCMP_OLT,
-                                         ICmpInst::ICMP_SLT, "vcltz");
+    return EmitAArch64CompareBuiltinExpr(
+        Ops[0], Ty, Floating ? ICmpInst::FCMP_OLT : ICmpInst::ICMP_SLT,
+        "vcltz");
   case NEON::BI__builtin_neon_vclz_v:
   case NEON::BI__builtin_neon_vclzq_v:
     // We generate target-independent intrinsic, which needs a second argument
@@ -8871,28 +8876,32 @@ Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
   return Builder.CreateBitCast(Result, ResultType, NameHint);
 }
 
-Value *CodeGenFunction::EmitAArch64CompareBuiltinExpr(
-    Value *Op, llvm::Type *Ty, const CmpInst::Predicate Fp,
-    const CmpInst::Predicate Ip, const Twine &Name) {
-  llvm::Type *OTy = Op->getType();
-
-  // FIXME: this is utterly horrific. We should not be looking at previous
-  // codegen context to find out what needs doing. Unfortunately TableGen
-  // currently gives us exactly the same calls for vceqz_f32 and vceqz_s32
-  // (etc).
-  if (BitCastInst *BI = dyn_cast<BitCastInst>(Op))
-    OTy = BI->getOperand(0)->getType();
-
-  Op = Builder.CreateBitCast(Op, OTy);
-  if (OTy->getScalarType()->isFloatingPointTy()) {
-    if (Fp == CmpInst::FCMP_OEQ)
-      Op = Builder.CreateFCmp(Fp, Op, Constant::getNullValue(OTy));
+Value *
+CodeGenFunction::EmitAArch64CompareBuiltinExpr(Value *Op, llvm::Type *Ty,
+                                               const CmpInst::Predicate Pred,
+                                               const Twine &Name) {
+
+  if (isa<FixedVectorType>(Ty)) {
+    // Vector types are cast to i8 vectors. Recover original type.
+    Op = Builder.CreateBitCast(Op, Ty);
+  }
+
+  if (CmpInst::isFPPredicate(Pred)) {
+...
[truncated]

Lukacma · 2025-02-20T15:42:50Z

This patch is third patch in a series as is dependent on #125097 and #127043 to be merged first

github-actions · 2025-02-20T15:45:34Z

✅ With the latest revision this PR passed the undef deprecator.

github-actions · 2025-02-20T15:45:34Z

✅ With the latest revision this PR passed the C/C++ code formatter.

CarolineConcatto

Hi Marian,
There is some tests failing with the codegen. Can you rebase this patch and fix the failings tests

This patch adds fp8 variants to existing intrinsics, whose operation doesn't depend on arguments being a specific type.

Lukacma · 2025-04-04T09:37:06Z

Failures should be fixedn now

Lukacma · 2025-04-04T09:38:14Z

clang/lib/CodeGen/CGCall.cpp

+        // Mfloat8 type is loaded as scalar type, but is treated as single
+        // vector type for other operations. We need to bitcast it to the vector
+        // type here.
+        if (auto *EltTy =


I am not sure if this is the best way to solve this issue so would appreciate your feedback on this.

I don't see an issue here. That is exactly what should happen regardless of the target architecture any time the ABI for that architecture says values of type T are passed as <1 x T>.

Does the ABI say this? My understand is that values of type _mfp8 are floating-point 8-bit values that are passes as _mfp8. The pretend it's an i8 in some cases and <1 x i8> in others is purely an implementation detail within clang.

This is not to say the code is invalid, but we should be cautious with how far down the rabbit hole we go.

FYI: As part of @MacDue's work to improve streaming-mode code generation I asked him to add the MVT aarch64mfp8 along with support to load and store it. I expect over time we'll migrate away from using i8 as our scalar type.

Not sure what the fallout will be from this but I think the problem here is we should not have loaded a scalar in the first place. Looking at CodeGenTypes::ConvertTypeForMem() I can see that we're using a different type for the memory representation than the normal one, which I think is a mistake.

Changing this so the types are consistent will remove the need for this code but I suspect it'll prompt further work elsewhere. My hope is that work sits in target specific areas relating to modelling the builtin so seem reasonable. Please shout though if it starts to get out of control.

Does the ABI say this?

It doesn't. Unfortunately this discussion was split and I didn't replicate all my comments here.

Momchil Velikov 15 Apr at 16:11
The ABI spec (naturally) does not say anything about <1 x i8> . It says (in a somewhat obscure way) that the value > is passed in a FPR.
And then clang/llvm decide to implement the ABI by mapping to <1 x T>.

I consider the "natural" mapping of __mfp8 to LLVM types to be i8 and <1 x i8> to be merely a hack coming from the peculiar way of implementing ABIs in clang/llvm (by implicit contracts and "mutual understading"). As such <1 x i8> out to be applicable only for values that are arguments passed in registers.

I’m not yet confident in my understanding of the trade-offs between the two approaches, beside that one impacts target-specific code while the other affects target-independent code. As such, I don’t feel well-positioned to contribute meaningfully to this discussion. That said, I’d appreciate it if we could reach alignment here, as I’d like to merge this patch soon.

The underlying storage for __mfp8 is an FPR and until we decide whether to use a dedicated target type, or LLVM gains an opaque 8-bit floating point type our only option is to represent it as an i8 vector type.

The reason for using i8 was for some specific code reuse but as this PR showed, that reuse is not total and so I'd rather we just be honest and insert the relevant bitcasts when necessary. This will put us in good stead if we decide to go the target type route.

paulwalker-arm · 2025-04-17T09:53:49Z

For my education can you explain why the fp8 variants are broken out into their own definitions. Taking VREV64_MF8 as an example, it looks like you should be able to add the new type strings to the current definition?

Lukacma · 2025-04-23T12:14:46Z

For my education can you explain why the fp8 variants are broken out into their own definitions. Taking VREV64_MF8 as an example, it looks like you should be able to add the new type strings to the current definition?

That's a good question. Its been a while since I implemented this patch, so I have forgotten my reasoning behind this, but I think this might be because I was originally not sure if we want to target guard these, behind fp8 feature flag. Since it looks like we are not doing that I can merge them back to their original intrinsics

Lukacma · 2025-04-28T14:45:47Z

@paulwalker-arm the reasoning behind creating separate records, is that mfloat type is not available for aarch32 architectures and therefore all intrinsics using it need to be gated behind ArchGuard = "defined(__aarch64__)" .

paulwalker-arm · 2025-04-29T12:17:13Z

@paulwalker-arm the reasoning behind creating separate records, is that mfloat type is not available for aarch32 architectures and therefore all intrinsics using it need to be gated behind ArchGuard = "defined(__aarch64__)" .

I see. How practical would it be for NEONEmitter to infer the ArchGuard based on the type? I'm assuming ArchGuard is either unset or set to what we need for all the cases we care about. This is not a firm ask but it would be nice to reuse the existing definitions if possible.

…hitecture checks on AArch64

Lukacma · 2025-04-29T16:21:36Z

I have adjusted NeonEmitter to automatically emit correct attribute for mfloat8 intrinsics and merged them to the original record.

clang/utils/TableGen/NeonEmitter.cpp

paulwalker-arm · 2025-05-02T17:01:25Z

clang/lib/CodeGen/CGCall.cpp

+        // Mfloat8 type is loaded as scalar type, but is treated as single
+        // vector type for other operations. We need to bitcast it to the vector
+        // type here.
+        if (auto *EltTy =


Not sure what the fallout will be from this but I think the problem here is we should not have loaded a scalar in the first place. Looking at CodeGenTypes::ConvertTypeForMem() I can see that we're using a different type for the memory representation than the normal one, which I think is a mistake.

Changing this so the types are consistent will remove the need for this code but I suspect it'll prompt further work elsewhere. My hope is that work sits in target specific areas relating to modelling the builtin so seem reasonable. Please shout though if it starts to get out of control.

Change-Id: I6c4d9d98fbe46fb3ee115532a9432709c6a86e10

paulwalker-arm

I've not verified every line of the test files but what I've seen looks good, as do the code changes. Other than a few stylistic suggestions this looks good to me.

clang/lib/CodeGen/TargetBuiltins/ARM.cpp

Change-Id: I9ee2f41ec8879bd631c6ef64e9dc721ef22cf2a1

clang/lib/CodeGen/TargetBuiltins/ARM.cpp

Change-Id: Ic460c0e6afdccdc37ef31f78cde9933cdcb3c544

…28019) This patch adds fp8 variants to existing intrinsics, whose operation doesn't depend on arguments being a specific type. It also changes mfloat8 type representation in memory from `i8` to `<1xi8>`

Lukacma requested review from jthackray, CarolineConcatto and tmatheson-arm February 20, 2025 15:41

llvmbot added clang Clang issues not falling into any other category backend:AArch64 clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Feb 20, 2025

mgabka requested a review from paulwalker-arm February 26, 2025 16:08

CarolineConcatto reviewed Mar 31, 2025

View reviewed changes

[Clang][AArch64] Add fp8 variants for untyped NEON intrinsics

c331c4c

This patch adds fp8 variants to existing intrinsics, whose operation doesn't depend on arguments being a specific type.

Lukacma force-pushed the neon-untyped-fp8 branch from 1dae495 to c331c4c Compare April 4, 2025 09:36

Lukacma commented Apr 4, 2025

View reviewed changes

momchil-velikov requested review from momchil-velikov and removed request for momchil-velikov April 17, 2025 10:26

Lukacma added 2 commits April 29, 2025 16:33

Merge branch 'main' into neon-untyped-fp8

e5873e2

[NeonEmitter] Update ArchGuard for MFloat8 type to ensure correct arc…

2e53a86

…hitecture checks on AArch64

paulwalker-arm reviewed May 2, 2025

View reviewed changes

Chnage format of Mfloat8 type in memory back to <1xi8>

842f197

Change-Id: I6c4d9d98fbe46fb3ee115532a9432709c6a86e10

paulwalker-arm approved these changes May 7, 2025

View reviewed changes

clang/lib/CodeGen/TargetBuiltins/ARM.cpp Show resolved Hide resolved

clang/lib/CodeGen/TargetBuiltins/ARM.cpp Outdated Show resolved Hide resolved

clang/lib/CodeGen/TargetBuiltins/ARM.cpp Outdated Show resolved Hide resolved

Simplify assertion

53202b7

Change-Id: I9ee2f41ec8879bd631c6ef64e9dc721ef22cf2a1

paulwalker-arm reviewed May 9, 2025

View reviewed changes

clang/lib/CodeGen/TargetBuiltins/ARM.cpp Outdated Show resolved Hide resolved

Lukacma added 3 commits May 9, 2025 10:14

remove debug guards

cd8af9d

Change-Id: Ic460c0e6afdccdc37ef31f78cde9933cdcb3c544

Merge branch 'main' into neon-untyped-fp8

a3472d7

Merge branch 'main' into neon-untyped-fp8

fd83679

Lukacma merged commit 6fc0312 into llvm:main May 15, 2025
11 checks passed

Lukacma deleted the neon-untyped-fp8 branch May 15, 2025 13:02

[Clang][AArch64] Add fp8 variants for untyped NEON intrinsics #128019

[Clang][AArch64] Add fp8 variants for untyped NEON intrinsics #128019

Uh oh!

Conversation

Lukacma commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Feb 20, 2025

Uh oh!

llvmbot commented Feb 20, 2025

Uh oh!

llvmbot commented Feb 20, 2025

Uh oh!

Lukacma commented Feb 20, 2025

Uh oh!

github-actions bot commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarolineConcatto left a comment

Choose a reason for hiding this comment

Uh oh!

Lukacma commented Apr 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm commented Apr 17, 2025

Uh oh!

Lukacma commented Apr 23, 2025

Uh oh!

Lukacma commented Apr 28, 2025

Uh oh!

paulwalker-arm commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lukacma commented Apr 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lukacma commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

paulwalker-arm commented Apr 29, 2025 •

edited

Loading