[SDag][ARM][RISCV] Allow lowering CTPOP into a libcall #101786

s-barannikov · 2024-08-03T03:57:44Z

This is a reland of #99752 with the bug fixed (see test diff in the third commit in this PR).
All popcount libcalls returns int, but ISD::CTPOP returns the type of the argument, which can be wider than int. The fix is to make DAG legalizer pass the correct return type to makeLibCall and sign-extend the result afterwards.

The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly.

llvmbot · 2024-08-03T03:58:17Z

@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-llvm-ir

Author: Sergei Barannikov (s-barannikov)

Changes

This is a reland of #99752 with the bug fixed.

Original commit message:

The main change is adding CTPOP to RuntimeLibcalls.def to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.

Patch is 118.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/101786.diff

20 Files Affected:

(modified) llvm/include/llvm/IR/RuntimeLibcalls.def (+3)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (+64-21)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+27-7)
(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+3-2)
(modified) llvm/lib/Target/ARM/ARMISelLowering.cpp (+2-1)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+6-1)
(modified) llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll (+8-8)
(modified) llvm/test/CodeGen/ARM/popcnt.ll (+8-62)
(modified) llvm/test/CodeGen/RISCV/ctlz-cttz-ctpop.ll (+328-886)
(modified) llvm/test/CodeGen/RISCV/ctz_zero_return_test.ll (+24-74)
(modified) llvm/test/CodeGen/RISCV/pr56457.ll (+5-26)
(modified) llvm/test/CodeGen/RISCV/pr95271.ll (+1-22)
(modified) llvm/test/CodeGen/RISCV/rv32xtheadbb.ll (+23-65)
(modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+76-237)
(modified) llvm/test/CodeGen/RISCV/rv64-legal-i32/rv64xtheadbb.ll (+5-29)
(modified) llvm/test/CodeGen/RISCV/rv64-legal-i32/rv64zbb.ll (+50-74)
(modified) llvm/test/CodeGen/RISCV/rv64xtheadbb.ll (+5-29)
(modified) llvm/test/CodeGen/RISCV/rv64zbb.ll (+65-120)
(modified) llvm/test/CodeGen/RISCV/sextw-removal.ll (+6-40)
(modified) llvm/test/CodeGen/Thumb2/mve-ctpop.ll (+13-50)

diff --git a/llvm/include/llvm/IR/RuntimeLibcalls.def b/llvm/include/llvm/IR/RuntimeLibcalls.def
index 89aaf6d1ad83f..3dd75622b8e43 100644
--- a/llvm/include/llvm/IR/RuntimeLibcalls.def
+++ b/llvm/include/llvm/IR/RuntimeLibcalls.def
@@ -85,6 +85,9 @@ HANDLE_LIBCALL(NEG_I64, "__negdi2")
 HANDLE_LIBCALL(CTLZ_I32, "__clzsi2")
 HANDLE_LIBCALL(CTLZ_I64, "__clzdi2")
 HANDLE_LIBCALL(CTLZ_I128, "__clzti2")
+HANDLE_LIBCALL(CTPOP_I32, "__popcountsi2")
+HANDLE_LIBCALL(CTPOP_I64, "__popcountdi2")
+HANDLE_LIBCALL(CTPOP_I128, "__popcountti2")
 
 // Floating-point
 HANDLE_LIBCALL(ADD_F32, "__addsf3")
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index bdb7917073020..8a10f95ca20f6 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -129,7 +129,8 @@ class SelectionDAGLegalize {
                                      ArrayRef<int> Mask) const;
 
   std::pair<SDValue, SDValue> ExpandLibCall(RTLIB::Libcall LC, SDNode *Node,
-                        TargetLowering::ArgListTy &&Args, bool isSigned);
+                                            TargetLowering::ArgListTy &&Args,
+                                            bool IsSigned, EVT RetVT);
   std::pair<SDValue, SDValue> ExpandLibCall(RTLIB::Libcall LC, SDNode *Node, bool isSigned);
 
   void ExpandFrexpLibCall(SDNode *Node, SmallVectorImpl<SDValue> &Results);
@@ -151,6 +152,9 @@ class SelectionDAGLegalize {
                           RTLIB::Libcall Call_F80, RTLIB::Libcall Call_F128,
                           RTLIB::Libcall Call_PPCF128,
                           SmallVectorImpl<SDValue> &Results);
+  SDValue ExpandBitCountingLibCall(SDNode *Node, RTLIB::Libcall CallI32,
+                                   RTLIB::Libcall CallI64,
+                                   RTLIB::Libcall CallI128);
   void ExpandDivRemLibCall(SDNode *Node, SmallVectorImpl<SDValue> &Results);
   void ExpandSinCosLibCall(SDNode *Node, SmallVectorImpl<SDValue> &Results);
 
@@ -2059,9 +2063,10 @@ SDValue SelectionDAGLegalize::ExpandSPLAT_VECTOR(SDNode *Node) {
 // register, return the lo part and set the hi part to the by-reg argument in
 // the first.  If it does fit into a single register, return the result and
 // leave the Hi part unset.
-std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall LC, SDNode *Node,
-                                            TargetLowering::ArgListTy &&Args,
-                                            bool isSigned) {
+std::pair<SDValue, SDValue>
+SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall LC, SDNode *Node,
+                                    TargetLowering::ArgListTy &&Args,
+                                    bool IsSigned, EVT RetVT) {
   EVT CodePtrTy = TLI.getPointerTy(DAG.getDataLayout());
   SDValue Callee;
   if (const char *LibcallName = TLI.getLibcallName(LC))
@@ -2072,7 +2077,6 @@ std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall L
                                 Node->getOperationName(&DAG));
   }
 
-  EVT RetVT = Node->getValueType(0);
   Type *RetTy = RetVT.getTypeForEVT(*DAG.getContext());
 
   // By default, the input chain to this libcall is the entry node of the
@@ -2092,7 +2096,7 @@ std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall L
     InChain = TCChain;
 
   TargetLowering::CallLoweringInfo CLI(DAG);
-  bool signExtend = TLI.shouldSignExtendTypeInLibCall(RetVT, isSigned);
+  bool signExtend = TLI.shouldSignExtendTypeInLibCall(RetVT, IsSigned);
   CLI.setDebugLoc(SDLoc(Node))
       .setChain(InChain)
       .setLibCallee(TLI.getLibcallCallingConv(LC), RetTy, Callee,
@@ -2128,7 +2132,8 @@ std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall L
     Args.push_back(Entry);
   }
 
-  return ExpandLibCall(LC, Node, std::move(Args), isSigned);
+  return ExpandLibCall(LC, Node, std::move(Args), isSigned,
+                       Node->getValueType(0));
 }
 
 void SelectionDAGLegalize::ExpandFrexpLibCall(
@@ -2155,7 +2160,8 @@ void SelectionDAGLegalize::ExpandFrexpLibCall(
   TargetLowering::ArgListTy Args = {FPArgEntry, PtrArgEntry};
 
   RTLIB::Libcall LC = RTLIB::getFREXP(VT);
-  auto [Call, Chain] = ExpandLibCall(LC, Node, std::move(Args), false);
+  auto [Call, Chain] = ExpandLibCall(LC, Node, std::move(Args),
+                                     /*IsSigned=*/false, VT);
 
   // FIXME: Get type of int for libcall declaration and cast
 
@@ -2243,6 +2249,50 @@ void SelectionDAGLegalize::ExpandArgFPLibCall(SDNode* Node,
   ExpandFPLibCall(Node, LC, Results);
 }
 
+SDValue SelectionDAGLegalize::ExpandBitCountingLibCall(
+    SDNode *Node, RTLIB::Libcall CallI32, RTLIB::Libcall CallI64,
+    RTLIB::Libcall CallI128) {
+  RTLIB::Libcall LC;
+  switch (Node->getSimpleValueType(0).SimpleTy) {
+  default:
+    llvm_unreachable("Unexpected request for libcall!");
+  case MVT::i32:
+    LC = CallI32;
+    break;
+  case MVT::i64:
+    LC = CallI64;
+    break;
+  case MVT::i128:
+    LC = CallI128;
+    break;
+  }
+
+  // Bit-counting libcalls have one unsigned argument and return `int`.
+  // Note that `int` may be illegal on this target; ExpandLibCall will
+  // take care of promoting it to a legal type.
+  SDValue Op = Node->getOperand(0);
+  EVT IntVT =
+      EVT::getIntegerVT(*DAG.getContext(), DAG.getLibInfo().getIntSize());
+
+  TargetLowering::ArgListEntry Arg;
+  EVT ArgVT = Op.getValueType();
+  Type *ArgTy = ArgVT.getTypeForEVT(*DAG.getContext());
+  Arg.Node = Op;
+  Arg.Ty = ArgTy;
+  Arg.IsSExt = TLI.shouldSignExtendTypeInLibCall(ArgVT, /*IsSigned=*/false);
+  Arg.IsZExt = !Arg.IsSExt;
+
+  SDValue Res = ExpandLibCall(LC, Node, TargetLowering::ArgListTy{Arg},
+                              /*IsSigned=*/true, IntVT)
+                    .first;
+
+  // If ExpandLibCall created a tail call, the result was already
+  // of the correct type. Otherwise, we need to sign extend it.
+  if (Res.getValueType() != MVT::Other)
+    Res = DAG.getSExtOrTrunc(Res, SDLoc(Node), Node->getValueType(0));
+  return Res;
+}
+
 /// Issue libcalls to __{u}divmod to compute div / rem pairs.
 void
 SelectionDAGLegalize::ExpandDivRemLibCall(SDNode *Node,
@@ -5000,19 +5050,12 @@ void SelectionDAGLegalize::ConvertNodeToLibcall(SDNode *Node) {
                                        RTLIB::MUL_I64, RTLIB::MUL_I128));
     break;
   case ISD::CTLZ_ZERO_UNDEF:
-    switch (Node->getSimpleValueType(0).SimpleTy) {
-    default:
-      llvm_unreachable("LibCall explicitly requested, but not available");
-    case MVT::i32:
-      Results.push_back(ExpandLibCall(RTLIB::CTLZ_I32, Node, false).first);
-      break;
-    case MVT::i64:
-      Results.push_back(ExpandLibCall(RTLIB::CTLZ_I64, Node, false).first);
-      break;
-    case MVT::i128:
-      Results.push_back(ExpandLibCall(RTLIB::CTLZ_I128, Node, false).first);
-      break;
-    }
+    Results.push_back(ExpandBitCountingLibCall(
+        Node, RTLIB::CTLZ_I32, RTLIB::CTLZ_I64, RTLIB::CTLZ_I128));
+    break;
+  case ISD::CTPOP:
+    Results.push_back(ExpandBitCountingLibCall(
+        Node, RTLIB::CTPOP_I32, RTLIB::CTPOP_I64, RTLIB::CTPOP_I128));
     break;
   case ISD::RESET_FPENV: {
     // It is legalized to call 'fesetenv(FE_DFL_ENV)'. On most targets
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index b1ada66aa9aeb..0b0789b75dd50 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -3850,15 +3850,35 @@ void DAGTypeLegalizer::ExpandIntRes_CTLZ(SDNode *N,
   Hi = DAG.getConstant(0, dl, NVT);
 }
 
-void DAGTypeLegalizer::ExpandIntRes_CTPOP(SDNode *N,
-                                          SDValue &Lo, SDValue &Hi) {
-  SDLoc dl(N);
+void DAGTypeLegalizer::ExpandIntRes_CTPOP(SDNode *N, SDValue &Lo, SDValue &Hi) {
+  SDValue Op = N->getOperand(0);
+  EVT VT = N->getValueType(0);
+  SDLoc DL(N);
+
+  if (TLI.getOperationAction(ISD::CTPOP, VT) == TargetLoweringBase::LibCall) {
+    RTLIB::Libcall LC = RTLIB::UNKNOWN_LIBCALL;
+    if (VT == MVT::i32)
+      LC = RTLIB::CTPOP_I32;
+    else if (VT == MVT::i64)
+      LC = RTLIB::CTPOP_I64;
+    else if (VT == MVT::i128)
+      LC = RTLIB::CTPOP_I128;
+    assert(LC != RTLIB::UNKNOWN_LIBCALL && TLI.getLibcallName(LC) &&
+           "LibCall explicitly requested, but not available");
+    TargetLowering::MakeLibCallOptions CallOptions;
+    EVT IntVT =
+        EVT::getIntegerVT(*DAG.getContext(), DAG.getLibInfo().getIntSize());
+    SDValue Res = TLI.makeLibCall(DAG, LC, IntVT, Op, CallOptions, DL).first;
+    SplitInteger(DAG.getSExtOrTrunc(Res, DL, VT), Lo, Hi);
+    return;
+  }
+
   // ctpop(HiLo) -> ctpop(Hi)+ctpop(Lo)
-  GetExpandedInteger(N->getOperand(0), Lo, Hi);
+  GetExpandedInteger(Op, Lo, Hi);
   EVT NVT = Lo.getValueType();
-  Lo = DAG.getNode(ISD::ADD, dl, NVT, DAG.getNode(ISD::CTPOP, dl, NVT, Lo),
-                   DAG.getNode(ISD::CTPOP, dl, NVT, Hi));
-  Hi = DAG.getConstant(0, dl, NVT);
+  Lo = DAG.getNode(ISD::ADD, DL, NVT, DAG.getNode(ISD::CTPOP, DL, NVT, Lo),
+                   DAG.getNode(ISD::CTPOP, DL, NVT, Hi));
+  Hi = DAG.getConstant(0, DL, NVT);
 }
 
 void DAGTypeLegalizer::ExpandIntRes_CTTZ(SDNode *N,
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 7fa83a5999dfe..0989b79771c03 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9161,8 +9161,9 @@ SDValue TargetLowering::expandCTTZ(SDNode *Node, SelectionDAG &DAG) const {
                         !isOperationLegalOrCustomOrPromote(ISD::XOR, VT)))
     return SDValue();
 
-  // Emit Table Lookup if ISD::CTLZ and ISD::CTPOP are not legal.
-  if (!VT.isVector() && isOperationExpand(ISD::CTPOP, VT) &&
+  // Emit Table Lookup if ISD::CTPOP used in the fallback path below is going
+  // to be expanded or converted to a libcall.
+  if (!VT.isVector() && !isOperationLegalOrCustomOrPromote(ISD::CTPOP, VT) &&
       !isOperationLegal(ISD::CTLZ, VT))
     if (SDValue V = CTTZTableLookup(Node, DAG, dl, VT, Op, NumBitsPerElt))
       return V;
diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 75d16a42d0205..1d1ea22f6aac4 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -1204,7 +1204,8 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ROTR, VT, Expand);
   }
   setOperationAction(ISD::CTTZ,  MVT::i32, Custom);
-  setOperationAction(ISD::CTPOP, MVT::i32, Expand);
+  setOperationAction(ISD::CTPOP, MVT::i32, LibCall);
+  setOperationAction(ISD::CTPOP, MVT::i64, LibCall);
   if (!Subtarget->hasV5TOps() || Subtarget->isThumb1Only()) {
     setOperationAction(ISD::CTLZ, MVT::i32, Expand);
     setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, LibCall);
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 9ee60b9db2837..c16b1ca742d7b 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -393,7 +393,12 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
         setOperationAction({ISD::CTTZ, ISD::CTTZ_ZERO_UNDEF}, MVT::i32, Custom);
     }
   } else {
-    setOperationAction({ISD::CTTZ, ISD::CTPOP}, XLenVT, Expand);
+    setOperationAction(ISD::CTTZ, XLenVT, Expand);
+    if (Subtarget.is64Bit())
+      setOperationAction(ISD::CTPOP, MVT::i128, LibCall);
+    else
+      setOperationAction(ISD::CTPOP, MVT::i32, LibCall);
+    setOperationAction(ISD::CTPOP, MVT::i64, LibCall);
     if (RV64LegalI32 && Subtarget.is64Bit())
       setOperationAction({ISD::CTTZ, ISD::CTPOP}, MVT::i32, Expand);
   }
diff --git a/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll b/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll
index 380f65b19b8fa..c0ecc63b82dca 100644
--- a/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll
@@ -159,7 +159,7 @@ define void @bitreverse() {
 
 define void @ctpop() {
 ; NOZVBB-LABEL: 'ctpop'
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %2 = call <2 x i8> @llvm.ctpop.v2i8(<2 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %3 = call <4 x i8> @llvm.ctpop.v4i8(<4 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %4 = call <8 x i8> @llvm.ctpop.v8i8(<8 x i8> undef)
@@ -169,7 +169,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %8 = call <vscale x 4 x i8> @llvm.ctpop.nxv4i8(<vscale x 4 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %9 = call <vscale x 8 x i8> @llvm.ctpop.nxv8i8(<vscale x 8 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %10 = call <vscale x 16 x i8> @llvm.ctpop.nxv16i8(<vscale x 16 x i8> undef)
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %12 = call <2 x i16> @llvm.ctpop.v2i16(<2 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %13 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %14 = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> undef)
@@ -179,7 +179,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %18 = call <vscale x 4 x i16> @llvm.ctpop.nxv4i16(<vscale x 4 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %19 = call <vscale x 8 x i16> @llvm.ctpop.nxv8i16(<vscale x 8 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %20 = call <vscale x 16 x i16> @llvm.ctpop.nxv16i16(<vscale x 16 x i16> undef)
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %22 = call <2 x i32> @llvm.ctpop.v2i32(<2 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %23 = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %24 = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> undef)
@@ -189,7 +189,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %28 = call <vscale x 4 x i32> @llvm.ctpop.nxv4i32(<vscale x 4 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %29 = call <vscale x 8 x i32> @llvm.ctpop.nxv8i32(<vscale x 8 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %30 = call <vscale x 16 x i32> @llvm.ctpop.nxv16i32(<vscale x 16 x i32> undef)
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %32 = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %33 = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %34 = call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> undef)
@@ -202,7 +202,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; ZVBB-LABEL: 'ctpop'
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %2 = call <2 x i8> @llvm.ctpop.v2i8(<2 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %3 = call <4 x i8> @llvm.ctpop.v4i8(<4 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %4 = call <8 x i8> @llvm.ctpop.v8i8(<8 x i8> undef)
@@ -212,7 +212,7 @@ define void @ctpop() {
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %8 = call <vscale x 4 x i8> @llvm.ctpop.nxv4i8(<vscale x 4 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %9 = call <vscale x 8 x i8> @llvm.ctpop.nxv8i8(<vscale x 8 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %10 = call <vscale x 16 x i8> @llvm.ctpop.nxv16i8(<vscale x 16 x i8> undef)
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %12 = call <2 x i16> @llvm.ctpop.v2i16(<2 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %13 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %14 = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> undef)
@@ -222,7 +222,7 @@ define void @ctpop() {
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %18 = call <vscale x 4 x i16> @llvm.ctpop.nxv4i16(<vscale x 4 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %19 = call <vscale x 8 x i16> @llvm.ctpop.nxv8i16(<vscale x 8 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %20 = call <vscale x 16 x i16> @llvm.ctpop.nxv16i16(<vscale x 16 x i16> undef)
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %22 = call <2 x i32> @llvm.ctpop.v2i32(<2 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %23 = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %24 = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> undef)
@@ -232,7 +232,7 @@ define void @ctpop() {
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %28 = call <vscale x 4 x i32> @llvm.ctpop.nxv4i32(<vscale x 4 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %29 = call <vscale x 8 x i32> @llvm.ctpop.nxv8i32(<vscale x 8 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %30 = call <vscale x 16 x i32> @llvm.ctpop.nxv16i32(<vscale x 16 x i32> undef)
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %32 = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %33 = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %34 = call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> undef)
diff --git a/llvm/test/CodeGen/ARM/popcnt.ll b/llvm/test/CodeGen/ARM/popcnt.ll
index edcae5e141e73..fc4387320ef77...
[truncated]

llvmbot · 2024-08-03T03:58:17Z

@llvm/pr-subscribers-llvm-analysis

Author: Sergei Barannikov (s-barannikov)

Changes

This is a reland of #99752 with the bug fixed.

Original commit message:

The main change is adding CTPOP to RuntimeLibcalls.def to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.

Patch is 118.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/101786.diff

20 Files Affected:

(modified) llvm/include/llvm/IR/RuntimeLibcalls.def (+3)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (+64-21)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+27-7)
(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+3-2)
(modified) llvm/lib/Target/ARM/ARMISelLowering.cpp (+2-1)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+6-1)
(modified) llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll (+8-8)
(modified) llvm/test/CodeGen/ARM/popcnt.ll (+8-62)
(modified) llvm/test/CodeGen/RISCV/ctlz-cttz-ctpop.ll (+328-886)
(modified) llvm/test/CodeGen/RISCV/ctz_zero_return_test.ll (+24-74)
(modified) llvm/test/CodeGen/RISCV/pr56457.ll (+5-26)
(modified) llvm/test/CodeGen/RISCV/pr95271.ll (+1-22)
(modified) llvm/test/CodeGen/RISCV/rv32xtheadbb.ll (+23-65)
(modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+76-237)
(modified) llvm/test/CodeGen/RISCV/rv64-legal-i32/rv64xtheadbb.ll (+5-29)
(modified) llvm/test/CodeGen/RISCV/rv64-legal-i32/rv64zbb.ll (+50-74)
(modified) llvm/test/CodeGen/RISCV/rv64xtheadbb.ll (+5-29)
(modified) llvm/test/CodeGen/RISCV/rv64zbb.ll (+65-120)
(modified) llvm/test/CodeGen/RISCV/sextw-removal.ll (+6-40)
(modified) llvm/test/CodeGen/Thumb2/mve-ctpop.ll (+13-50)

diff --git a/llvm/include/llvm/IR/RuntimeLibcalls.def b/llvm/include/llvm/IR/RuntimeLibcalls.def
index 89aaf6d1ad83f..3dd75622b8e43 100644
--- a/llvm/include/llvm/IR/RuntimeLibcalls.def
+++ b/llvm/include/llvm/IR/RuntimeLibcalls.def
@@ -85,6 +85,9 @@ HANDLE_LIBCALL(NEG_I64, "__negdi2")
 HANDLE_LIBCALL(CTLZ_I32, "__clzsi2")
 HANDLE_LIBCALL(CTLZ_I64, "__clzdi2")
 HANDLE_LIBCALL(CTLZ_I128, "__clzti2")
+HANDLE_LIBCALL(CTPOP_I32, "__popcountsi2")
+HANDLE_LIBCALL(CTPOP_I64, "__popcountdi2")
+HANDLE_LIBCALL(CTPOP_I128, "__popcountti2")
 
 // Floating-point
 HANDLE_LIBCALL(ADD_F32, "__addsf3")
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index bdb7917073020..8a10f95ca20f6 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -129,7 +129,8 @@ class SelectionDAGLegalize {
                                      ArrayRef<int> Mask) const;
 
   std::pair<SDValue, SDValue> ExpandLibCall(RTLIB::Libcall LC, SDNode *Node,
-                        TargetLowering::ArgListTy &&Args, bool isSigned);
+                                            TargetLowering::ArgListTy &&Args,
+                                            bool IsSigned, EVT RetVT);
   std::pair<SDValue, SDValue> ExpandLibCall(RTLIB::Libcall LC, SDNode *Node, bool isSigned);
 
   void ExpandFrexpLibCall(SDNode *Node, SmallVectorImpl<SDValue> &Results);
@@ -151,6 +152,9 @@ class SelectionDAGLegalize {
                           RTLIB::Libcall Call_F80, RTLIB::Libcall Call_F128,
                           RTLIB::Libcall Call_PPCF128,
                           SmallVectorImpl<SDValue> &Results);
+  SDValue ExpandBitCountingLibCall(SDNode *Node, RTLIB::Libcall CallI32,
+                                   RTLIB::Libcall CallI64,
+                                   RTLIB::Libcall CallI128);
   void ExpandDivRemLibCall(SDNode *Node, SmallVectorImpl<SDValue> &Results);
   void ExpandSinCosLibCall(SDNode *Node, SmallVectorImpl<SDValue> &Results);
 
@@ -2059,9 +2063,10 @@ SDValue SelectionDAGLegalize::ExpandSPLAT_VECTOR(SDNode *Node) {
 // register, return the lo part and set the hi part to the by-reg argument in
 // the first.  If it does fit into a single register, return the result and
 // leave the Hi part unset.
-std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall LC, SDNode *Node,
-                                            TargetLowering::ArgListTy &&Args,
-                                            bool isSigned) {
+std::pair<SDValue, SDValue>
+SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall LC, SDNode *Node,
+                                    TargetLowering::ArgListTy &&Args,
+                                    bool IsSigned, EVT RetVT) {
   EVT CodePtrTy = TLI.getPointerTy(DAG.getDataLayout());
   SDValue Callee;
   if (const char *LibcallName = TLI.getLibcallName(LC))
@@ -2072,7 +2077,6 @@ std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall L
                                 Node->getOperationName(&DAG));
   }
 
-  EVT RetVT = Node->getValueType(0);
   Type *RetTy = RetVT.getTypeForEVT(*DAG.getContext());
 
   // By default, the input chain to this libcall is the entry node of the
@@ -2092,7 +2096,7 @@ std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall L
     InChain = TCChain;
 
   TargetLowering::CallLoweringInfo CLI(DAG);
-  bool signExtend = TLI.shouldSignExtendTypeInLibCall(RetVT, isSigned);
+  bool signExtend = TLI.shouldSignExtendTypeInLibCall(RetVT, IsSigned);
   CLI.setDebugLoc(SDLoc(Node))
       .setChain(InChain)
       .setLibCallee(TLI.getLibcallCallingConv(LC), RetTy, Callee,
@@ -2128,7 +2132,8 @@ std::pair<SDValue, SDValue> SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall L
     Args.push_back(Entry);
   }
 
-  return ExpandLibCall(LC, Node, std::move(Args), isSigned);
+  return ExpandLibCall(LC, Node, std::move(Args), isSigned,
+                       Node->getValueType(0));
 }
 
 void SelectionDAGLegalize::ExpandFrexpLibCall(
@@ -2155,7 +2160,8 @@ void SelectionDAGLegalize::ExpandFrexpLibCall(
   TargetLowering::ArgListTy Args = {FPArgEntry, PtrArgEntry};
 
   RTLIB::Libcall LC = RTLIB::getFREXP(VT);
-  auto [Call, Chain] = ExpandLibCall(LC, Node, std::move(Args), false);
+  auto [Call, Chain] = ExpandLibCall(LC, Node, std::move(Args),
+                                     /*IsSigned=*/false, VT);
 
   // FIXME: Get type of int for libcall declaration and cast
 
@@ -2243,6 +2249,50 @@ void SelectionDAGLegalize::ExpandArgFPLibCall(SDNode* Node,
   ExpandFPLibCall(Node, LC, Results);
 }
 
+SDValue SelectionDAGLegalize::ExpandBitCountingLibCall(
+    SDNode *Node, RTLIB::Libcall CallI32, RTLIB::Libcall CallI64,
+    RTLIB::Libcall CallI128) {
+  RTLIB::Libcall LC;
+  switch (Node->getSimpleValueType(0).SimpleTy) {
+  default:
+    llvm_unreachable("Unexpected request for libcall!");
+  case MVT::i32:
+    LC = CallI32;
+    break;
+  case MVT::i64:
+    LC = CallI64;
+    break;
+  case MVT::i128:
+    LC = CallI128;
+    break;
+  }
+
+  // Bit-counting libcalls have one unsigned argument and return `int`.
+  // Note that `int` may be illegal on this target; ExpandLibCall will
+  // take care of promoting it to a legal type.
+  SDValue Op = Node->getOperand(0);
+  EVT IntVT =
+      EVT::getIntegerVT(*DAG.getContext(), DAG.getLibInfo().getIntSize());
+
+  TargetLowering::ArgListEntry Arg;
+  EVT ArgVT = Op.getValueType();
+  Type *ArgTy = ArgVT.getTypeForEVT(*DAG.getContext());
+  Arg.Node = Op;
+  Arg.Ty = ArgTy;
+  Arg.IsSExt = TLI.shouldSignExtendTypeInLibCall(ArgVT, /*IsSigned=*/false);
+  Arg.IsZExt = !Arg.IsSExt;
+
+  SDValue Res = ExpandLibCall(LC, Node, TargetLowering::ArgListTy{Arg},
+                              /*IsSigned=*/true, IntVT)
+                    .first;
+
+  // If ExpandLibCall created a tail call, the result was already
+  // of the correct type. Otherwise, we need to sign extend it.
+  if (Res.getValueType() != MVT::Other)
+    Res = DAG.getSExtOrTrunc(Res, SDLoc(Node), Node->getValueType(0));
+  return Res;
+}
+
 /// Issue libcalls to __{u}divmod to compute div / rem pairs.
 void
 SelectionDAGLegalize::ExpandDivRemLibCall(SDNode *Node,
@@ -5000,19 +5050,12 @@ void SelectionDAGLegalize::ConvertNodeToLibcall(SDNode *Node) {
                                        RTLIB::MUL_I64, RTLIB::MUL_I128));
     break;
   case ISD::CTLZ_ZERO_UNDEF:
-    switch (Node->getSimpleValueType(0).SimpleTy) {
-    default:
-      llvm_unreachable("LibCall explicitly requested, but not available");
-    case MVT::i32:
-      Results.push_back(ExpandLibCall(RTLIB::CTLZ_I32, Node, false).first);
-      break;
-    case MVT::i64:
-      Results.push_back(ExpandLibCall(RTLIB::CTLZ_I64, Node, false).first);
-      break;
-    case MVT::i128:
-      Results.push_back(ExpandLibCall(RTLIB::CTLZ_I128, Node, false).first);
-      break;
-    }
+    Results.push_back(ExpandBitCountingLibCall(
+        Node, RTLIB::CTLZ_I32, RTLIB::CTLZ_I64, RTLIB::CTLZ_I128));
+    break;
+  case ISD::CTPOP:
+    Results.push_back(ExpandBitCountingLibCall(
+        Node, RTLIB::CTPOP_I32, RTLIB::CTPOP_I64, RTLIB::CTPOP_I128));
     break;
   case ISD::RESET_FPENV: {
     // It is legalized to call 'fesetenv(FE_DFL_ENV)'. On most targets
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index b1ada66aa9aeb..0b0789b75dd50 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -3850,15 +3850,35 @@ void DAGTypeLegalizer::ExpandIntRes_CTLZ(SDNode *N,
   Hi = DAG.getConstant(0, dl, NVT);
 }
 
-void DAGTypeLegalizer::ExpandIntRes_CTPOP(SDNode *N,
-                                          SDValue &Lo, SDValue &Hi) {
-  SDLoc dl(N);
+void DAGTypeLegalizer::ExpandIntRes_CTPOP(SDNode *N, SDValue &Lo, SDValue &Hi) {
+  SDValue Op = N->getOperand(0);
+  EVT VT = N->getValueType(0);
+  SDLoc DL(N);
+
+  if (TLI.getOperationAction(ISD::CTPOP, VT) == TargetLoweringBase::LibCall) {
+    RTLIB::Libcall LC = RTLIB::UNKNOWN_LIBCALL;
+    if (VT == MVT::i32)
+      LC = RTLIB::CTPOP_I32;
+    else if (VT == MVT::i64)
+      LC = RTLIB::CTPOP_I64;
+    else if (VT == MVT::i128)
+      LC = RTLIB::CTPOP_I128;
+    assert(LC != RTLIB::UNKNOWN_LIBCALL && TLI.getLibcallName(LC) &&
+           "LibCall explicitly requested, but not available");
+    TargetLowering::MakeLibCallOptions CallOptions;
+    EVT IntVT =
+        EVT::getIntegerVT(*DAG.getContext(), DAG.getLibInfo().getIntSize());
+    SDValue Res = TLI.makeLibCall(DAG, LC, IntVT, Op, CallOptions, DL).first;
+    SplitInteger(DAG.getSExtOrTrunc(Res, DL, VT), Lo, Hi);
+    return;
+  }
+
   // ctpop(HiLo) -> ctpop(Hi)+ctpop(Lo)
-  GetExpandedInteger(N->getOperand(0), Lo, Hi);
+  GetExpandedInteger(Op, Lo, Hi);
   EVT NVT = Lo.getValueType();
-  Lo = DAG.getNode(ISD::ADD, dl, NVT, DAG.getNode(ISD::CTPOP, dl, NVT, Lo),
-                   DAG.getNode(ISD::CTPOP, dl, NVT, Hi));
-  Hi = DAG.getConstant(0, dl, NVT);
+  Lo = DAG.getNode(ISD::ADD, DL, NVT, DAG.getNode(ISD::CTPOP, DL, NVT, Lo),
+                   DAG.getNode(ISD::CTPOP, DL, NVT, Hi));
+  Hi = DAG.getConstant(0, DL, NVT);
 }
 
 void DAGTypeLegalizer::ExpandIntRes_CTTZ(SDNode *N,
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 7fa83a5999dfe..0989b79771c03 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -9161,8 +9161,9 @@ SDValue TargetLowering::expandCTTZ(SDNode *Node, SelectionDAG &DAG) const {
                         !isOperationLegalOrCustomOrPromote(ISD::XOR, VT)))
     return SDValue();
 
-  // Emit Table Lookup if ISD::CTLZ and ISD::CTPOP are not legal.
-  if (!VT.isVector() && isOperationExpand(ISD::CTPOP, VT) &&
+  // Emit Table Lookup if ISD::CTPOP used in the fallback path below is going
+  // to be expanded or converted to a libcall.
+  if (!VT.isVector() && !isOperationLegalOrCustomOrPromote(ISD::CTPOP, VT) &&
       !isOperationLegal(ISD::CTLZ, VT))
     if (SDValue V = CTTZTableLookup(Node, DAG, dl, VT, Op, NumBitsPerElt))
       return V;
diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 75d16a42d0205..1d1ea22f6aac4 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -1204,7 +1204,8 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ROTR, VT, Expand);
   }
   setOperationAction(ISD::CTTZ,  MVT::i32, Custom);
-  setOperationAction(ISD::CTPOP, MVT::i32, Expand);
+  setOperationAction(ISD::CTPOP, MVT::i32, LibCall);
+  setOperationAction(ISD::CTPOP, MVT::i64, LibCall);
   if (!Subtarget->hasV5TOps() || Subtarget->isThumb1Only()) {
     setOperationAction(ISD::CTLZ, MVT::i32, Expand);
     setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, LibCall);
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 9ee60b9db2837..c16b1ca742d7b 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -393,7 +393,12 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
         setOperationAction({ISD::CTTZ, ISD::CTTZ_ZERO_UNDEF}, MVT::i32, Custom);
     }
   } else {
-    setOperationAction({ISD::CTTZ, ISD::CTPOP}, XLenVT, Expand);
+    setOperationAction(ISD::CTTZ, XLenVT, Expand);
+    if (Subtarget.is64Bit())
+      setOperationAction(ISD::CTPOP, MVT::i128, LibCall);
+    else
+      setOperationAction(ISD::CTPOP, MVT::i32, LibCall);
+    setOperationAction(ISD::CTPOP, MVT::i64, LibCall);
     if (RV64LegalI32 && Subtarget.is64Bit())
       setOperationAction({ISD::CTTZ, ISD::CTPOP}, MVT::i32, Expand);
   }
diff --git a/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll b/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll
index 380f65b19b8fa..c0ecc63b82dca 100644
--- a/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/int-bit-manip.ll
@@ -159,7 +159,7 @@ define void @bitreverse() {
 
 define void @ctpop() {
 ; NOZVBB-LABEL: 'ctpop'
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %2 = call <2 x i8> @llvm.ctpop.v2i8(<2 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %3 = call <4 x i8> @llvm.ctpop.v4i8(<4 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %4 = call <8 x i8> @llvm.ctpop.v8i8(<8 x i8> undef)
@@ -169,7 +169,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %8 = call <vscale x 4 x i8> @llvm.ctpop.nxv4i8(<vscale x 4 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %9 = call <vscale x 8 x i8> @llvm.ctpop.nxv8i8(<vscale x 8 x i8> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %10 = call <vscale x 16 x i8> @llvm.ctpop.nxv16i8(<vscale x 16 x i8> undef)
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %12 = call <2 x i16> @llvm.ctpop.v2i16(<2 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %13 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %14 = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> undef)
@@ -179,7 +179,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %18 = call <vscale x 4 x i16> @llvm.ctpop.nxv4i16(<vscale x 4 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %19 = call <vscale x 8 x i16> @llvm.ctpop.nxv8i16(<vscale x 8 x i16> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %20 = call <vscale x 16 x i16> @llvm.ctpop.nxv16i16(<vscale x 16 x i16> undef)
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %22 = call <2 x i32> @llvm.ctpop.v2i32(<2 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %23 = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %24 = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> undef)
@@ -189,7 +189,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %28 = call <vscale x 4 x i32> @llvm.ctpop.nxv4i32(<vscale x 4 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %29 = call <vscale x 8 x i32> @llvm.ctpop.nxv8i32(<vscale x 8 x i32> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %30 = call <vscale x 16 x i32> @llvm.ctpop.nxv16i32(<vscale x 16 x i32> undef)
-; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
+; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %32 = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %33 = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> undef)
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %34 = call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> undef)
@@ -202,7 +202,7 @@ define void @ctpop() {
 ; NOZVBB-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; ZVBB-LABEL: 'ctpop'
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = call i8 @llvm.ctpop.i8(i8 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %2 = call <2 x i8> @llvm.ctpop.v2i8(<2 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %3 = call <4 x i8> @llvm.ctpop.v4i8(<4 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %4 = call <8 x i8> @llvm.ctpop.v8i8(<8 x i8> undef)
@@ -212,7 +212,7 @@ define void @ctpop() {
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %8 = call <vscale x 4 x i8> @llvm.ctpop.nxv4i8(<vscale x 4 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %9 = call <vscale x 8 x i8> @llvm.ctpop.nxv8i8(<vscale x 8 x i8> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %10 = call <vscale x 16 x i8> @llvm.ctpop.nxv16i8(<vscale x 16 x i8> undef)
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %11 = call i16 @llvm.ctpop.i16(i16 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %12 = call <2 x i16> @llvm.ctpop.v2i16(<2 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %13 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %14 = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> undef)
@@ -222,7 +222,7 @@ define void @ctpop() {
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %18 = call <vscale x 4 x i16> @llvm.ctpop.nxv4i16(<vscale x 4 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %19 = call <vscale x 8 x i16> @llvm.ctpop.nxv8i16(<vscale x 8 x i16> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %20 = call <vscale x 16 x i16> @llvm.ctpop.nxv16i16(<vscale x 16 x i16> undef)
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %21 = call i32 @llvm.ctpop.i32(i32 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %22 = call <2 x i32> @llvm.ctpop.v2i32(<2 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %23 = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %24 = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> undef)
@@ -232,7 +232,7 @@ define void @ctpop() {
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %28 = call <vscale x 4 x i32> @llvm.ctpop.nxv4i32(<vscale x 4 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %29 = call <vscale x 8 x i32> @llvm.ctpop.nxv8i32(<vscale x 8 x i32> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %30 = call <vscale x 16 x i32> @llvm.ctpop.nxv16i32(<vscale x 16 x i32> undef)
-; ZVBB-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
+; ZVBB-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %31 = call i64 @llvm.ctpop.i64(i64 undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %32 = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %33 = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> undef)
 ; ZVBB-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %34 = call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> undef)
diff --git a/llvm/test/CodeGen/ARM/popcnt.ll b/llvm/test/CodeGen/ARM/popcnt.ll
index edcae5e141e73..fc4387320ef77...
[truncated]

nathanchance · 2024-08-05T02:57:50Z

For what it's worth, not everything links against the compiler runtime, such as the Linux kernel. This change breaks building the kernel for RISC-V, as noticed by some CI infrastructure that built with the original change (#99752):

ERROR: modpost: "__popcountdi2" [fs/ext4/ext4.ko] undefined!

https://lore.kernel.org/all/?q=__popcountdi2

topperc · 2024-08-05T03:47:33Z

For what it's worth, not everything links against the compiler runtime, such as the Linux kernel. This change breaks building the kernel for RISC-V, as noticed by some CI infrastructure that built with the original change (#99752):
ERROR: modpost: "__popcountdi2" [fs/ext4/ext4.ko] undefined!
https://lore.kernel.org/all/?q=__popcountdi2

What about x86 CPUs that don't have POPCNT?

nathanchance · 2024-08-05T05:27:25Z

What about x86 CPUs that don't have POPCNT?

It appears the kernel handles this with an alternative that detects the feature and falls back to an out of line asm routine if it is not supported (perhaps because of something like this?):

https://elixir.bootlin.com/linux/v6.10.3/source/arch/x86/include/asm/arch_hweight.h

https://elixir.bootlin.com/linux/v6.10.3/source/arch/x86/lib/hweight.S

s-barannikov · 2024-08-05T11:06:26Z

For what it's worth, not everything links against the compiler runtime, such as the Linux kernel.

gcc does generate a libcall for __builtin_popcount link and somehow manages to build the kernel. Or the kernel doesn't use builtins and it is the compiler that emitted a libcall even though it wasn't in the source?
Is it possible that there is a compiler flag that should suppress recognition of libcall patterns?

Note that popcount is not the only libcall that the compiler can generate, but it appears to be only one that breaks the build.

s-barannikov · 2024-08-05T11:16:27Z

Some interesting discussions for a similar gcc bug(?):
https://bugzilla.kernel.org/show_bug.cgi?id=200671
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105253

s-barannikov · 2024-08-05T11:55:52Z

It looks like LookIdiomRecognize already bails out if popcount isn't supported by hardware, and the hook is correctly implemented on RISC-V.
However, this hook isn't called from AggressiveInstCombine, and this might be the reason we see the libcall. Another possible reason is that it is created by expandCTTZ / expandCTLZ in the backend.

It would be nice to have a reproducer.

nathanchance · 2024-08-05T14:14:43Z

It would be nice to have a reproducer.

I'll see if I can reduce one down from the kernel sources.

nathanchance · 2024-08-05T23:05:57Z

cvise spits out:

long is_power_of_2_n, inode_readahead_blks_store_t;
enum { attr_reserved_clusters } inode_readahead_blks_store_count;
long inode_readahead_blks_store() {
  _Bool __trans_tmp_1 = is_power_of_2_n & (is_power_of_2_n - 1);
  if (inode_readahead_blks_store_t &&
      (__trans_tmp_1 || inode_readahead_blks_store_t > 0))
    return 2;
  return inode_readahead_blks_store_count;
}

which shows __popcountdi2 after this change but not before this change or with GCC.

$ clang --target=riscv64-linux-gnu -O2 -c sysfs.i

$ llvm-nm sysfs.o &| grep popcount
                 U __popcountdi2

topperc · 2024-08-05T23:10:24Z

cvise spits out:

long is_power_of_2_n, inode_readahead_blks_store_t;
enum { attr_reserved_clusters } inode_readahead_blks_store_count;
long inode_readahead_blks_store() {
  _Bool __trans_tmp_1 = is_power_of_2_n & (is_power_of_2_n - 1);
  if (inode_readahead_blks_store_t &&
      (__trans_tmp_1 || inode_readahead_blks_store_t > 0))
    return 2;
  return inode_readahead_blks_store_count;
}

which shows __popcountdi2 after this change but not before this change or with GCC.

$ clang --target=riscv64-linux-gnu -O2 -c sysfs.i

$ llvm-nm sysfs.o &| grep popcount
                 U __popcountdi2

That's just checking if something is a power of 2 or 0 which shouldnt' require a popcnt. The middle end likes to canonicalize it to popcnt but the backend is supposed to undo it when popcnt isn't supported.

s-barannikov · 2024-08-05T23:57:32Z

That's just checking if something is a power of 2 or 0 which shouldnt' require a popcnt. The middle end likes to canonicalize it to popcnt but the backend is supposed to undo it when popcnt isn't supported.

This is how the backend undoes it:

	srli	a2, a1, 1
	lui	a3, 349525
	addiw	a3, a3, 1365
	slli	a4, a3, 32
	add	a3, a3, a4
	and	a2, a2, a3
	sub	a1, a1, a2
	lui	a2, 209715
	addiw	a2, a2, 819
	slli	a3, a2, 32
	add	a2, a2, a3
	and	a3, a1, a2
	srli	a1, a1, 2
	and	a1, a1, a2
	add	a1, a1, a3
	srli	a2, a1, 4
	add	a1, a1, a2
	lui	a2, 61681
	addiw	a2, a2, -241
	slli	a3, a2, 32
	add	a2, a2, a3
	and	a1, a1, a2
	lui	a2, 4112
	addiw	a2, a2, 257
	slli	a3, a2, 32
	add	a2, a2, a3
	mul	a1, a1, a2
	srli	a1, a1, 56
	li	a2, 1
	bltu	a2, a1, .LBB0_4

Without this canonicalization:

	addi	a2, a1, -1
	and	a1, a1, a2
	bnez	a1, .LBB0_4

I would argue that this canonicalization is harmful without hardware popcount.

@nikic

topperc · 2024-08-06T00:05:44Z

That's just checking if something is a power of 2 or 0 which shouldnt' require a popcnt. The middle end likes to canonicalize it to popcnt but the backend is supposed to undo it when popcnt isn't supported.

This is how the backend undoes it:
	srli	a2, a1, 1
	lui	a3, 349525
	addiw	a3, a3, 1365
	slli	a4, a3, 32
	add	a3, a3, a4
	and	a2, a2, a3
	sub	a1, a1, a2
	lui	a2, 209715
	addiw	a2, a2, 819
	slli	a3, a2, 32
	add	a2, a2, a3
	and	a3, a1, a2
	srli	a1, a1, 2
	and	a1, a1, a2
	add	a1, a1, a3
	srli	a2, a1, 4
	add	a1, a1, a2
	lui	a2, 61681
	addiw	a2, a2, -241
	slli	a3, a2, 32
	add	a2, a2, a3
	and	a1, a1, a2
	lui	a2, 4112
	addiw	a2, a2, 257
	slli	a3, a2, 32
	add	a2, a2, a3
	mul	a1, a1, a2
	srli	a1, a1, 56
	li	a2, 1
	bltu	a2, a1, .LBB0_4
Without this canonicalization:
	addi	a2, a1, -1
	and	a1, a1, a2
	bnez	a1, .LBB0_4
I would argue that this canonicalization is harmful without hardware popcount.

@nikic

The problem seems to be that the popcount and branch end up in different basic blocks so the backend can't undo it. The code to undo it is in simplifySetCCWithCTPOP in TargetLowering.cpp but it doesn't trigger on this example. Similar to #94829

s-barannikov · 2024-08-06T00:39:16Z

The problem seems to be that the popcount and branch end up in different basic blocks so the backend can't undo it. The code to undo it is in simplifySetCCWithCTPOP in TargetLowering.cpp but it doesn't trigger on this example. Similar to #94829

It seems so. This time it is SelectionDAGBuilder to blame, it transforms or of two icmps in two branches.

  %2 = tail call range(i64 0, 65) i64 @llvm.ctpop.i64(i64 %1)
  %tobool = icmp ugt i64 %2, 1
  %cmp = icmp sgt i64 %0, 0
  %or.cond = or i1 %cmp, %tobool
  br i1 %or.cond, label %cleanup, label %if.end

The outcome is the same: the popcount becomes liveout of its basic block.

nathanchance · 2024-08-06T01:56:51Z

Just in case it matters, this broke ARM kernel builds too: https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/10240673671/job/28328987353

ERROR: modpost: "__popcountsi2" [arch/arm/crypto/aes-arm-bs.ko] undefined!

I assume this is the same root cause (since the reproducer exhibits the same behavior with --target=arm-linux-gnueabi but I can reduce something out if it is not.

s-barannikov · 2024-08-06T16:41:23Z

I assume this is the same root cause (since the reproducer exhibits the same behavior with --target=arm-linux-gnueabi but I can reduce something out if it is not.

There is no need for this just yet. It is possible that there are other cases when the call is emitted, but the core issue should be the same.
Thank you for cooperation.

s-barannikov · 2025-04-23T06:20:04Z

Should we expand the ctpop+icmp idiom during CodeGenPrepare?

Done in #102731.

s-barannikov · 2025-04-23T09:41:24Z

@nathanchance I'm landing this now. Please let me know if the issue persists.

nathanchance · 2025-04-24T00:59:23Z

@s-barannikov Thanks for the heads up. Unfortunately, even with 5080a02, I still see __popcount instructions appear with 11a3de7. Testing at ff6a23d:

$ make -skj"$(nproc)" ARCH=arm LLVM=1 mrproper allmodconfig all
ERROR: modpost: "__popcountsi2" [arch/arm/crypto/aes-arm-bs.ko] undefined!
ERROR: modpost: "__popcountsi2" [arch/arm/crypto/aes-arm-ce.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/ext4/ext4.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/ext4/ext4.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/ext2/ext2.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/jbd2/jbd2.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/fat/fat.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/hfsplus/hfsplus.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/hfs/hfs.ko] undefined!
ERROR: modpost: "__popcountsi2" [fs/nfs/nfs.ko] undefined!
WARNING: modpost: suppressed 203 unresolved symbol warnings because there were too many)
...

$ make -skj"$(nproc)" ARCH=riscv LLVM=1 mrproper defconfig all
ld.lld: error: undefined symbol: __popcountdi2
>>> referenced by hweight.c
>>>               lib/hweight.o:(__sw_hweight64) in archive vmlinux.a
>>> referenced by div64.c
>>>               lib/math/div64.o:(mul_u64_u64_div_u64) in archive vmlinux.a
>>> referenced by div64.c
>>>               lib/math/div64.o:(mul_u64_u64_div_u64) in archive vmlinux.a
>>> referenced 1 more times
...

$ make -skj"$(nproc)" ARCH=riscv LLVM=1 mrproper allmodconfig all
ERROR: modpost: "__popcountdi2" [fs/ext4/ext4.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/jbd2/jbd2.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/exfat/exfat.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/nfs/nfs.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/nfs/nfsv4.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/ntfs3/ntfs3.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/ubifs/ubifs.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/xfs/xfs.ko] undefined!
ERROR: modpost: "__popcountdi2" [fs/btrfs/btrfs.ko] undefined!
WARNING: modpost: suppressed 73 unresolved symbol warnings because there were too many)
...

I will try to reduce something out tonight or tomorrow.

s-barannikov · 2025-04-24T04:30:01Z

Thanks,
Looking at lib/math/div64.c, I see calls to __builtin_ctzll and __builtin_clzll. They should be expanded into some math and a __popcountdi2 call. Are these calls supposed to be inlined by the compiler or does linux kernel provide __ctzdi2 / __clzdi2?

s-barannikov · 2025-04-24T04:50:28Z

does linux kernel provide __ctzdi2 / __clzdi2?

Looks like it does: https://github.com/torvalds/linux/blob/master/lib/clz_ctz.c
I guess it would be much easier if __popcountsi2/__popcountdi2 were also defined.
It is easy to make clang generate __clz call instead of expanding to math + __popcount (a little more difficult with __ctz), but I'm not sure that "this is necessary to build the Linux kernel" would be enough justification for the change.

s-barannikov · 2025-04-24T05:32:32Z

@koachan you might be interested in this as well.

The change as is breaks the Linux kernel build as pointed out in the comments.

mstorsjo · 2025-04-24T10:41:20Z

FYI, this showed up as issues while building for armv7 windows in MSVC style environments; in an MSVC style environment, we don't have the regular libgcc/compiler-rt builtins, only the ones shipped with MSVC.

I can work around it with a change like this:

diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index bdebd842b011..d65d48a6611f 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -1221,8 +1221,13 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ROTR, VT, Expand);
   }
   setOperationAction(ISD::CTTZ,  MVT::i32, Custom);
-  setOperationAction(ISD::CTPOP, MVT::i32, LibCall);
-  setOperationAction(ISD::CTPOP, MVT::i64, LibCall);
+  if (Subtarget->isTargetMSVC()) {
+    setOperationAction(ISD::CTPOP, MVT::i32, Expand);
+    setOperationAction(ISD::CTPOP, MVT::i64, Expand);
+  } else {
+    setOperationAction(ISD::CTPOP, MVT::i32, LibCall);
+    setOperationAction(ISD::CTPOP, MVT::i64, LibCall);
+  }
   if (!Subtarget->hasV5TOps() || Subtarget->isThumb1Only()) {
     setOperationAction(ISD::CTLZ, MVT::i32, Expand);
     setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, LibCall);
diff --git a/llvm/lib/Target/ARM/ARMSubtarget.h b/llvm/lib/Target/ARM/ARMSubtarget.h
index 7329d3f2055f..9a9a9b45adb2 100644
--- a/llvm/lib/Target/ARM/ARMSubtarget.h
+++ b/llvm/lib/Target/ARM/ARMSubtarget.h
@@ -340,6 +340,7 @@ public:
   bool isTargetWatchABI() const { return TargetTriple.isWatchABI(); }
   bool isTargetDriverKit() const { return TargetTriple.isDriverKit(); }
   bool isTargetLinux() const { return TargetTriple.isOSLinux(); }
+  bool isTargetMSVC() const { return TargetTriple.isWindowsMSVCEnvironment(); }
   bool isTargetNaCl() const { return TargetTriple.isOSNaCl(); }
   bool isTargetNetBSD() const { return TargetTriple.isOSNetBSD(); }
   bool isTargetWindows() const { return TargetTriple.isOSWindows(); }

It's probably fine to just use the existing isTargetWindows() as well, but for an armv7-windows-gnu target, we can use libgcc/compiler-rt style libcalls if we want to, but not for armv7-windows-msvc.

I see this was already somewhat turned off again, but I'd appreciate if you can fold this in before reenabling!

s-barannikov · 2025-04-24T11:06:52Z

@mstorsjo
Thanks for the information!
isTargetWindows() seems to be preferred for setting up libcall names (example), so I'll probably go with that interface just to be on the safe side.

koachan · 2025-04-24T16:30:29Z

@koachan you might be interested in this as well.

Hmmm so what is the takeaway here? I'm having difficulties following the discussion.
Lowering CTTZ/CTLZ to libcalls is fine but lowering CTPOP to libcall makes Linux kernel fail to compile?

s-barannikov · 2025-04-24T16:54:25Z

@koachan you might be interested in this as well.

Lowering CTTZ/CTLZ to libcalls is fine but lowering CTPOP to libcall makes Linux kernel fail to compile?

This is my understanding, yes.
However, we don't currently lower cttz/ctlz into libcalls and instead expand them, and this may also introduce a call to __popcountXi2. For instance, the generic expansion for cttz(x) is popcount(~x & (x - 1)).
Before this patch it was not possible to generate a libcall for popcount using the generic mechanism. Now it is possible, but apparently breaks the Linux kernel build because the call is generated when it is not expected (as a result of lowering __builtin_clz/__builtin_ctz).

nathanchance · 2025-04-24T17:21:45Z

I guess it would be much easier if __popcountsi2/__popcountdi2 were also defined.

I could try to push for something like this in the kernel:

diff --git a/lib/Makefile b/lib/Makefile
index f07b24ce1b3f..0240fa7d6b5b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -52,7 +52,7 @@ obj-y += lockref.o
 
 obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
         bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \
-        list_sort.o uuid.o iov_iter.o clz_ctz.o \
+        list_sort.o uuid.o iov_iter.o clz_ctz.o popcount.o \
         bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
         percpu-refcount.o rhashtable.o base64.o \
         once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
diff --git a/lib/popcount.c b/lib/popcount.c
new file mode 100644
index 000000000000..0234961cc35e
--- /dev/null
+++ b/lib/popcount.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * The functions in this file aren't called directly, but may be emitted
+ * by the compiler.
+ */
+
+#include <linux/bitops.h>
+#include <linux/export.h>
+
+int __popcountsi2(unsigned int val);
+int __popcountsi2(unsigned int val)
+{
+       return __arch_hweight32(val);
+}
+EXPORT_SYMBOL(__popcountsi2);
+
+int __popcountdi2(unsigned long long val);
+int __popcountdi2(unsigned long long val)
+{
+       return __arch_hweight64(val);
+}
+EXPORT_SYMBOL(__popcountdi2);

which would at least use existing architecture optimized popcount implementations if they exist, which would hopefully overcome Linus’s recent complaint around compiler optimizations inserting libcalls (resulting in kbuild: Add '-fno-builtin-wcslen') but even on the GCC Bugzilla links above, there are other developers who do not like this either. If that is not palatable in the kernel, is the only way to avoid this optimization -fno-builtin-popcount?

It is easy to make clang generate __clz call instead of expanding to math + __popcount (a little more difficult with __ctz), but I'm not sure that "this is necessary to build the Linux kernel" would be enough justification for the change.

I wonder if that matches what GCC does, which could be additional justification? I do understand the kernel is being special here so I can pursue the addition of __popcount{d,s}i2() in the kernel first and come back if that is not successful.

s-barannikov · 2025-04-24T18:26:59Z

I could try to push for something like this in the kernel:

I don't mind sharing the headache 😅. And, who knows, maybe this patch will be welcomed?

If that is not palatable in the kernel, is the only way to avoid this optimization -fno-builtin-popcount?

I think this option only applies to libc functions, not to libgcc ones.

I wonder if that matches what GCC does, which could be additional justification? I do understand the kernel is being special here so I can pursue the addition of __popcount{d,s}i2() in the kernel first and come back if that is not successful.

GCC generates calls to __clzXi2/__ctzXi2 for the corresponding builtins. This makes sense to me, but might not be a weighty argument for others. Will see, I guess, there was a precedent once.
A much stronger argument would be "it is faster and (and less code size)". While reduced code size is trivial to prove, performance measurements require hardware handy.

I do understand the kernel is being special here so I can pursue the addition of __popcount{d,s}i2() in the kernel first and come back if that is not successful.

Will be much appreciated, thank you! I'll continue trying to fix the issue on the compiler side.

nathanchance · 2025-04-25T14:00:40Z

I could try to push for something like this in the kernel:

I don't mind sharing the headache 😅. And, who knows, maybe this patch will be welcomed?

I do understand the kernel is being special here so I can pursue the addition of __popcount{d,s}i2() in the kernel first and come back if that is not successful.

Will be much appreciated, thank you! I'll continue trying to fix the issue on the compiler side.

Heh, I sent a message to Linus yesterday and the reaction was generally what I expected.

His suggestion was effectively

diff --git a/lib/hweight.c b/lib/hweight.c
index c94586b62551..00b2ff970203 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -66,3 +66,8 @@ unsigned long __sw_hweight64(__u64 w)
 #endif
 }
 EXPORT_SYMBOL(__sw_hweight64);
+
+int __popcountsi2(unsigned int w) __alias(__sw_hweight32);
+EXPORT_SYMBOL(__popcountsi2);
+int __popcountdi2(__u64 w) __alias(__sw_hweight64);
+EXPORT_SYMBOL(__popcountdi2);

which does also resolve the error but I realized that neither of these suggestions I have proposed will work in their current form because (at least for the RISC-V configuration that I chcked) __sw_hweight64 gets transformed into a call to __popcountdi2 so trying to reuse either __arch_hweight64 or __sw_hweight64 to define __popcountdi2 will just result in an infinite loop…

00000000000000be <__sw_hweight64>:
      be: 1141          addi    sp, sp, -0x10
      c0: e406          sd  ra, 0x8(sp)
      c2: e022          sd  s0, 0x0(sp)
      c4: 0800          addi    s0, sp, 0x10
      c6: 00000097      auipc   ra, 0x0
        00000000000000c6:  R_RISCV_CALL_PLT __popcountdi2
        00000000000000c6:  R_RISCV_RELAX    *ABS*
      ca: 000080e7      jalr    ra <__sw_hweight64+0x8>
      ce: 60a2          ld  ra, 0x8(sp)
      d0: 6402          ld  s0, 0x0(sp)
      d2: 0141          addi    sp, sp, 0x10
      d4: 8082          ret

s-barannikov · 2025-04-25T14:50:12Z

If I adjust __popcountsi2 implementation a bit, I get the same issue :) It is recognized by tryToRecognizePopcount.

For some functions this issue is worked around by comparing the optimized function's name with the name of the new callee and bailing out if they are the same. I can't find right away where I saw it; it wasn't done generically, only for some (or maybe just a couple) functions. We could do the same for __popcountXi2, but that won't help because the caller's name is different (__sw_hweightXX).

Even if the compiler stops generating calls to popcount as part of ctz/clz expansion, it will still recognize __sw_hweightXX as popcount and generate the libcall. If we want to prevent the compiler from doing that, we need a handle similar to -fno-builtin
or a function attribute or ...

This is a reland of #99752 with the bug fixed (see test diff in the third commit in this PR). All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type of the argument, which can be wider than `int`. The fix is to make DAG legalizer pass the correct return type to `makeLibCall` and sign-extend the result afterwards. Original commit message: The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly. Pull Request: llvm/llvm-project#101786

…01786) This is a reland of #99752 with the bug fixed (see test diff in the third commit in this PR). All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type of the argument, which can be wider than `int`. The fix is to make DAG legalizer pass the correct return type to `makeLibCall` and sign-extend the result afterwards. Original commit message: The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly. Pull Request: llvm/llvm-project#101786

This is a reland of llvm#99752 with the bug fixed (see test diff in the third commit in this PR). All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type of the argument, which can be wider than `int`. The fix is to make DAG legalizer pass the correct return type to `makeLibCall` and sign-extend the result afterwards. Original commit message: The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly. Pull Request: llvm#101786

The change as is breaks the Linux kernel build as pointed out in the comments.

This is a reland of llvm#99752 with the bug fixed (see test diff in the third commit in this PR). All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type of the argument, which can be wider than `int`. The fix is to make DAG legalizer pass the correct return type to `makeLibCall` and sign-extend the result afterwards. Original commit message: The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly. Pull Request: llvm#101786

The change as is breaks the Linux kernel build as pointed out in the comments.

This is a reland of llvm#99752 with the bug fixed (see test diff in the third commit in this PR). All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type of the argument, which can be wider than `int`. The fix is to make DAG legalizer pass the correct return type to `makeLibCall` and sign-extend the result afterwards. Original commit message: The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly. Pull Request: llvm#101786

The change as is breaks the Linux kernel build as pointed out in the comments.

s-barannikov added 3 commits August 3, 2024 06:51

[SDag][ARM][RISCV] Allow lowering CTPOP into a libcall (llvm#99752)

e6b2fa6

The main change is adding CTPOP to `RuntimeLibcalls.def` to allow targets to use LibCall action for CTPOP. DAG legalizers are changed accordingly.

Tests

ae98e96

Fixes

14340fe

s-barannikov requested review from arsenm, DavidSpickett, RKSimon, dtcxzyw and topperc August 3, 2024 03:57

llvmbot added backend:ARM backend:RISC-V llvm:SelectionDAG SelectionDAGISel as well llvm:ir llvm:analysis labels Aug 3, 2024

arsenm approved these changes Aug 4, 2024

View reviewed changes

s-barannikov added 2 commits April 21, 2025 09:28

Fix cost model check

7abb0d1

Merge remote-tracking branch 'upstream/main' into popcount-libcall-1

6870a9a

s-barannikov marked this pull request as ready for review April 23, 2025 06:20

s-barannikov merged commit 11a3de7 into llvm:main Apr 23, 2025
9 checks passed

s-barannikov deleted the popcount-libcall-1 branch April 23, 2025 09:43

s-barannikov added a commit to s-barannikov/llvm-project that referenced this pull request Apr 24, 2025

[ARM][RISCV] Partially revert llvm#101786

77f6508

The change as is breaks the Linux kernel build as pointed out in the comments.

s-barannikov added a commit that referenced this pull request Apr 24, 2025

[ARM][RISCV] Partially revert #101786 (#137120)

7af555e

The change as is breaks the Linux kernel build as pointed out in the comments.

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025

[ARM][RISCV] Partially revert llvm#101786 (llvm#137120)

0ea2aa1

The change as is breaks the Linux kernel build as pointed out in the comments.

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025

[ARM][RISCV] Partially revert llvm#101786 (llvm#137120)

64eb62d

The change as is breaks the Linux kernel build as pointed out in the comments.

IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025

[ARM][RISCV] Partially revert llvm#101786 (llvm#137120)

11f4274

The change as is breaks the Linux kernel build as pointed out in the comments.

Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request May 9, 2025

[ARM][RISCV] Partially revert llvm#101786 (llvm#137120)

31028d4

The change as is breaks the Linux kernel build as pointed out in the comments.

[SDag][ARM][RISCV] Allow lowering CTPOP into a libcall #101786

[SDag][ARM][RISCV] Allow lowering CTPOP into a libcall #101786

Uh oh!

Conversation

s-barannikov commented Aug 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 3, 2024

Uh oh!

nathanchance commented Aug 5, 2024

Uh oh!

topperc commented Aug 5, 2024

Uh oh!

nathanchance commented Aug 5, 2024

Uh oh!

s-barannikov commented Aug 5, 2024

Uh oh!

s-barannikov commented Aug 5, 2024

Uh oh!

s-barannikov commented Aug 5, 2024

Uh oh!

nathanchance commented Aug 5, 2024

Uh oh!

nathanchance commented Aug 5, 2024

Uh oh!

topperc commented Aug 5, 2024

Uh oh!

s-barannikov commented Aug 5, 2024

Uh oh!

topperc commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s-barannikov commented Aug 6, 2024

Uh oh!

nathanchance commented Aug 6, 2024

Uh oh!

s-barannikov commented Aug 6, 2024

Uh oh!

s-barannikov commented Apr 23, 2025

Uh oh!

s-barannikov commented Apr 23, 2025

Uh oh!

Uh oh!

nathanchance commented Apr 24, 2025

Uh oh!

s-barannikov commented Apr 24, 2025

Uh oh!

s-barannikov commented Apr 24, 2025

Uh oh!

s-barannikov commented Apr 24, 2025

Uh oh!

mstorsjo commented Apr 24, 2025

Uh oh!

s-barannikov commented Apr 24, 2025

Uh oh!

koachan commented Apr 24, 2025

Uh oh!

s-barannikov commented Apr 24, 2025

Uh oh!

nathanchance commented Apr 24, 2025

Uh oh!

s-barannikov commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nathanchance commented Apr 25, 2025

Uh oh!

s-barannikov commented Apr 25, 2025

Uh oh!

Uh oh!

s-barannikov commented Aug 3, 2024 •

edited

Loading

llvmbot commented Aug 3, 2024 •

edited

Loading

topperc commented Aug 6, 2024 •

edited

Loading

s-barannikov commented Apr 24, 2025 •

edited

Loading