[SLP]Add subvector vectorization for non-load nodes #108430

alexey-bataev · 2024-09-12T17:32:47Z

Previously SLP vectorize supported clustered vectorization for loads
only. This patch adds support for "clustered" vectorization for other
instructions.
If the buildvector node contains "clusters", which can be vectorized
separately and then inserted into the resulting buildvector result, it
is better to do, since it may reduce the cost of the vector graph and
produce better vector code.
The patch does some analysis, if it is profitable to try to do this kind
of extra vectorization. It checks the scalar instructions and its
operands and tries to vectorize them only if they result in a better
graph.

Created using spr 1.3.5

llvmbot · 2024-09-12T17:33:18Z

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

Previously SLP vectorize supported clustered vectorization for loads
only. This patch adds support for "clustered" vectorization for other
instructions.
If the buildvector node contains "clusters", which can be vectorized
separately and then inserted into the resulting buildvector result, it
is better to do, since it may reduce the cost of the vector graph and
produce better vector code.
The patch does some analysis, if it is profitable to try to do this kind
of extra vectorization. It checks the scalar instructions and its
operands and tries to vectorize them only if they result in a better
graph.

Patch is 34.98 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108430.diff

7 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+140-24)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+12-12)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+7-12)
(modified) llvm/test/Transforms/SLPVectorizer/X86/landing_pad.ll (+9-10)
(modified) llvm/test/Transforms/SLPVectorizer/X86/phi.ll (+17-19)
(modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+6-5)
(modified) llvm/test/Transforms/SLPVectorizer/X86/resched.ll (+19-24)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index c89a50fc7bd429..d2a3ffe3343a7e 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1350,6 +1350,7 @@ class BoUpSLP {
     }
     MinBWs.clear();
     ReductionBitWidth = 0;
+    BaseGraphSize = 1;
     CastMaxMinBWSizes.reset();
     ExtraBitWidthNodes.clear();
     InstrElementSize.clear();
@@ -1360,6 +1361,9 @@ class BoUpSLP {
 
   unsigned getTreeSize() const { return VectorizableTree.size(); }
 
+  /// Returns the base graph size, before any transformations.
+  unsigned getCanonicalGraphSize() const { return BaseGraphSize; }
+
   /// Perform LICM and CSE on the newly generated gather sequences.
   void optimizeGatherSequence();
 
@@ -2886,7 +2890,8 @@ class BoUpSLP {
   /// Create a new vector from a list of scalar values.  Produces a sequence
   /// which exploits values reused across lanes, and arranges the inserts
   /// for ease of later optimization.
-  Value *createBuildVector(const TreeEntry *E, Type *ScalarTy);
+  Value *createBuildVector(const TreeEntry *E, Type *ScalarTy,
+                           bool PostponedPHIs);
 
   /// Returns the instruction in the bundle, which can be used as a base point
   /// for scheduling. Usually it is the last instruction in the bundle, except
@@ -4144,6 +4149,9 @@ class BoUpSLP {
   /// reduction.
   unsigned ReductionBitWidth = 0;
 
+  /// Canonical graph size before the transformations.
+  unsigned BaseGraphSize = 1;
+
   /// If the tree contains any zext/sext/trunc nodes, contains max-min pair of
   /// type sizes, used in the tree.
   std::optional<std::pair<unsigned, unsigned>> CastMaxMinBWSizes;
@@ -8446,47 +8454,147 @@ getGEPCosts(const TargetTransformInfo &TTI, ArrayRef<Value *> Ptrs,
 
 void BoUpSLP::transformNodes() {
   constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  BaseGraphSize = VectorizableTree.size();
+  // Operands are profitable if all of them are:
+  // 1. Constants
+  // or
+  // 2. Splats
+  // or
+  // 3. Results in good vectorization opportunity, i.e. may generate vector
+  // nodes and reduce cost of the graph.
+  auto CheckOperandsProfitability = [this](Instruction *I1, Instruction *I2,
+                                           const InstructionsState &S) {
+    SmallVector<SmallVector<std::pair<Value *, Value *>>> Candidates;
+    for (unsigned Op : seq<unsigned>(S.MainOp->getNumOperands()))
+      Candidates.emplace_back().emplace_back(I1->getOperand(Op),
+                                             I2->getOperand(Op));
+    return all_of(Candidates, [this](
+                                  ArrayRef<std::pair<Value *, Value *>> Cand) {
+      return all_of(Cand,
+                    [](const std::pair<Value *, Value *> &P) {
+                      return isa<Constant>(P.first) ||
+                             isa<Constant>(P.second) || (P.first == P.second);
+                    }) ||
+             findBestRootPair(Cand, LookAheadHeuristics::ScoreSplatLoads);
+    });
+  };
   // The tree may grow here, so iterate over nodes, built before.
-  for (unsigned Idx : seq<unsigned>(VectorizableTree.size())) {
+  for (unsigned Idx : seq<unsigned>(BaseGraphSize)) {
     TreeEntry &E = *VectorizableTree[Idx];
     if (E.isGather()) {
       ArrayRef<Value *> VL = E.Scalars;
       const unsigned Sz = getVectorElementSize(VL.front());
       unsigned MinVF = getMinVF(2 * Sz);
+      // Do not try partial vectorization for small nodes (<= 2), nodes with the
+      // same opcode and same parent block or all constants.
       if (VL.size() <= 2 ||
-          (E.getOpcode() &&
-           (E.isAltShuffle() || E.getOpcode() != Instruction::Load)))
+          !(!E.getOpcode() || E.getOpcode() == Instruction::Load ||
+            E.isAltShuffle() || !allSameBlock(VL)) ||
+          allConstant(VL) || isSplat(VL))
         continue;
       // Try to find vectorizable sequences and transform them into a series of
       // insertvector instructions.
       unsigned StartIdx = 0;
       unsigned End = VL.size();
-      for (unsigned VF = VL.size() / 2; VF >= MinVF; VF /= 2) {
+      for (unsigned VF = VL.size() / 2; VF >= MinVF; VF = bit_ceil(VF) / 2) {
+        SmallVector<unsigned> Slices;
         for (unsigned Cnt = StartIdx; Cnt + VF <= End; Cnt += VF) {
           ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
           // If any instruction is vectorized already - do not try again.
-          if (getTreeEntry(Slice.front()) || getTreeEntry(Slice.back()))
+          // Reuse the existing node, if it fully matches the slice.
+          if (const TreeEntry *SE = getTreeEntry(Slice.front());
+              SE || getTreeEntry(Slice.back())) {
+            if (!SE)
+              continue;
+            if (VF != SE->getVectorFactor() || !SE->isSame(Slice))
+              continue;
+          }
+          // Constant already handled effectively - skip.
+          if (allConstant(Slice))
             continue;
-          InstructionsState S = getSameOpcode(Slice, *TLI);
-          if (!S.getOpcode() || S.isAltShuffle() ||
-              (S.getOpcode() != Instruction::Load &&
-               any_of(Slice, [&](Value *V) {
-                 return !areAllUsersVectorized(cast<Instruction>(V),
-                                               UserIgnoreList);
-               })))
+          // Do not try to vectorize small splats (less than vector register and
+          // only with the single non-undef element).
+          bool IsSplat = isSplat(Slice);
+          if (Slices.empty() || !IsSplat ||
+              (VF <= 2 && 2 * std::clamp(TTI->getNumberOfParts(getWidenedType(
+                                             Slice.front()->getType(), VF)),
+                                         1U, VF - 1) !=
+                              std::clamp(TTI->getNumberOfParts(getWidenedType(
+                                             Slice.front()->getType(), 2 * VF)),
+                                         1U, 2 * VF)) ||
+              count(Slice, Slice.front()) ==
+                  (isa<UndefValue>(Slice.front()) ? VF - 1 : 1)) {
+            if (IsSplat)
+              continue;
+            InstructionsState S = getSameOpcode(Slice, *TLI);
+            if (!S.getOpcode() || S.isAltShuffle() || !allSameBlock(Slice))
+              continue;
+            if (VF == 2) {
+              // Try to vectorize reduced values or if all users are vectorized.
+              // For expensive instructions extra extracts might be profitable.
+              if ((!UserIgnoreList || E.Idx != 0) &&
+                  TTI->getInstructionCost(cast<Instruction>(Slice.front()),
+                                          CostKind) < TTI::TCC_Expensive &&
+                  !all_of(Slice, [&](Value *V) {
+                    return areAllUsersVectorized(cast<Instruction>(V),
+                                                 UserIgnoreList);
+                  }))
+                continue;
+              if (S.getOpcode() == Instruction::Load) {
+                OrdersType Order;
+                SmallVector<Value *> PointerOps;
+                LoadsState Res =
+                    canVectorizeLoads(Slice, Slice.front(), Order, PointerOps);
+                // Do not vectorize gathers.
+                if (Res == LoadsState::ScatterVectorize ||
+                    Res == LoadsState::Gather)
+                  continue;
+              } else if (S.getOpcode() == Instruction::ExtractElement ||
+                         (TTI->getInstructionCost(
+                              cast<Instruction>(Slice.front()), CostKind) <
+                              TTI::TCC_Expensive &&
+                          !CheckOperandsProfitability(
+                              cast<Instruction>(Slice.front()),
+                              cast<Instruction>(Slice.back()), S))) {
+                // Do not vectorize extractelements (handled effectively
+                // alread). Do not vectorize non-profitable instructions (with
+                // low cost and non-vectorizable operands.)
+                continue;
+              }
+            }
+          }
+          Slices.emplace_back(Cnt);
+        }
+        auto AddCombinedNode = [&](unsigned Idx, unsigned Cnt) {
+          E.CombinedEntriesWithIndices.emplace_back(Idx, Cnt);
+          if (StartIdx == Cnt)
+            StartIdx = Cnt + VF;
+          if (End == Cnt + VF)
+            End = Cnt;
+        };
+        for (unsigned Cnt : Slices) {
+          ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
+          // If any instruction is vectorized already - do not try again.
+          if (const TreeEntry *SE = getTreeEntry(Slice.front());
+              SE || getTreeEntry(Slice.back())) {
+            if (!SE)
+              continue;
+            if (VF != SE->getVectorFactor() || !SE->isSame(Slice))
+              continue;
+            AddCombinedNode(SE->Idx, Cnt);
             continue;
+          }
           unsigned PrevSize = VectorizableTree.size();
           buildTree_rec(Slice, 0, EdgeInfo(&E, UINT_MAX));
           if (PrevSize + 1 == VectorizableTree.size() &&
-              VectorizableTree[PrevSize]->isGather()) {
+              VectorizableTree[PrevSize]->isGather() &&
+              VectorizableTree[PrevSize]->getOpcode() !=
+                  Instruction::ExtractElement &&
+              !isSplat(Slice)) {
             VectorizableTree.pop_back();
             continue;
           }
-          E.CombinedEntriesWithIndices.emplace_back(PrevSize, Cnt);
-          if (StartIdx == Cnt)
-            StartIdx = Cnt + VF;
-          if (End == Cnt + VF)
-            End = Cnt;
+          AddCombinedNode(PrevSize, Cnt);
         }
       }
     }
@@ -11690,6 +11798,13 @@ BoUpSLP::isGatherShuffledEntry(
          "Expected only single user of the gather node.");
   assert(VL.size() % NumParts == 0 &&
          "Number of scalars must be divisible by NumParts.");
+  if (TE->UserTreeIndices.front().UserTE->isGather() &&
+      TE->UserTreeIndices.front().EdgeIdx == UINT_MAX) {
+    assert((TE->getOpcode() == Instruction::ExtractElement ||
+            isSplat(TE->Scalars)) &&
+           "Expected splat or extractelements only node.");
+    return {};
+  }
   unsigned SliceSize = getPartNumElems(VL.size(), NumParts);
   SmallVector<std::optional<TTI::ShuffleKind>> Res;
   for (unsigned Part : seq<unsigned>(NumParts)) {
@@ -13197,9 +13312,10 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
   return Res;
 }
 
-Value *BoUpSLP::createBuildVector(const TreeEntry *E, Type *ScalarTy) {
+Value *BoUpSLP::createBuildVector(const TreeEntry *E, Type *ScalarTy,
+                                  bool PostponedPHIs) {
   for (auto [EIdx, _] : E->CombinedEntriesWithIndices)
-    (void)vectorizeTree(VectorizableTree[EIdx].get(), /*PostponedPHIs=*/false);
+    (void)vectorizeTree(VectorizableTree[EIdx].get(), PostponedPHIs);
   return processBuildVector<ShuffleInstructionBuilder, Value *>(E, ScalarTy,
                                                                 Builder, *this);
 }
@@ -13230,7 +13346,7 @@ Value *BoUpSLP::vectorizeTree(TreeEntry *E, bool PostponedPHIs) {
     // Set insert point for non-reduction initial nodes.
     if (E->getMainOp() && E->Idx == 0 && !UserIgnoreList)
       setInsertPointAfterBundle(E);
-    Value *Vec = createBuildVector(E, ScalarTy);
+    Value *Vec = createBuildVector(E, ScalarTy, PostponedPHIs);
     E->VectorizedValue = Vec;
     return Vec;
   }
@@ -16452,7 +16568,7 @@ SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
     if (R.isGathered(Chain.front()) ||
         R.isNotScheduled(cast<StoreInst>(Chain.front())->getValueOperand()))
       return std::nullopt;
-    Size = R.getTreeSize();
+    Size = R.getCanonicalGraphSize();
     return false;
   }
   R.reorderTopToBottom();
@@ -16462,7 +16578,7 @@ SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
 
   R.computeMinimumValueSizes();
 
-  Size = R.getTreeSize();
+  Size = R.getCanonicalGraphSize();
   if (S.getOpcode() == Instruction::Load)
     Size = 2; // cut off masked gather small trees
   InstructionCost Cost = R.getTreeCost();
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
index 5b878108af59af..5f0b16048d40c8 100644
--- a/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
@@ -685,10 +685,10 @@ define void @store_blockstrided3(ptr nocapture noundef readonly %x, ptr nocaptur
 ; CHECK-NEXT:    [[MUL:%.*]] = shl nsw i32 [[STRIDE]], 1
 ; CHECK-NEXT:    [[IDXPROM11:%.*]] = sext i32 [[MUL]] to i64
 ; CHECK-NEXT:    [[ARRAYIDX12:%.*]] = getelementptr inbounds i32, ptr [[X]], i64 [[IDXPROM11]]
-; CHECK-NEXT:    [[ADD18:%.*]] = add nsw i32 [[MUL]], 2
-; CHECK-NEXT:    [[IDXPROM19:%.*]] = sext i32 [[ADD18]] to i64
-; CHECK-NEXT:    [[ARRAYIDX20:%.*]] = getelementptr inbounds i32, ptr [[X]], i64 [[IDXPROM19]]
-; CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[ARRAYIDX20]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[ARRAYIDX12]], align 4
+; CHECK-NEXT:    [[ADD14:%.*]] = or disjoint i32 [[MUL]], 1
+; CHECK-NEXT:    [[IDXPROM15:%.*]] = sext i32 [[ADD14]] to i64
+; CHECK-NEXT:    [[ARRAYIDX16:%.*]] = getelementptr inbounds i32, ptr [[X]], i64 [[IDXPROM15]]
 ; CHECK-NEXT:    [[MUL21:%.*]] = mul nsw i32 [[STRIDE]], 3
 ; CHECK-NEXT:    [[IDXPROM23:%.*]] = sext i32 [[MUL21]] to i64
 ; CHECK-NEXT:    [[ARRAYIDX24:%.*]] = getelementptr inbounds i32, ptr [[X]], i64 [[IDXPROM23]]
@@ -700,8 +700,8 @@ define void @store_blockstrided3(ptr nocapture noundef readonly %x, ptr nocaptur
 ; CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX35]], align 4
 ; CHECK-NEXT:    [[ARRAYIDX41:%.*]] = getelementptr inbounds i32, ptr [[Y]], i64 [[IDXPROM5]]
 ; CHECK-NEXT:    [[ARRAYIDX48:%.*]] = getelementptr inbounds i32, ptr [[Y]], i64 [[IDXPROM11]]
-; CHECK-NEXT:    [[ARRAYIDX56:%.*]] = getelementptr inbounds i32, ptr [[Y]], i64 [[IDXPROM19]]
-; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX56]], align 4
+; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX48]], align 4
+; CHECK-NEXT:    [[ARRAYIDX52:%.*]] = getelementptr inbounds i32, ptr [[Y]], i64 [[IDXPROM15]]
 ; CHECK-NEXT:    [[ARRAYIDX60:%.*]] = getelementptr inbounds i32, ptr [[Y]], i64 [[IDXPROM23]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX60]], align 4
 ; CHECK-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i32, ptr [[Y]], i64 [[IDXPROM27]]
@@ -715,12 +715,12 @@ define void @store_blockstrided3(ptr nocapture noundef readonly %x, ptr nocaptur
 ; CHECK-NEXT:    [[TMP10:%.*]] = mul nsw <2 x i32> [[TMP8]], [[TMP6]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = mul nsw <2 x i32> [[TMP9]], [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP10]], <2 x i32> [[TMP11]], <4 x i32> <i32 1, i32 0, i32 3, i32 2>
-; CHECK-NEXT:    [[ARRAYIDX84:%.*]] = getelementptr inbounds i8, ptr [[Z]], i64 28
-; CHECK-NEXT:    [[TMP13:%.*]] = load <2 x i32>, ptr [[ARRAYIDX12]], align 4
-; CHECK-NEXT:    [[TMP14:%.*]] = load <2 x i32>, ptr [[ARRAYIDX48]], align 4
+; CHECK-NEXT:    [[MUL81:%.*]] = mul nsw i32 [[TMP4]], [[TMP1]]
+; CHECK-NEXT:    [[ARRAYIDX82:%.*]] = getelementptr inbounds i8, ptr [[Z]], i64 32
+; CHECK-NEXT:    [[TMP13:%.*]] = load <2 x i32>, ptr [[ARRAYIDX16]], align 4
+; CHECK-NEXT:    [[TMP14:%.*]] = load <2 x i32>, ptr [[ARRAYIDX52]], align 4
 ; CHECK-NEXT:    [[TMP15:%.*]] = mul nsw <2 x i32> [[TMP14]], [[TMP13]]
 ; CHECK-NEXT:    [[TMP16:%.*]] = shufflevector <2 x i32> [[TMP15]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
-; CHECK-NEXT:    [[MUL85:%.*]] = mul nsw i32 [[TMP4]], [[TMP1]]
 ; CHECK-NEXT:    [[MUL87:%.*]] = mul nsw i32 [[TMP5]], [[TMP2]]
 ; CHECK-NEXT:    [[ARRAYIDX88:%.*]] = getelementptr inbounds i8, ptr [[Z]], i64 44
 ; CHECK-NEXT:    [[ARRAYIDX92:%.*]] = getelementptr inbounds i8, ptr [[Z]], i64 36
@@ -728,8 +728,8 @@ define void @store_blockstrided3(ptr nocapture noundef readonly %x, ptr nocaptur
 ; CHECK-NEXT:    [[TMP18:%.*]] = load <2 x i32>, ptr [[ARRAYIDX64]], align 4
 ; CHECK-NEXT:    store i32 [[MUL73]], ptr [[Z]], align 4
 ; CHECK-NEXT:    store <4 x i32> [[TMP12]], ptr [[ARRAYIDX72]], align 4
-; CHECK-NEXT:    store <2 x i32> [[TMP16]], ptr [[ARRAYIDX84]], align 4
-; CHECK-NEXT:    store i32 [[MUL85]], ptr [[ARRAYIDX76]], align 4
+; CHECK-NEXT:    store i32 [[MUL81]], ptr [[ARRAYIDX82]], align 4
+; CHECK-NEXT:    store <2 x i32> [[TMP16]], ptr [[ARRAYIDX76]], align 4
 ; CHECK-NEXT:    store i32 [[MUL87]], ptr [[ARRAYIDX88]], align 4
 ; CHECK-NEXT:    [[TMP19:%.*]] = mul nsw <2 x i32> [[TMP18]], [[TMP17]]
 ; CHECK-NEXT:    [[TMP20:%.*]] = shufflevector <2 x i32> [[TMP19]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll
index f04c359b432b5e..71d6308b9dbbeb 100644
--- a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll
@@ -255,24 +255,19 @@ define void @select_uniform_ugt_16xi8(ptr %ptr, i8 %x) {
 ; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <8 x i8> [[TMP0]], i32 0
 ; CHECK-NEXT:    [[S_8:%.*]] = select i1 [[CMP_8]], i8 [[TMP1]], i8 [[X:%.*]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x i8>, ptr [[GEP_12]], align 1
-; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <8 x i8> [[TMP0]], <8 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <8 x i8> [[TMP0]], <8 x i8> poison, <16 x i32> <i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <16 x i8> [[TMP3]], i8 [[L_9]], i32 9
 ; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <16 x i8> [[TMP4]], i8 [[L_10]], i32 10
 ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <16 x i8> [[TMP5]], i8 [[L_11]], i32 11
 ; CHECK-NEXT:    [[TMP7:%.*]] = call <16 x i8> @llvm.vector.insert.v16i8.v8i8(<16 x i8> [[TMP6]], <8 x i8> [[TMP0]], i64 0)
 ; CHECK-NEXT:    [[TMP8:%.*]] = call <16 x i8> @llvm.vector.insert.v16i8.v4i8(<16 x i8> [[TMP7]], <4 x i8> [[TMP2]], i64 12)
 ; CHECK-NEXT:    [[TMP9:%.*]] = icmp ugt <16 x i8> [[TMP8]], <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
-; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <4 x i8> [[TMP2]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <4 x i8> [[TMP2]], <4 x i8> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
-; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <8 x i8> [[TMP0]], <8 x i8> [[TMP11]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison, i32 8, i32 9, i32 10, i32 11>
-; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <16 x i8> [[TMP12]], i8 [[L_9]], i32 9
-; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <16 x i8> [[TMP13]], i8 [[L_10]], i32 10
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <16 x i8> [[TMP14]], i8 [[L_11]], i32 11
-; CHECK-NEXT:    [[TMP16:%.*]] = shufflevector <16 x i8> [[TMP15]], <16 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
-; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <16 x i8> poison, i8 [[X]], i32 0
-; CHECK-NEXT:    [[TMP18:%.*]] = shufflevector <16 x i8> [[TMP17]], <16 x i8> poison, <16 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP19:%.*]] = select <16 x i1> [[TMP9]], <16 x i8> [[TMP16]], <16 x i8> [[TMP18]]
-; CHECK-NEXT:    store <16 x i8> [[TMP19]], ptr [[PTR]], align 2
+; CHECK-NEXT:    [[TMP10:%.*]] = call <16 x i8> @llvm.vector.insert.v16i8.v8i8(<16 x i8> [[TMP6]], <8 x i8> [[TMP0]], i64 0)
...
[truncated]

Created using spr 1.3.5

RKSimon · 2024-09-17T09:07:21Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+      return all_of(Cand,
+                    [](const std::pair<Value *, Value *> &P) {
+                      return isa<Constant>(P.first) ||
+                             isa<Constant>(P.second) || (P.first == P.second);


Should this be (isa<Constant>(P.first) && isa<Constant>(P.second)) || (P.first == P.second) ?

No, the logic is correct here

I will update the comment and remove extra parens

Created using spr 1.3.5

RKSimon

LGTM

Created using spr 1.3.5

aeubanks · 2024-10-01T16:58:31Z

I believe we're seeing a miscompile with this change, still reducing

aeubanks · 2024-10-01T20:03:48Z

here's a diff of the relevant function after slp-vectorizer with and without this patch reverted, and with inst names all set to %0 for minimal diffing purposes. I'll keep digging, but perhaps you can see something from this diff?

177a178,181
>   %0 = insertelement <2 x i32> poison, i32 %lshr166, i32 0
>   %0 = insertelement <2 x i32> %0, i32 %trunc176, i32 1
>   %0 = and <2 x i32> %0, <i32 255, i32 255>
>   %0 = icmp eq <2 x i32> %0, zeroinitializer
188,194c192
<   %0 = insertelement <2 x i32> poison, i32 %lshr166, i32 0
<   %0 = insertelement <2 x i32> %0, i32 %trunc176, i32 1
<   %0 = and <2 x i32> %0, <i32 255, i32 255>
<   %0 = icmp eq <2 x i32> %0, zeroinitializer
<   %0 = zext <2 x i8> %0 to <2 x i32>
<   %0 = shufflevector <2 x i32> %0, <2 x i32> %0, <8 x i32> <i32 poison, i32 poison, i32 0, i32 1, i32 poison, i32 poison, i32 2, i32 3>
<   %0 = insertelement <8 x i32> %0, i32 %trunc111, i32 0
---
>   %0 = insertelement <8 x i32> poison, i32 %trunc111, i32 0
197a196,199
>   %0 = shufflevector <2 x i32> %0, <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %0 = shufflevector <8 x i32> %0, <8 x i32> %0, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 4, i32 5, i32 8, i32 9>
>   %0 = sext <2 x i8> %0 to <2 x i32>
>   %0 = call <8 x i32> @llvm.vector.insert.v8i32.v2i32(<8 x i32> %0, <2 x i32> %0, i64 2)
204a207,208
>   %0 = shufflevector <2 x i1> %0, <2 x i1> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %0 = shufflevector <8 x i1> %0, <8 x i1> %0, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 4, i32 5, i32 8, i32 9>
206d209
<   %0 = call <8 x i1> @llvm.vector.insert.v8i1.v2i1(<8 x i1> %0, <2 x i1> %0, i64 6)
238c241,243
<   %0 = call <8 x i64> @llvm.vector.insert.v8i64.v4i64(<8 x i64> poison, <4 x i64> %0, i64 0)
---
>   %0 = shufflevector <2 x i64> %0, <2 x i64> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %0 = shufflevector <8 x i64> %0, <8 x i64> poison, <8 x i32> <i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 0, i32 1>
>   %0 = call <8 x i64> @llvm.vector.insert.v8i64.v4i64(<8 x i64> %0, <4 x i64> %0, i64 0)
240d244
<   %0 = call <8 x i64> @llvm.vector.insert.v8i64.v2i64(<8 x i64> %0, <2 x i64> %0, i64 6)

alexey-bataev · 2024-10-01T20:17:31Z

here's a diff of the relevant function after slp-vectorizer with and without this patch reverted, and with inst names all set to %0 for minimal diffing purposes. I'll keep digging, but perhaps you can see something from this diff?

177a178,181
>   %0 = insertelement <2 x i32> poison, i32 %lshr166, i32 0
>   %0 = insertelement <2 x i32> %0, i32 %trunc176, i32 1
>   %0 = and <2 x i32> %0, <i32 255, i32 255>
>   %0 = icmp eq <2 x i32> %0, zeroinitializer
188,194c192
<   %0 = insertelement <2 x i32> poison, i32 %lshr166, i32 0
<   %0 = insertelement <2 x i32> %0, i32 %trunc176, i32 1
<   %0 = and <2 x i32> %0, <i32 255, i32 255>
<   %0 = icmp eq <2 x i32> %0, zeroinitializer
<   %0 = zext <2 x i8> %0 to <2 x i32>
<   %0 = shufflevector <2 x i32> %0, <2 x i32> %0, <8 x i32> <i32 poison, i32 poison, i32 0, i32 1, i32 poison, i32 poison, i32 2, i32 3>
<   %0 = insertelement <8 x i32> %0, i32 %trunc111, i32 0
---
>   %0 = insertelement <8 x i32> poison, i32 %trunc111, i32 0
197a196,199
>   %0 = shufflevector <2 x i32> %0, <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %0 = shufflevector <8 x i32> %0, <8 x i32> %0, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 4, i32 5, i32 8, i32 9>
>   %0 = sext <2 x i8> %0 to <2 x i32>
>   %0 = call <8 x i32> @llvm.vector.insert.v8i32.v2i32(<8 x i32> %0, <2 x i32> %0, i64 2)
204a207,208
>   %0 = shufflevector <2 x i1> %0, <2 x i1> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %0 = shufflevector <8 x i1> %0, <8 x i1> %0, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 4, i32 5, i32 8, i32 9>
206d209
<   %0 = call <8 x i1> @llvm.vector.insert.v8i1.v2i1(<8 x i1> %0, <2 x i1> %0, i64 6)
238c241,243
<   %0 = call <8 x i64> @llvm.vector.insert.v8i64.v4i64(<8 x i64> poison, <4 x i64> %0, i64 0)
---
>   %0 = shufflevector <2 x i64> %0, <2 x i64> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %0 = shufflevector <8 x i64> %0, <8 x i64> poison, <8 x i32> <i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 0, i32 1>
>   %0 = call <8 x i64> @llvm.vector.insert.v8i64.v4i64(<8 x i64> %0, <4 x i64> %0, i64 0)
240d244
<   %0 = call <8 x i64> @llvm.vector.insert.v8i64.v2i64(<8 x i64> %0, <2 x i64> %0, i64 6)

Hmm, hard to tell, but the only significant difference I see is zext/sext. Maybe, something related to the minbitwidth analysis

alexfh · 2024-10-02T00:38:39Z

Here is the test that fails after this patch: https://github.com/google/highway/blob/master/hwy/tests/div_test.cc

The failure looks like this:

[ RUN      ] HwyDivTestGroup/HwyDivTest.TestAllIntegerDiv/EMU128


u8x8 expect [2+ ->]:
  0x00,0x01,0x01,0x00,0x00,0x00,
u8x8 actual [2+ ->]:
  0x00,0x01,0x00,0x00,0x00,0x00,
Abort at div_test.cc:65: EMU128, u8x8 lane 4 mismatch: expected '0x01', got '0x00'.

steelannelida · 2024-10-02T15:05:18Z

I managed a godbolt reproduction: https://gcc.godbolt.org/z/oK9vhjW48

I'm not sure which compiler flags are relevant

hiraditya · 2024-10-02T16:21:06Z

We need more test cases for this patch.

alexey-bataev · 2024-10-02T17:38:06Z

I managed a godbolt reproduction: https://gcc.godbolt.org/z/oK9vhjW48

I'm not sure which compiler flags are relevant

Ok, I almost figured out what's the cause of the miscompile. As I said, the problem is in minbitwidth analsysi and transformation, need to emit zext instead of sext. The fix expected to be soon

aeubanks · 2024-10-02T19:58:47Z

I tried minimizing some more by outlining code and am seeing SLPVectorizer trunc i64 to i8 then zext to i32 when the original code only truncs from i64 to i32, is that something that should happen?

alexey-bataev · 2024-10-02T20:00:52Z

I tried minimizing some more by outlining code and am seeing SLPVectorizer trunc i64 to i8 then zext to i32 when the original code only truncs from i64 to i32, is that something that should happen?

Yes, minbitwidth analysis to reduce the size of the data/reg usage.

The problem must be fixed in c1b911c

aeubanks · 2024-10-02T20:08:33Z

I tried minimizing some more by outlining code and am seeing SLPVectorizer trunc i64 to i8 then zext to i32 when the original code only truncs from i64 to i32, is that something that should happen?

Yes, minbitwidth analysis to reduce the size of the data/reg usage.

The problem must be fixed in c1b911c

thanks, that fixes the miscompiled test we were seeing

[𝘀𝗽𝗿] initial version

b717323

Created using spr 1.3.5

llvmbot added vectorizers llvm:transforms labels Sep 12, 2024

alexey-bataev requested a review from RKSimon September 12, 2024 23:41

Rebase

0a15015

Created using spr 1.3.5

RKSimon reviewed Sep 17, 2024

View reviewed changes

Rebase, fix comment, remove extra parens

d38ec13

Created using spr 1.3.5

RKSimon approved these changes Sep 17, 2024

View reviewed changes

Rebase

99fac75

Created using spr 1.3.5

alexey-bataev merged commit 3469db8 into main Sep 25, 2024
5 of 7 checks passed

alexey-bataev deleted the users/alexey-bataev/spr/slpadd-subvector-vectorization-for-non-load-nodes branch September 25, 2024 14:23

tstellar mentioned this pull request May 27, 2025

PowerPC regressions introduce by 3469db82b5c821c94b58c0b81f03bbef51efa30b #141642

Open

[SLP]Add subvector vectorization for non-load nodes #108430

[SLP]Add subvector vectorization for non-load nodes #108430

Uh oh!

Conversation

alexey-bataev commented Sep 12, 2024

Uh oh!

llvmbot commented Sep 12, 2024

Uh oh!

RKSimon Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aeubanks commented Oct 1, 2024

Uh oh!

aeubanks commented Oct 1, 2024

Uh oh!

alexey-bataev commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexfh commented Oct 2, 2024

Uh oh!

steelannelida commented Oct 2, 2024

Uh oh!

hiraditya commented Oct 2, 2024

Uh oh!

alexey-bataev commented Oct 2, 2024

Uh oh!

aeubanks commented Oct 2, 2024

Uh oh!

alexey-bataev commented Oct 2, 2024

Uh oh!

aeubanks commented Oct 2, 2024

Uh oh!

Uh oh!

alexey-bataev commented Oct 1, 2024 •

edited

Loading