[SLP]Improved reduction cost/codegen #118293

alexey-bataev · 2024-12-02T13:37:11Z

SLP vectorizer is able to combine several reductions from the list of
(potentially) reduced values with the different opcodes/values kind.
Currently, these reductions are handled independently of each other. But
instead the compiler can combine them into wide vector operations and
then perform only single reduction.
E.g, if the SLP vectorizer emits currently something like:

%r1 = reduce.add(<4 x i32> %v1)
%r2 = reduce.add(<4 x i32> %v2)
%r = add i32 %r1, %r2

it can be emitted as:

%v = add <4 x i32> %v1, %v2
%r = reduce.add(<4 x i32> %v)

It allows to improve the performance in some cases.

AVX512, -O3+LTO
Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%

Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization

RISC-V, SiFive-p670, -O3+LTO

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%

execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions

Created using spr 1.3.5

graphite-app · 2024-12-02T13:37:19Z

Your org has enabled the Graphite merge queue for merging into main

Add the label “FP Bundles” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

llvmbot · 2024-12-02T13:37:47Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-analysis

Author: Alexey Bataev (alexey-bataev)

Changes

AVX512, -O3+LTO
Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%

Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization

RISC-V, SiFive-p670, -O3+LTO

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%

execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast

ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast
ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions

Patch is 22.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118293.diff

5 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+8)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+1)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+16)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+290-32)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 985ca1532e0149..f2f0e56a3f2014 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1584,6 +1584,10 @@ class TargetTransformInfo {
   /// split during legalization. Zero is returned when the answer is unknown.
   unsigned getNumberOfParts(Type *Tp) const;
 
+  /// \return true if \p Tp represent a type, fully occupying whole register,
+  /// false otherwise.
+  bool isFullSingleRegisterType(Type *Tp) const;
+
   /// \returns The cost of the address computation. For most targets this can be
   /// merged into the instruction indexing mode. Some targets might want to
   /// distinguish between address computation for memory operations on vector
@@ -2196,6 +2200,7 @@ class TargetTransformInfo::Concept {
                                            ArrayRef<Type *> Tys,
                                            TTI::TargetCostKind CostKind) = 0;
   virtual unsigned getNumberOfParts(Type *Tp) = 0;
+  virtual bool isFullSingleRegisterType(Type *Tp) const = 0;
   virtual InstructionCost
   getAddressComputationCost(Type *Ty, ScalarEvolution *SE, const SCEV *Ptr) = 0;
   virtual InstructionCost
@@ -2930,6 +2935,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   unsigned getNumberOfParts(Type *Tp) override {
     return Impl.getNumberOfParts(Tp);
   }
+  bool isFullSingleRegisterType(Type *Tp) const override {
+    return Impl.isFullSingleRegisterType(Tp);
+  }
   InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
                                             const SCEV *Ptr) override {
     return Impl.getAddressComputationCost(Ty, SE, Ptr);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 38aba183f6a173..ce6a96ea317ba7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -833,6 +833,7 @@ class TargetTransformInfoImplBase {
 
   // Assume that we have a register of the right size for the type.
   unsigned getNumberOfParts(Type *Tp) const { return 1; }
+  bool isFullSingleRegisterType(Type *Tp) const { return false; }
 
   InstructionCost getAddressComputationCost(Type *Tp, ScalarEvolution *,
                                             const SCEV *) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 98cbb4886642bf..9e7ce48f901dc5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -2612,6 +2612,22 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return *LT.first.getValue();
   }
 
+  bool isFullSingleRegisterType(Type *Tp) const {
+    std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Tp);
+    if (!LT.first.isValid() || LT.first > 1)
+      return false;
+
+    if (auto *FTp = dyn_cast<FixedVectorType>(Tp);
+        Tp && LT.second.isFixedLengthVector()) {
+      // Check if the n x i1 fits fully into largest integer.
+      if (unsigned VF = LT.second.getVectorNumElements();
+          LT.second.getVectorElementType() == MVT::i1)
+        return DL.isLegalInteger(VF) && !DL.isLegalInteger(VF * 2);
+      return FTp == EVT(LT.second).getTypeForEVT(Tp->getContext());
+    }
+    return false;
+  }
+
   InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *,
                                             const SCEV *) {
     return 0;
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 1fb2b9836de0cc..f7ad9ed905e3a1 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1171,6 +1171,10 @@ unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
   return TTIImpl->getNumberOfParts(Tp);
 }
 
+bool TargetTransformInfo::isFullSingleRegisterType(Type *Tp) const {
+  return TTIImpl->isFullSingleRegisterType(Tp);
+}
+
 InstructionCost
 TargetTransformInfo::getAddressComputationCost(Type *Tp, ScalarEvolution *SE,
                                                const SCEV *Ptr) const {
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 7723442bc0fb6e..5df21b77643746 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -12080,7 +12080,11 @@ bool BoUpSLP::isTreeNotExtendable() const {
     TreeEntry &E = *VectorizableTree[Idx];
     if (!E.isGather())
       continue;
-    if (E.getOpcode() && E.getOpcode() != Instruction::Load)
+    if ((E.getOpcode() && E.getOpcode() != Instruction::Load) ||
+        (!E.getOpcode() &&
+         all_of(E.Scalars, IsaPred<ExtractElementInst, LoadInst>)) ||
+        (isa<ExtractElementInst>(E.Scalars.front()) &&
+         getSameOpcode(ArrayRef(E.Scalars).drop_front(), *TLI).getOpcode()))
       return false;
     if (isSplat(E.Scalars) || allConstant(E.Scalars))
       continue;
@@ -19174,6 +19178,9 @@ class HorizontalReduction {
   /// Checks if the optimization of original scalar identity operations on
   /// matched horizontal reductions is enabled and allowed.
   bool IsSupportedHorRdxIdentityOp = false;
+  /// Contains vector values for reduction including their scale factor and
+  /// signedness.
+  SmallVector<std::tuple<Value *, unsigned, bool>> VectorValuesAndScales;
 
   static bool isCmpSelMinMax(Instruction *I) {
     return match(I, m_Select(m_Cmp(), m_Value(), m_Value())) &&
@@ -19225,17 +19232,22 @@ class HorizontalReduction {
   static Value *createOp(IRBuilderBase &Builder, RecurKind Kind, Value *LHS,
                          Value *RHS, const Twine &Name, bool UseSelect) {
     unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(Kind);
+    Type *OpTy = LHS->getType();
+    assert(OpTy == RHS->getType() && "Expected LHS and RHS of same type");
     switch (Kind) {
     case RecurKind::Or:
-      if (UseSelect &&
-          LHS->getType() == CmpInst::makeCmpResultType(LHS->getType()))
-        return Builder.CreateSelect(LHS, Builder.getTrue(), RHS, Name);
+      if (UseSelect && OpTy == CmpInst::makeCmpResultType(OpTy))
+        return Builder.CreateSelect(
+            LHS,
+            ConstantInt::getAllOnesValue(CmpInst::makeCmpResultType(OpTy)),
+            RHS, Name);
       return Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, LHS, RHS,
                                  Name);
     case RecurKind::And:
-      if (UseSelect &&
-          LHS->getType() == CmpInst::makeCmpResultType(LHS->getType()))
-        return Builder.CreateSelect(LHS, RHS, Builder.getFalse(), Name);
+      if (UseSelect && OpTy == CmpInst::makeCmpResultType(OpTy))
+        return Builder.CreateSelect(
+            LHS, RHS,
+            ConstantInt::getNullValue(CmpInst::makeCmpResultType(OpTy)), Name);
       return Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, LHS, RHS,
                                  Name);
     case RecurKind::Add:
@@ -20108,12 +20120,11 @@ class HorizontalReduction {
                                          SameValuesCounter, TrackedToOrig);
         }
 
-        Value *ReducedSubTree;
         Type *ScalarTy = VL.front()->getType();
         if (isa<FixedVectorType>(ScalarTy)) {
           assert(SLPReVec && "FixedVectorType is not expected.");
           unsigned ScalarTyNumElements = getNumElements(ScalarTy);
-          ReducedSubTree = PoisonValue::get(FixedVectorType::get(
+          Value *ReducedSubTree = PoisonValue::get(getWidenedType(
               VectorizedRoot->getType()->getScalarType(), ScalarTyNumElements));
           for (unsigned I : seq<unsigned>(ScalarTyNumElements)) {
             // Do reduction for each lane.
@@ -20131,30 +20142,32 @@ class HorizontalReduction {
             SmallVector<int, 16> Mask =
                 createStrideMask(I, ScalarTyNumElements, VL.size());
             Value *Lane = Builder.CreateShuffleVector(VectorizedRoot, Mask);
-            ReducedSubTree = Builder.CreateInsertElement(
-                ReducedSubTree,
-                emitReduction(Lane, Builder, TTI, RdxRootInst->getType()), I);
+            Value *Val =
+                createSingleOp(Builder, *TTI, Lane,
+                               OptReusedScalars && SameScaleFactor
+                                   ? SameValuesCounter.front().second
+                                   : 1,
+                               Lane->getType()->getScalarType() !=
+                                       VL.front()->getType()->getScalarType()
+                                   ? V.isSignedMinBitwidthRootNode()
+                                   : true, RdxRootInst->getType());
+            ReducedSubTree =
+                Builder.CreateInsertElement(ReducedSubTree, Val, I);
           }
+          VectorizedTree = GetNewVectorizedTree(VectorizedTree, ReducedSubTree);
         } else {
-          ReducedSubTree = emitReduction(VectorizedRoot, Builder, TTI,
-                                         RdxRootInst->getType());
+          Type *VecTy = VectorizedRoot->getType();
+          Type *RedScalarTy = VecTy->getScalarType();
+          VectorValuesAndScales.emplace_back(
+              VectorizedRoot,
+              OptReusedScalars && SameScaleFactor
+                  ? SameValuesCounter.front().second
+                  : 1,
+              RedScalarTy != ScalarTy->getScalarType()
+                  ? V.isSignedMinBitwidthRootNode()
+                  : true);
         }
-        if (ReducedSubTree->getType() != VL.front()->getType()) {
-          assert(ReducedSubTree->getType() != VL.front()->getType() &&
-                 "Expected different reduction type.");
-          ReducedSubTree =
-              Builder.CreateIntCast(ReducedSubTree, VL.front()->getType(),
-                                    V.isSignedMinBitwidthRootNode());
-        }
-
-        // Improved analysis for add/fadd/xor reductions with same scale factor
-        // for all operands of reductions. We can emit scalar ops for them
-        // instead.
-        if (OptReusedScalars && SameScaleFactor)
-          ReducedSubTree = emitScaleForReusedOps(
-              ReducedSubTree, Builder, SameValuesCounter.front().second);
 
-        VectorizedTree = GetNewVectorizedTree(VectorizedTree, ReducedSubTree);
         // Count vectorized reduced values to exclude them from final reduction.
         for (Value *RdxVal : VL) {
           Value *OrigV = TrackedToOrig.at(RdxVal);
@@ -20183,6 +20196,10 @@ class HorizontalReduction {
         continue;
       }
     }
+    if (!VectorValuesAndScales.empty())
+      VectorizedTree = GetNewVectorizedTree(
+          VectorizedTree,
+          emitReduction(Builder, *TTI, ReductionRoot->getType()));
     if (VectorizedTree) {
       // Reorder operands of bool logical op in the natural order to avoid
       // possible problem with poison propagation. If not possible to reorder
@@ -20317,6 +20334,28 @@ class HorizontalReduction {
   }
 
 private:
+  /// Checks if the given type \p Ty is a vector type, which does not occupy the
+  /// whole vector register or is expensive for extraction.
+  static bool isNotFullVectorType(const TargetTransformInfo &TTI, Type *Ty) {
+    return TTI.getNumberOfParts(Ty) == 1 && !TTI.isFullSingleRegisterType(Ty);
+  }
+
+  /// Creates the reduction from the given \p Vec vector value with the given
+  /// scale \p Scale and signedness \p IsSigned.
+  Value *createSingleOp(IRBuilderBase &Builder, const TargetTransformInfo &TTI,
+                        Value *Vec, unsigned Scale, bool IsSigned,
+                        Type *DestTy) {
+    Value *Rdx = emitReduction(Vec, Builder, &TTI, DestTy);
+    if (Rdx->getType() != DestTy->getScalarType())
+      Rdx = Builder.CreateIntCast(Rdx, DestTy, IsSigned);
+    // Improved analysis for add/fadd/xor reductions with same scale
+    // factor for all operands of reductions. We can emit scalar ops for
+    // them instead.
+    if (Scale > 1)
+      Rdx = emitScaleForReusedOps(Rdx, Builder, Scale);
+    return Rdx;
+  }
+
   /// Calculate the cost of a reduction.
   InstructionCost getReductionCost(TargetTransformInfo *TTI,
                                    ArrayRef<Value *> ReducedVals,
@@ -20359,6 +20398,22 @@ class HorizontalReduction {
       }
       return Cost;
     };
+    // Require reduction cost if:
+    // 1. This type is not a full register type and no other vectors with the
+    // same type in the storage (first vector with small type).
+    // 2. The storage does not have any vector with full vector use (first
+    // vector with full register use).
+    bool DoesRequireReductionOp =
+        !AllConsts &&
+        (VectorValuesAndScales.empty() ||
+         (isNotFullVectorType(*TTI, VectorTy) &&
+          none_of(VectorValuesAndScales,
+                  [&](const auto &P) {
+                    return std::get<0>(P)->getType() == VectorTy;
+                  })) ||
+         all_of(VectorValuesAndScales, [&](const auto &P) {
+           return isNotFullVectorType(*TTI, std::get<0>(P)->getType());
+         }));
     switch (RdxKind) {
     case RecurKind::Add:
     case RecurKind::Mul:
@@ -20382,7 +20437,7 @@ class HorizontalReduction {
           VectorCost += TTI->getScalarizationOverhead(
               VecTy, APInt::getAllOnes(ScalarTyNumElements), /*Insert*/ true,
               /*Extract*/ false, TTI::TCK_RecipThroughput);
-        } else {
+        } else if (DoesRequireReductionOp) {
           Type *RedTy = VectorTy->getElementType();
           auto [RType, IsSigned] = R.getRootNodeTypeWithNoCast().value_or(
               std::make_pair(RedTy, true));
@@ -20394,6 +20449,14 @@ class HorizontalReduction {
                 RdxOpcode, !IsSigned, RedTy, getWidenedType(RType, ReduxWidth),
                 FMF, CostKind);
           }
+        } else {
+          unsigned NumParts = TTI->getNumberOfParts(VectorTy);
+          unsigned RegVF = getPartNumElems(getNumElements(VectorTy), NumParts);
+          VectorCost +=
+              NumParts * TTI->getArithmeticInstrCost(
+                             RdxOpcode,
+                             getWidenedType(VectorTy->getScalarType(), RegVF),
+                             CostKind);
         }
       }
       ScalarCost = EvaluateScalarCost([&]() {
@@ -20410,8 +20473,19 @@ class HorizontalReduction {
     case RecurKind::UMax:
     case RecurKind::UMin: {
       Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
-      if (!AllConsts)
-        VectorCost = TTI->getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
+      if (!AllConsts) {
+        if (DoesRequireReductionOp) {
+          VectorCost = TTI->getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
+        } else {
+          // Check if the previous reduction already exists and account it as
+          // series of operations + single reduction.
+          unsigned NumParts = TTI->getNumberOfParts(VectorTy);
+          unsigned RegVF = getPartNumElems(getNumElements(VectorTy), NumParts);
+          auto *RegVecTy = getWidenedType(VectorTy->getScalarType(), RegVF);
+          IntrinsicCostAttributes ICA(Id, RegVecTy, {RegVecTy, RegVecTy}, FMF);
+          VectorCost += NumParts * TTI->getIntrinsicInstrCost(ICA, CostKind);
+        }
+      }
       ScalarCost = EvaluateScalarCost([&]() {
         IntrinsicCostAttributes ICA(Id, ScalarTy, {ScalarTy, ScalarTy}, FMF);
         return TTI->getIntrinsicInstrCost(ICA, CostKind);
@@ -20428,6 +20502,190 @@ class HorizontalReduction {
     return VectorCost - ScalarCost;
   }
 
+  /// Splits the values, stored in VectorValuesAndScales, into registers/free
+  /// sub-registers, combines them with the given reduction operation as a
+  /// vector operation and then performs single (small enough) reduction.
+  Value *emitReduction(IRBuilderBase &Builder, const TargetTransformInfo &TTI,
+                       Type *DestTy) {
+    Value *ReducedSubTree = nullptr;
+    // Creates reduction and combines with the previous reduction.
+    auto CreateSingleOp = [&](Value *Vec, unsigned Scale, bool IsSigned) {
+      Value *Rdx = createSingleOp(Builder, TTI, Vec, Scale, IsSigned, DestTy);
+      if (ReducedSubTree)
+        ReducedSubTree = createOp(Builder, RdxKind, ReducedSubTree, Rdx,
+                                  "op.rdx", ReductionOps);
+      else
+        ReducedSubTree = Rdx;
+    };
+    if (VectorValuesAndScales.size() == 1) {
+      const auto &[Vec, Scale, IsSigned] = VectorValuesAndScales.front();
+      CreateSingleOp(Vec, Scale, IsSigned);
+      return ReducedSubTree;
+    }
+    // Splits multivector value into per-register values.
+    auto SplitVector = [&](Value *Vec) {
+      auto *ScalarTy = cast<VectorType>(Vec->getType())->getElementType();
+      unsigned Sz = getNumElements(Vec->getType());
+      unsigned NumParts = TTI.getNumberOfParts(Vec->getType());
+      if (NumParts <= 1 || NumParts >= Sz ||
+          isNotFullVectorType(TTI, Vec->getType()))
+        return SmallVector<Value *>(1, Vec);
+      unsigned RegSize = getPartNumElems(Sz, NumParts);
+      auto *DstTy = getWidenedType(ScalarTy, RegSize);
+      SmallVector<Value *> Regs(NumParts);
+      for (unsigned Part : seq<unsigned>(NumParts))
+        Regs[Part] = Builder.CreateExtractVector(
+            DstTy, Vec, Builder.getInt64(Part * RegSize));
+      return Regs;
+    };
+    SmallMapVector<Type *, Value *, 4> VecOps;
+    // Scales Vec using given Cnt scale factor and then performs vector combine
+    // with previous value of VecOp.
+    auto CreateVecOp = [&](Value *Vec, unsigned Cnt) {
+      Type *ScalarTy = cast<VectorType>(Vec->getType())->getElementType();
+      // Scale Vec using given Cnt scale factor.
+      if (Cnt > 1) {
+        ElementCount EC = cast<VectorType>(Vec->getType())->getElementCount();
+        switch (RdxKind) {
+        case RecurKind::Add: {
+          if (ScalarTy == Builder.getInt1Ty() && ScalarTy != DestTy) {
+            unsigned VF = getNumElements(Vec->getType());
+            LLVM_DEBUG(dbgs() << "SLP: ctpop " << Cnt << "of " << Vec
+                              << ". (HorRdx)\n");
+            SmallVector<int> Mask(Cnt * VF, PoisonMaskElem);
+            for (unsigned I : seq<unsigned>(Cnt))
+              std::iota(std::next(Mask.begin(), VF * I),
+                        std::next(Mask.begin(), VF * (I + 1)), 0);
+            ++NumVectorInstructions;
+            Vec = Builder.CreateShuffleVector(Vec, Mask);
+            break;
+          }
+          // res = mul vv, n
+          Value *Scale =
+              ConstantVector::getSplat(EC, ConstantInt::get(ScalarTy, Cnt));
+          LLVM_DEBUG(dbgs() << "SLP: Add (to-mul) " << Cnt << "of " << Vec
+                            << ". (HorRdx)\n");
+          ++NumVectorInstructions;
+          Vec = Builder.CreateMul(Vec, Scale);
+          break;
+        }
+        case RecurKind::Xor: {
+          // res = n % 2 ? 0 : vv
+          LLVM_DEBUG(dbgs()
+                     << "SLP: Xor " << Cnt << "of " << Vec << ". (HorRdx)\n");
+          if (Cnt % 2 == 0)
+            Vec = Constant::getNullValue(Vec->getType());
+          break;
+        }
+        case RecurKind::FAdd: {
+          // res = fmul v, n
+          Value *Scale =
+              ConstantVector::getSplat(EC, ConstantFP::get(ScalarTy, Cnt));
+          LLVM_DEBUG(dbgs() << "SLP: FAdd (to-fmul) " << Cnt << "of " << Vec
+                            << ". (HorRdx)\n");
+          ++NumVectorInstructions;
+          Vec = Builder.CreateFMul(Vec, Scale);
+          break;
+        }
+        case RecurKind::And:
+        case RecurKind::Or:
+        case RecurKind::SMax:
+        case RecurKind::SMin:
+        case RecurKind::UMax:
+        case RecurKind::UMin:
+        case RecurKind::FMax:
+        case RecurKind::FMin:
+        case RecurKind::FMaximum:
+        case RecurKind::FMinimum:
+          // res = vv
+          break;
+        case RecurKind::Mul:
+        case RecurKind::FMul:
+        case RecurKind::FMulAdd:
+        case RecurKind::IAnyOf:
+        case RecurKind::FAnyOf:
+        case RecurKind::None:
+          llvm_unreachable("Unexpected reduction kind for repeated scalar.");
+        }
+      }
+      // Combine Vec w...
[truncated]

llvmbot · 2024-12-02T13:37:47Z

@llvm/pr-subscribers-vectorizers

Author: Alexey Bataev (alexey-bataev)

Changes

AVX512, -O3+LTO
Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%

Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization

RISC-V, SiFive-p670, -O3+LTO

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%

execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast

ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast
ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions

Patch is 22.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118293.diff

5 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+8)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+1)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+16)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+290-32)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 985ca1532e0149..f2f0e56a3f2014 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1584,6 +1584,10 @@ class TargetTransformInfo {
   /// split during legalization. Zero is returned when the answer is unknown.
   unsigned getNumberOfParts(Type *Tp) const;
 
+  /// \return true if \p Tp represent a type, fully occupying whole register,
+  /// false otherwise.
+  bool isFullSingleRegisterType(Type *Tp) const;
+
   /// \returns The cost of the address computation. For most targets this can be
   /// merged into the instruction indexing mode. Some targets might want to
   /// distinguish between address computation for memory operations on vector
@@ -2196,6 +2200,7 @@ class TargetTransformInfo::Concept {
                                            ArrayRef<Type *> Tys,
                                            TTI::TargetCostKind CostKind) = 0;
   virtual unsigned getNumberOfParts(Type *Tp) = 0;
+  virtual bool isFullSingleRegisterType(Type *Tp) const = 0;
   virtual InstructionCost
   getAddressComputationCost(Type *Ty, ScalarEvolution *SE, const SCEV *Ptr) = 0;
   virtual InstructionCost
@@ -2930,6 +2935,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   unsigned getNumberOfParts(Type *Tp) override {
     return Impl.getNumberOfParts(Tp);
   }
+  bool isFullSingleRegisterType(Type *Tp) const override {
+    return Impl.isFullSingleRegisterType(Tp);
+  }
   InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
                                             const SCEV *Ptr) override {
     return Impl.getAddressComputationCost(Ty, SE, Ptr);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 38aba183f6a173..ce6a96ea317ba7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -833,6 +833,7 @@ class TargetTransformInfoImplBase {
 
   // Assume that we have a register of the right size for the type.
   unsigned getNumberOfParts(Type *Tp) const { return 1; }
+  bool isFullSingleRegisterType(Type *Tp) const { return false; }
 
   InstructionCost getAddressComputationCost(Type *Tp, ScalarEvolution *,
                                             const SCEV *) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 98cbb4886642bf..9e7ce48f901dc5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -2612,6 +2612,22 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return *LT.first.getValue();
   }
 
+  bool isFullSingleRegisterType(Type *Tp) const {
+    std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Tp);
+    if (!LT.first.isValid() || LT.first > 1)
+      return false;
+
+    if (auto *FTp = dyn_cast<FixedVectorType>(Tp);
+        Tp && LT.second.isFixedLengthVector()) {
+      // Check if the n x i1 fits fully into largest integer.
+      if (unsigned VF = LT.second.getVectorNumElements();
+          LT.second.getVectorElementType() == MVT::i1)
+        return DL.isLegalInteger(VF) && !DL.isLegalInteger(VF * 2);
+      return FTp == EVT(LT.second).getTypeForEVT(Tp->getContext());
+    }
+    return false;
+  }
+
   InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *,
                                             const SCEV *) {
     return 0;
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 1fb2b9836de0cc..f7ad9ed905e3a1 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1171,6 +1171,10 @@ unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
   return TTIImpl->getNumberOfParts(Tp);
 }
 
+bool TargetTransformInfo::isFullSingleRegisterType(Type *Tp) const {
+  return TTIImpl->isFullSingleRegisterType(Tp);
+}
+
 InstructionCost
 TargetTransformInfo::getAddressComputationCost(Type *Tp, ScalarEvolution *SE,
                                                const SCEV *Ptr) const {
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 7723442bc0fb6e..5df21b77643746 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -12080,7 +12080,11 @@ bool BoUpSLP::isTreeNotExtendable() const {
     TreeEntry &E = *VectorizableTree[Idx];
     if (!E.isGather())
       continue;
-    if (E.getOpcode() && E.getOpcode() != Instruction::Load)
+    if ((E.getOpcode() && E.getOpcode() != Instruction::Load) ||
+        (!E.getOpcode() &&
+         all_of(E.Scalars, IsaPred<ExtractElementInst, LoadInst>)) ||
+        (isa<ExtractElementInst>(E.Scalars.front()) &&
+         getSameOpcode(ArrayRef(E.Scalars).drop_front(), *TLI).getOpcode()))
       return false;
     if (isSplat(E.Scalars) || allConstant(E.Scalars))
       continue;
@@ -19174,6 +19178,9 @@ class HorizontalReduction {
   /// Checks if the optimization of original scalar identity operations on
   /// matched horizontal reductions is enabled and allowed.
   bool IsSupportedHorRdxIdentityOp = false;
+  /// Contains vector values for reduction including their scale factor and
+  /// signedness.
+  SmallVector<std::tuple<Value *, unsigned, bool>> VectorValuesAndScales;
 
   static bool isCmpSelMinMax(Instruction *I) {
     return match(I, m_Select(m_Cmp(), m_Value(), m_Value())) &&
@@ -19225,17 +19232,22 @@ class HorizontalReduction {
   static Value *createOp(IRBuilderBase &Builder, RecurKind Kind, Value *LHS,
                          Value *RHS, const Twine &Name, bool UseSelect) {
     unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(Kind);
+    Type *OpTy = LHS->getType();
+    assert(OpTy == RHS->getType() && "Expected LHS and RHS of same type");
     switch (Kind) {
     case RecurKind::Or:
-      if (UseSelect &&
-          LHS->getType() == CmpInst::makeCmpResultType(LHS->getType()))
-        return Builder.CreateSelect(LHS, Builder.getTrue(), RHS, Name);
+      if (UseSelect && OpTy == CmpInst::makeCmpResultType(OpTy))
+        return Builder.CreateSelect(
+            LHS,
+            ConstantInt::getAllOnesValue(CmpInst::makeCmpResultType(OpTy)),
+            RHS, Name);
       return Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, LHS, RHS,
                                  Name);
     case RecurKind::And:
-      if (UseSelect &&
-          LHS->getType() == CmpInst::makeCmpResultType(LHS->getType()))
-        return Builder.CreateSelect(LHS, RHS, Builder.getFalse(), Name);
+      if (UseSelect && OpTy == CmpInst::makeCmpResultType(OpTy))
+        return Builder.CreateSelect(
+            LHS, RHS,
+            ConstantInt::getNullValue(CmpInst::makeCmpResultType(OpTy)), Name);
       return Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, LHS, RHS,
                                  Name);
     case RecurKind::Add:
@@ -20108,12 +20120,11 @@ class HorizontalReduction {
                                          SameValuesCounter, TrackedToOrig);
         }
 
-        Value *ReducedSubTree;
         Type *ScalarTy = VL.front()->getType();
         if (isa<FixedVectorType>(ScalarTy)) {
           assert(SLPReVec && "FixedVectorType is not expected.");
           unsigned ScalarTyNumElements = getNumElements(ScalarTy);
-          ReducedSubTree = PoisonValue::get(FixedVectorType::get(
+          Value *ReducedSubTree = PoisonValue::get(getWidenedType(
               VectorizedRoot->getType()->getScalarType(), ScalarTyNumElements));
           for (unsigned I : seq<unsigned>(ScalarTyNumElements)) {
             // Do reduction for each lane.
@@ -20131,30 +20142,32 @@ class HorizontalReduction {
             SmallVector<int, 16> Mask =
                 createStrideMask(I, ScalarTyNumElements, VL.size());
             Value *Lane = Builder.CreateShuffleVector(VectorizedRoot, Mask);
-            ReducedSubTree = Builder.CreateInsertElement(
-                ReducedSubTree,
-                emitReduction(Lane, Builder, TTI, RdxRootInst->getType()), I);
+            Value *Val =
+                createSingleOp(Builder, *TTI, Lane,
+                               OptReusedScalars && SameScaleFactor
+                                   ? SameValuesCounter.front().second
+                                   : 1,
+                               Lane->getType()->getScalarType() !=
+                                       VL.front()->getType()->getScalarType()
+                                   ? V.isSignedMinBitwidthRootNode()
+                                   : true, RdxRootInst->getType());
+            ReducedSubTree =
+                Builder.CreateInsertElement(ReducedSubTree, Val, I);
           }
+          VectorizedTree = GetNewVectorizedTree(VectorizedTree, ReducedSubTree);
         } else {
-          ReducedSubTree = emitReduction(VectorizedRoot, Builder, TTI,
-                                         RdxRootInst->getType());
+          Type *VecTy = VectorizedRoot->getType();
+          Type *RedScalarTy = VecTy->getScalarType();
+          VectorValuesAndScales.emplace_back(
+              VectorizedRoot,
+              OptReusedScalars && SameScaleFactor
+                  ? SameValuesCounter.front().second
+                  : 1,
+              RedScalarTy != ScalarTy->getScalarType()
+                  ? V.isSignedMinBitwidthRootNode()
+                  : true);
         }
-        if (ReducedSubTree->getType() != VL.front()->getType()) {
-          assert(ReducedSubTree->getType() != VL.front()->getType() &&
-                 "Expected different reduction type.");
-          ReducedSubTree =
-              Builder.CreateIntCast(ReducedSubTree, VL.front()->getType(),
-                                    V.isSignedMinBitwidthRootNode());
-        }
-
-        // Improved analysis for add/fadd/xor reductions with same scale factor
-        // for all operands of reductions. We can emit scalar ops for them
-        // instead.
-        if (OptReusedScalars && SameScaleFactor)
-          ReducedSubTree = emitScaleForReusedOps(
-              ReducedSubTree, Builder, SameValuesCounter.front().second);
 
-        VectorizedTree = GetNewVectorizedTree(VectorizedTree, ReducedSubTree);
         // Count vectorized reduced values to exclude them from final reduction.
         for (Value *RdxVal : VL) {
           Value *OrigV = TrackedToOrig.at(RdxVal);
@@ -20183,6 +20196,10 @@ class HorizontalReduction {
         continue;
       }
     }
+    if (!VectorValuesAndScales.empty())
+      VectorizedTree = GetNewVectorizedTree(
+          VectorizedTree,
+          emitReduction(Builder, *TTI, ReductionRoot->getType()));
     if (VectorizedTree) {
       // Reorder operands of bool logical op in the natural order to avoid
       // possible problem with poison propagation. If not possible to reorder
@@ -20317,6 +20334,28 @@ class HorizontalReduction {
   }
 
 private:
+  /// Checks if the given type \p Ty is a vector type, which does not occupy the
+  /// whole vector register or is expensive for extraction.
+  static bool isNotFullVectorType(const TargetTransformInfo &TTI, Type *Ty) {
+    return TTI.getNumberOfParts(Ty) == 1 && !TTI.isFullSingleRegisterType(Ty);
+  }
+
+  /// Creates the reduction from the given \p Vec vector value with the given
+  /// scale \p Scale and signedness \p IsSigned.
+  Value *createSingleOp(IRBuilderBase &Builder, const TargetTransformInfo &TTI,
+                        Value *Vec, unsigned Scale, bool IsSigned,
+                        Type *DestTy) {
+    Value *Rdx = emitReduction(Vec, Builder, &TTI, DestTy);
+    if (Rdx->getType() != DestTy->getScalarType())
+      Rdx = Builder.CreateIntCast(Rdx, DestTy, IsSigned);
+    // Improved analysis for add/fadd/xor reductions with same scale
+    // factor for all operands of reductions. We can emit scalar ops for
+    // them instead.
+    if (Scale > 1)
+      Rdx = emitScaleForReusedOps(Rdx, Builder, Scale);
+    return Rdx;
+  }
+
   /// Calculate the cost of a reduction.
   InstructionCost getReductionCost(TargetTransformInfo *TTI,
                                    ArrayRef<Value *> ReducedVals,
@@ -20359,6 +20398,22 @@ class HorizontalReduction {
       }
       return Cost;
     };
+    // Require reduction cost if:
+    // 1. This type is not a full register type and no other vectors with the
+    // same type in the storage (first vector with small type).
+    // 2. The storage does not have any vector with full vector use (first
+    // vector with full register use).
+    bool DoesRequireReductionOp =
+        !AllConsts &&
+        (VectorValuesAndScales.empty() ||
+         (isNotFullVectorType(*TTI, VectorTy) &&
+          none_of(VectorValuesAndScales,
+                  [&](const auto &P) {
+                    return std::get<0>(P)->getType() == VectorTy;
+                  })) ||
+         all_of(VectorValuesAndScales, [&](const auto &P) {
+           return isNotFullVectorType(*TTI, std::get<0>(P)->getType());
+         }));
     switch (RdxKind) {
     case RecurKind::Add:
     case RecurKind::Mul:
@@ -20382,7 +20437,7 @@ class HorizontalReduction {
           VectorCost += TTI->getScalarizationOverhead(
               VecTy, APInt::getAllOnes(ScalarTyNumElements), /*Insert*/ true,
               /*Extract*/ false, TTI::TCK_RecipThroughput);
-        } else {
+        } else if (DoesRequireReductionOp) {
           Type *RedTy = VectorTy->getElementType();
           auto [RType, IsSigned] = R.getRootNodeTypeWithNoCast().value_or(
               std::make_pair(RedTy, true));
@@ -20394,6 +20449,14 @@ class HorizontalReduction {
                 RdxOpcode, !IsSigned, RedTy, getWidenedType(RType, ReduxWidth),
                 FMF, CostKind);
           }
+        } else {
+          unsigned NumParts = TTI->getNumberOfParts(VectorTy);
+          unsigned RegVF = getPartNumElems(getNumElements(VectorTy), NumParts);
+          VectorCost +=
+              NumParts * TTI->getArithmeticInstrCost(
+                             RdxOpcode,
+                             getWidenedType(VectorTy->getScalarType(), RegVF),
+                             CostKind);
         }
       }
       ScalarCost = EvaluateScalarCost([&]() {
@@ -20410,8 +20473,19 @@ class HorizontalReduction {
     case RecurKind::UMax:
     case RecurKind::UMin: {
       Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
-      if (!AllConsts)
-        VectorCost = TTI->getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
+      if (!AllConsts) {
+        if (DoesRequireReductionOp) {
+          VectorCost = TTI->getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
+        } else {
+          // Check if the previous reduction already exists and account it as
+          // series of operations + single reduction.
+          unsigned NumParts = TTI->getNumberOfParts(VectorTy);
+          unsigned RegVF = getPartNumElems(getNumElements(VectorTy), NumParts);
+          auto *RegVecTy = getWidenedType(VectorTy->getScalarType(), RegVF);
+          IntrinsicCostAttributes ICA(Id, RegVecTy, {RegVecTy, RegVecTy}, FMF);
+          VectorCost += NumParts * TTI->getIntrinsicInstrCost(ICA, CostKind);
+        }
+      }
       ScalarCost = EvaluateScalarCost([&]() {
         IntrinsicCostAttributes ICA(Id, ScalarTy, {ScalarTy, ScalarTy}, FMF);
         return TTI->getIntrinsicInstrCost(ICA, CostKind);
@@ -20428,6 +20502,190 @@ class HorizontalReduction {
     return VectorCost - ScalarCost;
   }
 
+  /// Splits the values, stored in VectorValuesAndScales, into registers/free
+  /// sub-registers, combines them with the given reduction operation as a
+  /// vector operation and then performs single (small enough) reduction.
+  Value *emitReduction(IRBuilderBase &Builder, const TargetTransformInfo &TTI,
+                       Type *DestTy) {
+    Value *ReducedSubTree = nullptr;
+    // Creates reduction and combines with the previous reduction.
+    auto CreateSingleOp = [&](Value *Vec, unsigned Scale, bool IsSigned) {
+      Value *Rdx = createSingleOp(Builder, TTI, Vec, Scale, IsSigned, DestTy);
+      if (ReducedSubTree)
+        ReducedSubTree = createOp(Builder, RdxKind, ReducedSubTree, Rdx,
+                                  "op.rdx", ReductionOps);
+      else
+        ReducedSubTree = Rdx;
+    };
+    if (VectorValuesAndScales.size() == 1) {
+      const auto &[Vec, Scale, IsSigned] = VectorValuesAndScales.front();
+      CreateSingleOp(Vec, Scale, IsSigned);
+      return ReducedSubTree;
+    }
+    // Splits multivector value into per-register values.
+    auto SplitVector = [&](Value *Vec) {
+      auto *ScalarTy = cast<VectorType>(Vec->getType())->getElementType();
+      unsigned Sz = getNumElements(Vec->getType());
+      unsigned NumParts = TTI.getNumberOfParts(Vec->getType());
+      if (NumParts <= 1 || NumParts >= Sz ||
+          isNotFullVectorType(TTI, Vec->getType()))
+        return SmallVector<Value *>(1, Vec);
+      unsigned RegSize = getPartNumElems(Sz, NumParts);
+      auto *DstTy = getWidenedType(ScalarTy, RegSize);
+      SmallVector<Value *> Regs(NumParts);
+      for (unsigned Part : seq<unsigned>(NumParts))
+        Regs[Part] = Builder.CreateExtractVector(
+            DstTy, Vec, Builder.getInt64(Part * RegSize));
+      return Regs;
+    };
+    SmallMapVector<Type *, Value *, 4> VecOps;
+    // Scales Vec using given Cnt scale factor and then performs vector combine
+    // with previous value of VecOp.
+    auto CreateVecOp = [&](Value *Vec, unsigned Cnt) {
+      Type *ScalarTy = cast<VectorType>(Vec->getType())->getElementType();
+      // Scale Vec using given Cnt scale factor.
+      if (Cnt > 1) {
+        ElementCount EC = cast<VectorType>(Vec->getType())->getElementCount();
+        switch (RdxKind) {
+        case RecurKind::Add: {
+          if (ScalarTy == Builder.getInt1Ty() && ScalarTy != DestTy) {
+            unsigned VF = getNumElements(Vec->getType());
+            LLVM_DEBUG(dbgs() << "SLP: ctpop " << Cnt << "of " << Vec
+                              << ". (HorRdx)\n");
+            SmallVector<int> Mask(Cnt * VF, PoisonMaskElem);
+            for (unsigned I : seq<unsigned>(Cnt))
+              std::iota(std::next(Mask.begin(), VF * I),
+                        std::next(Mask.begin(), VF * (I + 1)), 0);
+            ++NumVectorInstructions;
+            Vec = Builder.CreateShuffleVector(Vec, Mask);
+            break;
+          }
+          // res = mul vv, n
+          Value *Scale =
+              ConstantVector::getSplat(EC, ConstantInt::get(ScalarTy, Cnt));
+          LLVM_DEBUG(dbgs() << "SLP: Add (to-mul) " << Cnt << "of " << Vec
+                            << ". (HorRdx)\n");
+          ++NumVectorInstructions;
+          Vec = Builder.CreateMul(Vec, Scale);
+          break;
+        }
+        case RecurKind::Xor: {
+          // res = n % 2 ? 0 : vv
+          LLVM_DEBUG(dbgs()
+                     << "SLP: Xor " << Cnt << "of " << Vec << ". (HorRdx)\n");
+          if (Cnt % 2 == 0)
+            Vec = Constant::getNullValue(Vec->getType());
+          break;
+        }
+        case RecurKind::FAdd: {
+          // res = fmul v, n
+          Value *Scale =
+              ConstantVector::getSplat(EC, ConstantFP::get(ScalarTy, Cnt));
+          LLVM_DEBUG(dbgs() << "SLP: FAdd (to-fmul) " << Cnt << "of " << Vec
+                            << ". (HorRdx)\n");
+          ++NumVectorInstructions;
+          Vec = Builder.CreateFMul(Vec, Scale);
+          break;
+        }
+        case RecurKind::And:
+        case RecurKind::Or:
+        case RecurKind::SMax:
+        case RecurKind::SMin:
+        case RecurKind::UMax:
+        case RecurKind::UMin:
+        case RecurKind::FMax:
+        case RecurKind::FMin:
+        case RecurKind::FMaximum:
+        case RecurKind::FMinimum:
+          // res = vv
+          break;
+        case RecurKind::Mul:
+        case RecurKind::FMul:
+        case RecurKind::FMulAdd:
+        case RecurKind::IAnyOf:
+        case RecurKind::FAnyOf:
+        case RecurKind::None:
+          llvm_unreachable("Unexpected reduction kind for repeated scalar.");
+        }
+      }
+      // Combine Vec w...
[truncated]

llvmbot · 2024-12-02T13:37:47Z

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

AVX512, -O3+LTO
Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%

Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization

RISC-V, SiFive-p670, -O3+LTO

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%

execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast

ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast
ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions

Patch is 22.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118293.diff

5 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+8)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+1)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+16)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+290-32)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 985ca1532e0149..f2f0e56a3f2014 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1584,6 +1584,10 @@ class TargetTransformInfo {
   /// split during legalization. Zero is returned when the answer is unknown.
   unsigned getNumberOfParts(Type *Tp) const;
 
+  /// \return true if \p Tp represent a type, fully occupying whole register,
+  /// false otherwise.
+  bool isFullSingleRegisterType(Type *Tp) const;
+
   /// \returns The cost of the address computation. For most targets this can be
   /// merged into the instruction indexing mode. Some targets might want to
   /// distinguish between address computation for memory operations on vector
@@ -2196,6 +2200,7 @@ class TargetTransformInfo::Concept {
                                            ArrayRef<Type *> Tys,
                                            TTI::TargetCostKind CostKind) = 0;
   virtual unsigned getNumberOfParts(Type *Tp) = 0;
+  virtual bool isFullSingleRegisterType(Type *Tp) const = 0;
   virtual InstructionCost
   getAddressComputationCost(Type *Ty, ScalarEvolution *SE, const SCEV *Ptr) = 0;
   virtual InstructionCost
@@ -2930,6 +2935,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   unsigned getNumberOfParts(Type *Tp) override {
     return Impl.getNumberOfParts(Tp);
   }
+  bool isFullSingleRegisterType(Type *Tp) const override {
+    return Impl.isFullSingleRegisterType(Tp);
+  }
   InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *SE,
                                             const SCEV *Ptr) override {
     return Impl.getAddressComputationCost(Ty, SE, Ptr);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 38aba183f6a173..ce6a96ea317ba7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -833,6 +833,7 @@ class TargetTransformInfoImplBase {
 
   // Assume that we have a register of the right size for the type.
   unsigned getNumberOfParts(Type *Tp) const { return 1; }
+  bool isFullSingleRegisterType(Type *Tp) const { return false; }
 
   InstructionCost getAddressComputationCost(Type *Tp, ScalarEvolution *,
                                             const SCEV *) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 98cbb4886642bf..9e7ce48f901dc5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -2612,6 +2612,22 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return *LT.first.getValue();
   }
 
+  bool isFullSingleRegisterType(Type *Tp) const {
+    std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Tp);
+    if (!LT.first.isValid() || LT.first > 1)
+      return false;
+
+    if (auto *FTp = dyn_cast<FixedVectorType>(Tp);
+        Tp && LT.second.isFixedLengthVector()) {
+      // Check if the n x i1 fits fully into largest integer.
+      if (unsigned VF = LT.second.getVectorNumElements();
+          LT.second.getVectorElementType() == MVT::i1)
+        return DL.isLegalInteger(VF) && !DL.isLegalInteger(VF * 2);
+      return FTp == EVT(LT.second).getTypeForEVT(Tp->getContext());
+    }
+    return false;
+  }
+
   InstructionCost getAddressComputationCost(Type *Ty, ScalarEvolution *,
                                             const SCEV *) {
     return 0;
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 1fb2b9836de0cc..f7ad9ed905e3a1 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1171,6 +1171,10 @@ unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
   return TTIImpl->getNumberOfParts(Tp);
 }
 
+bool TargetTransformInfo::isFullSingleRegisterType(Type *Tp) const {
+  return TTIImpl->isFullSingleRegisterType(Tp);
+}
+
 InstructionCost
 TargetTransformInfo::getAddressComputationCost(Type *Tp, ScalarEvolution *SE,
                                                const SCEV *Ptr) const {
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 7723442bc0fb6e..5df21b77643746 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -12080,7 +12080,11 @@ bool BoUpSLP::isTreeNotExtendable() const {
     TreeEntry &E = *VectorizableTree[Idx];
     if (!E.isGather())
       continue;
-    if (E.getOpcode() && E.getOpcode() != Instruction::Load)
+    if ((E.getOpcode() && E.getOpcode() != Instruction::Load) ||
+        (!E.getOpcode() &&
+         all_of(E.Scalars, IsaPred<ExtractElementInst, LoadInst>)) ||
+        (isa<ExtractElementInst>(E.Scalars.front()) &&
+         getSameOpcode(ArrayRef(E.Scalars).drop_front(), *TLI).getOpcode()))
       return false;
     if (isSplat(E.Scalars) || allConstant(E.Scalars))
       continue;
@@ -19174,6 +19178,9 @@ class HorizontalReduction {
   /// Checks if the optimization of original scalar identity operations on
   /// matched horizontal reductions is enabled and allowed.
   bool IsSupportedHorRdxIdentityOp = false;
+  /// Contains vector values for reduction including their scale factor and
+  /// signedness.
+  SmallVector<std::tuple<Value *, unsigned, bool>> VectorValuesAndScales;
 
   static bool isCmpSelMinMax(Instruction *I) {
     return match(I, m_Select(m_Cmp(), m_Value(), m_Value())) &&
@@ -19225,17 +19232,22 @@ class HorizontalReduction {
   static Value *createOp(IRBuilderBase &Builder, RecurKind Kind, Value *LHS,
                          Value *RHS, const Twine &Name, bool UseSelect) {
     unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(Kind);
+    Type *OpTy = LHS->getType();
+    assert(OpTy == RHS->getType() && "Expected LHS and RHS of same type");
     switch (Kind) {
     case RecurKind::Or:
-      if (UseSelect &&
-          LHS->getType() == CmpInst::makeCmpResultType(LHS->getType()))
-        return Builder.CreateSelect(LHS, Builder.getTrue(), RHS, Name);
+      if (UseSelect && OpTy == CmpInst::makeCmpResultType(OpTy))
+        return Builder.CreateSelect(
+            LHS,
+            ConstantInt::getAllOnesValue(CmpInst::makeCmpResultType(OpTy)),
+            RHS, Name);
       return Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, LHS, RHS,
                                  Name);
     case RecurKind::And:
-      if (UseSelect &&
-          LHS->getType() == CmpInst::makeCmpResultType(LHS->getType()))
-        return Builder.CreateSelect(LHS, RHS, Builder.getFalse(), Name);
+      if (UseSelect && OpTy == CmpInst::makeCmpResultType(OpTy))
+        return Builder.CreateSelect(
+            LHS, RHS,
+            ConstantInt::getNullValue(CmpInst::makeCmpResultType(OpTy)), Name);
       return Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode, LHS, RHS,
                                  Name);
     case RecurKind::Add:
@@ -20108,12 +20120,11 @@ class HorizontalReduction {
                                          SameValuesCounter, TrackedToOrig);
         }
 
-        Value *ReducedSubTree;
         Type *ScalarTy = VL.front()->getType();
         if (isa<FixedVectorType>(ScalarTy)) {
           assert(SLPReVec && "FixedVectorType is not expected.");
           unsigned ScalarTyNumElements = getNumElements(ScalarTy);
-          ReducedSubTree = PoisonValue::get(FixedVectorType::get(
+          Value *ReducedSubTree = PoisonValue::get(getWidenedType(
               VectorizedRoot->getType()->getScalarType(), ScalarTyNumElements));
           for (unsigned I : seq<unsigned>(ScalarTyNumElements)) {
             // Do reduction for each lane.
@@ -20131,30 +20142,32 @@ class HorizontalReduction {
             SmallVector<int, 16> Mask =
                 createStrideMask(I, ScalarTyNumElements, VL.size());
             Value *Lane = Builder.CreateShuffleVector(VectorizedRoot, Mask);
-            ReducedSubTree = Builder.CreateInsertElement(
-                ReducedSubTree,
-                emitReduction(Lane, Builder, TTI, RdxRootInst->getType()), I);
+            Value *Val =
+                createSingleOp(Builder, *TTI, Lane,
+                               OptReusedScalars && SameScaleFactor
+                                   ? SameValuesCounter.front().second
+                                   : 1,
+                               Lane->getType()->getScalarType() !=
+                                       VL.front()->getType()->getScalarType()
+                                   ? V.isSignedMinBitwidthRootNode()
+                                   : true, RdxRootInst->getType());
+            ReducedSubTree =
+                Builder.CreateInsertElement(ReducedSubTree, Val, I);
           }
+          VectorizedTree = GetNewVectorizedTree(VectorizedTree, ReducedSubTree);
         } else {
-          ReducedSubTree = emitReduction(VectorizedRoot, Builder, TTI,
-                                         RdxRootInst->getType());
+          Type *VecTy = VectorizedRoot->getType();
+          Type *RedScalarTy = VecTy->getScalarType();
+          VectorValuesAndScales.emplace_back(
+              VectorizedRoot,
+              OptReusedScalars && SameScaleFactor
+                  ? SameValuesCounter.front().second
+                  : 1,
+              RedScalarTy != ScalarTy->getScalarType()
+                  ? V.isSignedMinBitwidthRootNode()
+                  : true);
         }
-        if (ReducedSubTree->getType() != VL.front()->getType()) {
-          assert(ReducedSubTree->getType() != VL.front()->getType() &&
-                 "Expected different reduction type.");
-          ReducedSubTree =
-              Builder.CreateIntCast(ReducedSubTree, VL.front()->getType(),
-                                    V.isSignedMinBitwidthRootNode());
-        }
-
-        // Improved analysis for add/fadd/xor reductions with same scale factor
-        // for all operands of reductions. We can emit scalar ops for them
-        // instead.
-        if (OptReusedScalars && SameScaleFactor)
-          ReducedSubTree = emitScaleForReusedOps(
-              ReducedSubTree, Builder, SameValuesCounter.front().second);
 
-        VectorizedTree = GetNewVectorizedTree(VectorizedTree, ReducedSubTree);
         // Count vectorized reduced values to exclude them from final reduction.
         for (Value *RdxVal : VL) {
           Value *OrigV = TrackedToOrig.at(RdxVal);
@@ -20183,6 +20196,10 @@ class HorizontalReduction {
         continue;
       }
     }
+    if (!VectorValuesAndScales.empty())
+      VectorizedTree = GetNewVectorizedTree(
+          VectorizedTree,
+          emitReduction(Builder, *TTI, ReductionRoot->getType()));
     if (VectorizedTree) {
       // Reorder operands of bool logical op in the natural order to avoid
       // possible problem with poison propagation. If not possible to reorder
@@ -20317,6 +20334,28 @@ class HorizontalReduction {
   }
 
 private:
+  /// Checks if the given type \p Ty is a vector type, which does not occupy the
+  /// whole vector register or is expensive for extraction.
+  static bool isNotFullVectorType(const TargetTransformInfo &TTI, Type *Ty) {
+    return TTI.getNumberOfParts(Ty) == 1 && !TTI.isFullSingleRegisterType(Ty);
+  }
+
+  /// Creates the reduction from the given \p Vec vector value with the given
+  /// scale \p Scale and signedness \p IsSigned.
+  Value *createSingleOp(IRBuilderBase &Builder, const TargetTransformInfo &TTI,
+                        Value *Vec, unsigned Scale, bool IsSigned,
+                        Type *DestTy) {
+    Value *Rdx = emitReduction(Vec, Builder, &TTI, DestTy);
+    if (Rdx->getType() != DestTy->getScalarType())
+      Rdx = Builder.CreateIntCast(Rdx, DestTy, IsSigned);
+    // Improved analysis for add/fadd/xor reductions with same scale
+    // factor for all operands of reductions. We can emit scalar ops for
+    // them instead.
+    if (Scale > 1)
+      Rdx = emitScaleForReusedOps(Rdx, Builder, Scale);
+    return Rdx;
+  }
+
   /// Calculate the cost of a reduction.
   InstructionCost getReductionCost(TargetTransformInfo *TTI,
                                    ArrayRef<Value *> ReducedVals,
@@ -20359,6 +20398,22 @@ class HorizontalReduction {
       }
       return Cost;
     };
+    // Require reduction cost if:
+    // 1. This type is not a full register type and no other vectors with the
+    // same type in the storage (first vector with small type).
+    // 2. The storage does not have any vector with full vector use (first
+    // vector with full register use).
+    bool DoesRequireReductionOp =
+        !AllConsts &&
+        (VectorValuesAndScales.empty() ||
+         (isNotFullVectorType(*TTI, VectorTy) &&
+          none_of(VectorValuesAndScales,
+                  [&](const auto &P) {
+                    return std::get<0>(P)->getType() == VectorTy;
+                  })) ||
+         all_of(VectorValuesAndScales, [&](const auto &P) {
+           return isNotFullVectorType(*TTI, std::get<0>(P)->getType());
+         }));
     switch (RdxKind) {
     case RecurKind::Add:
     case RecurKind::Mul:
@@ -20382,7 +20437,7 @@ class HorizontalReduction {
           VectorCost += TTI->getScalarizationOverhead(
               VecTy, APInt::getAllOnes(ScalarTyNumElements), /*Insert*/ true,
               /*Extract*/ false, TTI::TCK_RecipThroughput);
-        } else {
+        } else if (DoesRequireReductionOp) {
           Type *RedTy = VectorTy->getElementType();
           auto [RType, IsSigned] = R.getRootNodeTypeWithNoCast().value_or(
               std::make_pair(RedTy, true));
@@ -20394,6 +20449,14 @@ class HorizontalReduction {
                 RdxOpcode, !IsSigned, RedTy, getWidenedType(RType, ReduxWidth),
                 FMF, CostKind);
           }
+        } else {
+          unsigned NumParts = TTI->getNumberOfParts(VectorTy);
+          unsigned RegVF = getPartNumElems(getNumElements(VectorTy), NumParts);
+          VectorCost +=
+              NumParts * TTI->getArithmeticInstrCost(
+                             RdxOpcode,
+                             getWidenedType(VectorTy->getScalarType(), RegVF),
+                             CostKind);
         }
       }
       ScalarCost = EvaluateScalarCost([&]() {
@@ -20410,8 +20473,19 @@ class HorizontalReduction {
     case RecurKind::UMax:
     case RecurKind::UMin: {
       Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
-      if (!AllConsts)
-        VectorCost = TTI->getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
+      if (!AllConsts) {
+        if (DoesRequireReductionOp) {
+          VectorCost = TTI->getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
+        } else {
+          // Check if the previous reduction already exists and account it as
+          // series of operations + single reduction.
+          unsigned NumParts = TTI->getNumberOfParts(VectorTy);
+          unsigned RegVF = getPartNumElems(getNumElements(VectorTy), NumParts);
+          auto *RegVecTy = getWidenedType(VectorTy->getScalarType(), RegVF);
+          IntrinsicCostAttributes ICA(Id, RegVecTy, {RegVecTy, RegVecTy}, FMF);
+          VectorCost += NumParts * TTI->getIntrinsicInstrCost(ICA, CostKind);
+        }
+      }
       ScalarCost = EvaluateScalarCost([&]() {
         IntrinsicCostAttributes ICA(Id, ScalarTy, {ScalarTy, ScalarTy}, FMF);
         return TTI->getIntrinsicInstrCost(ICA, CostKind);
@@ -20428,6 +20502,190 @@ class HorizontalReduction {
     return VectorCost - ScalarCost;
   }
 
+  /// Splits the values, stored in VectorValuesAndScales, into registers/free
+  /// sub-registers, combines them with the given reduction operation as a
+  /// vector operation and then performs single (small enough) reduction.
+  Value *emitReduction(IRBuilderBase &Builder, const TargetTransformInfo &TTI,
+                       Type *DestTy) {
+    Value *ReducedSubTree = nullptr;
+    // Creates reduction and combines with the previous reduction.
+    auto CreateSingleOp = [&](Value *Vec, unsigned Scale, bool IsSigned) {
+      Value *Rdx = createSingleOp(Builder, TTI, Vec, Scale, IsSigned, DestTy);
+      if (ReducedSubTree)
+        ReducedSubTree = createOp(Builder, RdxKind, ReducedSubTree, Rdx,
+                                  "op.rdx", ReductionOps);
+      else
+        ReducedSubTree = Rdx;
+    };
+    if (VectorValuesAndScales.size() == 1) {
+      const auto &[Vec, Scale, IsSigned] = VectorValuesAndScales.front();
+      CreateSingleOp(Vec, Scale, IsSigned);
+      return ReducedSubTree;
+    }
+    // Splits multivector value into per-register values.
+    auto SplitVector = [&](Value *Vec) {
+      auto *ScalarTy = cast<VectorType>(Vec->getType())->getElementType();
+      unsigned Sz = getNumElements(Vec->getType());
+      unsigned NumParts = TTI.getNumberOfParts(Vec->getType());
+      if (NumParts <= 1 || NumParts >= Sz ||
+          isNotFullVectorType(TTI, Vec->getType()))
+        return SmallVector<Value *>(1, Vec);
+      unsigned RegSize = getPartNumElems(Sz, NumParts);
+      auto *DstTy = getWidenedType(ScalarTy, RegSize);
+      SmallVector<Value *> Regs(NumParts);
+      for (unsigned Part : seq<unsigned>(NumParts))
+        Regs[Part] = Builder.CreateExtractVector(
+            DstTy, Vec, Builder.getInt64(Part * RegSize));
+      return Regs;
+    };
+    SmallMapVector<Type *, Value *, 4> VecOps;
+    // Scales Vec using given Cnt scale factor and then performs vector combine
+    // with previous value of VecOp.
+    auto CreateVecOp = [&](Value *Vec, unsigned Cnt) {
+      Type *ScalarTy = cast<VectorType>(Vec->getType())->getElementType();
+      // Scale Vec using given Cnt scale factor.
+      if (Cnt > 1) {
+        ElementCount EC = cast<VectorType>(Vec->getType())->getElementCount();
+        switch (RdxKind) {
+        case RecurKind::Add: {
+          if (ScalarTy == Builder.getInt1Ty() && ScalarTy != DestTy) {
+            unsigned VF = getNumElements(Vec->getType());
+            LLVM_DEBUG(dbgs() << "SLP: ctpop " << Cnt << "of " << Vec
+                              << ". (HorRdx)\n");
+            SmallVector<int> Mask(Cnt * VF, PoisonMaskElem);
+            for (unsigned I : seq<unsigned>(Cnt))
+              std::iota(std::next(Mask.begin(), VF * I),
+                        std::next(Mask.begin(), VF * (I + 1)), 0);
+            ++NumVectorInstructions;
+            Vec = Builder.CreateShuffleVector(Vec, Mask);
+            break;
+          }
+          // res = mul vv, n
+          Value *Scale =
+              ConstantVector::getSplat(EC, ConstantInt::get(ScalarTy, Cnt));
+          LLVM_DEBUG(dbgs() << "SLP: Add (to-mul) " << Cnt << "of " << Vec
+                            << ". (HorRdx)\n");
+          ++NumVectorInstructions;
+          Vec = Builder.CreateMul(Vec, Scale);
+          break;
+        }
+        case RecurKind::Xor: {
+          // res = n % 2 ? 0 : vv
+          LLVM_DEBUG(dbgs()
+                     << "SLP: Xor " << Cnt << "of " << Vec << ". (HorRdx)\n");
+          if (Cnt % 2 == 0)
+            Vec = Constant::getNullValue(Vec->getType());
+          break;
+        }
+        case RecurKind::FAdd: {
+          // res = fmul v, n
+          Value *Scale =
+              ConstantVector::getSplat(EC, ConstantFP::get(ScalarTy, Cnt));
+          LLVM_DEBUG(dbgs() << "SLP: FAdd (to-fmul) " << Cnt << "of " << Vec
+                            << ". (HorRdx)\n");
+          ++NumVectorInstructions;
+          Vec = Builder.CreateFMul(Vec, Scale);
+          break;
+        }
+        case RecurKind::And:
+        case RecurKind::Or:
+        case RecurKind::SMax:
+        case RecurKind::SMin:
+        case RecurKind::UMax:
+        case RecurKind::UMin:
+        case RecurKind::FMax:
+        case RecurKind::FMin:
+        case RecurKind::FMaximum:
+        case RecurKind::FMinimum:
+          // res = vv
+          break;
+        case RecurKind::Mul:
+        case RecurKind::FMul:
+        case RecurKind::FMulAdd:
+        case RecurKind::IAnyOf:
+        case RecurKind::FAnyOf:
+        case RecurKind::None:
+          llvm_unreachable("Unexpected reduction kind for repeated scalar.");
+        }
+      }
+      // Combine Vec w...
[truncated]

github-actions · 2024-12-02T13:40:25Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Created using spr 1.3.5

preames

There are no tests changed by this review.

alexey-bataev · 2024-12-02T18:51:42Z

There are no tests changed by this review.

Forgot to update, will fix it

Created using spr 1.3.5

github-actions · 2024-12-17T18:45:57Z

✅ With the latest revision this PR passed the undef deprecator.

Created using spr 1.3.5

alexey-bataev · 2025-01-06T14:38:29Z

Ping!

RKSimon

Please can you add a patch description and not just the perf changes?

RKSimon · 2025-01-06T16:11:24Z

llvm/include/llvm/Analysis/TargetTransformInfo.h

@@ -1611,6 +1611,10 @@ class TargetTransformInfo {
  /// split during legalization. Zero is returned when the answer is unknown.
  unsigned getNumberOfParts(Type *Tp) const;

+  /// \return true if \p Tp represent a type, fully occupying whole register,
+  /// false otherwise.


Improve the description as it doesn't seem to match the implementation in BasicTTIImpl.h

RKSimon · 2025-01-06T16:22:26Z

llvm/test/Transforms/SLPVectorizer/X86/external-used-across-reductions.ll

+; CHECK-NEXT:    [[RDX_OP19:%.*]] = add <2 x i64> [[RDX_OP18]], [[TMP11]]
+; CHECK-NEXT:    [[RDX_OP20:%.*]] = add <2 x i64> [[RDX_OP19]], [[TMP12]]
+; CHECK-NEXT:    [[RDX_OP21:%.*]] = add <2 x i64> [[RDX_OP20]], [[TMP13]]
+; CHECK-NEXT:    [[OP_RDX16:%.*]] = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> [[RDX_OP21]])


why is all this better? it just feels like the reduction has been split to legal types

Instead of 2 reductions it now emits only 1

but isn't all the additional extract / add precisely the same as the expansion of the original v8i64 reduction?

Not exactly. Instead of final

%red1 = reduce_add1 %v1 %red2 = reduce_add2 %v2 %res = add i64 %red1, %red2

it will emit

%v = add<2 x i32> %v1, %v2 %res = reduce_add %v

which is slightly better for X86 and significantly better for other targets

I'm still of the opinion that this would be better:

; CHECK-NEXT: [[ADD:%.*]] = add <8 x i64> [[TMP5]], [[TMP7]] ; CHECK-NEXT: [[OP_RDX16:%.*]] = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> [[ADD]])

Compared the codegen and the throughput, they are almost the same. The modified version produces 3 less instructions and has slightly less rthroughput: 14.3 for the modified version and 15.0 for the original version for the generic cpu.

Created using spr 1.3.5

hiraditya · 2025-01-17T07:22:40Z

need to rebase

Created using spr 1.3.5

mikaelholmen · 2025-02-14T11:16:24Z

I'm seeing a crash with this patch when compiling for my out-of-tree target:

opt: ../lib/IR/Instructions.cpp:1748: llvm::ShuffleVectorInst::ShuffleVectorInst(Value *, Value *, ArrayRef<int>, const Twine &, InsertPosition): Assertion `isValidOperands(V1, V2, Mask) && "Invalid shuffle vector instruction operands!"' failed.
[...]
 #9 0x0000555848bf2503 (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::DataLayout const&, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo const&, llvm::AssumptionCache*) SLPVectorizer.cpp:0:0
#10 0x0000555848bc1089 llvm::SLPVectorizerPass::vectorizeHorReduction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&) (build-all/bin/opt+0x5faa089)
#11 0x0000555848bc14f2 llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5faa4f2)
#12 0x0000555848bb601c llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5f9f01c)
#13 0x0000555848bb2fd4 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (build-all/bin/opt+0x5f9bfd4)
#14 0x0000555848bb2557 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x5f9b557)

I'll see if I can manage to reproduce for some in-tree target too.

alexey-bataev · 2025-02-14T11:53:50Z

I'm seeing a crash with this patch when compiling for my out-of-tree target:

opt: ../lib/IR/Instructions.cpp:1748: llvm::ShuffleVectorInst::ShuffleVectorInst(Value *, Value *, ArrayRef<int>, const Twine &, InsertPosition): Assertion `isValidOperands(V1, V2, Mask) && "Invalid shuffle vector instruction operands!"' failed.
[...]
 #9 0x0000555848bf2503 (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::DataLayout const&, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo const&, llvm::AssumptionCache*) SLPVectorizer.cpp:0:0
#10 0x0000555848bc1089 llvm::SLPVectorizerPass::vectorizeHorReduction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&) (build-all/bin/opt+0x5faa089)
#11 0x0000555848bc14f2 llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5faa4f2)
#12 0x0000555848bb601c llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5f9f01c)
#13 0x0000555848bb2fd4 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (build-all/bin/opt+0x5f9bfd4)
#14 0x0000555848bb2557 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x5f9b557)

I'll see if I can manage to reproduce for some in-tree target too.

I'm going to revert the patch, please try to prepare the reproducer

This reverts commit 2ad8166 to fix bug/miscompiles, reported in #118293 (comment) and #118293 (comment).

This reverts commit 2ad8166 to fix bug/miscompiles, reported in llvm/llvm-project#118293 (comment) and llvm/llvm-project#118293 (comment).

alexey-bataev · 2025-02-14T13:47:25Z

This change miscompiles one file in ffmpeg, for x86 and x86_64.

To reproduce, clone https://github.com/ffmpeg/ffmpeg, compile and run tests like this:
$ git clone https://github.com/ffmpeg/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../ffmpeg-samples
$ make fate-rsync
$ make -j$(nproc) fate-msnsiren
The miscompiled object file is libavcodec/siren.o.

Fixed

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: #118293

luporl · 2025-02-14T14:16:11Z

This patch broke 2 bots:

clang crashes compiling test-suite/MultiSource/Benchmarks/Prolangs-C/gnugo/endgame.c:

clang-21: /home/leandro.lupori/llvm/llvm/lib/IR/Instructions.cpp:2112: static bool llvm::ShuffleVectorInst::isInsertSubvectorMask(ArrayRef<int>, int, int &, int &): Assertion `!Src0Elts.isZero() && !Src1Elts.isZero() && "2-source shuffle not found"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /home/leandro.lupori/stage1/bin/clang-21 -cc1 -triple aarch64-unknown-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -main-file-name endgame.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -enable-tlsdesc -target-cpu neoverse-512tvb -target-feature +v8.4a -target-feature +aes -target-feature +bf16 -target-feature +ccdp -target-feature +ccidx -target-feature +ccpp -target-feature +complxnum -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +fullfp16 -target-feature +i8mm -target-feature +jsconv -target-feature +lse -target-feature +neon -target-feature +pauth -target-feature +perfmon -target-feature +rand -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +sm4 -target-feature +spe -target-feature +ssbs -target-feature +sve -target-abi aapcs -mvscale-max=2 -mvscale-min=2 -debugger-tuning=gdb -fdebug-compilation-dir=/home/tcwg-buildbot/worker/clang-aarch64-sve-vls/test/sandbox/build/MultiSource/Benchmarks/Prolangs-C/gnugo -fcoverage-compilation-dir=/home/tcwg-buildbot/worker/clang-aarch64-sve-vls/test/sandbox/build/MultiSource/Benchmarks/Prolangs-C/gnugo -sys-header-deps -D NDEBUG -O3 -ferror-limit 19 -fno-signed-char -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -vectorize-loops -vectorize-slp -mllvm -treat-scalable-fixed-error-as-warning=false -target-feature +outline-atomics -target-feature -fmv -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -x c endgame-c4c5c8.c
1.	<eof> parser at end of file
2.	Optimizer
3.	Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;no-prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-lookup;no-keep-loops;hoist-common-insts;no-hoist-loads-stores-with-cond-faulting;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;no-switch-to-lookup;keep-loops;no-hoist-common-insts;hoist-loads-stores-with-cond-faulting;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "endgame-c4c5c8.c"
4.	Running pass "vector-combine" on function "endgame"
 #0 0x0000b6bdf3972c40 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/leandro.lupori/stage1/bin/clang-21+0x8152c40)
 #1 0x0000b6bdf3970b8c llvm::sys::RunSignalHandlers() (/home/leandro.lupori/stage1/bin/clang-21+0x8150b8c)
 #2 0x0000b6bdf39732cc SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x0000e69b40ae48f8 (linux-vdso.so.1+0x8f8)
 #4 0x0000e69b404cf1f0 __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #5 0x0000e69b4048a67c gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x0000e69b40477130 abort ./stdlib/./stdlib/abort.c:81:7
 #7 0x0000e69b40483fd4 __assert_fail_base ./assert/./assert/assert.c:91:7
 #8 0x0000e69b4048404c (/lib/aarch64-linux-gnu/libc.so.6+0x3404c)
 #9 0x0000b6bdf3333278 llvm::ShuffleVectorInst::isInsertSubvectorMask(llvm::ArrayRef<int>, int, int&, int&) (/home/leandro.lupori/stage1/bin/clang-21+0x7b13278)
#10 0x0000b6bdf12c5b0c llvm::TargetTransformInfoImplCRTPBase<llvm::AArch64TTIImpl>::getInstructionCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) AArch64TargetMachine.cpp:0:0
#11 0x0000b6bdf2bff820 llvm::TargetTransformInfo::getInstructionCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const (/home/leandro.lupori/stage1/bin/clang-21+0x73df820)
#12 0x0000b6bdf29c67d0 llvm::TargetTransformInfo::getInstructionCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const CodeMetrics.cpp:0:0
#13 0x0000b6bdf527f8d4 (anonymous namespace)::VectorCombine::run()::$_0::operator()(llvm::Instruction&) const VectorCombine.cpp:0:0
#14 0x0000b6bdf5277e14 llvm::VectorCombinePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/leandro.lupori/stage1/bin/clang-21+0x9a57e14)
...

The full stacktrace is available in https://lab.llvm.org/buildbot/#/builders/143/builds/5429/steps/13/logs/stdio (search for ::PrintStackTrace)

The reproducer is attached.

repro.zip

alexey-bataev · 2025-02-14T14:33:13Z

This patch broke 2 bots:

clang crashes compiling test-suite/MultiSource/Benchmarks/Prolangs-C/gnugo/endgame.c:

clang-21: /home/leandro.lupori/llvm/llvm/lib/IR/Instructions.cpp:2112: static bool llvm::ShuffleVectorInst::isInsertSubvectorMask(ArrayRef<int>, int, int &, int &): Assertion `!Src0Elts.isZero() && !Src1Elts.isZero() && "2-source shuffle not found"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /home/leandro.lupori/stage1/bin/clang-21 -cc1 -triple aarch64-unknown-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -main-file-name endgame.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -enable-tlsdesc -target-cpu neoverse-512tvb -target-feature +v8.4a -target-feature +aes -target-feature +bf16 -target-feature +ccdp -target-feature +ccidx -target-feature +ccpp -target-feature +complxnum -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +fullfp16 -target-feature +i8mm -target-feature +jsconv -target-feature +lse -target-feature +neon -target-feature +pauth -target-feature +perfmon -target-feature +rand -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +sm4 -target-feature +spe -target-feature +ssbs -target-feature +sve -target-abi aapcs -mvscale-max=2 -mvscale-min=2 -debugger-tuning=gdb -fdebug-compilation-dir=/home/tcwg-buildbot/worker/clang-aarch64-sve-vls/test/sandbox/build/MultiSource/Benchmarks/Prolangs-C/gnugo -fcoverage-compilation-dir=/home/tcwg-buildbot/worker/clang-aarch64-sve-vls/test/sandbox/build/MultiSource/Benchmarks/Prolangs-C/gnugo -sys-header-deps -D NDEBUG -O3 -ferror-limit 19 -fno-signed-char -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -vectorize-loops -vectorize-slp -mllvm -treat-scalable-fixed-error-as-warning=false -target-feature +outline-atomics -target-feature -fmv -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -x c endgame-c4c5c8.c
1.	<eof> parser at end of file
2.	Optimizer
3.	Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;no-prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-lookup;no-keep-loops;hoist-common-insts;no-hoist-loads-stores-with-cond-faulting;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;no-switch-to-lookup;keep-loops;no-hoist-common-insts;hoist-loads-stores-with-cond-faulting;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "endgame-c4c5c8.c"
4.	Running pass "vector-combine" on function "endgame"
 #0 0x0000b6bdf3972c40 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/leandro.lupori/stage1/bin/clang-21+0x8152c40)
 #1 0x0000b6bdf3970b8c llvm::sys::RunSignalHandlers() (/home/leandro.lupori/stage1/bin/clang-21+0x8150b8c)
 #2 0x0000b6bdf39732cc SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x0000e69b40ae48f8 (linux-vdso.so.1+0x8f8)
 #4 0x0000e69b404cf1f0 __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #5 0x0000e69b4048a67c gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x0000e69b40477130 abort ./stdlib/./stdlib/abort.c:81:7
 #7 0x0000e69b40483fd4 __assert_fail_base ./assert/./assert/assert.c:91:7
 #8 0x0000e69b4048404c (/lib/aarch64-linux-gnu/libc.so.6+0x3404c)
 #9 0x0000b6bdf3333278 llvm::ShuffleVectorInst::isInsertSubvectorMask(llvm::ArrayRef<int>, int, int&, int&) (/home/leandro.lupori/stage1/bin/clang-21+0x7b13278)
#10 0x0000b6bdf12c5b0c llvm::TargetTransformInfoImplCRTPBase<llvm::AArch64TTIImpl>::getInstructionCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) AArch64TargetMachine.cpp:0:0
#11 0x0000b6bdf2bff820 llvm::TargetTransformInfo::getInstructionCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const (/home/leandro.lupori/stage1/bin/clang-21+0x73df820)
#12 0x0000b6bdf29c67d0 llvm::TargetTransformInfo::getInstructionCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const CodeMetrics.cpp:0:0
#13 0x0000b6bdf527f8d4 (anonymous namespace)::VectorCombine::run()::$_0::operator()(llvm::Instruction&) const VectorCombine.cpp:0:0
#14 0x0000b6bdf5277e14 llvm::VectorCombinePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/leandro.lupori/stage1/bin/clang-21+0x9a57e14)
...

The full stacktrace is available in https://lab.llvm.org/buildbot/#/builders/143/builds/5429/steps/13/logs/stdio (search for ::PrintStackTrace)

The reproducer is attached.

repro.zip

Should be fixed already with the new version of the patch

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: llvm/llvm-project#118293

luporl · 2025-02-14T16:09:04Z

Should be fixed already with the new version of the patch

Thanks for the fix.

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: #118293

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: llvm/llvm-project#118293

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: llvm#118293

This reverts commit 2ad8166 to fix bug/miscompiles, reported in llvm#118293 (comment) and llvm#118293 (comment).

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: llvm#118293

mikaelholmen · 2025-02-15T12:37:06Z

I'm seeing a crash with this patch when compiling for my out-of-tree target:

opt: ../lib/IR/Instructions.cpp:1748: llvm::ShuffleVectorInst::ShuffleVectorInst(Value *, Value *, ArrayRef<int>, const Twine &, InsertPosition): Assertion `isValidOperands(V1, V2, Mask) && "Invalid shuffle vector instruction operands!"' failed.
[...]
 #9 0x0000555848bf2503 (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::DataLayout const&, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo const&, llvm::AssumptionCache*) SLPVectorizer.cpp:0:0
#10 0x0000555848bc1089 llvm::SLPVectorizerPass::vectorizeHorReduction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&) (build-all/bin/opt+0x5faa089)
#11 0x0000555848bc14f2 llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5faa4f2)
#12 0x0000555848bb601c llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (build-all/bin/opt+0x5f9f01c)
#13 0x0000555848bb2fd4 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (build-all/bin/opt+0x5f9bfd4)
#14 0x0000555848bb2557 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (build-all/bin/opt+0x5f9b557)

I'll see if I can manage to reproduce for some in-tree target too.

The problem I saw disappeared with 3b18d47.
Probably the same bug as reported in #127220.

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: llvm#118293

This reverts commit 2ad8166 to fix bug/miscompiles, reported in llvm#118293 (comment) and llvm#118293 (comment).

SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: llvm#118293

[𝘀𝗽𝗿] initial version

2f17bfb

Created using spr 1.3.5

llvmbot added vectorizers llvm:analysis llvm:transforms labels Dec 2, 2024

alexey-bataev requested review from preames, RKSimon and topperc December 2, 2024 13:37

Fix formatting

708daae

Created using spr 1.3.5

preames requested changes Dec 2, 2024

View reviewed changes

Tests update

ff0e058

Created using spr 1.3.5

llvmbot added the backend:AMDGPU label Dec 2, 2024

alexey-bataev mentioned this pull request Dec 10, 2024

x264 performance regression since 19.1.5 with rva22u64_v #119386

Open

Rebase

d526472

Created using spr 1.3.5

alexey-bataev added 2 commits December 17, 2024 20:44

Rebase

153a1ae

Created using spr 1.3.5

Rebase

ce71769

Created using spr 1.3.5

RKSimon reviewed Jan 6, 2025

View reviewed changes

Address comments

5d4d7c1

Created using spr 1.3.5

alexey-bataev requested a review from hiraditya January 15, 2025 17:22

alexey-bataev added 2 commits January 17, 2025 21:03

Rebase

f1a1be0

Created using spr 1.3.5

Rebase

1826483

Created using spr 1.3.5

alexey-bataev added a commit that referenced this pull request Feb 14, 2025

Revert "[SLP]Improved reduction cost/codegen"

afa3c10

This reverts commit 2ad8166 to fix bug/miscompiles, reported in #118293 (comment) and #118293 (comment).

vzakhari mentioned this pull request Feb 14, 2025

[SLPVectorizer] Crash: Invalid shuffle vector instruction operands! #127220

Closed

joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025

Revert "[SLP]Improved reduction cost/codegen"

f3c23de

This reverts commit 2ad8166 to fix bug/miscompiles, reported in llvm#118293 (comment) and llvm#118293 (comment).

sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025

Revert "[SLP]Improved reduction cost/codegen"

3f9081c

This reverts commit 2ad8166 to fix bug/miscompiles, reported in llvm#118293 (comment) and llvm#118293 (comment).

[SLP]Improved reduction cost/codegen #118293

[SLP]Improved reduction cost/codegen #118293

Uh oh!

Conversation

alexey-bataev commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graphite-app bot commented Dec 2, 2024

Your org has enabled the Graphite merge queue for merging into main

Uh oh!

llvmbot commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 2, 2024

Uh oh!

llvmbot commented Dec 2, 2024

Uh oh!

github-actions bot commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

alexey-bataev commented Dec 2, 2024

Uh oh!

github-actions bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-bataev commented Jan 6, 2025

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

RKSimon Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

RKSimon Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

RKSimon Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

RKSimon Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

alexey-bataev Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

hiraditya commented Jan 17, 2025

Uh oh!

mikaelholmen commented Feb 14, 2025

Uh oh!

alexey-bataev commented Feb 14, 2025

Uh oh!

alexey-bataev commented Feb 14, 2025

Uh oh!

luporl commented Feb 14, 2025

Uh oh!

alexey-bataev commented Feb 14, 2025

Uh oh!

luporl commented Feb 14, 2025

Uh oh!

mikaelholmen commented Feb 15, 2025

Uh oh!

Uh oh!

alexey-bataev commented Dec 2, 2024 •

edited

Loading

llvmbot commented Dec 2, 2024 •

edited

Loading

github-actions bot commented Dec 2, 2024 •

edited

Loading

github-actions bot commented Dec 17, 2024 •

edited

Loading