[RISCV][TTI] Reduce cost of a build_vector pattern #108419

preames · 2024-09-12T16:45:25Z

This change is actually two related changes, but they're very hard to meaningfully separate as the second balances the first, and yet doesn't do much good on it's own.

First, we can reduce the cost of a build_vector pattern. Our current costing for this defers to generic insertelement costing which isn't unreasonable, but also isn't correct. While inserting N elements requires N-1 slides and N vmv.s.x, doing the full build_vector only requires N vslide1down. (Note there are other cases that our build vector lowering can do more cheaply, this is simply the easiest upper bound which appears to be "good enough" for SLP costing purposes.)

Second, we need to tell SLP that calls don't preserve vector registers. Without this, SLP will vectorize scalar code which performs e.g. 4 x float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the costing works out that this is in fact the optimal choice - except that we don't actually have a <2 x float> @exp, and unroll during DAG. This would be fine (or at least cost neutral) except that the libcall for the scalar @exp blows all vector registers. So the net effect is we added a bunch of spills that SLP had no idea about. Thankfully, AArch64 has a similiar problem, and has taught SLP how to reason about spill cost once the right TTI hook is implemented.

Now, for some implications...

The SLP solution for spill costing has some inaccuracies. In particular, it basically just guesses whether a intrinsic will be lowered to a call or not, and can be wrong in both directions. It also has no mechanism to differentiate on calling convention.

This has the effect of making partial vectorization (i.e. starting in scalar) more profitable. In practice, the major effect of this is to make it more like SLP will vectorize part of a tree in an intersecting forrest, and then vectorize the remaining tree once those uses have been removed.

This has the effect of biasing us slightly away from strided, or indexed loads during vectorization - because the scalar cost is more accurately modeled, and these instructions look relevatively less profitable.

@exp

This change is actually two related changes, but they're very hard to meaningfully separate as the second balances the first, and yet doesn't do much good on it's own. First, we can reduce the cost of a build_vector pattern. Our current costing for this defers to generic insertelement costing which isn't unreasonable, but also isn't correct. While inserting N elements requires N-1 slides and N vmv.s.x, doing the full build_vector only requires N vslide1down. (Note there are other cases that our build vector lowering can do more cheaply, this is simply the easiest upper bound which appears to be "good enough" for SLP costing purposes.) Second, we need to tell SLP that calls don't preserve vector registers. Without this, SLP will vectorize scalar code which performs e.g. 4 x float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the costing works out that this is in fact the optimal choice - except that we don't actually have a <2 x float> @exp, and unroll during DAG. This would be fine (or at least cost neutral) except that the libcall for the scalar @exp blows all vector registers. So the net effect is we added a bunch of spills that SLP had no idea about. Thankfully, AArch64 has a similiar problem, and has taught SLP how to reason about spill cost once the right TTI hook is implemented. Now, for some implications... The SLP solution for spill costing has some inaccuracies. In particular, it basically just guesses whether a intrinsic will be lowered to a call or not, and can be wrong in both directions. It also has no mechanism to differentiate on calling convention. This has the effect of making partial vectorization (i.e. starting in scalar) more profitable. In practice, the major effect of this is to make it more like SLP will vectorize part of a tree in an intersecting forrest, and then vectorize the remaining tree once those uses have been removed. This has the effect of biasing us slightly away from strided, or indexed loads during vectorization - because the scalar cost is more accurately modeled, and these instructions look relevatively less profitable.

llvmbot · 2024-09-12T16:45:56Z

@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Philip Reames (preames)

Changes

This change is actually two related changes, but they're very hard to meaningfully separate as the second balances the first, and yet doesn't do much good on it's own.

First, we can reduce the cost of a build_vector pattern. Our current costing for this defers to generic insertelement costing which isn't unreasonable, but also isn't correct. While inserting N elements requires N-1 slides and N vmv.s.x, doing the full build_vector only requires N vslide1down. (Note there are other cases that our build vector lowering can do more cheaply, this is simply the easiest upper bound which appears to be "good enough" for SLP costing purposes.)

Second, we need to tell SLP that calls don't preserve vector registers. Without this, SLP will vectorize scalar code which performs e.g. 4 x float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the costing works out that this is in fact the optimal choice - except that we don't actually have a <2 x float> @exp, and unroll during DAG. This would be fine (or at least cost neutral) except that the libcall for the scalar @exp blows all vector registers. So the net effect is we added a bunch of spills that SLP had no idea about. Thankfully, AArch64 has a similiar problem, and has taught SLP how to reason about spill cost once the right TTI hook is implemented.

Now, for some implications...

The SLP solution for spill costing has some inaccuracies. In particular, it basically just guesses whether a intrinsic will be lowered to a call or not, and can be wrong in both directions. It also has no mechanism to differentiate on calling convention.

This has the effect of making partial vectorization (i.e. starting in scalar) more profitable. In practice, the major effect of this is to make it more like SLP will vectorize part of a tree in an intersecting forrest, and then vectorize the remaining tree once those uses have been removed.

This has the effect of biasing us slightly away from strided, or indexed loads during vectorization - because the scalar cost is more accurately modeled, and these instructions look relevatively less profitable.

Patch is 136.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108419.diff

13 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+54)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+7)
(modified) llvm/test/Analysis/CostModel/RISCV/arith-fp.ll (+6-6)
(modified) llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll (+24-24)
(modified) llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll (+3-3)
(modified) llvm/test/Analysis/CostModel/RISCV/fp-trig-log-exp.ll (+21-21)
(modified) llvm/test/Analysis/CostModel/RISCV/gep.ll (+2-2)
(modified) llvm/test/Analysis/CostModel/RISCV/int-sat-math.ll (+20-20)
(modified) llvm/test/Analysis/CostModel/RISCV/rvv-intrinsics.ll (+16-16)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll (+372-304)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/mixed-extracts-types.ll (+4-4)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/remarks-insert-into-small-vector.ll (+1-1)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll (+15-7)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index e809e15eacf696..91430bf6999330 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -616,6 +616,43 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
   return BaseT::getShuffleCost(Kind, Tp, Mask, CostKind, Index, SubTp);
 }
 
+static unsigned isM1OrSmaller(MVT VT) {
+  RISCVII::VLMUL LMUL = RISCVTargetLowering::getLMUL(VT);
+  return (LMUL == RISCVII::VLMUL::LMUL_F8 ||
+          LMUL == RISCVII::VLMUL::LMUL_F4 ||
+          LMUL == RISCVII::VLMUL::LMUL_F2 ||
+          LMUL == RISCVII::VLMUL::LMUL_1);
+}
+
+InstructionCost RISCVTTIImpl::getScalarizationOverhead(
+    VectorType *Ty, const APInt &DemandedElts, bool Insert, bool Extract,
+    TTI::TargetCostKind CostKind) {
+  if (isa<ScalableVectorType>(Ty))
+    return InstructionCost::getInvalid();
+
+  // A build_vector (which is m1 sized or smaller) can be done in no
+  // worse than one vslide1down.vx per element in the type.  We could
+  // in theory do an explode_vector in the inverse manner, but our
+  // lowering today does not have a first class node for this pattern.
+  InstructionCost Cost =
+      BaseT::getScalarizationOverhead(Ty, DemandedElts, Insert, Extract,
+                                      CostKind);
+  std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
+  if (Insert && !Extract && LT.first.isValid() && LT.second.isVector() &&
+      Ty->getScalarSizeInBits() != 1) {
+    MVT ContainerVT = LT.second;
+    if (ContainerVT.isFixedLengthVector())
+      ContainerVT = TLI->getContainerForFixedLengthVector(ContainerVT);
+    if (isM1OrSmaller(ContainerVT)) {
+      InstructionCost BV = cast<FixedVectorType>(Ty)->getNumElements();
+      if (BV < Cost)
+        Cost = BV;
+    }
+  }
+  return Cost;
+}
+
+
 InstructionCost
 RISCVTTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src, Align Alignment,
                                     unsigned AddressSpace,
@@ -767,6 +804,23 @@ InstructionCost RISCVTTIImpl::getStridedMemoryOpCost(
   return NumLoads * MemOpCost;
 }
 
+InstructionCost
+RISCVTTIImpl::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) {
+  // FIXME: This is a property of the default vector convention, not
+  // all possible calling conventions.  Fixing that will require
+  // some TTI API and SLP rework.
+  InstructionCost Cost = 0;
+  TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  for (auto *Ty : Tys) {
+    if (!Ty->isVectorTy())
+      continue;
+    Align A = DL.getPrefTypeAlign(Ty);
+    Cost += getMemoryOpCost(Instruction::Store, Ty, A, 0, CostKind) +
+            getMemoryOpCost(Instruction::Load, Ty, A, 0, CostKind);
+  }
+  return Cost;
+}
+
 // Currently, these represent both throughput and codesize costs
 // for the respective intrinsics.  The costs in this table are simply
 // instruction counts with the following adjustments made:
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 763b89bfec0a66..092601fdb61f64 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -149,6 +149,11 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                  ArrayRef<const Value *> Args = std::nullopt,
                                  const Instruction *CxtI = nullptr);
 
+  InstructionCost getScalarizationOverhead(VectorType *Ty,
+                                           const APInt &DemandedElts,
+                                           bool Insert, bool Extract,
+                                           TTI::TargetCostKind CostKind);
+
   InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                         TTI::TargetCostKind CostKind);
 
@@ -169,6 +174,8 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                          TTI::TargetCostKind CostKind,
                                          const Instruction *I);
 
+  InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);
+
   InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
                                    TTI::CastContextHint CCH,
                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll b/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
index 5236f5a3bae954..208dff19a3156c 100644
--- a/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
@@ -329,9 +329,9 @@ define i32 @frem() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %F32 = frem float undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %F64 = frem double undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F16 = frem <1 x half> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F16 = frem <2 x half> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F16 = frem <4 x half> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F16 = frem <8 x half> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F16 = frem <2 x half> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4F16 = frem <4 x half> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V8F16 = frem <8 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 63 for instruction: %V16F16 = frem <16 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 127 for instruction: %V32F16 = frem <32 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F16 = frem <vscale x 1 x half> undef, undef
@@ -341,8 +341,8 @@ define i32 @frem() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV16F16 = frem <vscale x 16 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV32F16 = frem <vscale x 32 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F32 = frem <1 x float> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F32 = frem <2 x float> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F32 = frem <4 x float> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F32 = frem <2 x float> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4F32 = frem <4 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F32 = frem <8 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 63 for instruction: %V16F32 = frem <16 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F32 = frem <vscale x 1 x float> undef, undef
@@ -351,7 +351,7 @@ define i32 @frem() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV8F32 = frem <vscale x 8 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV16F32 = frem <vscale x 16 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F64 = frem <1 x double> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F64 = frem <2 x double> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F64 = frem <2 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F64 = frem <4 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F64 = frem <8 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F64 = frem <vscale x 1 x double> undef, undef
diff --git a/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll b/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
index f37cd99e803ec9..8e0dfd3344729a 100644
--- a/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
@@ -42,35 +42,35 @@ define i32 @masked_gather() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4I8 = call <4 x i8> @llvm.masked.gather.v4i8.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2I8 = call <2 x i8> @llvm.masked.gather.v2i8.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V1I8 = call <1 x i8> @llvm.masked.gather.v1i8.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V8F64.u = call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x double> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V4F64.u = call <4 x double> @llvm.masked.gather.v4f64.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x double> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2F64.u = call <2 x double> @llvm.masked.gather.v2f64.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8F64.u = call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4F64.u = call <4 x double> @llvm.masked.gather.v4f64.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F64.u = call <2 x double> @llvm.masked.gather.v2f64.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F64.u = call <1 x double> @llvm.masked.gather.v1f64.v1p0(<1 x ptr> undef, i32 2, <1 x i1> undef, <1 x double> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 68 for instruction: %V16F32.u = call <16 x float> @llvm.masked.gather.v16f32.v16p0(<16 x ptr> undef, i32 2, <16 x i1> undef, <16 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8F32.u = call <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4F32.u = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2F32.u = call <2 x float> @llvm.masked.gather.v2f32.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16F32.u = call <16 x float> @llvm.masked.gather.v16f32.v16p0(<16 x ptr> undef, i32 2, <16 x i1> undef, <16 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8F32.u = call <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4F32.u = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F32.u = call <2 x float> @llvm.masked.gather.v2f32.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F32.u = call <1 x float> @llvm.masked.gather.v1f32.v1p0(<1 x ptr> undef, i32 2, <1 x i1> undef, <1 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 140 for instruction: %V32F16.u = call <32 x half> @llvm.masked.gather.v32f16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 70 for instruction: %V16F16.u = call <16 x half> @llvm.masked.gather.v16f16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %V8F16.u = call <8 x half> @llvm.masked.gather.v8f16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4F16.u = call <4 x half> @llvm.masked.gather.v4f16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2F16.u = call <2 x half> @llvm.masked.gather.v2f16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V32F16.u = call <32 x half> @llvm.masked.gather.v32f16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16F16.u = call <16 x half> @llvm.masked.gather.v16f16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8F16.u = call <8 x half> @llvm.masked.gather.v8f16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4F16.u = call <4 x half> @llvm.masked.gather.v4f16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F16.u = call <2 x half> @llvm.masked.gather.v2f16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x half> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F16.u = call <1 x half> @llvm.masked.gather.v1f16.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V8I64.u = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> undef, i32 4, <8 x i1> undef, <8 x i64> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V4I64.u = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> undef, i32 4, <4 x i1> undef, <4 x i64> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2I64.u = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> undef, i32 4, <2 x i1> undef, <2 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8I64.u = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> undef, i32 4, <8 x i1> undef, <8 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4I64.u = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> undef, i32 4, <4 x i1> undef, <4 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2I64.u = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> undef, i32 4, <2 x i1> undef, <2 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1I64.u = call <1 x i64> @llvm.masked.gather.v1i64.v1p0(<1 x ptr> undef, i32 4, <1 x i1> undef, <1 x i64> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 68 for instruction: %V16I32.u = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8I32.u = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4I32.u = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2I32.u = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16I32.u = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8I32.u = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4I32.u = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2I32.u = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1I32.u = call <1 x i32> @llvm.masked.gather.v1i32.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 140 for instruction: %V32I16.u = call <32 x i16> @llvm.masked.gather.v32i16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 70 for instruction: %V16I16.u = call <16 x i16> @llvm.masked.gather.v16i16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %V8I16.u = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4I16.u = call <4 x i16> @llvm.masked.gather.v4i16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2I16.u = call <2 x i16> @llvm.masked.gather.v2i16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V32I16.u = call <32 x i16> @llvm.masked.gather.v32i16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16I16.u = call <16 x i16> @llvm.masked.gather.v16i16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8I16.u = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4I16.u = call <4 x i16> @llvm.masked.gather.v4i16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2I16.u = call <2 x i16> @llvm.masked.gather.v2i16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1I16.u = call <1 x i16> @llvm.masked.gather.v1i16.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 0
 ;
diff --git a/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll b/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
index 71e98e5acde89e..a0b5386c1544c4 100644
--- a/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
@@ -96,8 +96,8 @@ declare <vscale x 8 x double> @llvm.sqrt.nvx8f64(<vscale x 8 x double>)
 define void @pow() {
 ; CHECK-LABEL: 'pow'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.pow.f32(float undef, float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.pow.v2f32(<2 x float> undef, <2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.pow.v4f32(<4 x float> undef, <4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 ...
[truncated]

github-actions · 2024-09-12T17:55:09Z

✅ With the latest revision this PR passed the C/C++ code formatter.

If we know an exact VLEN, then the index is effectively modulo the number of elements in a single vector register. Our lowering performs this subvector optimization. A bit of context. This change may look a bit strange on it's own given we are currently *not* scaling insert/extract cost by LMUL. This costing decision needs to change, but is very intertwined with SLP profitability, and is thus a bit hard to adjust. I'm hoping that llvm#108419 will let me start to untangle this. This change is basically a case of finding a subset I can tackle before other dependencies are in place which does no real harm in the meantime.

lukel97 · 2024-09-16T16:21:05Z

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

+  if (Insert && !Extract && LT.first.isValid() && LT.second.isVector() &&
+      Ty->getScalarSizeInBits() != 1) {
+    MVT ContainerVT = LT.second;
+    if (ContainerVT.isFixedLengthVector())


Will ContainerVT always be a fixed length vector here?

Not if this API were called from the LoopVectorizer.

We're bailing on scalable vectors at the start of the function, and we cast Ty to a fixed length vector below. Fixed length vectors are legal so surely getTypeLegalizationCost won't legalize a fixed length to a scalable?

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

#108595) If we know an exact VLEN, then the index is effectively modulo the number of elements in a single vector register. Our lowering performs this subvector optimization. A bit of context. This change may look a bit strange on it's own given we are currently *not* scaling insert/extract cost by LMUL. This costing decision needs to change, but is very intertwined with SLP profitability, and is thus a bit hard to adjust. I'm hoping that #108419 will let me start to untangle this. This change is basically a case of finding a subset I can tackle before other dependencies are in place which does no real harm in the meantime.

lukel97

I'm seeing a 0.88% regression on 511.povray_r on the BPI F3 after applying this, and it's fairly reproducible (< 0.1% stddev). Looking through the codegen changes, it looks like we're now avoiding partial vectorization in some places where we e.g. exploded a vector to do a bunch of exp intrinsic calls, spilling the vector registers exactly as you described.

I would have thought that avoiding these vector spills would have been more performant, I'm not sure why it's turning out to be slower in the scalar form. Do we need discount build_vectors a bit more to get it to partially vectorize these parts again?

To clarify, I think the changes in this PR are the right thing to do, I just want to point the interaction with SLP. I'm running the other benchmarks now to see if they're also affected.

preames · 2024-09-18T14:50:17Z

I'm seeing a 0.88% regression on 511.povray_r on the BPI F3 after applying this, and it's fairly reproducible (< 0.1% stddev). Looking through the codegen changes, it looks like we're now avoiding partial vectorization in some places where we e.g. exploded a vector to do a bunch of exp intrinsic calls, spilling the vector registers exactly as you described.

Just to confirm, you're looking at cycle count right? What routine are you seeing this in? I'm looking at an LTO build of povray, and not seeing any heavy use of the @exp routine - except indirectly through a function pointer table. Is your build -Ofast -flto=auto? Or something else?

I would have thought that avoiding these vector spills would have been more performant, I'm not sure why it's turning out to be slower in the scalar form. Do we need discount build_vectors a bit more to get it to partially vectorize these parts again?

Either that, or we exposing some other issue. If you can show me the relevant areas, I'll try to take a look.

To clarify, I think the changes in this PR are the right thing to do, I just want to point the interaction with SLP. I'm running the other benchmarks now to see if they're also affected.

Ack.

lukel97 · 2024-09-18T16:53:59Z

Just to confirm, you're looking at cycle count right? What routine are you seeing this in? I'm looking at an LTO build of povray, and not seeing any heavy use of the @exp routine - except indirectly through a function pointer table. Is your build -Ofast -flto=auto? Or something else?

This is measure cycle count, with -O3 -mcpu=spacemit-x60 on the train dataset, I'll queue up another run with -Ofast and LTO.

One example I found was in pov::compute_backtrace_texture(float*, pov::Texture_Struct*, double*, double*, pov::Ray_Struct*, double, pov::istk_entry*) (_ZN3povL25compute_backtrace_textureEPfPNS_14Texture_StructEPdS3_PNS_10Ray_StructEdPNS_10istk_entryE)

-       flw     fa5, 36(s1)
-       fld     fs3, %pcrel_lo(.Lpcrel_hi250)(a1)
-       fld     fs4, 0(s10)
-       fcvt.d.s        fa5, fa5
-       fsub.d  fa5, fs3, fa5
-       fneg.d  fa5, fa5
-       fmul.d  fa5, fs4, fa5
-       fdiv.d  fa0, fa5, fs2
-       call    exp
-       flw     fa5, 40(s1)
-       fmv.d   fs1, fa0
-       fcvt.d.s        fa5, fa5
-       fsub.d  fa5, fs3, fa5
-       fneg.d  fa5, fa5
-       fmul.d  fa5, fs4, fa5
-       fdiv.d  fa0, fa5, fs2
-       call    exp
-       vsetivli        zero, 2, e64, m1, ta, ma
-       vfmv.v.f        v8, fs1
-       flw     fa5, 44(s1)
-       vfslide1down.vf v8, v8, fa0
-       vfmul.vf        v8, v8, fs0
-       vsetvli zero, zero, e32, mf2, ta, ma
-       vfncvt.f.f.w    v9, v8
-       csrr    a0, vlenb
-       add     a0, a0, sp
-       addi    a0, a0, 2047
-       addi    a0, a0, 65
-       vs1r.v  v9, (a0)                        # Unknown-size Folded Spill
-       fcvt.d.s        fa5, fa5
-       fsub.d  fa5, fs3, fa5
-       fneg.d  fa5, fa5
-       fmul.d  fa5, fs4, fa5
-       fdiv.d  fa0, fa5, fs2
-       call    exp
-       csrr    a0, vlenb
-       add     a0, a0, sp
-       addi    a0, a0, 2047
-       addi    a0, a0, 65
-       vl1r.v  v9, (a0)                        # Unknown-size Folded Reload
-       fmul.d  fa5, fa0, fs0
-       fcvt.s.d        fs0, fa5
-       vsetivli        zero, 2, e32, mf2, ta, ma
+       flw     fa5, 36(s1)
+       fld     fs4, %pcrel_lo(.Lpcrel_hi250)(s0)
+       fld     fs5, 0(s10)
+       fcvt.d.s        fa5, fa5
+       fsub.d  fa5, fs4, fa5
+       fneg.d  fa5, fa5
+       fmul.d  fa5, fs5, fa5
+       fdiv.d  fa0, fa5, fs2
+       call    exp
+       flw     fa5, 40(s1)
+       fmul.d  fa4, fa0, fs1
+       fcvt.s.d        fs0, fa4
+       fcvt.d.s        fa5, fa5
+       fsub.d  fa5, fs4, fa5
+       fneg.d  fa5, fa5
+       fmul.d  fa5, fs5, fa5
+       fdiv.d  fa0, fa5, fs2
+       call    exp
+       flw     fa5, 44(s1)
+       fmul.d  fa4, fa0, fs1
+       fcvt.s.d        fs3, fa4
+       fcvt.d.s        fa5, fa5
+       fsub.d  fa5, fs4, fa5
+       fneg.d  fa5, fa5
+       fmul.d  fa5, fs5, fa5
+       fdiv.d  fa0, fa5, fs2
+       call    exp
+       fmul.d  fa5, fa0, fs1

But at the same time, in pov::do_light(pov::Light_Source_Struct*, double*, pov::Ray_Struct*, pov::Ray_Struct*, double*, float*) (_ZN3povL8do_lightEPNS_19Light_Source_StructEPdPNS_10Ray_StructES4_S2_Pf) we actually go in the other direction

-       fneg.d  fa5, fa5
-       fsd     fa5, 24(s0)
-       fneg.d  fa5, fa4
-       fsd     fa5, 32(s0)
+       vsetivli        zero, 2, e64, m1, ta, ma
+       vfmv.v.f        v8, fa5
+       vfslide1down.vf v8, v8, fa4
+       vfneg.v v8, v8
+       vse64.v v8, (s5)

Unfortunately none of the hot *_Intersection methods seem to be affected, instead it's a large number of cold functions that are slightly perturbed.

I'm really not sure how to interpret these changes. If the rest of the SPEC benchmarks are OK, I would be fine just chalking this up to SLP "noise".

lukel97 · 2024-09-19T11:47:31Z

Just a quick update on the benchmarking, with LTO the regression disappears. 511.povray_r is actually significantly faster with LTO, by about 5.5%. I'm rerunning the benchmarks again with LTO, but not with -Ofast. -ffast-math seems to cause the tests to fail

preames · 2024-09-19T14:30:48Z

One example I found was in pov::compute_backtrace_texture(float*, pov::Texture_Struct*, double*, double*, pov::Ray_Struct*, double, pov::istk_entry*) (_ZN3povL25compute_backtrace_textureEPfPNS_14Texture_StructEPdS3_PNS_10Ray_StructEdPNS_10istk_entryE)

I spent some time looking at this routine yesterday, and am still digging today. I've been focusing less on what the current codegen looks like, than what optimal code for this sequence might look like. (Mostly because I was looking at a differently compiled binary.) I see a couple of possible improvements which would help both before and after significantly, but none of them are easy. I'm going to switch and look more closely at the regression today.

llvm#108595) If we know an exact VLEN, then the index is effectively modulo the number of elements in a single vector register. Our lowering performs this subvector optimization. A bit of context. This change may look a bit strange on it's own given we are currently *not* scaling insert/extract cost by LMUL. This costing decision needs to change, but is very intertwined with SLP profitability, and is thus a bit hard to adjust. I'm hoping that llvm#108419 will let me start to untangle this. This change is basically a case of finding a subset I can tackle before other dependencies are in place which does no real harm in the meantime.

lukel97 · 2024-09-20T06:28:53Z

Here's the benchmark diffs with LTO: https://lnt.lukelau.me/db_default/v4/nts/8?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=7

The povray regression is gone, the only noticeable change left is a 2.08% regression in 541.leela_r. In the hottest function FastState::play_random_move() there's a few more places where we now vectorize to a vredor.vs:

	vsetivli	zero, 0x1, e64, m1, ta, ma
	vmv.s.x	v8, a2
      	vsetivli	zero, 0x4, e16, mf4, ta, ma
	vmseq.vi	v0, v8, 0x1
      	slliw	a2, t0, 0x8
      	vsetvli	zero, zero, e32, mf2, ta, mu
      	vmv.v.i	v8, 0x0
      	ld	a4, 0x38(sp)
      	vle32.v	v8, (a4), v0.t
      	slliw	a4, a6, 0x6
	slliw	a5, t1, 0x4
      	slliw	a3, a3, 0x2
	vredor.vs	v8, v8, v8
	vmv.x.s	s1, v8

We're doing a packed e16 build vector which is then converted to a mask vector.
I think we can avoid the vmv.v.i if we move the masking from the vle32.v to the vredor.vs. But that's orthogonal to this PR and I don't think the vectorized code is bad per say, so I think this is fine.

lukel97

Modulo the ContainerVT nit LGTM

preames · 2024-09-20T14:45:22Z

We're doing a packed e16 build vector which is then converted to a mask vector.

Not quite. I found the surrounding code, and I'm pretty sure what we originally had was a build vector of variable i1, which was lowered to an i16 build vector and a comparison against zero. Which is quite odd, as this patch shouldn't be impacting the costing of a build_vector.

I think we can avoid the vmv.v.i if we move the masking from the vle32.v to the vredor.vs. But that's orthogonal to this PR and I don't think the vectorized code is bad per say, so I think this is fine.

I agree on both points. I would assume the original code had a vselect with the neutral value after the load and before the reduction.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

preames · 2024-09-20T15:58:43Z

We're doing a packed e16 build vector which is then converted to a mask vector.

Not quite. I found the surrounding code, and I'm pretty sure what we originally had was a build vector of variable i1, which was lowered to an i16 build vector and a comparison against zero. Which is quite odd, as this patch shouldn't be impacting the costing of a build_vector.

Just noting that I also see a large number of instances of this pattern in leela without this patch. Looking at them, I'm less sure if this is lowering, or a straight forward vectorization of the original code. It does look like something we could improve both from the perspective of the i1 vector creation, and that most of the masks are indexed loads (which aren't being vectorized at all).

preames · 2024-09-20T20:11:17Z

I figured out the root cause of the Leela thing, and filed a ticket so we don't forget. This is much sillier than we'd first realized. Vectorizing this is a quite terrible result, but seems to have little to do with this patch. #109466

preames requested review from lukel97, alexey-bataev and topperc September 12, 2024 16:45

llvmbot added backend:RISC-V llvm:analysis llvm:transforms labels Sep 12, 2024

Merge branch 'main' into pr-riscv-tti-buildvec-cost

29cccc6

preames added 2 commits September 12, 2024 11:20

Test updates in new test files missed in merge

e436e1b

Clang format

89753a4

preames mentioned this pull request Sep 13, 2024

[RISCV][TTI] Adjust cost for extract/insert element when VLEN is known #108595

Merged

lukel97 reviewed Sep 16, 2024

View reviewed changes

lukel97 reviewed Sep 18, 2024

View reviewed changes

lukel97 approved these changes Sep 20, 2024

View reviewed changes

Address review comment

75fa351

alexey-bataev reviewed Sep 20, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp Outdated Show resolved Hide resolved

preames added 2 commits September 20, 2024 08:18

Merge branch 'main' into pr-riscv-tti-buildvec-cost

c0d019b

Address review comment

cc75abd

preames merged commit 7f6bbb3 into llvm:main Sep 20, 2024
5 of 6 checks passed

preames deleted the pr-riscv-tti-buildvec-cost branch September 20, 2024 15:34

preames mentioned this pull request Sep 20, 2024

[RISCV] Unprofitable select vectorization/lowering #109466

Closed

[RISCV][TTI] Reduce cost of a build_vector pattern #108419

[RISCV][TTI] Reduce cost of a build_vector pattern #108419

Uh oh!

Conversation

preames commented Sep 12, 2024

Uh oh!

llvmbot commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukel97 Sep 16, 2024

Choose a reason for hiding this comment

Uh oh!

preames Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

lukel97 Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukel97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preames commented Sep 18, 2024

Uh oh!

lukel97 commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukel97 commented Sep 19, 2024

Uh oh!

preames commented Sep 19, 2024

Uh oh!

lukel97 commented Sep 20, 2024

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

preames commented Sep 20, 2024

Uh oh!

Uh oh!

Uh oh!

preames commented Sep 20, 2024

Uh oh!

preames commented Sep 20, 2024

Uh oh!

Uh oh!

llvmbot commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

lukel97 left a comment •

edited

Loading

lukel97 commented Sep 18, 2024 •

edited

Loading