Skip to content

[IR][LangRef] Add partial reduction add intrinsic #94499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 33 additions & 2 deletions llvm/docs/LangRef.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14250,7 +14250,7 @@ Arguments:
""""""""""
The first 4 arguments are similar to ``llvm.instrprof.increment``. The indexing
is specific to callsites, meaning callsites are indexed from 0, independent from
the indexes used by the other intrinsics (such as
the indexes used by the other intrinsics (such as
``llvm.instrprof.increment[.step]``).

The last argument is the called value of the callsite this intrinsic precedes.
Expand All @@ -14264,7 +14264,7 @@ a buffer LLVM can use to perform counter increments (i.e. the lowering of
``llvm.instrprof.increment[.step]``. The address range following the counter
buffer, ``<num-counters>`` x ``sizeof(ptr)`` - sized, is expected to contain
pointers to contexts of functions called from this function ("subcontexts").
LLVM does not dereference into that memory region, just calculates GEPs.
LLVM does not dereference into that memory region, just calculates GEPs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated whitespace change.


The lowering of ``llvm.instrprof.callsite`` consists of:

Expand Down Expand Up @@ -19209,6 +19209,37 @@ will be on any later loop iteration.
This intrinsic will only return 0 if the input count is also 0. A non-zero input
count will produce a non-zero result.

'``llvm.experimental.vector.partial.reduce.add.*``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""
This is an overloaded intrinsic.

::

declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v8i32(<4 x i32> %accum, <8 x i32> %in)
declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v16i32(<4 x i32> %accum, <16 x i32> %in)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv8i32(<vscale x 4 x i32> %accum, <vscale x 8 x i32> %in)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv16i32(<vscale x 4 x i32> %accum, <vscale x 16 x i32> %in)

Overview:
"""""""""

The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics perform an integer
``ADD`` reduction of subvectors within a vector, before adding the resulting vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be loosed to unrestricted the way the operands are combined. In its broadest sense the operands are concatenated into a single vector that's then reduced down to the number of elements dictated but the result type (and hence first operand type) but there's no specification for how the reduction is distributed throughout those elements.

to the provided accumulator vector. The return type is a vector type that matches
the type of the accumulator vector.

Arguments:
""""""""""

The first argument is the accumulator vector, or a `zeroinitializer`. The type of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think zeroinitializer adds anything to the description, as in, there's no change of behaviour base on this specific value.

this argument must match the return type. The second argument is the vector to reduce
into the accumulator, the width of this vector must be a positive integer multiple of
the accumulator vector/return type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat contentious so feel free to ignore but when talking about the number of elements I see vectors having length not width.

For now it's worth an extra restriction for the two vector types to have matching styles (i.e. both fixed or both scalable) whilst also making it clear both vectors must have the same element type. The "style" restriction is something I think we'll want to relax in the future (AArch64's SVE2p1 feature is a possible enabling use case) but there's no point worrying about that yet.



'``llvm.experimental.vector.histogram.*``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
6 changes: 6 additions & 0 deletions llvm/include/llvm/IR/Intrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -2635,6 +2635,12 @@ def int_vector_deinterleave2 : DefaultAttrsIntrinsic<[LLVMHalfElementsVectorType
[llvm_anyvector_ty],
[IntrNoMem]>;

//===-------------- Intrinsics to perform partial reduction ---------------===//

def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
[llvm_anyvector_ty, llvm_anyvector_ty],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a new matcher class to constrain the second parameter to the restrictions you defined in the langref would be helpful (same element type, width an integer multiple).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is an experimental intrinsic is it worth implementing that plumbing?

Also, the matcher classes typically exist to allow for fewer explicit types when creating a call, which in this instance is not possible because both vector lengths are unknown (or to put another way, there's no 1-1 link between them).

Personally I think there verifier route is better, plus it allow for a more user friendly error message.

[IntrNoMem]>;

//===----------------- Pointer Authentication Intrinsics ------------------===//
//

Expand Down
22 changes: 22 additions & 0 deletions llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7914,6 +7914,28 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
setValue(&I, Trunc);
return;
}
case Intrinsic::experimental_vector_partial_reduce_add: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can pass this through as an INTRINSIC_WO_CHAIN node, at least for targets that support it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful because I don't think common code exists to type legalise arbitrary INTRINSIC_WO_CHAIN calls (given their nature). Presumably we'll just follow the precedent set for get.active.lane.mask and cttz.elts when we add AArch64 specific lowering.

I can't help but think as some point we'll just want to restrict the "same element type" restrict of VECREDUCE_ADD to have explicit signed and unsigned versions, like we have for ABDS/ABDU, but I guess we can see how things work out (again much as we are for the intrinsics mentioned before).

auto DL = getCurSDLoc();
Copy link
Collaborator

@huntergr-arm huntergr-arm Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would be good to remove the 'auto' declarations and use the appropriate named types (SDValue, EVT, int, etc). I think you should already have a variable in scope for getCurSDLoc() as well (sdl, from the start of the function).

auto ReducedTy = EVT::getEVT(I.getType());
auto OpNode = getValue(I.getOperand(1));
auto FullTy = OpNode.getValueType();

auto Accumulator = getValue(I.getOperand(0));
unsigned ScaleFactor = FullTy.getVectorMinNumElements() / ReducedTy.getVectorMinNumElements();

for(unsigned i = 0; i < ScaleFactor; i++) {
Copy link
Collaborator

@huntergr-arm huntergr-arm Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now a bit concerned about the semantics of the intrinsic. In one of the test cases below (partial_reduce_add), you have the same size vector for both inputs. Applying this lowering results in the second vector being reduced and the result added to the first lane of the accumulator, with the other lanes being untouched.

I think the idea was to reduce the second input vector until it matched the size of the first, then perform a vector add of the two. If both are the same size to begin with, you just need to perform a single vector add. @paulwalker-arm can you please clarify?

The langref text will need to make the exact semantics clear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous design may have been better, since it was clearly just performing the reduction of a single vector value into another (and possibly to a scalar, as @arsenm suggests). Making it a binop as well seems to make it less flexible vs. just having a separate binop afterwards. Maybe I'm missing something though...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the "having a separate binop" approach is that it constrains optimisation/code generation because that binop requires a very specific ordering for how elements are combined, which is the very problem the partial reduction is solving.

I think folk are stuck in a "how can we use dot instructions" mindset, whilst I'm trying to push for "what is the loosest way reductions can be represented in IR". To this point, the current suggested langref text for the intrinsic is still too strict because it gives the impression there's a defined order for how the second operand's elements are combined with the first, where there shouldn't be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huntergr-arm - Yes, the intent for "same size operands" is to emit a stock binop. This will effectively match what LoopVectorize does today and thus allow the intrinsic to be used regardless of the target rather than having to implement target specific/controlled paths within the vectorizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was thrown off by the langref description. I guess then I'd like to see the default lowering changed to just extract the subvectors from the second operand and perform a vector add on to the first operand, instead of reducing the subvectors and adding the result to individual lanes. It technically meets the defined semantics (target-defined order of reduction operations), but the current codegen is pretty awful compared to a series of vector adds.

auto SourceIndex = DAG.getVectorIdxConstant(i * ScaleFactor, DL);
auto TargetIndex = DAG.getVectorIdxConstant(i, DL);
auto ExistingValue = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ReducedTy.getScalarType(), {Accumulator, TargetIndex});
auto N = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ReducedTy, {OpNode, SourceIndex});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to assume that each subvector will be the same size as the smaller vector type? It works for the case we're interested in (e.g. <vscale x 16 x i32> to <vscale x 4 x i32>), but would fail if the larger type were <vscale x 8 x i32> -- you'd want to extract <vscale x 2 x i32> and reduce that. (We might never create such a partial reduction, but I think it should work correctly if we did).

N = DAG.getNode(ISD::VECREDUCE_ADD, DL, ReducedTy.getScalarType(), N);
N = DAG.getNode(ISD::ADD, DL, ReducedTy.getScalarType(), ExistingValue, N);
Accumulator = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, ReducedTy, {Accumulator, N, TargetIndex});
}

setValue(&I, Accumulator);
return;
}
case Intrinsic::experimental_cttz_elts: {
auto DL = getCurSDLoc();
SDValue Op = getValue(I.getOperand(0));
Expand Down
13 changes: 13 additions & 0 deletions llvm/lib/IR/Verifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6131,6 +6131,19 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
}
break;
}
case Intrinsic::experimental_vector_partial_reduce_add: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my matcher class suggestion would remove the need for this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for my 2c.

VectorType *AccTy = cast<VectorType>(Call.getArgOperand(0)->getType());
VectorType *VecTy = cast<VectorType>(Call.getArgOperand(1)->getType());

auto VecWidth = VecTy->getElementCount().getKnownMinValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: more autos.

auto AccWidth = AccTy->getElementCount().getKnownMinValue();

Check((VecWidth % AccWidth) == 0, "Invalid vector widths for partial "
"reduction. The width of the input vector "
"must be a postive integer multiple of "
"the width of the accumulator vector.");
break;
}
case Intrinsic::experimental_noalias_scope_decl: {
NoAliasScopeDecls.push_back(cast<IntrinsicInst>(&Call));
break;
Expand Down
162 changes: 162 additions & 0 deletions llvm/test/CodeGen/AArch64/partial-reduction-add.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
; RUN: llc -force-vector-interleave=1 %s | FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-unknown-elf"

define <4 x i32> @partial_reduce_add_fixed(<4 x i32> %accumulator, <4 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_fixed:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: addv s1, v1.4s
; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: add w8, w9, w8
; CHECK-NEXT: mov v0.s[0], w8
; CHECK-NEXT: ret
entry:
%partial.reduce = call <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v4i32(<4 x i32> %accumulator, <4 x i32> %0)
ret <4 x i32> %partial.reduce
}

define <4 x i32> @partial_reduce_add_fixed_half(<4 x i32> %accumulator, <8 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_fixed_half:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: addv s1, v1.4s
; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: mov w10, v0.s[1]
; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: add w9, w9, w8
; CHECK-NEXT: add w8, w10, w8
; CHECK-NEXT: mov v0.s[0], w9
; CHECK-NEXT: mov v0.s[1], w8
; CHECK-NEXT: ret
entry:
%partial.reduce = call <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v8i32(<4 x i32> %accumulator, <8 x i32> %0)
ret <4 x i32> %partial.reduce
}

define <vscale x 4 x i32> @partial_reduce_add(<vscale x 4 x i32> %accumulator, <vscale x 4 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: uaddv d1, p0, z1.s
; CHECK-NEXT: ptrue p0.s, vl1
; CHECK-NEXT: fmov x9, d1
; CHECK-NEXT: add w8, w8, w9
; CHECK-NEXT: mov z0.s, p0/m, w8
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv4i32(<vscale x 4 x i32> %accumulator, <vscale x 4 x i32> %0)
ret <vscale x 4 x i32> %partial.reduce
}

define <vscale x 4 x i32> @partial_reduce_add_half(<vscale x 4 x i32> %accumulator, <vscale x 8 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_half:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: mov w8, #1 // =0x1
; CHECK-NEXT: index z2.s, #0, #1
; CHECK-NEXT: mov z3.s, w8
; CHECK-NEXT: fmov w10, s0
; CHECK-NEXT: mov w9, v0.s[1]
; CHECK-NEXT: uaddv d1, p0, z1.s
; CHECK-NEXT: ptrue p1.s, vl1
; CHECK-NEXT: cmpeq p0.s, p0/z, z2.s, z3.s
; CHECK-NEXT: fmov x8, d1
; CHECK-NEXT: add w10, w10, w8
; CHECK-NEXT: add w8, w9, w8
; CHECK-NEXT: mov z0.s, p1/m, w10
; CHECK-NEXT: mov z0.s, p0/m, w8
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv8i32(<vscale x 4 x i32> %accumulator, <vscale x 8 x i32> %0)
ret <vscale x 4 x i32> %partial.reduce
}

define <vscale x 4 x i32> @partial_reduce_add_quart(<vscale x 4 x i32> %accumulator, <vscale x 16 x i32> %0) #0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reducing into the first 4 elements of the accumulator; it doesn't work correctly with vscale.

; CHECK-LABEL: partial_reduce_add_quart:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: mov w8, #1 // =0x1
; CHECK-NEXT: fmov w10, s0
; CHECK-NEXT: mov z6.s, w8
; CHECK-NEXT: index z5.s, #0, #1
; CHECK-NEXT: ptrue p2.s, vl1
; CHECK-NEXT: uaddv d1, p0, z1.s
; CHECK-NEXT: mov w9, v0.s[1]
; CHECK-NEXT: uaddv d2, p0, z2.s
; CHECK-NEXT: uaddv d3, p0, z3.s
; CHECK-NEXT: cmpeq p1.s, p0/z, z5.s, z6.s
; CHECK-NEXT: uaddv d4, p0, z4.s
; CHECK-NEXT: fmov x8, d1
; CHECK-NEXT: mov z1.d, z0.d
; CHECK-NEXT: add w8, w10, w8
; CHECK-NEXT: mov w10, #2 // =0x2
; CHECK-NEXT: mov z1.s, p2/m, w8
; CHECK-NEXT: fmov x8, d2
; CHECK-NEXT: mov z6.s, w10
; CHECK-NEXT: mov w10, v0.s[2]
; CHECK-NEXT: add w8, w9, w8
; CHECK-NEXT: mov w9, #3 // =0x3
; CHECK-NEXT: cmpeq p2.s, p0/z, z5.s, z6.s
; CHECK-NEXT: mov z2.s, w9
; CHECK-NEXT: fmov x9, d3
; CHECK-NEXT: mov z1.s, p1/m, w8
; CHECK-NEXT: mov w8, v0.s[3]
; CHECK-NEXT: add w9, w10, w9
; CHECK-NEXT: cmpeq p0.s, p0/z, z5.s, z2.s
; CHECK-NEXT: mov z1.s, p2/m, w9
; CHECK-NEXT: fmov x9, d4
; CHECK-NEXT: add w8, w8, w9
; CHECK-NEXT: mov z1.s, p0/m, w8
; CHECK-NEXT: mov z0.d, z1.d
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv16i32(<vscale x 4 x i32> %accumulator, <vscale x 16 x i32> %0)
ret <vscale x 4 x i32> %partial.reduce
}

define <vscale x 8 x i32> @partial_reduce_add_half_8(<vscale x 8 x i32> %accumulator, <vscale x 16 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_half_8:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add z2.s, z2.s, z3.s
; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: mov w8, #1 // =0x1
; CHECK-NEXT: index z3.s, #0, #1
; CHECK-NEXT: mov z4.s, w8
; CHECK-NEXT: fmov w10, s0
; CHECK-NEXT: mov w9, v0.s[1]
; CHECK-NEXT: ptrue p1.s, vl1
; CHECK-NEXT: uaddv d2, p0, z2.s
; CHECK-NEXT: cmpeq p0.s, p0/z, z3.s, z4.s
; CHECK-NEXT: fmov x8, d2
; CHECK-NEXT: add w10, w10, w8
; CHECK-NEXT: add w8, w9, w8
; CHECK-NEXT: mov z0.s, p1/m, w10
; CHECK-NEXT: mov z0.s, p0/m, w8
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 8 x i32> @llvm.experimental.vector.partial.reduce.add.nxv8i32.nxv8i32.nxv16i32(<vscale x 8 x i32> %accumulator, <vscale x 16 x i32> %0)
ret <vscale x 8 x i32> %partial.reduce
}

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i32>) #1

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv8i32(<vscale x 4 x i32>, <vscale x 8 x i32>) #1

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv16i32(<vscale x 4 x i32>, <vscale x 16 x i32>) #1

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <vscale x 8 x i32> @llvm.experimental.vector.partial.reduce.add.nxv8i32.nxv8i32.nxv16i32(<vscale x 8 x i32>, <vscale x 16 x i32>) #1

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>) #2
declare i32 @llvm.vector.reduce.add.nxv8i32(<vscale x 8 x i32>) #2

attributes #0 = { "target-features"="+fp-armv8,+fullfp16,+neon,+sve,+sve2,+v8a" }
attributes #1 = { nocallback nofree nosync nounwind willreturn memory(none) }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think attributes 1 and 2 can be removed entirely, and 0 only really needs +sve2.

attributes #2 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
Loading