-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[IR][LangRef] Add partial reduction add intrinsic #94499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
f365ac7
102f9e4
b0c126e
b811558
0786587
a9a1028
c01d6c6
fadffcc
7c428bd
913ac87
631208f
f8eec05
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14250,7 +14250,7 @@ Arguments: | |
"""""""""" | ||
The first 4 arguments are similar to ``llvm.instrprof.increment``. The indexing | ||
is specific to callsites, meaning callsites are indexed from 0, independent from | ||
the indexes used by the other intrinsics (such as | ||
the indexes used by the other intrinsics (such as | ||
``llvm.instrprof.increment[.step]``). | ||
|
||
The last argument is the called value of the callsite this intrinsic precedes. | ||
|
@@ -14264,7 +14264,7 @@ a buffer LLVM can use to perform counter increments (i.e. the lowering of | |
``llvm.instrprof.increment[.step]``. The address range following the counter | ||
buffer, ``<num-counters>`` x ``sizeof(ptr)`` - sized, is expected to contain | ||
pointers to contexts of functions called from this function ("subcontexts"). | ||
LLVM does not dereference into that memory region, just calculates GEPs. | ||
LLVM does not dereference into that memory region, just calculates GEPs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: unrelated whitespace change. |
||
|
||
The lowering of ``llvm.instrprof.callsite`` consists of: | ||
|
||
|
@@ -19209,6 +19209,35 @@ will be on any later loop iteration. | |
This intrinsic will only return 0 if the input count is also 0. A non-zero input | ||
count will produce a non-zero result. | ||
|
||
'``llvm.experimental.vector.partial.reduce.add.*``' Intrinsic | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Syntax: | ||
""""""" | ||
This is an overloaded intrinsic. | ||
|
||
:: | ||
|
||
declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v2i32.v8i32(<8 x i32> %in) | ||
declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<16 x i32> %in) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it best for these intrinsics to take a single operand? I kind of see them more like binary operators whose input and output operands are less restrictive. The intent being we're doing a partial reduction where currently LoopVectorize achieves this via an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be specific I'm proposing
whereby the result and first operand types match, but the second operand differs (perhaps with a restriction that is must have the same or more elements). |
||
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv2i32.nxv8i32(<vscale x 8 x i32> %in) | ||
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %in) | ||
|
||
Overview: | ||
""""""""" | ||
|
||
The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics do an integer | ||
``ADD`` reduction of subvectors within a vector, returning each scalar result as | ||
a lane within a vector. The return type is a vector type with an | ||
element-type of the vector input and a width a factor of the vector input | ||
(typically either half or quarter). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't been involved in defining these intrinsic internally, but have thought about how they might work before. I'm not sure if it is better to have a generic partial reduction like this or something more specific to dotprod that includes the zext/sext and mul. They both have advantages and disadvantages. The more instructions there are the harder they are to costmodel well, but more can be done with them. But it would seem that we should be defining how these are expected to reduce the inputs into the output lanes. Otherwise the definition is a bit wishy-washy in a way that can make them more difficult to use than is necessary. I would expect them to perform pair-wise reductions, and might be simpler if they are limited to power-2 so that they can deinterleave in steps. The codegen that currently exists doesn't seem to do that though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The intent here is to keep the reduction intrinsics as loose as possible so we don't lock the code generator into a specific ordering. If there's an option to simply extend the original intrinsics that would be super but I figured it would be easier to move current uses to a newer intrinsic (assuming it leaves the experimental space) than the other way round. |
||
|
||
Arguments: | ||
"""""""""" | ||
|
||
The argument to this intrinsic must be a vector of integer values. | ||
|
||
|
||
'``llvm.experimental.vector.histogram.*``' Intrinsic | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7914,6 +7914,27 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I, | |
setValue(&I, Trunc); | ||
return; | ||
} | ||
case Intrinsic::experimental_vector_partial_reduce_add: { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can pass this through as an INTRINSIC_WO_CHAIN node, at least for targets that support it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to be careful because I don't think common code exists to type legalise arbitrary INTRINSIC_WO_CHAIN calls (given their nature). Presumably we'll just follow the precedent set for I can't help but think as some point we'll just want to restrict the "same element type" restrict of |
||
auto DL = getCurSDLoc(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: It would be good to remove the 'auto' declarations and use the appropriate named types (SDValue, EVT, int, etc). I think you should already have a variable in scope for getCurSDLoc() as well (sdl, from the start of the function). |
||
auto ReducedTy = EVT::getEVT(I.getType()); | ||
auto OpNode = getValue(I.getOperand(0)); | ||
auto Index = DAG.getVectorIdxConstant(0, DL); | ||
auto FullTy = OpNode.getValueType(); | ||
|
||
auto ResultVector = DAG.getSplat(ReducedTy, DL, DAG.getConstant(0, DL, ReducedTy.getScalarType())); | ||
unsigned ScaleFactor = FullTy.getVectorMinNumElements() / ReducedTy.getVectorMinNumElements(); | ||
|
||
for(unsigned i = 0; i < ScaleFactor; i++) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm now a bit concerned about the semantics of the intrinsic. In one of the test cases below (partial_reduce_add), you have the same size vector for both inputs. Applying this lowering results in the second vector being reduced and the result added to the first lane of the accumulator, with the other lanes being untouched. I think the idea was to reduce the second input vector until it matched the size of the first, then perform a vector add of the two. If both are the same size to begin with, you just need to perform a single vector add. @paulwalker-arm can you please clarify? The langref text will need to make the exact semantics clear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the previous design may have been better, since it was clearly just performing the reduction of a single vector value into another (and possibly to a scalar, as @arsenm suggests). Making it a binop as well seems to make it less flexible vs. just having a separate binop afterwards. Maybe I'm missing something though... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem with the "having a separate binop" approach is that it constrains optimisation/code generation because that binop requires a very specific ordering for how elements are combined, which is the very problem the partial reduction is solving. I think folk are stuck in a "how can we use dot instructions" mindset, whilst I'm trying to push for "what is the loosest way reductions can be represented in IR". To this point, the current suggested langref text for the intrinsic is still too strict because it gives the impression there's a defined order for how the second operand's elements are combined with the first, where there shouldn't be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huntergr-arm - Yes, the intent for "same size operands" is to emit a stock binop. This will effectively match what LoopVectorize does today and thus allow the intrinsic to be used regardless of the target rather than having to implement target specific/controlled paths within the vectorizer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I was thrown off by the langref description. I guess then I'd like to see the default lowering changed to just extract the subvectors from the second operand and perform a vector add on to the first operand, instead of reducing the subvectors and adding the result to individual lanes. It technically meets the defined semantics (target-defined order of reduction operations), but the current codegen is pretty awful compared to a series of vector adds. |
||
auto SourceIndex = DAG.getVectorIdxConstant(i * ScaleFactor, DL); | ||
auto TargetIndex = DAG.getVectorIdxConstant(i, DL); | ||
auto N = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ReducedTy, {OpNode, SourceIndex}); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to assume that each subvector will be the same size as the smaller vector type? It works for the case we're interested in (e.g. <vscale x 16 x i32> to <vscale x 4 x i32>), but would fail if the larger type were <vscale x 8 x i32> -- you'd want to extract <vscale x 2 x i32> and reduce that. (We might never create such a partial reduction, but I think it should work correctly if we did). |
||
N = DAG.getNode(ISD::VECREDUCE_ADD, DL, ReducedTy.getScalarType(), N); | ||
ResultVector = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, ReducedTy, {ResultVector, N, TargetIndex}); | ||
} | ||
|
||
setValue(&I, ResultVector); | ||
return; | ||
} | ||
case Intrinsic::experimental_cttz_elts: { | ||
auto DL = getCurSDLoc(); | ||
SDValue Op = getValue(I.getOperand(0)); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4 | ||
; RUN: llc -force-vector-interleave=1 %s | FileCheck %s | ||
|
||
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" | ||
target triple = "aarch64-none-unknown-elf" | ||
|
||
define void @partial_reduce_add(<vscale x 16 x i8> %wide.load.pre, <vscale x 16 x i32> %0, <vscale x 16 x i32> %1, i64 %index) #0 { | ||
; CHECK-LABEL: partial_reduce_add: | ||
; CHECK: // %bb.0: // %entry | ||
; CHECK-NEXT: ptrue p0.s | ||
; CHECK-NEXT: mov w8, #1 // =0x1 | ||
; CHECK-NEXT: index z2.s, #0, #1 | ||
; CHECK-NEXT: mov z4.s, w8 | ||
; CHECK-NEXT: mov w8, #2 // =0x2 | ||
; CHECK-NEXT: ptrue p2.s, vl1 | ||
; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0] | ||
; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, #1, mul vl] | ||
; CHECK-NEXT: ld1w { z5.s }, p0/z, [x0, #2, mul vl] | ||
; CHECK-NEXT: mov z6.s, w8 | ||
; CHECK-NEXT: cmpeq p1.s, p0/z, z2.s, z4.s | ||
; CHECK-NEXT: uaddv d3, p0, z0.s | ||
; CHECK-NEXT: mov z0.s, #0 // =0x0 | ||
; CHECK-NEXT: uaddv d7, p0, z1.s | ||
; CHECK-NEXT: uaddv d4, p0, z5.s | ||
; CHECK-NEXT: mov z1.d, z0.d | ||
; CHECK-NEXT: fmov x8, d3 | ||
; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0, #3, mul vl] | ||
; CHECK-NEXT: mov z1.s, p2/m, w8 | ||
; CHECK-NEXT: mov w8, #3 // =0x3 | ||
; CHECK-NEXT: cmpeq p2.s, p0/z, z2.s, z6.s | ||
; CHECK-NEXT: mov z5.s, w8 | ||
; CHECK-NEXT: fmov x8, d7 | ||
; CHECK-NEXT: uaddv d3, p0, z3.s | ||
; CHECK-NEXT: mov z1.s, p1/m, w8 | ||
; CHECK-NEXT: fmov x8, d4 | ||
; CHECK-NEXT: cmpeq p0.s, p0/z, z2.s, z5.s | ||
; CHECK-NEXT: mov z1.s, p2/m, w8 | ||
; CHECK-NEXT: fmov x8, d3 | ||
; CHECK-NEXT: mov z1.s, p0/m, w8 | ||
; CHECK-NEXT: addvl x8, x1, #1 | ||
; CHECK-NEXT: .LBB0_1: // %vector.body | ||
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1 | ||
; CHECK-NEXT: orr z0.d, z1.d, z0.d | ||
; CHECK-NEXT: cbnz x8, .LBB0_1 | ||
; CHECK-NEXT: // %bb.2: // %middle.block | ||
; CHECK-NEXT: ret | ||
entry: | ||
%2 = call i64 @llvm.vscale.i64() | ||
%3 = mul i64 %2, 16 | ||
br label %vector.body | ||
|
||
vector.body: ; preds = %vector.body, %entry | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't need a loop for the test. |
||
%vec.phi = phi <vscale x 4 x i32> [ zeroinitializer, %entry ], [ %4, %vector.body ] | ||
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %1) | ||
%4 = or <vscale x 4 x i32> %partial.reduce, %vec.phi | ||
%index.next = add i64 %index, %3 | ||
%5 = icmp eq i64 %index.next, 0 | ||
br i1 %5, label %middle.block, label %vector.body | ||
|
||
middle.block: ; preds = %vector.body | ||
%6 = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> %4) | ||
ret void | ||
} | ||
|
||
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none) | ||
declare i64 @llvm.vscale.i64() #1 | ||
|
||
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none) | ||
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32>) #1 | ||
|
||
; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none) | ||
declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>) #2 | ||
|
||
attributes #0 = { "target-features"="+fp-armv8,+fullfp16,+neon,+sve,+sve2,+v8a" } | ||
attributes #1 = { nocallback nofree nosync nounwind willreturn memory(none) } | ||
attributes #2 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) } |
Uh oh!
There was an error while loading. Please reload this page.