Skip to content

Commit 0e51b54

Browse files
authored
[DirectX] Implement the resource.store.rawbuffer intrinsic (llvm#121282)
This introduces `@llvm.dx.resource.store.rawbuffer` and generalizes the buffer store docs under DirectX/DXILResources. Fixes llvm#106188
1 parent 08028d6 commit 0e51b54

File tree

7 files changed

+469
-41
lines changed

7 files changed

+469
-41
lines changed

llvm/docs/DirectX/DXILResources.rst

Lines changed: 99 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -491,26 +491,28 @@ Examples:
491491
i32 %byte_offset,
492492
i32 0)
493493
494-
Texture and Typed Buffer Stores
495-
-------------------------------
494+
Stores
495+
------
496496

497-
*relevant types: Textures and TypedBuffer*
497+
*relevant types: Textures and Buffer*
498498

499-
The `TextureStore`_ and `BufferStore`_ DXIL operations always write all four
500-
32-bit components to a texture or a typed buffer. While both operations include
501-
a mask parameter, it is specified that the mask must cover all components when
502-
used with these types.
499+
The `TextureStore`_, `BufferStore`_, and `RawBufferStore`_ DXIL operations
500+
write four components to a texture or a buffer. These include a mask argument
501+
that is used when fewer than 4 components are written, but notably this only
502+
takes on the contiguous x, xy, xyz, and xyzw values.
503503

504-
The store operations that we define as intrinsics behave similarly, and will
505-
only accept writes to the whole of the contained type. This differs from the
506-
loads above, but this makes sense to do from a semantics preserving point of
507-
view. Thus, texture and buffer stores may only operate on 4-element vectors of
508-
types that are 32-bits or fewer, such as ``<4 x i32>``, ``<4 x float>``, and
509-
``<4 x half>``, and 2 element vectors of 64-bit types like ``<2 x double>`` and
510-
``<2 x i64>``.
504+
We define the LLVM store intrinsics to accept vectors when storing multiple
505+
components rather than using `undef` and a mask, but otherwise match the DXIL
506+
ops fairly closely.
511507

512-
.. _BufferStore: https://github.com/microsoft/DirectXShaderCompiler/blob/main/docs/DXIL.rst#bufferstore
513508
.. _TextureStore: https://github.com/microsoft/DirectXShaderCompiler/blob/main/docs/DXIL.rst#texturestore
509+
.. _BufferStore: https://github.com/microsoft/DirectXShaderCompiler/blob/main/docs/DXIL.rst#bufferstore
510+
.. _RawBufferStore: https://github.com/microsoft/DirectXShaderCompiler/blob/main/docs/DXIL.rst#rawbufferstore
511+
512+
For TypedBuffer, we only need one coordinate, and we must always write a vector
513+
since partial writes aren't possible. Similarly to the load operations
514+
described above, we handle 64-bit types specially and only handle 2-element
515+
vectors rather than 4.
514516

515517
Examples:
516518

@@ -548,3 +550,85 @@ Examples:
548550
target("dx.TypedBuffer", f16, 1, 0) %buf, i32 %index, <4 x f16> %data)
549551
call void @llvm.dx.resource.store.typedbuffer.tdx.Buffer_v2f64_1_0_0t(
550552
target("dx.TypedBuffer", f64, 1, 0) %buf, i32 %index, <2 x f64> %data)
553+
554+
For RawBuffer, we need two indices and we accept scalars and vectors of 4 or
555+
fewer elements. Note that we do allow vectors of 4 64-bit elements here.
556+
557+
Examples:
558+
559+
.. list-table:: ``@llvm.dx.resource.store.rawbuffer``
560+
:header-rows: 1
561+
562+
* - Argument
563+
-
564+
- Type
565+
- Description
566+
* - Return value
567+
-
568+
- ``void``
569+
-
570+
* - ``%buffer``
571+
- 0
572+
- ``target(dx.RawBuffer, ...)``
573+
- The buffer to store into
574+
* - ``%index``
575+
- 1
576+
- ``i32``
577+
- Index into the buffer
578+
* - ``%offset``
579+
- 2
580+
- ``i32``
581+
- Byte offset into structured buffer elements
582+
* - ``%data``
583+
- 3
584+
- Scalar or vector
585+
- The data to store
586+
587+
Examples:
588+
589+
.. code-block:: llvm
590+
591+
; float
592+
call void @llvm.dx.resource.store.rawbuffer.tdx.RawBuffer_f32_1_0_0t.f32(
593+
target("dx.RawBuffer", float, 1, 0, 0) %buffer,
594+
i32 %index, i32 0, float %data)
595+
call void @llvm.dx.resource.store.rawbuffer.tdx.RawBuffer_i8_1_0_0t.f32(
596+
target("dx.RawBuffer", i8, 1, 0, 0) %buffer,
597+
i32 %index, i32 0, float %data)
598+
599+
; float4
600+
call void @llvm.dx.resource.store.rawbuffer.tdx.RawBuffer_v4f32_1_0_0t.v4f32(
601+
target("dx.RawBuffer", <4 x float>, 1, 0, 0) %buffer,
602+
i32 %index, i32 0, <4 x float> %data)
603+
call void @llvm.dx.resource.store.rawbuffer.tdx.RawBuffer_i8_1_0_0t.v4f32(
604+
target("dx.RawBuffer", i8, 1, 0, 0) %buffer,
605+
i32 %index, i32 0, <4 x float> %data)
606+
607+
; struct S0 { float4 f; int4 i; }
608+
call void @llvm.dx.resource.store.rawbuffer.v4f32(
609+
target("dx.RawBuffer", { <4 x float>, <4 x i32> }, 1, 0, 0) %buffer,
610+
i32 %index, i32 0, <4 x float> %data0)
611+
call void @llvm.dx.resource.store.rawbuffer.v4i32(
612+
target("dx.RawBuffer", { <4 x float>, <4 x i32> }, 1, 0, 0) %buffer,
613+
i32 %index, i32 16, <4 x i32> %data1)
614+
615+
; struct Q { float4 f; int3 i; }
616+
; struct R { int z; S x; }
617+
call void @llvm.dx.resource.store.rawbuffer.i32(
618+
target("dx.RawBuffer", {i32, {<4 x float>, <3 x half>}}, 1, 0, 0)
619+
%buffer,
620+
i32 %index, i32 0, i32 %data0)
621+
call void @llvm.dx.resource.store.rawbuffer.v4f32(
622+
target("dx.RawBuffer", {i32, {<4 x float>, <3 x half>}}, 1, 0, 0)
623+
%buffer,
624+
i32 %index, i32 4, <4 x float> %data1)
625+
call void @llvm.dx.resource.store.rawbuffer.v3f16(
626+
target("dx.RawBuffer", {i32, {<4 x float>, <3 x half>}}, 1, 0, 0)
627+
%buffer,
628+
i32 %index, i32 20, <3 x half> %data2)
629+
630+
; byteaddressbuf.Store<int64_t4>
631+
call void @llvm.dx.resource.store.rawbuffer.tdx.RawBuffer_i8_1_0_0t.v4f64(
632+
target("dx.RawBuffer", i8, 1, 0, 0) %buffer,
633+
i32 %index, i32 0, <4 x double> %data)
634+

llvm/include/llvm/IR/IntrinsicsDirectX.td

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ def int_dx_resource_load_rawbuffer
4040
: DefaultAttrsIntrinsic<[llvm_any_ty, llvm_i1_ty],
4141
[llvm_any_ty, llvm_i32_ty, llvm_i32_ty],
4242
[IntrReadMem]>;
43+
def int_dx_resource_store_rawbuffer
44+
: DefaultAttrsIntrinsic<
45+
[], [llvm_any_ty, llvm_i32_ty, llvm_i32_ty, llvm_any_ty],
46+
[IntrWriteMem]>;
4347

4448
def int_dx_resource_updatecounter
4549
: DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_any_ty, llvm_i8_ty],

llvm/lib/Target/DirectX/DXIL.td

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -909,6 +909,26 @@ def RawBufferLoad : DXILOp<139, rawBufferLoad> {
909909
let stages = [Stages<DXIL1_2, [all_stages]>];
910910
}
911911

912+
def RawBufferStore : DXILOp<140, rawBufferStore> {
913+
let Doc = "writes to a RWByteAddressBuffer or RWStructuredBuffer";
914+
// Handle, Coord0, Coord1, Val0, Val1, Val2, Val3, Mask, Alignment
915+
let arguments = [
916+
HandleTy, Int32Ty, Int32Ty, OverloadTy, OverloadTy, OverloadTy, OverloadTy,
917+
Int8Ty, Int32Ty
918+
];
919+
let result = VoidTy;
920+
let overloads = [
921+
Overloads<DXIL1_2,
922+
[ResRetHalfTy, ResRetFloatTy, ResRetInt16Ty, ResRetInt32Ty]>,
923+
Overloads<DXIL1_3,
924+
[
925+
ResRetHalfTy, ResRetFloatTy, ResRetDoubleTy, ResRetInt16Ty,
926+
ResRetInt32Ty, ResRetInt64Ty
927+
]>
928+
];
929+
let stages = [Stages<DXIL1_2, [all_stages]>];
930+
}
931+
912932
def Dot4AddI8Packed : DXILOp<163, dot4AddPacked> {
913933
let Doc = "signed dot product of 4 x i8 vectors packed into i32, with "
914934
"accumulate to i32";

llvm/lib/Target/DirectX/DXILOpLowering.cpp

Lines changed: 56 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -616,7 +616,10 @@ class OpLowerer {
616616
return false;
617617
}
618618

619-
[[nodiscard]] bool lowerTypedBufferStore(Function &F) {
619+
[[nodiscard]] bool lowerBufferStore(Function &F, bool IsRaw) {
620+
Triple TT(Triple(M.getTargetTriple()));
621+
VersionTuple DXILVersion = TT.getDXILVersion();
622+
const DataLayout &DL = F.getDataLayout();
620623
IRBuilder<> &IRB = OpBuilder.getIRB();
621624
Type *Int8Ty = IRB.getInt8Ty();
622625
Type *Int32Ty = IRB.getInt32Ty();
@@ -627,51 +630,75 @@ class OpLowerer {
627630
Value *Handle =
628631
createTmpHandleCast(CI->getArgOperand(0), OpBuilder.getHandleType());
629632
Value *Index0 = CI->getArgOperand(1);
630-
Value *Index1 = UndefValue::get(Int32Ty);
631-
// For typed stores, the mask must always cover all four elements.
632-
Constant *Mask = ConstantInt::get(Int8Ty, 0xF);
633+
Value *Index1 = IsRaw ? CI->getArgOperand(2) : UndefValue::get(Int32Ty);
634+
635+
Value *Data = CI->getArgOperand(IsRaw ? 3 : 2);
636+
Type *DataTy = Data->getType();
637+
Type *ScalarTy = DataTy->getScalarType();
633638

634-
Value *Data = CI->getArgOperand(2);
635-
auto *DataTy = dyn_cast<FixedVectorType>(Data->getType());
636-
if (!DataTy || DataTy->getNumElements() != 4)
639+
uint64_t NumElements =
640+
DL.getTypeSizeInBits(DataTy) / DL.getTypeSizeInBits(ScalarTy);
641+
Value *Mask = ConstantInt::get(Int8Ty, ~(~0U << NumElements));
642+
643+
// TODO: check that we only have vector or scalar...
644+
if (!IsRaw && NumElements != 4)
637645
return make_error<StringError>(
638646
"typedBufferStore data must be a vector of 4 elements",
639647
inconvertibleErrorCode());
648+
else if (NumElements > 4)
649+
return make_error<StringError>(
650+
"rawBufferStore data must have at most 4 elements",
651+
inconvertibleErrorCode());
640652

641-
// Since we're post-scalarizer, we likely have a vector that's constructed
642-
// solely for the argument of the store. If so, just use the scalar values
643-
// from before they're inserted into the temporary.
644653
std::array<Value *, 4> DataElements{nullptr, nullptr, nullptr, nullptr};
645-
auto *IEI = dyn_cast<InsertElementInst>(Data);
646-
while (IEI) {
647-
auto *IndexOp = dyn_cast<ConstantInt>(IEI->getOperand(2));
648-
if (!IndexOp)
649-
break;
650-
size_t IndexVal = IndexOp->getZExtValue();
651-
assert(IndexVal < 4 && "Too many elements for buffer store");
652-
DataElements[IndexVal] = IEI->getOperand(1);
653-
IEI = dyn_cast<InsertElementInst>(IEI->getOperand(0));
654+
if (DataTy == ScalarTy)
655+
DataElements[0] = Data;
656+
else {
657+
// Since we're post-scalarizer, if we see a vector here it's likely
658+
// constructed solely for the argument of the store. Just use the scalar
659+
// values from before they're inserted into the temporary.
660+
auto *IEI = dyn_cast<InsertElementInst>(Data);
661+
while (IEI) {
662+
auto *IndexOp = dyn_cast<ConstantInt>(IEI->getOperand(2));
663+
if (!IndexOp)
664+
break;
665+
size_t IndexVal = IndexOp->getZExtValue();
666+
assert(IndexVal < 4 && "Too many elements for buffer store");
667+
DataElements[IndexVal] = IEI->getOperand(1);
668+
IEI = dyn_cast<InsertElementInst>(IEI->getOperand(0));
669+
}
654670
}
655671

656672
// If for some reason we weren't able to forward the arguments from the
657-
// scalarizer artifact, then we need to actually extract elements from the
658-
// vector.
659-
for (int I = 0, E = 4; I != E; ++I)
673+
// scalarizer artifact, then we may need to actually extract elements from
674+
// the vector.
675+
for (int I = 0, E = NumElements; I < E; ++I)
660676
if (DataElements[I] == nullptr)
661677
DataElements[I] =
662678
IRB.CreateExtractElement(Data, ConstantInt::get(Int32Ty, I));
679+
// For any elements beyond the length of the vector, fill up with undef.
680+
for (int I = NumElements, E = 4; I < E; ++I)
681+
if (DataElements[I] == nullptr)
682+
DataElements[I] = UndefValue::get(ScalarTy);
663683

664-
std::array<Value *, 8> Args{
684+
dxil::OpCode Op = OpCode::BufferStore;
685+
SmallVector<Value *, 9> Args{
665686
Handle, Index0, Index1, DataElements[0],
666687
DataElements[1], DataElements[2], DataElements[3], Mask};
688+
if (IsRaw && DXILVersion >= VersionTuple(1, 2)) {
689+
Op = OpCode::RawBufferStore;
690+
// RawBufferStore requires the alignment
691+
Args.push_back(
692+
ConstantInt::get(Int32Ty, DL.getPrefTypeAlign(ScalarTy).value()));
693+
}
667694
Expected<CallInst *> OpCall =
668-
OpBuilder.tryCreateOp(OpCode::BufferStore, Args, CI->getName());
695+
OpBuilder.tryCreateOp(Op, Args, CI->getName());
669696
if (Error E = OpCall.takeError())
670697
return E;
671698

672699
CI->eraseFromParent();
673700
// Clean up any leftover `insertelement`s
674-
IEI = dyn_cast<InsertElementInst>(Data);
701+
auto *IEI = dyn_cast<InsertElementInst>(Data);
675702
while (IEI && IEI->use_empty()) {
676703
InsertElementInst *Tmp = IEI;
677704
IEI = dyn_cast<InsertElementInst>(IEI->getOperand(0));
@@ -776,11 +803,14 @@ class OpLowerer {
776803
HasErrors |= lowerTypedBufferLoad(F, /*HasCheckBit=*/true);
777804
break;
778805
case Intrinsic::dx_resource_store_typedbuffer:
779-
HasErrors |= lowerTypedBufferStore(F);
806+
HasErrors |= lowerBufferStore(F, /*IsRaw=*/false);
780807
break;
781808
case Intrinsic::dx_resource_load_rawbuffer:
782809
HasErrors |= lowerRawBufferLoad(F);
783810
break;
811+
case Intrinsic::dx_resource_store_rawbuffer:
812+
HasErrors |= lowerBufferStore(F, /*IsRaw=*/true);
813+
break;
784814
case Intrinsic::dx_resource_updatecounter:
785815
HasErrors |= lowerUpdateCounter(F);
786816
break;

0 commit comments

Comments
 (0)