Skip to content

[MLIR] Pass hostShared flag in gpu.alloc op to runtime wrappers #66401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,8 @@ class ConvertOpToGpuRuntimeCallPattern : public ConvertOpToLLVMPattern<OpTy> {
"mgpuMemAlloc",
llvmPointerType /* void * */,
{llvmIntPtrType /* intptr_t sizeBytes */,
llvmPointerType /* void *stream */}};
llvmPointerType /* void *stream */,
llvmInt8Type /* bool isHostShared */}};
FunctionCallBuilder deallocCallBuilder = {
"mgpuMemFree",
llvmVoidType,
Expand Down Expand Up @@ -786,19 +787,23 @@ LogicalResult ConvertHostUnregisterOpToGpuRuntimeCallPattern::matchAndRewrite(
LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::AllocOp allocOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {
if (adaptor.getHostShared())
return rewriter.notifyMatchFailure(
allocOp, "host_shared allocation is not supported");

MemRefType memRefType = allocOp.getType();

if (failed(areAllLLVMTypes(allocOp, adaptor.getOperands(), rewriter)) ||
!isConvertibleAndHasIdentityMaps(memRefType) ||
failed(isAsyncWithOneDependency(rewriter, allocOp)))
!isConvertibleAndHasIdentityMaps(memRefType))
return failure();

auto loc = allocOp.getLoc();

bool isShared = allocOp.getHostShared();

if (isShared && allocOp.getAsyncToken())
return rewriter.notifyMatchFailure(
allocOp, "Host Shared allocation cannot be done async");
else if (!isShared && failed(isAsyncWithOneDependency(rewriter, allocOp)))
return failure();

// Get shape of the memref as values: static sizes are constant
// values and dynamic sizes are passed to 'alloc' as operands.
SmallVector<Value, 4> shape;
Expand All @@ -811,8 +816,13 @@ LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
// descriptor.
Type elementPtrType = this->getElementPtrType(memRefType);
auto stream = adaptor.getAsyncDependencies().front();

auto isHostShared = rewriter.create<mlir::LLVM::ConstantOp>(
loc, llvmInt8Type, rewriter.getI8IntegerAttr(isShared));

Value allocatedPtr =
allocCallBuilder.create(loc, rewriter, {sizeBytes, stream}).getResult();
allocCallBuilder.create(loc, rewriter, {sizeBytes, stream, isHostShared})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a consensus about avoiding the use of host_shared and async because of USM's limitation in handling asynchronous data allocation. Are we still on the same page with this?

#65539 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to relax the checks in that case in GPUToLLVMConversion Pass to allow lowering of non async gpu.alloc......also we might need to change the gpu-async-region pass to handle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it's worth to do it! What's your take on this? Personally, I think having a complete PR is the way to go. Otherwise, we will have an improperly implemented Op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I relaxed the checks for the GPUToLLVMCoversion pass. Regarding touching other passes, can we do it in an iterative PR once all the PR's relating to #65539 are merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks! Sure, it sounds good me to.
Let me know if there is anything else to do cuda runtime lowering.

Copy link
Contributor Author

@nbpatel nbpatel Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. any other feedback on the PR @grypp ? or else can we merge it if it looks good to you

.getResult();
if (!getTypeConverter()->useOpaquePointers())
allocatedPtr =
rewriter.create<LLVM::BitcastOp>(loc, elementPtrType, allocatedPtr);
Expand Down
3 changes: 2 additions & 1 deletion mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,8 @@ extern MLIR_CUDA_WRAPPERS_EXPORT "C" void mgpuEventRecord(CUevent event,
CUDA_REPORT_IF_ERROR(cuEventRecord(event, stream));
}

extern "C" void *mgpuMemAlloc(uint64_t sizeBytes, CUstream /*stream*/) {
extern "C" void *mgpuMemAlloc(uint64_t sizeBytes, CUstream /*stream*/,
bool /*isHostShared*/) {
ScopedContext scopedContext;
CUdeviceptr ptr = 0;
if (sizeBytes != 0)
Expand Down
3 changes: 2 additions & 1 deletion mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,8 @@ extern "C" void mgpuEventRecord(hipEvent_t event, hipStream_t stream) {
HIP_REPORT_IF_ERROR(hipEventRecord(event, stream));
}

extern "C" void *mgpuMemAlloc(uint64_t sizeBytes, hipStream_t /*stream*/) {
extern "C" void *mgpuMemAlloc(uint64_t sizeBytes, hipStream_t /*stream*/,
bool /*isHostShared*/) {
void *ptr;
HIP_REPORT_IF_ERROR(hipMalloc(&ptr, sizeBytes));
return ptr;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ module attributes {gpu.container_module} {
%0 = gpu.wait async
// CHECK: %[[gep:.*]] = llvm.getelementptr {{.*}}[%[[size]]]
// CHECK: %[[size_bytes:.*]] = llvm.ptrtoint %[[gep]]
// CHECK: llvm.call @mgpuMemAlloc(%[[size_bytes]], %[[stream]])
// CHECK: %[[isHostShared:.*]] = llvm.mlir.constant
// CHECK: llvm.call @mgpuMemAlloc(%[[size_bytes]], %[[stream]], %[[isHostShared]])
%1, %2 = gpu.alloc async [%0] (%size) : memref<?xf32>
// CHECK: %[[float_ptr:.*]] = llvm.extractvalue {{.*}}[0]
// CHECK: llvm.call @mgpuMemFree(%[[float_ptr]], %[[stream]])
Expand Down
3 changes: 2 additions & 1 deletion mlir/test/Conversion/GPUCommon/typed-pointers.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ module attributes {gpu.container_module} {
%0 = gpu.wait async
// CHECK: %[[gep:.*]] = llvm.getelementptr {{.*}}[%[[size]]]
// CHECK: %[[size_bytes:.*]] = llvm.ptrtoint %[[gep]]
// CHECK: llvm.call @mgpuMemAlloc(%[[size_bytes]], %[[stream]])
// CHECK: %[[isHostShared:.*]] = llvm.mlir.constant
// CHECK: llvm.call @mgpuMemAlloc(%[[size_bytes]], %[[stream]], %[[isHostShared]])
%1, %2 = gpu.alloc async [%0] (%size) : memref<?xf32>
// CHECK: %[[float_ptr:.*]] = llvm.extractvalue {{.*}}[0]
// CHECK: %[[void_ptr:.*]] = llvm.bitcast %[[float_ptr]]
Expand Down