Skip to content

[MLIR] Modify lowering of gpu.alloc op to llvm #69969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 25, 2023

Conversation

nbpatel
Copy link
Contributor

@nbpatel nbpatel commented Oct 23, 2023

If gpu.alloc has no asyn deependency ( in case if gpu.alloc has hostShared allocation), create a new stream & synchronize. This PR is follow up to #66401

@nbpatel
Copy link
Contributor Author

nbpatel commented Oct 23, 2023

@grypp tagging for review

@llvmbot
Copy link
Member

llvmbot commented Oct 23, 2023

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Nishant Patel (nbpatel)

Changes

If gpu.alloc has no asyn deependency ( this is true if gpu.alloc has hostShared allocation), create a new stream. This PR is follow up to #66401


Full diff: https://github.com/llvm/llvm-project/pull/69969.diff

1 Files Affected:

  • (modified) mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp (+16-2)
diff --git a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
index 097caf23edfa5dd..da1c468ed1dfd71 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
@@ -836,7 +836,11 @@ LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
   // Allocate the underlying buffer and store a pointer to it in the MemRef
   // descriptor.
   Type elementPtrType = this->getElementPtrType(memRefType);
-  auto stream = adaptor.getAsyncDependencies().front();
+
+  Value stream =
+      adaptor.getAsyncDependencies().empty()
+          ? streamCreateCallBuilder.create(loc, rewriter, {}).getResult()
+          : adaptor.getAsyncDependencies().front();
 
   auto isHostShared = rewriter.create<mlir::LLVM::ConstantOp>(
       loc, llvmInt8Type, rewriter.getI8IntegerAttr(isShared));
@@ -855,7 +859,17 @@ LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
   auto memRefDescriptor = this->createMemRefDescriptor(
       loc, memRefType, allocatedPtr, alignedPtr, shape, strides, rewriter);
 
-  rewriter.replaceOp(allocOp, {memRefDescriptor, stream});
+  if (allocOp.getAsyncToken()) {
+    // Async alloc: make dependent ops use the same stream.
+    rewriter.replaceOp(allocOp, {memRefDescriptor, stream});
+  } else {
+    // Synchronize with host and destroy stream. This must be the stream created
+    // above (with no other uses) because we check that the synchronous version
+    // does not have any async dependencies.
+    streamSynchronizeCallBuilder.create(loc, rewriter, {stream});
+    streamDestroyCallBuilder.create(loc, rewriter, {stream});
+    rewriter.replaceOp(allocOp, {memRefDescriptor});
+  }
 
   return success();
 }

Copy link
Member

@grypp grypp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this PR. It creates and syncs a stream, but it's never used in our runtimes, including the new sycl runtime.

We should avoid creating and syncing a stream when it's not necessary, as it has overhead.

auto stream = adaptor.getAsyncDependencies().front();

Value stream =
adaptor.getAsyncDependencies().empty()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unclear of creating stream. If the stream isn't available, my understanding is that we would prefer synchronous execution. In such a scenario, we wouldn't need the stream. At least in CUDA model, you don't have to do that, the cuda driver/runtime can use existing default stream implicitly.

Can we pass nullptr for stream when it's not available? Alternatively, we might create another API in the runtimes.

Afaik, creating and synchronizing stream is not free. We better avoid if it is not necessary.

For example, sycl runtime (below) that doesn't utilize the 'stream' (aka sycl::queue *queue).

static void *allocDeviceMemory(sycl::queue *queue, size_t size, bool isShared) {
  void *memPtr = nullptr;
  if (isShared) {
    memPtr = sycl::aligned_alloc_shared(64, size, getDefaultDevice(), getDefaultContext());
  } else {
    memPtr = sycl::aligned_alloc_device(64, size, getDefaultDevice(), getDefaultContext());
  }
  if (memPtr == nullptr) {
    throw std::runtime_error("mem allocation failed!");
  }
  return memPtr;
}

mgpuMemAlloc(uint64_t size, sycl::queue *queue, bool isShared) {
  return catchAll([&]() {
    return allocDeviceMemory(queue, static_cast<size_t>(size), true);
  });
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I agree. I made the change. I added the stream in the first place is because all the runtime signatures for mgpuMemAlloc expects a stream (albeit it doesn't use it now) ...but in future if they do require the stream passed for allocation ...but I agree for now we can pass nullptr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I agree. I made the change.

Thanks for doing that! It looks better now. We just need a test for this PR as it’s regular llvm code submission policy.

I added the stream in the first place is because all the runtime signatures for mgpuMemAlloc expects a stream (albeit it doesn't use it now) ...but in future if they do require the stream passed for allocation ...but I agree for now we can pass nullptr

Would you like to separate runtime APIs into mgpuMemAlloc and mgpuMemAllocAsync? The first one doesn't need a stream in the function declaration, while the second one works asynchronously and requires a stream. This is an additional work if you are interested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test. Yes I am interested to do it but in a separate PR once all the PR's for #65539 are merged

@nbpatel nbpatel requested a review from grypp October 24, 2023 18:17
Copy link
Member

@grypp grypp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks good

@nbpatel
Copy link
Contributor Author

nbpatel commented Oct 25, 2023

@grypp can you help me merge this?

@grypp grypp merged commit ced9f4f into llvm:main Oct 25, 2023
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Oct 26, 2023
If gpu.alloc has no asyn deependency ( in case if gpu.alloc has
hostShared allocation), create a new stream & synchronize. This PR is
follow up to llvm#66401
@nbpatel nbpatel deleted the nishant_changeLowering branch April 16, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants