Skip to content

[Offload][AMDGPU] Only allow memory pool access to valid agents #93969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 31, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented May 31, 2024

Summary:
The logic since the next-gen plugins was added was that every single
agent would get access to a memory pool we allocated. This is necessary
for things like fine-grained memory and to faciliate d2d copied.
However, there are cases where an agent cannot legally access a memory
pool. We have a debug check for this, but it would always be triggered
in these situations because both uses of the function simply passed
every agent. This patch changes the behavior by only enabling memory
pool access for agents that can access the memory pool.

Summary:
The logic since the next-gen plugins was added was that every single
agent would get access to a memory pool we allocated. This is necessary
for things like fine-grained memory and to faciliate d2d copied.
However, there are cases where an agent cannot legally access a memory
pool. We have a debug check for this, but it would always be triggered
in these situations because both uses of the function simply passed
every agent. This patch changes the behavior by only enabling memory
pool access for agents that can access the memory pool.
@llvmbot
Copy link
Member

llvmbot commented May 31, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-offload

Author: Joseph Huber (jhuber6)

Changes

Summary:
The logic since the next-gen plugins was added was that every single
agent would get access to a memory pool we allocated. This is necessary
for things like fine-grained memory and to faciliate d2d copied.
However, there are cases where an agent cannot legally access a memory
pool. We have a debug check for this, but it would always be triggered
in these situations because both uses of the function simply passed
every agent. This patch changes the behavior by only enabling memory
pool access for agents that can access the memory pool.


Full diff: https://github.com/llvm/llvm-project/pull/93969.diff

1 Files Affected:

  • (modified) offload/plugins-nextgen/amdgpu/src/rtl.cpp (+27-10)
diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index 2a9503333c199..c6dd954746e4a 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -307,6 +307,15 @@ struct AMDGPUMemoryPoolTy {
     return Plugin::check(Status, "Error in hsa_amd_memory_pool_free: %s");
   }
 
+  /// Returns if the \p Agent can access the memory pool.
+  bool canAccess(hsa_agent_t Agent) {
+    hsa_amd_memory_pool_access_t Access;
+    if (hsa_amd_agent_memory_pool_get_info(
+            Agent, MemoryPool, HSA_AMD_AGENT_MEMORY_POOL_INFO_ACCESS, &Access))
+      return false;
+    return Access != HSA_AMD_MEMORY_POOL_ACCESS_NEVER_ALLOWED;
+  }
+
   /// Allow the device to access a specific allocation.
   Error enableAccess(void *Ptr, int64_t Size,
                      const llvm::SmallVector<hsa_agent_t> &Agents) const {
@@ -3407,10 +3416,14 @@ void *AMDGPUMemoryManagerTy::allocate(size_t Size, void *HstPtr,
   }
   assert(Ptr && "Invalid pointer");
 
-  auto &KernelAgents = Plugin.getKernelAgents();
+  // Get a list of agents that can access this memory pool.
+  llvm::SmallVector<hsa_agent_t> Agents;
+  llvm::copy_if(
+      Plugin.getKernelAgents(), std::back_inserter(Agents),
+      [&](hsa_agent_t Agent) { return MemoryPool->canAccess(Agent); });
 
-  // Allow all kernel agents to access the allocation.
-  if (auto Err = MemoryPool->enableAccess(Ptr, Size, KernelAgents)) {
+  // Allow all valid kernel agents to access the allocation.
+  if (auto Err = MemoryPool->enableAccess(Ptr, Size, Agents)) {
     REPORT("%s\n", toString(std::move(Err)).data());
     return nullptr;
   }
@@ -3450,13 +3463,17 @@ void *AMDGPUDeviceTy::allocate(size_t Size, void *, TargetAllocTy Kind) {
   }
 
   if (Alloc) {
-    auto &KernelAgents =
-        static_cast<AMDGPUPluginTy &>(Plugin).getKernelAgents();
-    // Inherently necessary for host or shared allocations
-    // Also enabled for device memory to allow device to device memcpy
-
-    // Enable all kernel agents to access the buffer.
-    if (auto Err = MemoryPool->enableAccess(Alloc, Size, KernelAgents)) {
+    // Get a list of agents that can access this memory pool. Inherently
+    // necessary for host or shared allocations Also enabled for device memory
+    // to allow device to device memcpy
+    llvm::SmallVector<hsa_agent_t> Agents;
+    llvm::copy_if(static_cast<AMDGPUPluginTy &>(Plugin).getKernelAgents(),
+                  std::back_inserter(Agents), [&](hsa_agent_t Agent) {
+                    return MemoryPool->canAccess(Agent);
+                  });
+
+    // Enable all valid kernel agents to access the buffer.
+    if (auto Err = MemoryPool->enableAccess(Alloc, Size, Agents)) {
       REPORT("%s\n", toString(std::move(Err)).data());
       return nullptr;
     }

Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, but I think this should be reworked to minimize what we do per allocation.

@jhuber6 jhuber6 merged commit e19565c into llvm:main May 31, 2024
8 checks passed
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jun 2, 2024
…#93969)

Summary:
The logic since the next-gen plugins was added was that every single
agent would get access to a memory pool we allocated. This is necessary
for things like fine-grained memory and to faciliate d2d copied.
However, there are cases where an agent cannot legally access a memory
pool. We have a debug check for this, but it would always be triggered
in these situations because both uses of the function simply passed
every agent. This patch changes the behavior by only enabling memory
pool access for agents that can access the memory pool.

Change-Id: I4761963f82a2c8ddcf152ba254f6d662c495dd4a
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 19, 2024
…#93969)

Summary:
The logic since the next-gen plugins was added was that every single
agent would get access to a memory pool we allocated. This is necessary
for things like fine-grained memory and to faciliate d2d copied.
However, there are cases where an agent cannot legally access a memory
pool. We have a debug check for this, but it would always be triggered
in these situations because both uses of the function simply passed
every agent. This patch changes the behavior by only enabling memory
pool access for agents that can access the memory pool.

Change-Id: I4761963f82a2c8ddcf152ba254f6d662c495dd4a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants