Skip to content

AMDGPU: Move enqueued block handling into clang #128519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Feb 24, 2025

The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.

This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.

I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.

We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.

https://reviews.llvm.org/D141700

Copy link
Contributor Author

arsenm commented Feb 24, 2025

@llvmbot
Copy link
Member

llvmbot commented Feb 24, 2025

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.

This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.

I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.

We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.

https://reviews.llvm.org/D141700


Patch is 94.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128519.diff

17 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+79-5)
  • (added) clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel-linking.cl (+80)
  • (modified) clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel.cl (+85-70)
  • (modified) llvm/docs/AMDGPUUsage.rst (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3-3)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUExportKernelRuntimeHandles.cpp (+110)
  • (renamed) llvm/lib/Target/AMDGPU/AMDGPUExportKernelRuntimeHandles.h (+7-7)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp (+36-8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h (+7-3)
  • (removed) llvm/lib/Target/AMDGPU/AMDGPUOpenCLEnqueuedBlockLowering.cpp (-136)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+5-5)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-export-kernel-runtime-handles.ll (+62)
  • (removed) llvm/test/CodeGen/AMDGPU/enqueue-kernel.ll (-215)
  • (modified) llvm/test/CodeGen/AMDGPU/hsa-metadata-from-llvm-ir-full.ll (+29-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5-5)
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index dc45def4f3249..428de4690e6ef 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -614,6 +614,41 @@ void AMDGPUTargetCodeGenInfo::setCUDAKernelCallingConvention(
       FT, FT->getExtInfo().withCallingConv(CC_OpenCLKernel));
 }
 
+/// Return IR struct type corresponding to kernel_descriptor_t (See
+/// AMDHSAKernelDescriptor.h)
+static llvm::StructType *getAMDGPUKernelDescriptorType(CodeGenFunction &CGF) {
+  return llvm::StructType::create(
+      CGF.getLLVMContext(),
+      {
+          CGF.Int32Ty,                          // group_segment_fixed_size
+          CGF.Int32Ty,                          // private_segment_fixed_size
+          CGF.Int32Ty,                          // kernarg_size
+          llvm::ArrayType::get(CGF.Int8Ty, 4),  // reserved0
+          CGF.Int64Ty,                          // kernel_code_entry_byte_offset
+          llvm::ArrayType::get(CGF.Int8Ty, 20), // reserved1
+          CGF.Int32Ty,                          // compute_pgm_rsrc3
+          CGF.Int32Ty,                          // compute_pgm_rsrc1
+          CGF.Int32Ty,                          // compute_pgm_rsrc2
+          CGF.Int16Ty,                          // kernel_code_properties
+          llvm::ArrayType::get(CGF.Int8Ty, 6)   // reserved2
+      },
+      "kernel_descriptor_t");
+}
+
+/// Return IR struct type for rtinfo struct in rocm-device-libs used for device
+/// enqueue.
+///
+/// ptr addrspace(1) kernel_object, i32 private_segment_size,
+/// i32 group_segment_size
+
+static llvm::StructType *
+getAMDGPURuntimeHandleType(llvm::LLVMContext &C,
+                           llvm::Type *KernelDescriptorPtrTy) {
+  llvm::Type *Int32 = llvm::Type::getInt32Ty(C);
+  return llvm::StructType::create(C, {KernelDescriptorPtrTy, Int32, Int32},
+                                  "block.runtime.handle.t");
+}
+
 /// Create an OpenCL kernel for an enqueued block.
 ///
 /// The type of the first argument (the block literal) is the struct type
@@ -653,23 +688,29 @@ llvm::Value *AMDGPUTargetCodeGenInfo::createEnqueuedBlockKernel(
     ArgNames.push_back(
         llvm::MDString::get(C, (Twine("local_arg") + Twine(I)).str()));
   }
-  std::string Name = Invoke->getName().str() + "_kernel";
+
+  llvm::Module &Mod = CGF.CGM.getModule();
+  const llvm::DataLayout &DL = Mod.getDataLayout();
+
+  llvm::Twine Name = Invoke->getName() + "_kernel";
   auto *FT = llvm::FunctionType::get(llvm::Type::getVoidTy(C), ArgTys, false);
+
+  // The kernel itself can be internal, the runtime does not directly access the
+  // kernel address (only the kernel descriptor).
   auto *F = llvm::Function::Create(FT, llvm::GlobalValue::InternalLinkage, Name,
-                                   &CGF.CGM.getModule());
+                                   &Mod);
   F->setCallingConv(llvm::CallingConv::AMDGPU_KERNEL);
 
   llvm::AttrBuilder KernelAttrs(C);
   // FIXME: The invoke isn't applying the right attributes either
   // FIXME: This is missing setTargetAttributes
   CGF.CGM.addDefaultFunctionDefinitionAttributes(KernelAttrs);
-  KernelAttrs.addAttribute("enqueued-block");
   F->addFnAttrs(KernelAttrs);
 
   auto IP = CGF.Builder.saveIP();
   auto *BB = llvm::BasicBlock::Create(C, "entry", F);
   Builder.SetInsertPoint(BB);
-  const auto BlockAlign = CGF.CGM.getDataLayout().getPrefTypeAlign(BlockTy);
+  const auto BlockAlign = DL.getPrefTypeAlign(BlockTy);
   auto *BlockPtr = Builder.CreateAlloca(BlockTy, nullptr);
   BlockPtr->setAlignment(BlockAlign);
   Builder.CreateAlignedStore(F->arg_begin(), BlockPtr, BlockAlign);
@@ -692,7 +733,40 @@ llvm::Value *AMDGPUTargetCodeGenInfo::createEnqueuedBlockKernel(
   if (CGF.CGM.getCodeGenOpts().EmitOpenCLArgMetadata)
     F->setMetadata("kernel_arg_name", llvm::MDNode::get(C, ArgNames));
 
-  return F;
+  llvm::Type *KernelDescriptorTy = getAMDGPUKernelDescriptorType(CGF);
+  llvm::StructType *HandleTy = getAMDGPURuntimeHandleType(
+      C, KernelDescriptorTy->getPointerTo(DL.getDefaultGlobalsAddressSpace()));
+  llvm::Constant *RuntimeHandleInitializer =
+      llvm::ConstantAggregateZero::get(HandleTy);
+
+  llvm::Twine RuntimeHandleName = F->getName() + ".runtime.handle";
+
+  // The runtime needs access to the runtime handle as an external symbol. The
+  // runtime handle will need to be made external later, in
+  // AMDGPUExportOpenCLEnqueuedBlocks. The kernel itself has a hidden reference
+  // inside the runtime handle, and is not directly referenced.
+
+  // TODO: We would initialize the first field by declaring F->getName() + ".kd"
+  // to reference the kernel descriptor. The runtime wouldn't need to bother
+  // setting it. We would need to have a final symbol name though.
+  // TODO: Can we directly use an external symbol with getGlobalIdentifier?
+  auto *RuntimeHandle = new llvm::GlobalVariable(
+      Mod, HandleTy,
+      /*isConstant=*/true, llvm::GlobalValue::InternalLinkage,
+      /*Initializer=*/RuntimeHandleInitializer, RuntimeHandleName,
+      /*InsertBefore=*/nullptr, llvm::GlobalValue::NotThreadLocal,
+      DL.getDefaultGlobalsAddressSpace(),
+      /*isExternallyInitialized=*/true);
+
+  llvm::MDNode *HandleAsMD =
+      llvm::MDNode::get(C, llvm::ValueAsMetadata::get(RuntimeHandle));
+  F->setMetadata(llvm::LLVMContext::MD_associated, HandleAsMD);
+
+  RuntimeHandle->setSection(".amdgpu.kernel.runtime.handle");
+
+  CGF.CGM.addUsedGlobal(F);
+  CGF.CGM.addUsedGlobal(RuntimeHandle);
+  return RuntimeHandle;
 }
 
 void CodeGenModule::handleAMDGPUFlatWorkGroupSizeAttr(
diff --git a/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel-linking.cl b/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel-linking.cl
new file mode 100644
index 0000000000000..eeb3c87a0b5e2
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel-linking.cl
@@ -0,0 +1,80 @@
+// Make sure that invoking blocks in static functions with the same name in
+// different modules are linked together.
+
+// RUN: %clang_cc1 -O0 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -fno-ident -DKERNEL_NAME=test_kernel_first -DTYPE=float -DCONST=256.0f -emit-llvm-bc -o %t.0.bc %s
+// RUN: %clang_cc1 -O0 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -fno-ident -DKERNEL_NAME=test_kernel_second -DTYPE=int -DCONST=128.0f -emit-llvm-bc -o %t.1.bc %s
+
+// Make sure nothing strange happens with the linkage choices.
+// RUN: opt -passes=globalopt -o %t.opt.0.bc %t.0.bc
+// RUN: opt -passes=globalopt -o %t.opt.1.bc %t.1.bc
+
+// Check the result of linking
+// RUN: llvm-link -S %t.opt.0.bc %t.opt.1.bc -o - | FileCheck %s
+
+// Make sure that a block invoke used with the same name works in multiple
+// translation units
+
+// CHECK: @llvm.used = appending addrspace(1) global [4 x ptr] [ptr @__static_invoker_block_invoke_kernel, ptr addrspacecast (ptr addrspace(1) @__static_invoker_block_invoke_kernel.runtime.handle to ptr), ptr @__static_invoker_block_invoke_kernel.2, ptr addrspacecast (ptr addrspace(1) @__static_invoker_block_invoke_kernel.runtime.handle.3 to ptr)], section "llvm.metadata"
+
+
+// CHECK: @__static_invoker_block_invoke_kernel.runtime.handle = internal addrspace(1) externally_initialized constant %block.runtime.handle.t zeroinitializer, section ".amdgpu.kernel.runtime.handle"
+// CHECK: @__static_invoker_block_invoke_kernel.runtime.handle.3 = internal addrspace(1) externally_initialized constant %block.runtime.handle.t zeroinitializer, section ".amdgpu.kernel.runtime.handle"
+
+// CHECK: define internal amdgpu_kernel void @__static_invoker_block_invoke_kernel(<{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1) }> %0) #{{[0-9]+}} !associated ![[ASSOC_FIRST_MD:[0-9]+]]
+
+
+// CHECK-LABEL: define internal void @__static_invoker_block_invoke(ptr noundef %.block_descriptor)
+// CHECK: call float @llvm.fmuladd.f32
+
+
+// CHECK-LABEL: define dso_local amdgpu_kernel void @test_kernel_first(
+
+
+// CHECK-LABEL: define internal fastcc void @static_invoker(ptr addrspace(1) noundef %outptr, ptr addrspace(1) noundef %argptr)
+// CHECK:  call i32 @__enqueue_kernel_basic(ptr addrspace(1) %{{[0-9]+}}, i32 %{{[0-9]+}}, ptr %tmp.ascast, ptr addrspacecast (ptr addrspace(1) @__static_invoker_block_invoke_kernel.runtime.handle to ptr), ptr %block.ascast)
+
+// CHECK: declare i32 @__enqueue_kernel_basic(ptr addrspace(1), i32, ptr, ptr, ptr) local_unnamed_addr
+
+
+// CHECK: define internal amdgpu_kernel void @__static_invoker_block_invoke_kernel.2(<{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1) }> %0) #{{[0-9]+}} !associated ![[ASSOC_SECOND_MD:[0-9]+]]
+// CHECK: call void @__static_invoker_block_invoke.4(ptr %
+
+
+// CHECK-LABEL: define internal void @__static_invoker_block_invoke.4(ptr noundef %.block_descriptor)
+// CHECK: mul nsw i32
+// CHECK: sitofp
+// CHECK: fadd
+// CHECK: fptosi
+
+// CHECK-LABEL: define dso_local amdgpu_kernel void @test_kernel_second(ptr addrspace(1) noundef align 4 %outptr, ptr addrspace(1) noundef align 4 %argptr, ptr addrspace(1) noundef align 4 %difference)
+
+// CHECK-LABEL: define internal fastcc void @static_invoker.5(ptr addrspace(1) noundef %outptr, ptr addrspace(1) noundef %argptr) unnamed_addr #{{[0-9]+}} {
+// CHECK: call i32 @__enqueue_kernel_basic(ptr addrspace(1) %{{[0-9]+}}, i32 %{{[0-9]+}}, ptr %tmp.ascast, ptr addrspacecast (ptr addrspace(1) @__static_invoker_block_invoke_kernel.runtime.handle.3 to ptr), ptr %block.ascast)
+
+
+typedef struct {int a;} ndrange_t;
+
+static void static_invoker(global TYPE* outptr, global TYPE* argptr) {
+  queue_t default_queue;
+  unsigned flags = 0;
+  ndrange_t ndrange;
+
+  enqueue_kernel(default_queue, flags, ndrange,
+  ^(void) {
+  global TYPE* f = argptr;
+  outptr[0] = f[1] * f[2] + CONST;
+  });
+}
+
+kernel void KERNEL_NAME(global TYPE *outptr, global TYPE *argptr, global TYPE *difference) {
+  queue_t default_queue;
+  unsigned flags = 0;
+  ndrange_t ndrange;
+
+  static_invoker(outptr, argptr);
+
+  *difference = CONST;
+}
+
+// CHECK: ![[ASSOC_FIRST_MD]] = !{ptr addrspace(1) @__static_invoker_block_invoke_kernel.runtime.handle}
+// CHECK: ![[ASSOC_SECOND_MD]] = !{ptr addrspace(1) @__static_invoker_block_invoke_kernel.runtime.handle.3}
diff --git a/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel.cl b/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel.cl
index ace34dd0ca6dc..b9563f4fde942 100644
--- a/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel.cl
+++ b/clang/test/CodeGenOpenCL/amdgpu-enqueue-kernel.cl
@@ -61,7 +61,13 @@ kernel void test_target_features_kernel(global int *i) {
 }
 
 //.
+// CHECK: @__test_block_invoke_kernel.runtime.handle = internal addrspace(1) externally_initialized constant %block.runtime.handle.t zeroinitializer, section ".amdgpu.kernel.runtime.handle"
+// CHECK: @__test_block_invoke_2_kernel.runtime.handle = internal addrspace(1) externally_initialized constant %block.runtime.handle.t.1 zeroinitializer, section ".amdgpu.kernel.runtime.handle"
+// CHECK: @__test_block_invoke_3_kernel.runtime.handle = internal addrspace(1) externally_initialized constant %block.runtime.handle.t.3 zeroinitializer, section ".amdgpu.kernel.runtime.handle"
+// CHECK: @__test_block_invoke_4_kernel.runtime.handle = internal addrspace(1) externally_initialized constant %block.runtime.handle.t.5 zeroinitializer, section ".amdgpu.kernel.runtime.handle"
 // CHECK: @__block_literal_global = internal addrspace(1) constant { i32, i32, ptr } { i32 16, i32 8, ptr @__test_target_features_kernel_block_invoke }, align 8 #0
+// CHECK: @__test_target_features_kernel_block_invoke_kernel.runtime.handle = internal addrspace(1) externally_initialized constant %block.runtime.handle.t.7 zeroinitializer, section ".amdgpu.kernel.runtime.handle"
+// CHECK: @llvm.used = appending addrspace(1) global [10 x ptr] [ptr @__test_block_invoke_kernel, ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_kernel.runtime.handle to ptr), ptr @__test_block_invoke_2_kernel, ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_2_kernel.runtime.handle to ptr), ptr @__test_block_invoke_3_kernel, ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_3_kernel.runtime.handle to ptr), ptr @__test_block_invoke_4_kernel, ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_4_kernel.runtime.handle to ptr), ptr @__test_target_features_kernel_block_invoke_kernel, ptr addrspacecast (ptr addrspace(1) @__test_target_features_kernel_block_invoke_kernel.runtime.handle to ptr)], section "llvm.metadata"
 // CHECK: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500
 //.
 // NOCPU: Function Attrs: convergent noinline norecurse nounwind optnone
@@ -140,7 +146,7 @@ kernel void test_target_features_kernel(global int *i) {
 // NOCPU-NEXT:    [[BLOCK_CAPTURED1:%.*]] = getelementptr inbounds nuw <{ i32, i32, ptr, ptr addrspace(1), i8 }>, ptr [[BLOCK_ASCAST]], i32 0, i32 4
 // NOCPU-NEXT:    [[TMP3:%.*]] = load i8, ptr [[B_ADDR_ASCAST]], align 1
 // NOCPU-NEXT:    store i8 [[TMP3]], ptr [[BLOCK_CAPTURED1]], align 8
-// NOCPU-NEXT:    [[TMP4:%.*]] = call i32 @__enqueue_kernel_basic(ptr addrspace(1) [[TMP0]], i32 [[TMP1]], ptr [[TMP_ASCAST]], ptr @__test_block_invoke_kernel, ptr [[BLOCK_ASCAST]])
+// NOCPU-NEXT:    [[TMP4:%.*]] = call i32 @__enqueue_kernel_basic(ptr addrspace(1) [[TMP0]], i32 [[TMP1]], ptr [[TMP_ASCAST]], ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_kernel.runtime.handle to ptr), ptr [[BLOCK_ASCAST]])
 // NOCPU-NEXT:    [[TMP5:%.*]] = load ptr addrspace(1), ptr [[DEFAULT_QUEUE_ASCAST]], align 8
 // NOCPU-NEXT:    [[TMP6:%.*]] = load i32, ptr [[FLAGS_ASCAST]], align 4
 // NOCPU-NEXT:    call void @llvm.memcpy.p0.p0.i64(ptr align 4 [[TMP2_ASCAST]], ptr align 4 [[NDRANGE_ASCAST]], i64 4, i1 false)
@@ -162,7 +168,7 @@ kernel void test_target_features_kernel(global int *i) {
 // NOCPU-NEXT:    [[BLOCK_CAPTURED10:%.*]] = getelementptr inbounds nuw <{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }>, ptr [[BLOCK3_ASCAST]], i32 0, i32 5
 // NOCPU-NEXT:    [[TMP10:%.*]] = load i64, ptr [[D_ADDR_ASCAST]], align 8
 // NOCPU-NEXT:    store i64 [[TMP10]], ptr [[BLOCK_CAPTURED10]], align 8
-// NOCPU-NEXT:    [[TMP11:%.*]] = call i32 @__enqueue_kernel_basic(ptr addrspace(1) [[TMP5]], i32 [[TMP6]], ptr [[TMP2_ASCAST]], ptr @__test_block_invoke_2_kernel, ptr [[BLOCK3_ASCAST]])
+// NOCPU-NEXT:    [[TMP11:%.*]] = call i32 @__enqueue_kernel_basic(ptr addrspace(1) [[TMP5]], i32 [[TMP6]], ptr [[TMP2_ASCAST]], ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_2_kernel.runtime.handle to ptr), ptr [[BLOCK3_ASCAST]])
 // NOCPU-NEXT:    [[TMP12:%.*]] = load ptr addrspace(1), ptr [[DEFAULT_QUEUE_ASCAST]], align 8
 // NOCPU-NEXT:    [[TMP13:%.*]] = load i32, ptr [[FLAGS_ASCAST]], align 4
 // NOCPU-NEXT:    call void @llvm.memcpy.p0.p0.i64(ptr align 4 [[TMP11_ASCAST]], ptr align 4 [[NDRANGE_ASCAST]], i64 4, i1 false)
@@ -186,7 +192,7 @@ kernel void test_target_features_kernel(global int *i) {
 // NOCPU-NEXT:    store i64 [[TMP17]], ptr [[BLOCK_CAPTURED19]], align 8
 // NOCPU-NEXT:    [[TMP18:%.*]] = getelementptr [1 x i64], ptr [[BLOCK_SIZES_ASCAST]], i32 0, i32 0
 // NOCPU-NEXT:    store i64 100, ptr [[TMP18]], align 8
-// NOCPU-NEXT:    [[TMP19:%.*]] = call i32 @__enqueue_kernel_varargs(ptr addrspace(1) [[TMP12]], i32 [[TMP13]], ptr [[TMP11_ASCAST]], ptr @__test_block_invoke_3_kernel, ptr [[BLOCK12_ASCAST]], i32 1, ptr [[TMP18]])
+// NOCPU-NEXT:    [[TMP19:%.*]] = call i32 @__enqueue_kernel_varargs(ptr addrspace(1) [[TMP12]], i32 [[TMP13]], ptr [[TMP11_ASCAST]], ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_3_kernel.runtime.handle to ptr), ptr [[BLOCK12_ASCAST]], i32 1, ptr [[TMP18]])
 // NOCPU-NEXT:    [[BLOCK_SIZE22:%.*]] = getelementptr inbounds nuw <{ i32, i32, ptr, i64, ptr addrspace(1) }>, ptr [[BLOCK21_ASCAST]], i32 0, i32 0
 // NOCPU-NEXT:    store i32 32, ptr [[BLOCK_SIZE22]], align 8
 // NOCPU-NEXT:    [[BLOCK_ALIGN23:%.*]] = getelementptr inbounds nuw <{ i32, i32, ptr, i64, ptr addrspace(1) }>, ptr [[BLOCK21_ASCAST]], i32 0, i32 1
@@ -204,7 +210,7 @@ kernel void test_target_features_kernel(global int *i) {
 // NOCPU-NEXT:    [[TMP23:%.*]] = load i32, ptr [[FLAGS_ASCAST]], align 4
 // NOCPU-NEXT:    call void @llvm.memcpy.p0.p0.i64(ptr align 4 [[TMP27_ASCAST]], ptr align 4 [[NDRANGE_ASCAST]], i64 4, i1 false)
 // NOCPU-NEXT:    [[TMP24:%.*]] = load ptr, ptr [[BLOCK20_ASCAST]], align 8
-// NOCPU-NEXT:    [[TMP25:%.*]] = call i32 @__enqueue_kernel_basic(ptr addrspace(1) [[TMP22]], i32 [[TMP23]], ptr [[TMP27_ASCAST]], ptr @__test_block_invoke_4_kernel, ptr [[BLOCK21_ASCAST]])
+// NOCPU-NEXT:    [[TMP25:%.*]] = call i32 @__enqueue_kernel_basic(ptr addrspace(1) [[TMP22]], i32 [[TMP23]], ptr [[TMP27_ASCAST]], ptr addrspacecast (ptr addrspace(1) @__test_block_invoke_4_kernel.runtime.handle to ptr), ptr [[BLOCK21_ASCAST]])
 // NOCPU-NEXT:    ret void
 //
 //
@@ -229,7 +235,7 @@ kernel void test_target_features_kernel(global int *i) {
 //
 // NOCPU: Function Attrs: convergent nounwind
 // NOCPU-LABEL: define {{[^@]+}}@__test_block_invoke_kernel
-// NOCPU-SAME: (<{ i32, i32, ptr, ptr addrspace(1), i8 }> [[TMP0:%.*]]) #[[ATTR5:[0-9]+]] !kernel_arg_addr_space [[META7:![0-9]+]] !kernel_arg_access_qual [[META8:![0-9]+]] !kernel_arg_type [[META9:![0-9]+]] !kernel_arg_base_type [[META9]] !kernel_arg_type_qual [[META10:![0-9]+]] {
+// NOCPU-SAME: (<{ i32, i32, ptr, ptr addrspace(1), i8 }> [[TMP0:%.*]]) #[[ATTR5:[0-9]+]] !associated [[META7:![0-9]+]] !kernel_arg_addr_space [[META8:![0-9]+]] !kernel_arg_access_qual [[META9:![0-9]+]] !kernel_arg_type [[META10:![0-9]+]] !kernel_arg_base_type [[META10]] !kernel_arg_type_qual [[META11:![0-9]+]] {
 // NOCPU-NEXT:  entry:
 // NOCPU-NEXT:    [[TMP1:%.*]] = alloca <{ i32, i32, ptr, ptr addrspace(1), i8 }>, align 8, addrspace(5)
 // NOCPU-NEXT:    store <{ i32, i32, ptr, ptr addrspace(1), i8 }> [[TMP0]], ptr addrspace(5) [[TMP1]], align 8
@@ -265,7 +271,7 @@ kernel void test_target_features_kernel(global int *i) {
 //
 // NOCPU: Function Attrs: convergent nounwind
 // NOCPU-LABEL: define {{[^@]+}}@__test_block_invoke_2_kernel
-// NOCPU-SAME: (<{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }> [[TMP0:%.*]]) #[[ATTR5]] !kernel_arg_addr_space [[META7]] !kernel_arg_access_qual [[META8]] !kernel_arg_type [[META9]] !kernel_arg_base_type [[META9]] !kernel_arg_type_qual [[META10]] {
+// NOCPU-SAME: (<{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }> [[TMP0:%.*]]) #[[ATTR5]] !associated [[META12:![0-9]+]] !kernel_arg_addr_space [[META8]] !kernel_arg_access_qual [[META9]] !kernel_arg_type [[META10]] !kernel_arg_base_type [[META10]] !kernel_arg_type_qual [[META11]] {
 // NOCPU-NEXT:  entry:
 // NOCPU-NEXT:    [[TMP1:%.*]] = alloca <{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }>, align 8, addrspace(5)
 // NOCPU-NEXT:    store <{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }> [[TMP0]], ptr addrspace(5) [[TMP1]], align 8
@@ -307,7 +313,7 @@ kernel void test_target_features_kernel(global int *i) {
 //
 // NOCPU: Function Attrs: convergent nounwind
 // NOCPU-LABEL: define {{[^@]+}}@__test_block_invoke_3_kernel
-// NOCPU-SAME: (<{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }> [[TMP0:%.*]], ptr addrspace(3) [[TMP1:%.*]]) #[[ATTR5]] !kernel_arg_addr_space [[META11:![0-9]+]] !kernel_arg_access_qual [[META12:![0-9]+]] !kernel_arg_type [[META13:![0-9]+]] !kernel_arg_base_type [[META13]] !kernel_arg_type_qual [[META14:![0-9]+]] {
+// NOCPU-SAME: (<{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }> [[TMP0:%.*]], ptr addrspace(3) [[TMP1:%.*]]) #[[ATTR5]] !associated [[META13:![0-9]+]] !kernel_arg_addr_space [[META14:![0-9]+]] !kernel_arg_access_qual [[META15:![0-9]+]] !kernel_arg_type [[META16:![0-9]+]] !kernel_arg_base_type [[META16]] !kernel_arg_type_qual [[META17:![0-9]+]] {
 // NOCPU-NEXT:  entry:
 // NOCPU-NEXT:    [[TMP2:%.*]] = alloca <{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }>, align 8, addrspace(5)
 // NOCPU-NEXT:    store <{ i32, i32, ptr, ptr addrspace(1), ptr addrspace(1), i64, i8 }> [[TMP0]], ptr addrspace(5) [[TMP2]], align 8
@@ -336,7 +342,7 @@ kernel void test_target_features_kernel(global int *i) {
 //
 // NOCPU: Function Attrs: convergent nounwind
 // NOCPU-LABEL: define {{[^@]+}}@__test_block_invoke_4_kernel
-// NOCPU-SAME: (<{ i32, i32, ptr, i64, ptr addrspace(1) }> [[TMP0:%.*]]) #[[ATTR5]] !kernel_arg_addr_space [[META7]] !kernel_arg_access_qual [[...
[truncated]

@arsenm arsenm marked this pull request as ready for review February 24, 2025 14:26
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Feb 24, 2025
@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-opencl-enqueued-block-handling-into-clang branch from 7c230fb to 6b51198 Compare February 24, 2025 15:09
@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-opencl-enqueued-block-handling-into-clang branch 2 times, most recently from 6e193cc to e910f85 Compare March 6, 2025 16:22
@llvmbot llvmbot added the clang:openmp OpenMP related changes to Clang label Mar 6, 2025
/// Return IR struct type for rtinfo struct in rocm-device-libs used for device
/// enqueue.
///
/// ptr addrspace(1) kernel_object, i32 private_segment_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from the implicit arguments? Could that be considered constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this isn't in the implicit arguments. It could be constant, but I think I had difficulty getting externally_initialized + constant to work as expected.

I also think in principle the special pass isn't necessary anymore. I think something about visibility was how it ended up this way, and might require a runtime change to fully delete it

@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-opencl-enqueued-block-handling-into-clang branch 2 times, most recently from b4c7b92 to 1b5f2f2 Compare March 7, 2025 12:22
Copy link
Collaborator

@ssahasra ssahasra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have recent exposure to the OpenCL implementation, but generally eyeballed the code to make sure that the changes do what is described. Is there some confidence that the new scheme actually works? For example, maybe an existing CodeGen test where the final HSA metadata remains equivalent, and the runtime can successfully pick it up.

return false;

// FIXME: We shouldn't really need to export the kernel address. We can
// initialize the runtime handle with the kernel descriptorG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo at the end of the sentence?

@@ -1,23 +1,23 @@
//===- AMDGPUOpenCLEnqueuedBlockLowering.h -----------------------*- C++-*-===//
//===- AMDGPUExportKernelRuntimeHandles.h -----------------------*- C++-*-===//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the filename and the emacs marking on the first line anymore? I would be very happy to see it dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My emacs certainly needs it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement was dropped from the LLVM Coding Standards, and the "standard header" mentioned there does not have it. Also, Emacs has not needed this since at least version 26.

; CHECK-NEXT: .vgpr_count:
; CHECK-NEXT: .vgpr_spill_count: 0
; CHECK-NEXT: .wavefront_size: 64
; CHECK-NOT: device_enqueue_symbol
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all the other CHECK-NEXT lines? Ideally, it would be good to see a single CHECK-NOT between two CHECK or CHECK-LABEL statements.

Copy link
Contributor Author

@arsenm arsenm Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK-NOT is terrible. Ideally we would have comprehensive -NEXT checks

@arsenm
Copy link
Contributor Author

arsenm commented Mar 10, 2025

I don't have recent exposure to the OpenCL implementation, but generally eyeballed the code to make sure that the changes do what is described. Is there some confidence that the new scheme actually works?

Conformance test passes, which is all that really matters

arsenm added 2 commits March 10, 2025 17:49
The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.

This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.

I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.

We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.

https://reviews.llvm.org/D141700
@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-opencl-enqueued-block-handling-into-clang branch from 1b5f2f2 to 2de1a05 Compare March 10, 2025 10:54
@@ -1,23 +1,23 @@
//===- AMDGPUOpenCLEnqueuedBlockLowering.h -----------------------*- C++-*-===//
//===- AMDGPUExportKernelRuntimeHandles.h -----------------------*- C++-*-===//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement was dropped from the LLVM Coding Standards, and the "standard header" mentioned there does not have it. Also, Emacs has not needed this since at least version 26.

Copy link
Contributor Author

arsenm commented Mar 10, 2025

Merge activity

  • Mar 10, 8:52 AM EDT: A user started a stack merge that includes this pull request via Graphite.
  • Mar 10, 8:54 AM EDT: A user merged this pull request with Graphite.

@arsenm arsenm merged commit 0d2c55c into main Mar 10, 2025
12 checks passed
@arsenm arsenm deleted the users/arsenm/amdgpu/move-opencl-enqueued-block-handling-into-clang branch March 10, 2025 12:54
@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 10, 2025

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building clang,llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/2459

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/38/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-22964-38-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=38 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category OpenCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants