-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[MLIR] Enabling Intel GPU Integration. #65539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add CMake option MLIR_ENABLE_SYCL_RUNNER
@rengolin @joker-eph @Hardcode84 FYI that this PR enables the current GPU dialect on Intel GPU as is, without introducing stream/queue to the current GPU dialect. |
I suggest to extract |
Agree. And the changes are originally from different authors so should be splitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, this should likely get reviewed by @antiagainst / @kuhar ; and it's be great if you can split the independent changes and send them ahead of the e2e integration.
return spvModName->consume_front("__spv__") && spvModName == name; | ||
}; | ||
auto spvMods = mod.getOps<spirv::ModuleOp>(); | ||
auto it = llvm::find_if(spvMods, isSameMod); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really costly, can you build a symbol table once at the beginning of the function and use it for the queries instead?
spvMod->erase(); | ||
} | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianmcg : how would that fit in the new serialization flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GpuSerializeToSPIRVPass
can definitely be integrated as an SPIR-V target attribute. Basically GpuSerializeToSPIRVPass::run
would be the body of SPIRVTargetAttr::serializeToObject
. However, we might need to modify the signature of serializeToObject
to include a symbol table to make the SPIR-V serialization efficient, then serializeToObject
could query the SPIR-V module name directly.
With that, gpu-module-to-binary
would be able to take care of the serialization like with NVIDIA & AMD.
Then we also need to modify SelectObjectAttr
. I think the only modification needed is adding the extra parameters in mgpuModuleLoad
, etc. Another option is creating a dedicated SPIRVObjectAttr
but so far I don't think it's needed.
Overall I think those would be the only 2 changes we would need to do to integrate this patch into the new workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! @silee2 : can you adjust in this direction? I'm sure Fabian can support you if you have more questions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@silee2 here are the steps:
- Implement a target attribute, see for example: NVVMTargetAttr. The idea of this attribute is to hold properties intrinsic to the target, like triple, chip, flags, etc.
- Add a pass to attach the target to a module, see: GpuNVVMAttachTarget and Dialect/GPU/Transforms/NVVMAttachTarget.cpp. The idea of this pass is to attach the SPIRV target to GPU modules, so it must know how to create them.
- We're currently implementing
TargetAttrs
as external models to keep libraries separated, see NVVM/Target.cpp, soGpuSerializeToSPIRVPass::run
would be there. - Modify
getModuleLoadFn
&createKernelLaunch
appropriately in SelectObjectAttr.cpp#L125-L15 instead of adding the changes inGPUToLLVMCommon
. - Then the compilation workflow should look something similar to this: GPU: Compilation Overview
I'll take care of adding a pointer to the top module symbol table so it can be used be the SPIRVTarget
.
If you have any questions just ping me in discord or discourse @fabianmc
.
@@ -71,7 +71,8 @@ void GPUToSPIRVPass::runOnOperation() { | |||
std::unique_ptr<ConversionTarget> target = | |||
spirv::getMemorySpaceToStorageClassTarget(*context); | |||
spirv::MemorySpaceToStorageClassMap memorySpaceMap = | |||
spirv::mapMemorySpaceToVulkanStorageClass; | |||
this->useOpenCL ? spirv::mapMemorySpaceToOpenCLStorageClass : | |||
spirv::mapMemorySpaceToVulkanStorageClass; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out which unit-test is covering this? Can you send this in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #66445 which includes the code change and unit-test for covering.
@@ -1158,7 +1178,7 @@ LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite( | |||
{function.getResult(), adaptor.getGridSizeX(), adaptor.getGridSizeY(), | |||
adaptor.getGridSizeZ(), adaptor.getBlockSizeX(), adaptor.getBlockSizeY(), | |||
adaptor.getBlockSizeZ(), dynamicSharedMemorySize, stream, kernelParams, | |||
/*extra=*/nullpointer}); | |||
/*extra=*/nullpointer, paramsCount}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a unit-test for this: but can you send everything related, but can you send everything related to paramsCount in a separate PR?
Value allocatedPtr = | ||
allocCallBuilder.create(loc, rewriter, {sizeBytes, stream}).getResult(); | ||
allocCallBuilder.create(loc, rewriter, {sizeBytes, stream, isHostShared}) | ||
.getResult(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a unit-test for this: but can you send everything related to isHostShared
in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding host_shared
, I noticed this code in the examples:
%memref, %asyncToken = gpu.alloc async [%0] host_shared (): memref<3x3xi64>
Can SYCL's runtime allocate host_shared
data asynchronously? It might be a good idea to prevent the use of host_shared
and async
together. FWIW, CUDA and HIP cannot do that. As far as I can see from the PR, the async queue is not used in sycl runtime when allocating host_shared
. I am guessing there is no async allocation here as well.
Nonetheless, having async
on gpu.alloc
is perfectly acceptable. CUDA does support asynchronous device memory allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, SYCL provides a more abstract memory management with sycl::buffer
and sycl::accessor
defining an implicit asynchronous task graph. The allocation details are left to the implementation, asynchronous or synchronous allocation is left to the implementers.
Here the lower-level synchronous USM memory management API of SYCL is used instead, similar to CUDA/HIP memory management.
So, should the async
allocation in the example be synchronous instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, SYCL provides a more abstract memory management with sycl::buffer and sycl::accessor defining an implicit asynchronous task graph. The allocation details are left to the implementation, asynchronous or synchronous allocation is left to the implementers.
I haven't touched SYCL much, thanks for the explanation. Creating a task graph implicitly sounds interesting. In this case, SYCL users are ware of asynchrony while writing their program. In CUDA (or HIP), users choose sync or async execution.
Here the lower-level synchronous USM memory management API of SYCL is used instead, similar to CUDA/HIP memory management.
Yes that's correct. I don't think there is an USM that can do allocation asynchronously.
So, should the async allocation in the example be synchronous instead?
Yes, I think this is the correct behaviour. We can disallow host_shared
and async
on the Op.
Here are the possible IRs:
// Valid
%memref = gpu.alloc host_shared (): memref<3x3xi64>
// Valid
%memref = gpu.alloc (): memref<3x3xi64>
// Invalid, USM managers don't allocate async
%memref, %asyncToken = gpu.alloc async [%0] host_shared (): memref<3x3xi64>
// Valid, only for CUDA. Afaik, SYCL or HIP cannot do that
%memref, %asyncToken = gpu.alloc async [%0] (): memref<3x3xi64>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the upstream GPUToLLVMConversion lowering does not support lowering of gpu.alloc which is not async.
https://github.com/llvm/llvm-project/blob/main/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L797
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that if the runtime uses actually synchronous allocation behind the scene and produces an always-ready async token, it works, even if non optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the upstream GPUToLLVMConversion lowering does not support lowering of gpu.alloc which is not async.
Would that work if omit that check when host_shared
is present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite interesting!
At some point it would be nice to have some design document or documentation somewhere explaining how all these MLIR runners works, including this one.
Globally this PR add a SYCL runner, but it is very specific for Intel Level 0.
It would be nice to have in the future some generalization, like SYCL using OpenCL interoperability interface to run the SPIR-V kernels or even native kernels.
@@ -116,6 +116,7 @@ add_definitions(-DMLIR_ROCM_CONVERSIONS_ENABLED=${MLIR_ENABLE_ROCM_CONVERSIONS}) | |||
|
|||
set(MLIR_ENABLE_CUDA_RUNNER 0 CACHE BOOL "Enable building the mlir CUDA runner") | |||
set(MLIR_ENABLE_ROCM_RUNNER 0 CACHE BOOL "Enable building the mlir ROCm runner") | |||
set(MLIR_ENABLE_SYCL_RUNNER 0 CACHE BOOL "Enable building the mlir Sycl runner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please spell SYCL correctly.
set(MLIR_ENABLE_SYCL_RUNNER 0 CACHE BOOL "Enable building the mlir Sycl runner") | |
set(MLIR_ENABLE_SYCL_RUNNER 0 CACHE BOOL "Enable building the mlir SYCL runner") |
One could argue that mlir
should be spelled MLIR
but the train seems to have left long time ago. :-)
|
||
// Create global device and context | ||
sycl::device syclDevice = getDefaultDevice(); | ||
sycl::context syclContext = sycl::context(syclDevice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl::context syclContext = sycl::context(syclDevice); | |
sycl::context syclContext { syclDevice }; |
|
||
auto kernel = sycl::make_kernel<sycl::backend::ext_oneapi_level_zero>( | ||
{kernelBundle, zeKernel}, syclContext); | ||
syclKernel = new sycl::kernel(kernel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this dynamic memory allocation?
Most of the SYCL objects have already this reference semantics already: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics
sycl::nd_range<3> syclNdRange( | ||
sycl::nd_range<3>(syclGlobalRange, syclLocalRange)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl::nd_range<3> syclNdRange( | |
sycl::nd_range<3>(syclGlobalRange, syclLocalRange)); | |
sycl::nd_range<3> syclNdRange(syclGlobalRange, syclLocalRange); |
return syclKernel; | ||
} | ||
|
||
static void launchKernel(QUEUE *queue, sycl::kernel *kernel, size_t gridX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void launchKernel(QUEUE *queue, sycl::kernel *kernel, size_t gridX, | |
static void launchKernel(QUEUE queue, sycl::kernel kernel, size_t gridX, |
or even use &
if you are afraid of using the reference semantics of SYCL behind the scene.
size_t blockY, size_t blockZ, size_t sharedMemBytes, | ||
void **params, size_t paramsCount) { | ||
auto syclGlobalRange = | ||
::sycl::range<3>(blockZ * gridZ, blockY * gridY, blockX * gridX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using sometime ::sycl
and sycl
?
|
||
#pragma clang diagnostic pop | ||
|
||
struct QUEUE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this spelling? Coding standard?
Why do you need this object? At the end this looks like a std::optional<sycl::queue>
.
@@ -187,4 +187,8 @@ def GpuROCDLAttachTarget: Pass<"rocdl-attach-target", ""> { | |||
]; | |||
} | |||
|
|||
def GpuSerializeToSPIRVPass : Pass<"gpu-serialize-to-spirv", "ModuleOp"> { | |||
let summary = "Serialize spirv dialect to spirv binary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let summary = "Serialize spirv dialect to spirv binary"; | |
let summary = "Serialize spirv dialect to SPIR-Vbinary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a space?
# | ||
# Example usage: | ||
# | ||
# find_package(SyclRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be
# find_package(SyclRuntime) | |
# find_package(SYCLRuntime) |
everywhere?
auto spvMod = *it; | ||
|
||
spvBinary.clear(); | ||
// serialize the spv module to spv binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// serialize the spv module to spv binary | |
// serialize the SPIR-V module to SPIR-V binary |
everywhere.
The idea is to eventually consolidate all runners into one. This PR is just another piece of the puzzle. Once we're all happy with how the runners work, we should common them up using command line options to select the "type" and CMake options to enable particular runner types (depending on the runtimes and hardware available).
Agreed! The SYCL runtime here is just being used to abstract the LevelZero calls, but this work will be helpful when adding a full SYCL runner (actual language extensions and libraries) to other CPUs/GPUs later. |
CI failure looks like Buildkite issue?
|
Agree. The key point of this PR is not to create yet another runner, but just enable the existing mlir-cpu-runner to runs llvm + device kernel on SYCL platform. The current implementation uses L0 but could be extended. I won't call this as adding a "SYCL runner", it is add SYCL backend to mlir-cpu-runner. The term "xxx runner" in MLIR is a bit overused. mlir-cpu-runner doesn't mean it runs only on CPU, instead it runs both the host and device kernel. mlir-cpu-runner currently does that but the name is a bit misleading. We really just need a "mlir-runner" with target-platform as command parameters. |
} | ||
|
||
#pragma clang diagnostic push | ||
#pragma clang diagnostic ignored "-Wglobal-constructors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have this anywhere in the codebase right now, can you remove it and use lazy initialization instead?
This naming stems originally that there was a Vulkan and Spirv still have dedicated runners on the model of the original cuda-runner, but I suspect this is just legacy?
What kind of "target-platform" command parameters do you have in mind? (other than what we do now) |
The current way of mlir-cpu-runner using the share library name to indicate target-platform looks good to me: Cuda, Rocm, and SYCL with this PR. Vulkan could be added same way. mlir-cpu-spirv-runner could be refactored to be mlir-opt passes generating spirv binary and feed to mlir-cpu-runner. If we reach that state, the name "mlir-cpu-runner" could be promoted to "mlir-runner". That would clear up a lot of misunderstanding. The "mlir-runner" is really running MLIR programs on a target platform, and has little to do with enabling specific language or runtime features of a target platform. For example, we say "SYCL runner", people think it is about running SYCL program on top of MLIR. |
…oduleToBinary (#65797) This patch adds the option of building an optional symbol table for the top operation in the `gpu-module-to-binary` pass. The table is not created by default as most targets don't need it; instead, it is lazily built. The table is passed through a callback in `TargetOptions`. This patch is required to integrate #65539 .
…oduleToBinary (llvm#65797) This patch adds the option of building an optional symbol table for the top operation in the `gpu-module-to-binary` pass. The table is not created by default as most targets don't need it; instead, it is lazily built. The table is passed through a callback in `TargetOptions`. This patch is required to integrate llvm#65539 .
…#66154) This PR is a breakdown of the big PR #65539 which enables intel gpu integration. In this PR we pass count of parameters and size of gpu binary to runtime wrappers since the SyclRuntimeWrappers (which will come in subsequent PR) requires the spirv size for compilation and also the number of parameters to iterate over the params.
This PR is a breakdown of the big PR #65539 which enables intel gpu integration. In this PR we pass hostShared flag to runtime wrappers (required by SyclRuntimeWrappers which will come in subsequent PR) to indicate if the allocation is done on host shared gpu memory or device only memory.
…llvm#66154) This PR is a breakdown of the big PR llvm#65539 which enables intel gpu integration. In this PR we pass count of parameters and size of gpu binary to runtime wrappers since the SyclRuntimeWrappers (which will come in subsequent PR) requires the spirv size for compilation and also the number of parameters to iterate over the params.
…#66401) This PR is a breakdown of the big PR llvm#65539 which enables intel gpu integration. In this PR we pass hostShared flag to runtime wrappers (required by SyclRuntimeWrappers which will come in subsequent PR) to indicate if the allocation is done on host shared gpu memory or device only memory.
Thanks for breaking down this pull request into various smaller pieces to make it easier for review. I looked at various pieces; LGTM. Looking forward to see this being supported! :) from @joker-eph:
from @Jianhui-Li:
Yup; it's legacy. +1 to the idea of unifying! I've created #73457 to track this issue to make it more visible. I might not have the bandwidth to work on this; if somebody else is interested that'd be nice! So maked it as "good first issue" (not sure whether I'm pushing the limit of "good first issue" but hoping to get traction there). |
Closing as all sub-components has been merged. |
MLIR currently doesn't have a gpu runtime for Intel GPU.
This PR adds components required to enable lowering from GPU dialect and execute on Intel GPU including a new Sycl runtime wrapper.
List of changes:
This PR is monolithic for now and will be broken down to smaller pieces as needed.
Integration workflow:
GPU dialect lowering and execution workflow for Intel GPU added by this PR works as follows. It is similar to CUDA and ROCm workflow but uses spirv dialect for device code IR instead of llvm dialect.
Device code is lowered from gpu.func to spirv dialect using existing convert-gpu-to-spirv pass.
Then gpu-serialize-to-spirv pass is called to serialize spirv and attach serialized binary blob as an attribute to gpu.module
Host code is lowered by gpu-to-llvm pass and gpu dialect ops are converted in to llvm calls to various entries in the sycl runtime wrapper.
The resulting lowered code from both host and device code is now entirely in LLVM dialect and that is piped into mlir-cpu-runner for JIT execution with sycl runtime wrapper loaded as a shared library.
Limitations:
Sycl runtime wrapper in the PR is an initial version and does not implement all entries supported by CUDA and ROCm.
Intel GPU's share local memory is not supported.
Bare pointers are used for passing parameters to device kernels and lacks support for handling dynamic shaped memrefs.