-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[MLIR][OpenMP] Improve Generic-SPMD kernel detection #137307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-flang-openmp @llvm/pr-subscribers-mlir-openmp Author: Sergio Afonso (skatrak) ChangesThe previous implementation assumed that, for a target region to be tagged as Generic-SPMD, it would need to contain a single This patch updates the kernel execution flags identification logic to accept any number of Full diff: https://github.com/llvm/llvm-project/pull/137307.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index dd701da507fc6..3afb374381bdf 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -1954,7 +1954,7 @@ LogicalResult TargetOp::verifyRegions() {
}
static Operation *
-findCapturedOmpOp(Operation *rootOp, bool checkSingleMandatoryExec,
+findCapturedOmpOp(Operation *rootOp,
llvm::function_ref<bool(Operation *)> siblingAllowedFn) {
assert(rootOp && "expected valid operation");
@@ -1982,19 +1982,17 @@ findCapturedOmpOp(Operation *rootOp, bool checkSingleMandatoryExec,
// (i.e. its block's successors can reach it) or if it's not guaranteed to
// be executed before all exits of the region (i.e. it doesn't dominate all
// blocks with no successors reachable from the entry block).
- if (checkSingleMandatoryExec) {
- Region *parentRegion = op->getParentRegion();
- Block *parentBlock = op->getBlock();
-
- for (Block *successor : parentBlock->getSuccessors())
- if (successor->isReachable(parentBlock))
- return WalkResult::interrupt();
-
- for (Block &block : *parentRegion)
- if (domInfo.isReachableFromEntry(&block) && block.hasNoSuccessors() &&
- !domInfo.dominates(parentBlock, &block))
- return WalkResult::interrupt();
- }
+ Region *parentRegion = op->getParentRegion();
+ Block *parentBlock = op->getBlock();
+
+ for (Block *successor : parentBlock->getSuccessors())
+ if (successor->isReachable(parentBlock))
+ return WalkResult::interrupt();
+
+ for (Block &block : *parentRegion)
+ if (domInfo.isReachableFromEntry(&block) && block.hasNoSuccessors() &&
+ !domInfo.dominates(parentBlock, &block))
+ return WalkResult::interrupt();
// Don't capture this op if it has a not-allowed sibling, and stop recursing
// into nested operations.
@@ -2017,27 +2015,25 @@ Operation *TargetOp::getInnermostCapturedOmpOp() {
// Only allow OpenMP terminators and non-OpenMP ops that have known memory
// effects, but don't include a memory write effect.
- return findCapturedOmpOp(
- *this, /*checkSingleMandatoryExec=*/true, [&](Operation *sibling) {
- if (!sibling)
- return false;
-
- if (ompDialect == sibling->getDialect())
- return sibling->hasTrait<OpTrait::IsTerminator>();
-
- if (auto memOp = dyn_cast<MemoryEffectOpInterface>(sibling)) {
- SmallVector<SideEffects::EffectInstance<MemoryEffects::Effect>, 4>
- effects;
- memOp.getEffects(effects);
- return !llvm::any_of(
- effects, [&](MemoryEffects::EffectInstance &effect) {
- return isa<MemoryEffects::Write>(effect.getEffect()) &&
- isa<SideEffects::AutomaticAllocationScopeResource>(
- effect.getResource());
- });
- }
- return true;
+ return findCapturedOmpOp(*this, [&](Operation *sibling) {
+ if (!sibling)
+ return false;
+
+ if (ompDialect == sibling->getDialect())
+ return sibling->hasTrait<OpTrait::IsTerminator>();
+
+ if (auto memOp = dyn_cast<MemoryEffectOpInterface>(sibling)) {
+ SmallVector<SideEffects::EffectInstance<MemoryEffects::Effect>, 4>
+ effects;
+ memOp.getEffects(effects);
+ return !llvm::any_of(effects, [&](MemoryEffects::EffectInstance &effect) {
+ return isa<MemoryEffects::Write>(effect.getEffect()) &&
+ isa<SideEffects::AutomaticAllocationScopeResource>(
+ effect.getResource());
});
+ }
+ return true;
+ });
}
TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
@@ -2098,33 +2094,23 @@ TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
if (isa<LoopOp>(innermostWrapper))
return TargetRegionFlags::spmd | TargetRegionFlags::trip_count;
- // Find single immediately nested captured omp.parallel and add spmd flag
- // (generic-spmd case).
+ // Add spmd flag if there's a nested omp.parallel (generic-spmd case).
//
// TODO: This shouldn't have to be done here, as it is too easy to break.
// The openmp-opt pass should be updated to be able to promote kernels like
// this from "Generic" to "Generic-SPMD". However, the use of the
// `kmpc_distribute_static_loop` family of functions produced by the
// OMPIRBuilder for these kernels prevents that from working.
- Dialect *ompDialect = targetOp->getDialect();
- Operation *nestedCapture = findCapturedOmpOp(
- capturedOp, /*checkSingleMandatoryExec=*/false,
- [&](Operation *sibling) {
- return sibling && (ompDialect != sibling->getDialect() ||
- sibling->hasTrait<OpTrait::IsTerminator>());
- });
+ bool hasParallel = capturedOp
+ ->walk<WalkOrder::PreOrder>([](ParallelOp) {
+ return WalkResult::interrupt();
+ })
+ .wasInterrupted();
TargetRegionFlags result =
TargetRegionFlags::generic | TargetRegionFlags::trip_count;
- if (!nestedCapture)
- return result;
-
- while (nestedCapture->getParentOp() != capturedOp)
- nestedCapture = nestedCapture->getParentOp();
-
- return isa<ParallelOp>(nestedCapture) ? result | TargetRegionFlags::spmd
- : result;
+ return hasParallel ? result | TargetRegionFlags::spmd : result;
}
// Detect target-parallel-wsloop[-simd].
else if (isa<WsloopOp>(innermostWrapper)) {
diff --git a/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir b/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir
index 8101660e571e4..3273de0c26d27 100644
--- a/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir
@@ -1,11 +1,7 @@
-// RUN: split-file %s %t
-// RUN: mlir-translate -mlir-to-llvmir %t/host.mlir | FileCheck %s --check-prefix=HOST
-// RUN: mlir-translate -mlir-to-llvmir %t/device.mlir | FileCheck %s --check-prefix=DEVICE
-
-//--- host.mlir
+// RUN: mlir-translate -mlir-to-llvmir -split-input-file %s | FileCheck %s
module attributes {omp.is_target_device = false, omp.target_triples = ["amdgcn-amd-amdhsa"]} {
- llvm.func @main(%arg0 : !llvm.ptr) {
+ llvm.func @host(%arg0 : !llvm.ptr) {
%x = llvm.load %arg0 : !llvm.ptr -> i32
%0 = omp.map.info var_ptr(%arg0 : !llvm.ptr, i32) map_clauses(to) capture(ByCopy) -> !llvm.ptr
omp.target host_eval(%x -> %lb, %x -> %ub, %x -> %step : i32, i32, i32) map_entries(%0 -> %ptr : !llvm.ptr) {
@@ -32,36 +28,36 @@ module attributes {omp.is_target_device = false, omp.target_triples = ["amdgcn-a
}
}
-// HOST-LABEL: define void @main
-// HOST: %omp_loop.tripcount = {{.*}}
-// HOST-NEXT: br label %[[ENTRY:.*]]
-// HOST: [[ENTRY]]:
-// HOST: %[[TRIPCOUNT:.*]] = zext i32 %omp_loop.tripcount to i64
-// HOST: %[[TRIPCOUNT_KARG:.*]] = getelementptr inbounds nuw %struct.__tgt_kernel_arguments, ptr %[[KARGS:.*]], i32 0, i32 8
-// HOST-NEXT: store i64 %[[TRIPCOUNT]], ptr %[[TRIPCOUNT_KARG]]
-// HOST: %[[RESULT:.*]] = call i32 @__tgt_target_kernel({{.*}}, ptr %[[KARGS]])
-// HOST-NEXT: %[[CMP:.*]] = icmp ne i32 %[[RESULT]], 0
-// HOST-NEXT: br i1 %[[CMP]], label %[[OFFLOAD_FAILED:.*]], label %{{.*}}
-// HOST: [[OFFLOAD_FAILED]]:
-// HOST: call void @[[TARGET_OUTLINE:.*]]({{.*}})
+// CHECK-LABEL: define void @host
+// CHECK: %omp_loop.tripcount = {{.*}}
+// CHECK-NEXT: br label %[[ENTRY:.*]]
+// CHECK: [[ENTRY]]:
+// CHECK: %[[TRIPCOUNT:.*]] = zext i32 %omp_loop.tripcount to i64
+// CHECK: %[[TRIPCOUNT_KARG:.*]] = getelementptr inbounds nuw %struct.__tgt_kernel_arguments, ptr %[[KARGS:.*]], i32 0, i32 8
+// CHECK-NEXT: store i64 %[[TRIPCOUNT]], ptr %[[TRIPCOUNT_KARG]]
+// CHECK: %[[RESULT:.*]] = call i32 @__tgt_target_kernel({{.*}}, ptr %[[KARGS]])
+// CHECK-NEXT: %[[CMP:.*]] = icmp ne i32 %[[RESULT]], 0
+// CHECK-NEXT: br i1 %[[CMP]], label %[[OFFLOAD_FAILED:.*]], label %{{.*}}
+// CHECK: [[OFFLOAD_FAILED]]:
+// CHECK: call void @[[TARGET_OUTLINE:.*]]({{.*}})
-// HOST: define internal void @[[TARGET_OUTLINE]]
-// HOST: call void{{.*}}@__kmpc_fork_teams({{.*}}, ptr @[[TEAMS_OUTLINE:.*]], {{.*}})
+// CHECK: define internal void @[[TARGET_OUTLINE]]
+// CHECK: call void{{.*}}@__kmpc_fork_teams({{.*}}, ptr @[[TEAMS_OUTLINE:.*]], {{.*}})
-// HOST: define internal void @[[TEAMS_OUTLINE]]
-// HOST: call void @[[DISTRIBUTE_OUTLINE:.*]]({{.*}})
+// CHECK: define internal void @[[TEAMS_OUTLINE]]
+// CHECK: call void @[[DISTRIBUTE_OUTLINE:.*]]({{.*}})
-// HOST: define internal void @[[DISTRIBUTE_OUTLINE]]
-// HOST: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 92, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
-// HOST: call void (ptr, i32, ptr, ...) @__kmpc_fork_call({{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], {{.*}})
+// CHECK: define internal void @[[DISTRIBUTE_OUTLINE]]
+// CHECK: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 92, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
+// CHECK: call void (ptr, i32, ptr, ...) @__kmpc_fork_call({{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], {{.*}})
-// HOST: define internal void @[[PARALLEL_OUTLINE]]
-// HOST: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 34, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
+// CHECK: define internal void @[[PARALLEL_OUTLINE]]
+// CHECK: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 34, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
-//--- device.mlir
+// -----
module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_target_device = true, omp.is_gpu = true} {
- llvm.func @main(%arg0 : !llvm.ptr) {
+ llvm.func @device(%arg0 : !llvm.ptr) {
%0 = omp.map.info var_ptr(%arg0 : !llvm.ptr, i32) map_clauses(to) capture(ByCopy) -> !llvm.ptr
omp.target map_entries(%0 -> %ptr : !llvm.ptr) {
%x = llvm.load %ptr : !llvm.ptr -> i32
@@ -87,25 +83,80 @@ module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_target_devic
}
}
-// DEVICE: @[[KERNEL_NAME:.*]]_exec_mode = weak protected constant i8 [[EXEC_MODE:3]]
-// DEVICE: @llvm.compiler.used = appending global [1 x ptr] [ptr @[[KERNEL_NAME]]_exec_mode], section "llvm.metadata"
-// DEVICE: @[[KERNEL_NAME]]_kernel_environment = weak_odr protected constant %struct.KernelEnvironmentTy {
-// DEVICE-SAME: %struct.ConfigurationEnvironmentTy { i8 1, i8 1, i8 [[EXEC_MODE]], {{.*}}},
-// DEVICE-SAME: ptr @{{.*}}, ptr @{{.*}} }
+// CHECK: @[[KERNEL_NAME:.*]]_exec_mode = weak protected constant i8 [[EXEC_MODE:3]]
+// CHECK: @llvm.compiler.used = appending global [1 x ptr] [ptr @[[KERNEL_NAME]]_exec_mode], section "llvm.metadata"
+// CHECK: @[[KERNEL_NAME]]_kernel_environment = weak_odr protected constant %struct.KernelEnvironmentTy {
+// CHECK-SAME: %struct.ConfigurationEnvironmentTy { i8 1, i8 1, i8 [[EXEC_MODE]], {{.*}}},
+// CHECK-SAME: ptr @{{.*}}, ptr @{{.*}} }
+
+// CHECK: define weak_odr protected amdgpu_kernel void @[[KERNEL_NAME]]({{.*}})
+// CHECK: %{{.*}} = call i32 @__kmpc_target_init(ptr @[[KERNEL_NAME]]_kernel_environment, {{.*}})
+// CHECK: call void @[[TARGET_OUTLINE:.*]]({{.*}})
+// CHECK: call void @__kmpc_target_deinit()
+
+// CHECK: define internal void @[[TARGET_OUTLINE]]({{.*}})
+// CHECK: call void @[[TEAMS_OUTLINE:.*]]({{.*}})
+
+// CHECK: define internal void @[[TEAMS_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_distribute_static_loop{{.*}}({{.*}}, ptr @[[DISTRIBUTE_OUTLINE:[^,]*]], {{.*}})
+
+// CHECK: define internal void @[[DISTRIBUTE_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
+
+// CHECK: define internal void @[[PARALLEL_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_for_static_loop{{.*}}({{.*}})
+
+// -----
+
+module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_target_device = true, omp.is_gpu = true} {
+ llvm.func @device2(%arg0 : !llvm.ptr) {
+ %0 = omp.map.info var_ptr(%arg0 : !llvm.ptr, i32) map_clauses(to) capture(ByCopy) -> !llvm.ptr
+ omp.target map_entries(%0 -> %ptr : !llvm.ptr) {
+ %x = llvm.load %ptr : !llvm.ptr -> i32
+ omp.teams {
+ omp.distribute {
+ omp.loop_nest (%iv1) : i32 = (%x) to (%x) step (%x) {
+ omp.parallel {
+ omp.terminator
+ }
+ llvm.br ^bb2
+ ^bb1:
+ omp.parallel {
+ omp.terminator
+ }
+ omp.yield
+ ^bb2:
+ llvm.br ^bb1
+ }
+ }
+ omp.terminator
+ }
+ omp.terminator
+ }
+ llvm.return
+ }
+}
+
+// CHECK: @[[KERNEL_NAME:.*]]_exec_mode = weak protected constant i8 [[EXEC_MODE:3]]
+// CHECK: @llvm.compiler.used = appending global [1 x ptr] [ptr @[[KERNEL_NAME]]_exec_mode], section "llvm.metadata"
+// CHECK: @[[KERNEL_NAME]]_kernel_environment = weak_odr protected constant %struct.KernelEnvironmentTy {
+// CHECK-SAME: %struct.ConfigurationEnvironmentTy { i8 1, i8 1, i8 [[EXEC_MODE]], {{.*}}},
+// CHECK-SAME: ptr @{{.*}}, ptr @{{.*}} }
-// DEVICE: define weak_odr protected amdgpu_kernel void @[[KERNEL_NAME]]({{.*}})
-// DEVICE: %{{.*}} = call i32 @__kmpc_target_init(ptr @[[KERNEL_NAME]]_kernel_environment, {{.*}})
-// DEVICE: call void @[[TARGET_OUTLINE:.*]]({{.*}})
-// DEVICE: call void @__kmpc_target_deinit()
+// CHECK: define weak_odr protected amdgpu_kernel void @[[KERNEL_NAME]]({{.*}})
+// CHECK: %{{.*}} = call i32 @__kmpc_target_init(ptr @[[KERNEL_NAME]]_kernel_environment, {{.*}})
+// CHECK: call void @[[TARGET_OUTLINE:.*]]({{.*}})
+// CHECK: call void @__kmpc_target_deinit()
-// DEVICE: define internal void @[[TARGET_OUTLINE]]({{.*}})
-// DEVICE: call void @[[TEAMS_OUTLINE:.*]]({{.*}})
+// CHECK: define internal void @[[TARGET_OUTLINE]]({{.*}})
+// CHECK: call void @[[TEAMS_OUTLINE:.*]]({{.*}})
-// DEVICE: define internal void @[[TEAMS_OUTLINE]]({{.*}})
-// DEVICE: call void @__kmpc_distribute_static_loop{{.*}}({{.*}}, ptr @[[DISTRIBUTE_OUTLINE:[^,]*]], {{.*}})
+// CHECK: define internal void @[[TEAMS_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_distribute_static_loop{{.*}}({{.*}}, ptr @[[DISTRIBUTE_OUTLINE:[^,]*]], {{.*}})
-// DEVICE: define internal void @[[DISTRIBUTE_OUTLINE]]({{.*}})
-// DEVICE: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
+// CHECK: define internal void @[[DISTRIBUTE_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE0:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
+// CHECK: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE1:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
-// DEVICE: define internal void @[[PARALLEL_OUTLINE]]({{.*}})
-// DEVICE: call void @__kmpc_for_static_loop{{.*}}({{.*}})
+// CHECK: define internal void @[[PARALLEL_OUTLINE1]]({{.*}})
+// CHECK: define internal void @[[PARALLEL_OUTLINE0]]({{.*}})
|
@llvm/pr-subscribers-mlir Author: Sergio Afonso (skatrak) ChangesThe previous implementation assumed that, for a target region to be tagged as Generic-SPMD, it would need to contain a single This patch updates the kernel execution flags identification logic to accept any number of Full diff: https://github.com/llvm/llvm-project/pull/137307.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index dd701da507fc6..3afb374381bdf 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -1954,7 +1954,7 @@ LogicalResult TargetOp::verifyRegions() {
}
static Operation *
-findCapturedOmpOp(Operation *rootOp, bool checkSingleMandatoryExec,
+findCapturedOmpOp(Operation *rootOp,
llvm::function_ref<bool(Operation *)> siblingAllowedFn) {
assert(rootOp && "expected valid operation");
@@ -1982,19 +1982,17 @@ findCapturedOmpOp(Operation *rootOp, bool checkSingleMandatoryExec,
// (i.e. its block's successors can reach it) or if it's not guaranteed to
// be executed before all exits of the region (i.e. it doesn't dominate all
// blocks with no successors reachable from the entry block).
- if (checkSingleMandatoryExec) {
- Region *parentRegion = op->getParentRegion();
- Block *parentBlock = op->getBlock();
-
- for (Block *successor : parentBlock->getSuccessors())
- if (successor->isReachable(parentBlock))
- return WalkResult::interrupt();
-
- for (Block &block : *parentRegion)
- if (domInfo.isReachableFromEntry(&block) && block.hasNoSuccessors() &&
- !domInfo.dominates(parentBlock, &block))
- return WalkResult::interrupt();
- }
+ Region *parentRegion = op->getParentRegion();
+ Block *parentBlock = op->getBlock();
+
+ for (Block *successor : parentBlock->getSuccessors())
+ if (successor->isReachable(parentBlock))
+ return WalkResult::interrupt();
+
+ for (Block &block : *parentRegion)
+ if (domInfo.isReachableFromEntry(&block) && block.hasNoSuccessors() &&
+ !domInfo.dominates(parentBlock, &block))
+ return WalkResult::interrupt();
// Don't capture this op if it has a not-allowed sibling, and stop recursing
// into nested operations.
@@ -2017,27 +2015,25 @@ Operation *TargetOp::getInnermostCapturedOmpOp() {
// Only allow OpenMP terminators and non-OpenMP ops that have known memory
// effects, but don't include a memory write effect.
- return findCapturedOmpOp(
- *this, /*checkSingleMandatoryExec=*/true, [&](Operation *sibling) {
- if (!sibling)
- return false;
-
- if (ompDialect == sibling->getDialect())
- return sibling->hasTrait<OpTrait::IsTerminator>();
-
- if (auto memOp = dyn_cast<MemoryEffectOpInterface>(sibling)) {
- SmallVector<SideEffects::EffectInstance<MemoryEffects::Effect>, 4>
- effects;
- memOp.getEffects(effects);
- return !llvm::any_of(
- effects, [&](MemoryEffects::EffectInstance &effect) {
- return isa<MemoryEffects::Write>(effect.getEffect()) &&
- isa<SideEffects::AutomaticAllocationScopeResource>(
- effect.getResource());
- });
- }
- return true;
+ return findCapturedOmpOp(*this, [&](Operation *sibling) {
+ if (!sibling)
+ return false;
+
+ if (ompDialect == sibling->getDialect())
+ return sibling->hasTrait<OpTrait::IsTerminator>();
+
+ if (auto memOp = dyn_cast<MemoryEffectOpInterface>(sibling)) {
+ SmallVector<SideEffects::EffectInstance<MemoryEffects::Effect>, 4>
+ effects;
+ memOp.getEffects(effects);
+ return !llvm::any_of(effects, [&](MemoryEffects::EffectInstance &effect) {
+ return isa<MemoryEffects::Write>(effect.getEffect()) &&
+ isa<SideEffects::AutomaticAllocationScopeResource>(
+ effect.getResource());
});
+ }
+ return true;
+ });
}
TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
@@ -2098,33 +2094,23 @@ TargetRegionFlags TargetOp::getKernelExecFlags(Operation *capturedOp) {
if (isa<LoopOp>(innermostWrapper))
return TargetRegionFlags::spmd | TargetRegionFlags::trip_count;
- // Find single immediately nested captured omp.parallel and add spmd flag
- // (generic-spmd case).
+ // Add spmd flag if there's a nested omp.parallel (generic-spmd case).
//
// TODO: This shouldn't have to be done here, as it is too easy to break.
// The openmp-opt pass should be updated to be able to promote kernels like
// this from "Generic" to "Generic-SPMD". However, the use of the
// `kmpc_distribute_static_loop` family of functions produced by the
// OMPIRBuilder for these kernels prevents that from working.
- Dialect *ompDialect = targetOp->getDialect();
- Operation *nestedCapture = findCapturedOmpOp(
- capturedOp, /*checkSingleMandatoryExec=*/false,
- [&](Operation *sibling) {
- return sibling && (ompDialect != sibling->getDialect() ||
- sibling->hasTrait<OpTrait::IsTerminator>());
- });
+ bool hasParallel = capturedOp
+ ->walk<WalkOrder::PreOrder>([](ParallelOp) {
+ return WalkResult::interrupt();
+ })
+ .wasInterrupted();
TargetRegionFlags result =
TargetRegionFlags::generic | TargetRegionFlags::trip_count;
- if (!nestedCapture)
- return result;
-
- while (nestedCapture->getParentOp() != capturedOp)
- nestedCapture = nestedCapture->getParentOp();
-
- return isa<ParallelOp>(nestedCapture) ? result | TargetRegionFlags::spmd
- : result;
+ return hasParallel ? result | TargetRegionFlags::spmd : result;
}
// Detect target-parallel-wsloop[-simd].
else if (isa<WsloopOp>(innermostWrapper)) {
diff --git a/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir b/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir
index 8101660e571e4..3273de0c26d27 100644
--- a/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-target-generic-spmd.mlir
@@ -1,11 +1,7 @@
-// RUN: split-file %s %t
-// RUN: mlir-translate -mlir-to-llvmir %t/host.mlir | FileCheck %s --check-prefix=HOST
-// RUN: mlir-translate -mlir-to-llvmir %t/device.mlir | FileCheck %s --check-prefix=DEVICE
-
-//--- host.mlir
+// RUN: mlir-translate -mlir-to-llvmir -split-input-file %s | FileCheck %s
module attributes {omp.is_target_device = false, omp.target_triples = ["amdgcn-amd-amdhsa"]} {
- llvm.func @main(%arg0 : !llvm.ptr) {
+ llvm.func @host(%arg0 : !llvm.ptr) {
%x = llvm.load %arg0 : !llvm.ptr -> i32
%0 = omp.map.info var_ptr(%arg0 : !llvm.ptr, i32) map_clauses(to) capture(ByCopy) -> !llvm.ptr
omp.target host_eval(%x -> %lb, %x -> %ub, %x -> %step : i32, i32, i32) map_entries(%0 -> %ptr : !llvm.ptr) {
@@ -32,36 +28,36 @@ module attributes {omp.is_target_device = false, omp.target_triples = ["amdgcn-a
}
}
-// HOST-LABEL: define void @main
-// HOST: %omp_loop.tripcount = {{.*}}
-// HOST-NEXT: br label %[[ENTRY:.*]]
-// HOST: [[ENTRY]]:
-// HOST: %[[TRIPCOUNT:.*]] = zext i32 %omp_loop.tripcount to i64
-// HOST: %[[TRIPCOUNT_KARG:.*]] = getelementptr inbounds nuw %struct.__tgt_kernel_arguments, ptr %[[KARGS:.*]], i32 0, i32 8
-// HOST-NEXT: store i64 %[[TRIPCOUNT]], ptr %[[TRIPCOUNT_KARG]]
-// HOST: %[[RESULT:.*]] = call i32 @__tgt_target_kernel({{.*}}, ptr %[[KARGS]])
-// HOST-NEXT: %[[CMP:.*]] = icmp ne i32 %[[RESULT]], 0
-// HOST-NEXT: br i1 %[[CMP]], label %[[OFFLOAD_FAILED:.*]], label %{{.*}}
-// HOST: [[OFFLOAD_FAILED]]:
-// HOST: call void @[[TARGET_OUTLINE:.*]]({{.*}})
+// CHECK-LABEL: define void @host
+// CHECK: %omp_loop.tripcount = {{.*}}
+// CHECK-NEXT: br label %[[ENTRY:.*]]
+// CHECK: [[ENTRY]]:
+// CHECK: %[[TRIPCOUNT:.*]] = zext i32 %omp_loop.tripcount to i64
+// CHECK: %[[TRIPCOUNT_KARG:.*]] = getelementptr inbounds nuw %struct.__tgt_kernel_arguments, ptr %[[KARGS:.*]], i32 0, i32 8
+// CHECK-NEXT: store i64 %[[TRIPCOUNT]], ptr %[[TRIPCOUNT_KARG]]
+// CHECK: %[[RESULT:.*]] = call i32 @__tgt_target_kernel({{.*}}, ptr %[[KARGS]])
+// CHECK-NEXT: %[[CMP:.*]] = icmp ne i32 %[[RESULT]], 0
+// CHECK-NEXT: br i1 %[[CMP]], label %[[OFFLOAD_FAILED:.*]], label %{{.*}}
+// CHECK: [[OFFLOAD_FAILED]]:
+// CHECK: call void @[[TARGET_OUTLINE:.*]]({{.*}})
-// HOST: define internal void @[[TARGET_OUTLINE]]
-// HOST: call void{{.*}}@__kmpc_fork_teams({{.*}}, ptr @[[TEAMS_OUTLINE:.*]], {{.*}})
+// CHECK: define internal void @[[TARGET_OUTLINE]]
+// CHECK: call void{{.*}}@__kmpc_fork_teams({{.*}}, ptr @[[TEAMS_OUTLINE:.*]], {{.*}})
-// HOST: define internal void @[[TEAMS_OUTLINE]]
-// HOST: call void @[[DISTRIBUTE_OUTLINE:.*]]({{.*}})
+// CHECK: define internal void @[[TEAMS_OUTLINE]]
+// CHECK: call void @[[DISTRIBUTE_OUTLINE:.*]]({{.*}})
-// HOST: define internal void @[[DISTRIBUTE_OUTLINE]]
-// HOST: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 92, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
-// HOST: call void (ptr, i32, ptr, ...) @__kmpc_fork_call({{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], {{.*}})
+// CHECK: define internal void @[[DISTRIBUTE_OUTLINE]]
+// CHECK: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 92, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
+// CHECK: call void (ptr, i32, ptr, ...) @__kmpc_fork_call({{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], {{.*}})
-// HOST: define internal void @[[PARALLEL_OUTLINE]]
-// HOST: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 34, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
+// CHECK: define internal void @[[PARALLEL_OUTLINE]]
+// CHECK: call void @__kmpc_for_static_init{{.*}}(ptr {{.*}}, i32 {{.*}}, i32 34, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, ptr {{.*}}, i32 {{.*}}, i32 {{.*}})
-//--- device.mlir
+// -----
module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_target_device = true, omp.is_gpu = true} {
- llvm.func @main(%arg0 : !llvm.ptr) {
+ llvm.func @device(%arg0 : !llvm.ptr) {
%0 = omp.map.info var_ptr(%arg0 : !llvm.ptr, i32) map_clauses(to) capture(ByCopy) -> !llvm.ptr
omp.target map_entries(%0 -> %ptr : !llvm.ptr) {
%x = llvm.load %ptr : !llvm.ptr -> i32
@@ -87,25 +83,80 @@ module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_target_devic
}
}
-// DEVICE: @[[KERNEL_NAME:.*]]_exec_mode = weak protected constant i8 [[EXEC_MODE:3]]
-// DEVICE: @llvm.compiler.used = appending global [1 x ptr] [ptr @[[KERNEL_NAME]]_exec_mode], section "llvm.metadata"
-// DEVICE: @[[KERNEL_NAME]]_kernel_environment = weak_odr protected constant %struct.KernelEnvironmentTy {
-// DEVICE-SAME: %struct.ConfigurationEnvironmentTy { i8 1, i8 1, i8 [[EXEC_MODE]], {{.*}}},
-// DEVICE-SAME: ptr @{{.*}}, ptr @{{.*}} }
+// CHECK: @[[KERNEL_NAME:.*]]_exec_mode = weak protected constant i8 [[EXEC_MODE:3]]
+// CHECK: @llvm.compiler.used = appending global [1 x ptr] [ptr @[[KERNEL_NAME]]_exec_mode], section "llvm.metadata"
+// CHECK: @[[KERNEL_NAME]]_kernel_environment = weak_odr protected constant %struct.KernelEnvironmentTy {
+// CHECK-SAME: %struct.ConfigurationEnvironmentTy { i8 1, i8 1, i8 [[EXEC_MODE]], {{.*}}},
+// CHECK-SAME: ptr @{{.*}}, ptr @{{.*}} }
+
+// CHECK: define weak_odr protected amdgpu_kernel void @[[KERNEL_NAME]]({{.*}})
+// CHECK: %{{.*}} = call i32 @__kmpc_target_init(ptr @[[KERNEL_NAME]]_kernel_environment, {{.*}})
+// CHECK: call void @[[TARGET_OUTLINE:.*]]({{.*}})
+// CHECK: call void @__kmpc_target_deinit()
+
+// CHECK: define internal void @[[TARGET_OUTLINE]]({{.*}})
+// CHECK: call void @[[TEAMS_OUTLINE:.*]]({{.*}})
+
+// CHECK: define internal void @[[TEAMS_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_distribute_static_loop{{.*}}({{.*}}, ptr @[[DISTRIBUTE_OUTLINE:[^,]*]], {{.*}})
+
+// CHECK: define internal void @[[DISTRIBUTE_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
+
+// CHECK: define internal void @[[PARALLEL_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_for_static_loop{{.*}}({{.*}})
+
+// -----
+
+module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_target_device = true, omp.is_gpu = true} {
+ llvm.func @device2(%arg0 : !llvm.ptr) {
+ %0 = omp.map.info var_ptr(%arg0 : !llvm.ptr, i32) map_clauses(to) capture(ByCopy) -> !llvm.ptr
+ omp.target map_entries(%0 -> %ptr : !llvm.ptr) {
+ %x = llvm.load %ptr : !llvm.ptr -> i32
+ omp.teams {
+ omp.distribute {
+ omp.loop_nest (%iv1) : i32 = (%x) to (%x) step (%x) {
+ omp.parallel {
+ omp.terminator
+ }
+ llvm.br ^bb2
+ ^bb1:
+ omp.parallel {
+ omp.terminator
+ }
+ omp.yield
+ ^bb2:
+ llvm.br ^bb1
+ }
+ }
+ omp.terminator
+ }
+ omp.terminator
+ }
+ llvm.return
+ }
+}
+
+// CHECK: @[[KERNEL_NAME:.*]]_exec_mode = weak protected constant i8 [[EXEC_MODE:3]]
+// CHECK: @llvm.compiler.used = appending global [1 x ptr] [ptr @[[KERNEL_NAME]]_exec_mode], section "llvm.metadata"
+// CHECK: @[[KERNEL_NAME]]_kernel_environment = weak_odr protected constant %struct.KernelEnvironmentTy {
+// CHECK-SAME: %struct.ConfigurationEnvironmentTy { i8 1, i8 1, i8 [[EXEC_MODE]], {{.*}}},
+// CHECK-SAME: ptr @{{.*}}, ptr @{{.*}} }
-// DEVICE: define weak_odr protected amdgpu_kernel void @[[KERNEL_NAME]]({{.*}})
-// DEVICE: %{{.*}} = call i32 @__kmpc_target_init(ptr @[[KERNEL_NAME]]_kernel_environment, {{.*}})
-// DEVICE: call void @[[TARGET_OUTLINE:.*]]({{.*}})
-// DEVICE: call void @__kmpc_target_deinit()
+// CHECK: define weak_odr protected amdgpu_kernel void @[[KERNEL_NAME]]({{.*}})
+// CHECK: %{{.*}} = call i32 @__kmpc_target_init(ptr @[[KERNEL_NAME]]_kernel_environment, {{.*}})
+// CHECK: call void @[[TARGET_OUTLINE:.*]]({{.*}})
+// CHECK: call void @__kmpc_target_deinit()
-// DEVICE: define internal void @[[TARGET_OUTLINE]]({{.*}})
-// DEVICE: call void @[[TEAMS_OUTLINE:.*]]({{.*}})
+// CHECK: define internal void @[[TARGET_OUTLINE]]({{.*}})
+// CHECK: call void @[[TEAMS_OUTLINE:.*]]({{.*}})
-// DEVICE: define internal void @[[TEAMS_OUTLINE]]({{.*}})
-// DEVICE: call void @__kmpc_distribute_static_loop{{.*}}({{.*}}, ptr @[[DISTRIBUTE_OUTLINE:[^,]*]], {{.*}})
+// CHECK: define internal void @[[TEAMS_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_distribute_static_loop{{.*}}({{.*}}, ptr @[[DISTRIBUTE_OUTLINE:[^,]*]], {{.*}})
-// DEVICE: define internal void @[[DISTRIBUTE_OUTLINE]]({{.*}})
-// DEVICE: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
+// CHECK: define internal void @[[DISTRIBUTE_OUTLINE]]({{.*}})
+// CHECK: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE0:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
+// CHECK: call void @__kmpc_parallel_51(ptr {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, i32 {{.*}}, ptr @[[PARALLEL_OUTLINE1:.*]], ptr {{.*}}, ptr {{.*}}, i64 {{.*}})
-// DEVICE: define internal void @[[PARALLEL_OUTLINE]]({{.*}})
-// DEVICE: call void @__kmpc_for_static_loop{{.*}}({{.*}})
+// CHECK: define internal void @[[PARALLEL_OUTLINE1]]({{.*}})
+// CHECK: define internal void @[[PARALLEL_OUTLINE0]]({{.*}})
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "a range of kernels behaving incorrectly"?
Why do you think this is legal? If you have control flow in the teams distribute construct (including a sequence of omp parallel
constructs), the teams must synchronize with the initial thread. The absence of such synchronization is what allows SPMD.
I think I probably didn't explain very well the cases this patch covers, since having a I absolutely agree with you that this seems counterintuitive. But I think the main point here is that we're not tagging these kernels as "SPMD", but rather "Generic-SPMD". This is in contrast to just "Generic", which is what we're currently doing. The reason is that, in practice, if a ! This works (it's already tagged Generic-SPMD):
! condition=true: 1 1 1 1 1 1
! condition=false: 2 2 2 2 2 2
subroutine if_cond_single(condition)
implicit none
logical, intent(in) :: condition
integer, parameter :: M = 2, N = 3
integer :: i, j
integer :: v(M,N)
v(:,:) = 0
!$omp target teams distribute
do i=1, M
if (condition) then
!$omp parallel do
do j=1, N
v(i, j) = v(i, j) + 1
end do
else
do j=1, N
v(i, j) = v(i, j) + 2
end do
end if
end do
print *, v(:,:)
end subroutine
! This doesn't work without this patch:
! condition=true: 0 0 0 0 0 0
! condition=false: 0 0 0 0 0 0
subroutine if_cond_multiple(condition)
implicit none
logical, intent(in) :: condition
integer, parameter :: M = 2, N = 3
integer :: i, j
integer :: v(M,N)
v(:,:) = 0
!$omp target teams distribute
do i=1, M
if (condition) then
!$omp parallel do
do j=1, N
v(i, j) = v(i, j) + 1
end do
else
!$omp parallel do
do j=1, N
v(i, j) = v(i, j) + 2
end do
end if
end do
print *, v(:,:)
end subroutine
! This works (it's already tagged Generic-SPMD):
! 3 3 2 2 2 2
subroutine single_parallel()
implicit none
integer, parameter :: M = 2, N = 3
integer :: i, j
integer :: v(M,N)
v(:,:) = 0
!$omp target teams distribute
do i=1, M
!$omp parallel do
do j=1, N
v(i, j) = v(i, j) + 1
end do
v(i, 1) = v(i, 1) + 1
do j=1, N
v(i, j) = v(i, j) + 1
end do
end do
print *, v(:,:)
end subroutine
! This doesn't work without this patch:
! 1 1 0 0 0 0
subroutine multi_parallel()
implicit none
integer, parameter :: M = 2, N = 3
integer :: i, j
integer :: v(M,N)
v(:,:) = 0
!$omp target teams distribute
do i=1, M
!$omp parallel do
do j=1, N
v(i, j) = v(i, j) + 1
end do
v(i, 1) = v(i, 1) + 1
!$omp parallel do
do j=1, N
v(i, j) = v(i, j) + 1
end do
end do
print *, v(:,:)
end subroutine I'm no expert on the exact uses of Generic-SPMD, but making it mean roughly "a |
6540b27
to
d9ca701
Compare
IIUC, Generic-SPMD is when the openmp-opt pass converts a Generic kernel to an SPMD kernel. Because openmp-opt only sees the GPU kernel, it cannot modify the host-side kernel invocation, so you end up with a mix of both. One consequence is that at kernel invocation, it does not pass the number of iterations because it is not known. With this background, I don't see why a frontend would ever use Generic-SPMD mode, since it has control over kernel code and host-side invocation. Clang does not know about it. Independent of that, Generic -- as the name implies -- is supposed to always work. If it does not, it is a bug. |
Yes, that's how it works for clang. This does not work for flang because we use different DeviceRTL functions for
That's the thing I also struggle to understand. There must be a bug in Generic mode if it doesn't always produce correct results, performance considerations apart. But it appears that these tests only work if tagged as Generic-SPMD, not Generic or SPMD. Considering the OpenMPOpt pass can't currently make the promotion from Generic on its own, we are temporarily handling it in codegen. There's a TODO comment documenting this. |
The previous implementation assumed that, for a target region to be tagged as Generic-SPMD, it would need to contain a single `teams distribute` loop with a single `parallel wsloop` nested in it. However, this was an overly restrictive set of conditions which resulted in a range of kernels behaving incorrectly. This patch updates the kernel execution flags identification logic to accept any number of `parallel` regions inside of a single `teams distribute` loop (possibly as part of conditional or loop control flow) as Generic-SPMD.
I tried to replicate the issue in C, but that doesn't seem to work. Maybe we should compare the IR. What I did is:
All resulting in
Note that O0 is executed in Generic mode
while the rest is optimized to generic-spmd. |
Thank you for checking this. It looks like that test has been fixed sometime after I created this PR, since I was able to reproduce failures with clang until I updated to the latest main branch. This other test that @Meinersbur made, however, does show another case where running in Generic-SPMD mode is currently required in order to get the expected results: #include <stdio.h>
#include <omp.h>
int main() {
int i, j, a = 0, b = 0, c = 0, g = 21;
#pragma omp target teams distribute thread_limit(10) private(i,j) reduction(+:a,b,c,g)
for (i = 1; i <= 10; ++i) {
j = i;
if (j == 5) {
g += 10 * omp_get_team_num() + omp_get_thread_num();
++c;
j = 11;
}
if (j == 11) {
#pragma omp parallel num_threads(10) reduction(+:a)
{
++a;
}
} else {
#pragma omp parallel num_threads(10) reduction(+:b)
{
++b;
}
}
}
printf("a: %d\nb: %d\nc: %d\ng: %d", a, b, c, g);
return 0;
} On this, we get the following (same output for -O1, -O2 and -O3, since they all use Generic-SPMD):
|
Long story short, the result of O0 is correct and expected. What's going on: First, note that
So, for those 2 reasons we see 1 thread parallel regions in the example. Again, that's perfectly valid OpenMP. If you add
The fun part is, I run into a hang if you enable WS2; see #140786 for more information. |
The previous implementation assumed that, for a target region to be tagged as Generic-SPMD, it would need to contain a single
teams distribute
loop with a singleparallel wsloop
nested in it. However, this was an overly restrictive set of conditions which resulted in a range of kernels behaving incorrectly.This patch updates the kernel execution flags identification logic to accept any number of
parallel
regions inside of a singleteams distribute
loop (possibly as part of conditional or loop control flow) as Generic-SPMD.