-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[OpenMP] Introduce the KernelLaunchEnvironment as implicit argument #70401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-mlir-llvm @llvm/pr-subscribers-flang-openmp Author: Johannes Doerfert (jdoerfert) Changes
Patch is 1.55 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70401.diff 34 Files Affected:
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index 6262b3416a1730a..c1be7c2d0321589 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -6002,6 +6002,42 @@ void CGOpenMPRuntime::emitUsesAllocatorsFini(CodeGenFunction &CGF,
{ThreadId, AllocatorVal});
}
+void CGOpenMPRuntime::computeMinAndMaxThreadsAndTeams(
+ const OMPExecutableDirective &D, CodeGenFunction &CGF,
+ int32_t &MinThreadsVal, int32_t &MaxThreadsVal, int32_t &MinTeamsVal,
+ int32_t &MaxTeamsVal) {
+
+ getNumTeamsExprForTargetDirective(CGF, D, MinTeamsVal, MaxTeamsVal);
+ getNumThreadsExprForTargetDirective(CGF, D, MaxThreadsVal,
+ /*UpperBoundOnly=*/true);
+
+ for (auto *C : D.getClausesOfKind<OMPXAttributeClause>()) {
+ for (auto *A : C->getAttrs()) {
+ int32_t AttrMinThreadsVal = 1, AttrMaxThreadsVal = -1;
+ int32_t AttrMinBlocksVal = 1, AttrMaxBlocksVal = -1;
+ if (auto *Attr = dyn_cast<CUDALaunchBoundsAttr>(A))
+ CGM.handleCUDALaunchBoundsAttr(nullptr, Attr, &AttrMaxThreadsVal,
+ &AttrMinBlocksVal, &AttrMaxBlocksVal);
+ else if (auto *Attr = dyn_cast<AMDGPUFlatWorkGroupSizeAttr>(A))
+ CGM.handleAMDGPUFlatWorkGroupSizeAttr(
+ nullptr, Attr, /*ReqdWGS=*/nullptr, &AttrMinThreadsVal,
+ &AttrMaxThreadsVal);
+ else
+ continue;
+
+ MinThreadsVal = std::max(MinThreadsVal, AttrMinThreadsVal);
+ if (AttrMaxThreadsVal > 0)
+ MaxThreadsVal = MaxThreadsVal > 0
+ ? std::min(MaxThreadsVal, AttrMaxThreadsVal)
+ : AttrMaxThreadsVal;
+ MinTeamsVal = std::max(MinTeamsVal, AttrMinBlocksVal);
+ if (AttrMaxBlocksVal > 0)
+ MaxTeamsVal = MaxTeamsVal > 0 ? std::min(MaxTeamsVal, AttrMaxBlocksVal)
+ : AttrMaxBlocksVal;
+ }
+ }
+}
+
void CGOpenMPRuntime::emitTargetOutlinedFunctionHelper(
const OMPExecutableDirective &D, StringRef ParentName,
llvm::Function *&OutlinedFn, llvm::Constant *&OutlinedFnID,
@@ -6020,47 +6056,8 @@ void CGOpenMPRuntime::emitTargetOutlinedFunctionHelper(
return CGF.GenerateOpenMPCapturedStmtFunction(CS, D.getBeginLoc());
};
- // Get NumTeams and ThreadLimit attributes
- int32_t DefaultValMinTeams = 1;
- int32_t DefaultValMaxTeams = -1;
- uint32_t DefaultValMinThreads = 1;
- uint32_t DefaultValMaxThreads = UINT32_MAX;
-
- getNumTeamsExprForTargetDirective(CGF, D, DefaultValMinTeams,
- DefaultValMaxTeams);
- getNumThreadsExprForTargetDirective(CGF, D, DefaultValMaxThreads,
- /*UpperBoundOnly=*/true);
-
- for (auto *C : D.getClausesOfKind<OMPXAttributeClause>()) {
- for (auto *A : C->getAttrs()) {
- int32_t MinThreadsVal = 1, MaxThreadsVal = 0;
- int32_t MinBlocksVal = 1, MaxBlocksVal = -1;
- if (auto *Attr = dyn_cast<CUDALaunchBoundsAttr>(A))
- CGM.handleCUDALaunchBoundsAttr(nullptr, Attr, &MaxThreadsVal,
- &MinBlocksVal, &MaxBlocksVal);
- else if (auto *Attr = dyn_cast<AMDGPUFlatWorkGroupSizeAttr>(A))
- CGM.handleAMDGPUFlatWorkGroupSizeAttr(
- nullptr, Attr, /*ReqdWGS=*/nullptr, &MinThreadsVal, &MaxThreadsVal);
- else
- continue;
-
- DefaultValMinThreads =
- std::max(DefaultValMinThreads, uint32_t(MinThreadsVal));
- DefaultValMaxThreads =
- DefaultValMaxThreads
- ? std::min(DefaultValMaxThreads, uint32_t(MaxThreadsVal))
- : MaxThreadsVal;
- DefaultValMinTeams = DefaultValMinTeams
- ? std::max(DefaultValMinTeams, MinBlocksVal)
- : MinBlocksVal;
- DefaultValMaxTeams = std::min(DefaultValMaxTeams, MaxBlocksVal);
- }
- }
-
- OMPBuilder.emitTargetRegionFunction(
- EntryInfo, GenerateOutlinedFunction, DefaultValMinTeams,
- DefaultValMaxTeams, DefaultValMinThreads, DefaultValMaxThreads,
- IsOffloadEntry, OutlinedFn, OutlinedFnID);
+ OMPBuilder.emitTargetRegionFunction(EntryInfo, GenerateOutlinedFunction,
+ IsOffloadEntry, OutlinedFn, OutlinedFnID);
if (!OutlinedFn)
return;
@@ -6306,7 +6303,7 @@ llvm::Value *CGOpenMPRuntime::emitNumTeamsForTargetDirective(
/// store the condition in \p CondVal. If \p E, and \p CondVal respectively, are
/// nullptr, no expression evaluation is perfomed.
static void getNumThreads(CodeGenFunction &CGF, const CapturedStmt *CS,
- const Expr **E, uint32_t &UpperBound,
+ const Expr **E, int32_t &UpperBound,
bool UpperBoundOnly, llvm::Value **CondVal) {
const Stmt *Child = CGOpenMPRuntime::getSingleCompoundChild(
CGF.getContext(), CS->getCapturedStmt());
@@ -6368,10 +6365,10 @@ static void getNumThreads(CodeGenFunction &CGF, const CapturedStmt *CS,
UpperBound
? Constant->getZExtValue()
: std::min(UpperBound,
- static_cast<uint32_t>(Constant->getZExtValue()));
+ static_cast<int32_t>(Constant->getZExtValue()));
// If we haven't found a upper bound, remember we saw a thread limiting
// clause.
- if (UpperBound == UINT32_MAX)
+ if (UpperBound == -1)
UpperBound = 0;
if (!E)
return;
@@ -6397,7 +6394,7 @@ static void getNumThreads(CodeGenFunction &CGF, const CapturedStmt *CS,
}
const Expr *CGOpenMPRuntime::getNumThreadsExprForTargetDirective(
- CodeGenFunction &CGF, const OMPExecutableDirective &D, uint32_t &UpperBound,
+ CodeGenFunction &CGF, const OMPExecutableDirective &D, int32_t &UpperBound,
bool UpperBoundOnly, llvm::Value **CondVal, const Expr **ThreadLimitExpr) {
assert((!CGF.getLangOpts().OpenMPIsTargetDevice || UpperBoundOnly) &&
"Clauses associated with the teams directive expected to be emitted "
@@ -6414,11 +6411,11 @@ const Expr *CGOpenMPRuntime::getNumThreadsExprForTargetDirective(
if (auto Constant = E->getIntegerConstantExpr(CGF.getContext()))
UpperBound = UpperBound ? Constant->getZExtValue()
: std::min(UpperBound,
- uint32_t(Constant->getZExtValue()));
+ int32_t(Constant->getZExtValue()));
}
// If we haven't found a upper bound, remember we saw a thread limiting
// clause.
- if (UpperBound == UINT32_MAX)
+ if (UpperBound == -1)
UpperBound = 0;
if (EPtr)
*EPtr = E;
@@ -6562,7 +6559,7 @@ llvm::Value *CGOpenMPRuntime::emitNumThreadsForTargetDirective(
llvm::Value *CondVal = nullptr;
llvm::Value *ThreadLimitVal = nullptr;
const Expr *ThreadLimitExpr = nullptr;
- uint32_t UpperBound = -1;
+ int32_t UpperBound = -1;
const Expr *NT = getNumThreadsExprForTargetDirective(
CGF, D, UpperBound, /* UpperBoundOnly */ false, &CondVal,
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.h b/clang/lib/CodeGen/CGOpenMPRuntime.h
index d2f922da3320924..0c4ad46e881b9c5 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.h
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.h
@@ -311,6 +311,14 @@ class CGOpenMPRuntime {
/// An OpenMP-IR-Builder instance.
llvm::OpenMPIRBuilder OMPBuilder;
+ /// Helper to determine the min/max number of threads/teams for \p D.
+ void computeMinAndMaxThreadsAndTeams(const OMPExecutableDirective &D,
+ CodeGenFunction &CGF,
+ int32_t &MinThreadsVal,
+ int32_t &MaxThreadsVal,
+ int32_t &MinTeamsVal,
+ int32_t &MaxTeamsVal);
+
/// Helper to emit outlined function for 'target' directive.
/// \param D Directive to emit.
/// \param ParentName Name of the function that encloses the target region.
@@ -649,7 +657,7 @@ class CGOpenMPRuntime {
/// UpperBoundOnly is true, no expression evaluation is perfomed.
const Expr *getNumThreadsExprForTargetDirective(
CodeGenFunction &CGF, const OMPExecutableDirective &D,
- uint32_t &UpperBound, bool UpperBoundOnly,
+ int32_t &UpperBound, bool UpperBoundOnly,
llvm::Value **CondExpr = nullptr, const Expr **ThreadLimitExpr = nullptr);
/// Emit an expression that denotes the number of threads a target region
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
index 152a7511f4dd1b0..9d00ebae702802a 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
@@ -757,13 +757,15 @@ void CGOpenMPRuntimeGPU::emitNonSPMDKernel(const OMPExecutableDirective &D,
// Emit target region as a standalone region.
class NVPTXPrePostActionTy : public PrePostActionTy {
CGOpenMPRuntimeGPU::EntryFunctionState &EST;
+ const OMPExecutableDirective &D;
public:
- NVPTXPrePostActionTy(CGOpenMPRuntimeGPU::EntryFunctionState &EST)
- : EST(EST) {}
+ NVPTXPrePostActionTy(CGOpenMPRuntimeGPU::EntryFunctionState &EST,
+ const OMPExecutableDirective &D)
+ : EST(EST), D(D) {}
void Enter(CodeGenFunction &CGF) override {
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
- RT.emitKernelInit(CGF, EST, /* IsSPMD */ false);
+ RT.emitKernelInit(D, CGF, EST, /* IsSPMD */ false);
// Skip target region initialization.
RT.setLocThreadIdInsertPt(CGF, /*AtCurrentPoint=*/true);
}
@@ -772,7 +774,7 @@ void CGOpenMPRuntimeGPU::emitNonSPMDKernel(const OMPExecutableDirective &D,
RT.clearLocThreadIdInsertPt(CGF);
RT.emitKernelDeinit(CGF, EST, /* IsSPMD */ false);
}
- } Action(EST);
+ } Action(EST, D);
CodeGen.setAction(Action);
IsInTTDRegion = true;
emitTargetOutlinedFunctionHelper(D, ParentName, OutlinedFn, OutlinedFnID,
@@ -780,10 +782,17 @@ void CGOpenMPRuntimeGPU::emitNonSPMDKernel(const OMPExecutableDirective &D,
IsInTTDRegion = false;
}
-void CGOpenMPRuntimeGPU::emitKernelInit(CodeGenFunction &CGF,
+void CGOpenMPRuntimeGPU::emitKernelInit(const OMPExecutableDirective &D,
+ CodeGenFunction &CGF,
EntryFunctionState &EST, bool IsSPMD) {
+ int32_t MinThreadsVal = 1, MaxThreadsVal = -1, MinTeamsVal = 1,
+ MaxTeamsVal = -1;
+ computeMinAndMaxThreadsAndTeams(D, CGF, MinThreadsVal, MaxThreadsVal,
+ MinTeamsVal, MaxTeamsVal);
+
CGBuilderTy &Bld = CGF.Builder;
- Bld.restoreIP(OMPBuilder.createTargetInit(Bld, IsSPMD));
+ Bld.restoreIP(OMPBuilder.createTargetInit(
+ Bld, IsSPMD, MinThreadsVal, MaxThreadsVal, MinTeamsVal, MaxTeamsVal));
if (!IsSPMD)
emitGenericVarsProlog(CGF, EST.Loc);
}
@@ -815,19 +824,20 @@ void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,
CGOpenMPRuntimeGPU::EntryFunctionState &EST;
bool IsBareKernel;
DataSharingMode Mode;
+ const OMPExecutableDirective &D;
public:
NVPTXPrePostActionTy(CGOpenMPRuntimeGPU &RT,
CGOpenMPRuntimeGPU::EntryFunctionState &EST,
- bool IsBareKernel)
+ bool IsBareKernel, const OMPExecutableDirective &D)
: RT(RT), EST(EST), IsBareKernel(IsBareKernel),
- Mode(RT.CurrentDataSharingMode) {}
+ Mode(RT.CurrentDataSharingMode), D(D) {}
void Enter(CodeGenFunction &CGF) override {
if (IsBareKernel) {
RT.CurrentDataSharingMode = DataSharingMode::DS_CUDA;
return;
}
- RT.emitKernelInit(CGF, EST, /* IsSPMD */ true);
+ RT.emitKernelInit(D, CGF, EST, /* IsSPMD */ true);
// Skip target region initialization.
RT.setLocThreadIdInsertPt(CGF, /*AtCurrentPoint=*/true);
}
@@ -839,7 +849,7 @@ void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,
RT.clearLocThreadIdInsertPt(CGF);
RT.emitKernelDeinit(CGF, EST, /* IsSPMD */ true);
}
- } Action(*this, EST, IsBareKernel);
+ } Action(*this, EST, IsBareKernel, D);
CodeGen.setAction(Action);
IsInTTDRegion = true;
emitTargetOutlinedFunctionHelper(D, ParentName, OutlinedFn, OutlinedFnID,
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
index c4501a1a2a496b0..46e1361f2f895ba 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
+++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
@@ -60,8 +60,8 @@ class CGOpenMPRuntimeGPU : public CGOpenMPRuntime {
void syncCTAThreads(CodeGenFunction &CGF);
/// Helper for target directive initialization.
- void emitKernelInit(CodeGenFunction &CGF, EntryFunctionState &EST,
- bool IsSPMD);
+ void emitKernelInit(const OMPExecutableDirective &D, CodeGenFunction &CGF,
+ EntryFunctionState &EST, bool IsSPMD);
/// Helper for target directive finalization.
void emitKernelDeinit(CodeGenFunction &CGF, EntryFunctionState &EST,
diff --git a/clang/lib/Sema/SemaOpenMP.cpp b/clang/lib/Sema/SemaOpenMP.cpp
index f28e0f2693080c1..257afa4cfd63917 100644
--- a/clang/lib/Sema/SemaOpenMP.cpp
+++ b/clang/lib/Sema/SemaOpenMP.cpp
@@ -4249,12 +4249,15 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
getCurCapturedRegion()->TheCapturedDecl->addAttr(
AlwaysInlineAttr::CreateImplicit(
Context, {}, AlwaysInlineAttr::Keyword_forceinline));
- Sema::CapturedParamNameType ParamsTarget[] = {
- std::make_pair(StringRef(), QualType()) // __context with shared vars
- };
+ SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+ if (getLangOpts().OpenMPIsTargetDevice)
+ ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+ ParamsTarget.push_back(
+ std::make_pair(StringRef(), QualType())); // __context with shared vars;
// Start a captured region for 'target' with no implicit parameters.
ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
- ParamsTarget, /*OpenMPCaptureLevel=*/1);
+ ParamsTarget,
+ /*OpenMPCaptureLevel=*/1);
Sema::CapturedParamNameType ParamsTeamsOrParallel[] = {
std::make_pair(".global_tid.", KmpInt32PtrTy),
std::make_pair(".bound_tid.", KmpInt32PtrTy),
@@ -4293,8 +4296,13 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
getCurCapturedRegion()->TheCapturedDecl->addAttr(
AlwaysInlineAttr::CreateImplicit(
Context, {}, AlwaysInlineAttr::Keyword_forceinline));
+ SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+ if (getLangOpts().OpenMPIsTargetDevice)
+ ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+ ParamsTarget.push_back(
+ std::make_pair(StringRef(), QualType())); // __context with shared vars;
ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
- std::make_pair(StringRef(), QualType()),
+ ParamsTarget,
/*OpenMPCaptureLevel=*/1);
break;
}
@@ -4499,9 +4507,11 @@ void Sema::ActOnOpenMPRegionStart(OpenMPDirectiveKind DKind, Scope *CurScope) {
getCurCapturedRegion()->TheCapturedDecl->addAttr(
AlwaysInlineAttr::CreateImplicit(
Context, {}, AlwaysInlineAttr::Keyword_forceinline));
- Sema::CapturedParamNameType ParamsTarget[] = {
- std::make_pair(StringRef(), QualType()) // __context with shared vars
- };
+ SmallVector<Sema::CapturedParamNameType, 2> ParamsTarget;
+ if (getLangOpts().OpenMPIsTargetDevice)
+ ParamsTarget.push_back(std::make_pair(StringRef("dyn_ptr"), VoidPtrTy));
+ ParamsTarget.push_back(
+ std::make_pair(StringRef(), QualType())); // __context with shared vars;
// Start a captured region for 'target' with no implicit parameters.
ActOnCapturedRegionStart(DSAStack->getConstructLoc(), CurScope, CR_OpenMP,
ParamsTarget, /*OpenMPCaptureLevel=*/1);
diff --git a/clang/test/OpenMP/distribute_simd_codegen.cpp b/clang/test/OpenMP/distribute_simd_codegen.cpp
index 297f508575d99d9..f74abbe32e454f6 100644
--- a/clang/test/OpenMP/distribute_simd_codegen.cpp
+++ b/clang/test/OpenMP/distribute_simd_codegen.cpp
@@ -220,7 +220,7 @@ int fint(void) { return ftemplate<int>(); }
// CHECK1-NEXT: [[TMP32:%.*]] = icmp ne i32 [[TMP31]], 0
// CHECK1-NEXT: br i1 [[TMP32]], label [[OMP_OFFLOAD_FAILED:%.*]], label [[OMP_OFFLOAD_CONT:%.*]]
// CHECK1: omp_offload.failed:
-// CHECK1-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l70(ptr [[TMP0]], ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]]) #[[ATTR4:[0-9]+]]
+// CHECK1-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l70(ptr [[TMP0]], ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]]) #[[ATTR3:[0-9]+]]
// CHECK1-NEXT: br label [[OMP_OFFLOAD_CONT]]
// CHECK1: omp_offload.cont:
// CHECK1-NEXT: ret void
@@ -242,7 +242,7 @@ int fint(void) { return ftemplate<int>(); }
//
//
// CHECK1-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l70.omp_outlined
-// CHECK1-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.*]], ptr noalias noundef [[DOTBOUND_TID_:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[A:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[B:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[C:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[D:%.*]]) #[[ATTR2:[0-9]+]] {
+// CHECK1-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.*]], ptr noalias noundef [[DOTBOUND_TID_:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[A:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[B:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[C:%.*]], ptr noundef nonnull align 8 dereferenceable(8) [[D:%.*]]) #[[ATTR1]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8
// CHECK1-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8
@@ -291,45 +291,45 @@ int fint(void) { return ftemplate<int>(); }
// CHECK1-NEXT: store i32 [[TMP9]], ptr [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK1: omp.inner.for.cond:
-// CHECK1-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTOMP_IV]], align 4, !llvm.access.group [[ACC_GRP13:![0-9]+]]
-// CHECK1-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTOMP_UB]], align 4, !llvm.access.group [[ACC_GRP13]]
+// CHECK1-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTOMP_IV]], align 4, !llvm.access.group [[ACC_GRP8:![0-9]+]]
+// CHECK1-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTOMP_UB]], align 4, !llvm.access.group [[ACC_GRP8]]
// CHECK1-NEXT: [[CMP1:%.*]] = icmp sle i32 [[TMP10]], [[TMP11]]
// CHECK1-NEXT: br i1 [[CMP1]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK1: omp.inner.for.body:
-// CHECK1-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_IV]], align 4, !llvm.access.group [[ACC_GRP13]]
+// CHECK1-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_IV]], align 4, !llvm.access.group [[ACC_GRP8]]
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP12]], 7
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 33, [[MUL]]
-// CHECK1-NEXT: store i32 [[ADD]], ptr [[I]], align 4, !llvm.access.group [[ACC_GRP13]]
-// CHECK1-NEXT: [[TMP13:%.*]] = load ptr, ptr [[TMP1]], align 8, !llvm.access.group [[ACC_GRP13]]
-// CHECK1-NEXT: [[TMP14:%.*]] = load i32, ptr [[I]], align 4, !llvm.access.group [[ACC_GRP13]]
+// CHECK1-NEXT: store i32 [[ADD]], ptr [[I]], align 4, !llvm.access.group [[ACC_GRP8]]
+// CHECK1-NEXT: [[TMP13:%.*]] = load ptr, ptr [[TMP1]], align 8, !llvm.access.group [[ACC_GRP8]]
+// CHECK1-NEXT: [[TMP14:%.*]] = load i32, ptr [[I]], align 4, !llvm.access.group [[ACC_GRP8]]
// CHECK1-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP14]] to i64
// CHECK1-NEXT: [[ARRAYIDX:%.*]] = getelementptr inboun...
[truncated]
|
4ce2822
to
e5b4d3e
Compare
e5b4d3e
to
ec4d394
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
ec4d394
to
534deed
Compare
Tests in |
Small change that affects a lot of tests. LG once these are fixed:
|
534deed
to
5c5d473
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
1762931
to
a503e7c
Compare
The KernelEnvironment is for compile time information about a kernel. It allows the compiler to feed information to the runtime. The KernelLaunchEnvironment is for dynamic information *per* kernel launch. It allows the rutime to feed information to the kernel that is not shared with other invocations of the kernel. The first use case will be to replace the globals that synchronize teams reductions with per-launch versions, later we can have per launch memory pools. To provide the information to the kernel, we introduce a default argument at position 0 which is allocated, passed, and later deallocated by the runtime.
a503e7c
to
65de2e7
Compare
…lvm#70401) The KernelEnvironment is for compile time information about a kernel. It allows the compiler to feed information to the runtime. The KernelLaunchEnvironment is for dynamic information *per* kernel launch. It allows the rutime to feed information to the kernel that is not shared with other invocations of the kernel. The first use case is to replace the globals that synchronize teams reductions with per-launch versions. This allows concurrent teams reductions. More uses cases will follow, e.g., per launch memory pools. Fixes: llvm#70249 Change-Id: I06ce8c63cf5020be778e1a9e06053a1950dfb18e
performance degradation observed for this patch on trunk build : hpc2021 8 mpi210's reference run |
With reference to the performance degradation, this patch introduces an additional allocation/data-submit/deallocation for every kernel (GenericKernelTy::getKernelLaunchEnvironment(), PluginInterface.cpp). Analysis shows that this overhead appears to be the primary reason for the perf degradation. Is it possible to limit this additional overhead only when we need it? For example, can it be avoided for non-reduction kernels? |
@jdoerfert please respond to our comments about performance degradation ? |
both of these benchmarks which exhibited performance regressions, contain reductions. |
This was brought up and discussed in the weekly meeting. |
After #73864, only reduction kernels may have the additional data submit for the kernel launch environment. If you have an INFO message like in #74030, you will see this additional INFO message
For example, using https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/test/smoke/reduction_teams_distribute_parallel/reduction_teams_distribute_parallel.c
@jdoerfert The above can be used as a repro. |
original perf regressions seem recovered enough, thx. |
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information per kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: #70249