Skip to content

[AMDGPU][Attributor] Make AAAMDFlatWorkGroupSize honor existing attribute #114357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented Oct 31, 2024

If a function has amdgpu-flat-work-group-size, honor it in initialize by
taking its value directly; otherwise, it uses the default range as a starting
point. We will no longer manipulate the known range, which can cause issues
because the known range is a "throttle" to the assumed range such that the
assumed range can't get widened properly in updateImpl if the known range is
not set properly for whatever reasons. Another benefit of not touching the known
range is, if we indicate pessimistic state, it also invalidates the AA such that
manifest will not be called. Since we honor the attribute, we don't want and
will not add any half-baked attribute added to a function.

Copy link
Contributor Author

shiltian commented Oct 31, 2024

@shiltian shiltian requested review from arsenm and choikwa October 31, 2024 05:05
@llvmbot
Copy link
Member

llvmbot commented Oct 31, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

If a function has amdgpu-flat-work-group-size, honor it in initialize by
taking its value directly, set it to known range, indicate a pessimistic fixed
point such that the known range is propagated to the assumed range; otherwise,
it simply does nothing. We will no longer clamp (real clamp, instead of the
union one in IntegerRangeState) the known range, which can cause issues
because the known range is a "throttle" to the assumed range such that the
assumed range can't get widened properly in updateImpl. Another benefit of not
touching the known range in initialize is, if we indicate pessimistic state in
updateImpl, it is also invalid, such that manifest will not be called. Since
we honor the attribute, we don't want any half-baked attribute added to a
function.


Full diff: https://github.com/llvm/llvm-project/pull/114357.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+54-11)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index 6a69b9d2bfc716..53e44bb5e094fd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -168,7 +168,15 @@ class AMDGPUInformationCache : public InformationCache {
     return ST.supportsGetDoorbellID();
   }
 
-  std::pair<unsigned, unsigned> getFlatWorkGroupSizes(const Function &F) {
+  std::optional<std::pair<unsigned, unsigned>>
+  getFlatWorkGroupSizeAttr(const Function &F) const {
+    Attribute A = F.getFnAttribute("amdgpu-flat-work-group-size");
+    if (!A.isStringAttribute())
+      return std::nullopt;
+    return getFlatWorkGroupSizes(F);
+  }
+
+  std::pair<unsigned, unsigned> getFlatWorkGroupSizes(const Function &F) const {
     const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
     return ST.getFlatWorkGroupSizes(F);
   }
@@ -707,8 +715,7 @@ struct AAAMDSizeRangeAttribute
   /// See AbstractAttribute::trackStatistics()
   void trackStatistics() const override {}
 
-  template <class AttributeImpl>
-  ChangeStatus updateImplImpl(Attributor &A) {
+  template <class AttributeImpl> ChangeStatus updateImplImpl(Attributor &A) {
     ChangeStatus Change = ChangeStatus::UNCHANGED;
 
     auto CheckCallSite = [&](AbstractCallSite CS) {
@@ -728,12 +735,43 @@ struct AAAMDSizeRangeAttribute
     };
 
     bool AllCallSitesKnown = true;
-    if (!A.checkForAllCallSites(CheckCallSite, *this, true, AllCallSitesKnown))
+    if (!A.checkForAllCallSites(CheckCallSite, *this,
+                                /*RequireAllCallSites=*/true,
+                                AllCallSitesKnown))
       return indicatePessimisticFixpoint();
 
     return Change;
   }
 
+  /// Clamp the assumed range to the default value ([Min, Max]) and emit the
+  /// attribute if it is not same as default.
+  ChangeStatus
+  emitAttributeIfNotDefaultAfterClamp(Attributor &A,
+                                      std::pair<unsigned, unsigned> Default) {
+    auto [Min, Max] = Default;
+    unsigned Lower = getAssumed().getLower().getZExtValue();
+    unsigned Upper = getAssumed().getUpper().getZExtValue();
+
+    // Clamp the range to the default value.
+    if (Lower < Min)
+      Lower = Min;
+    if (Upper > Max + 1)
+      Upper = Max + 1;
+
+    // No manifest if the value is same as default after clamp.
+    if (Lower == Min && Upper == Max + 1)
+      return ChangeStatus::UNCHANGED;
+
+    Function *F = getAssociatedFunction();
+    LLVMContext &Ctx = F->getContext();
+    SmallString<10> Buffer;
+    raw_svector_ostream OS(Buffer);
+    OS << Lower << ',' << Upper - 1;
+    return A.manifestAttrs(getIRPosition(),
+                           {Attribute::get(Ctx, AttrName, OS.str())},
+                           /* ForceReplace=*/true);
+  }
+
   ChangeStatus emitAttributeIfNotDefault(Attributor &A, unsigned Min,
                                          unsigned Max) {
     // Don't add the attribute if it's the implied default.
@@ -767,11 +805,17 @@ struct AAAMDFlatWorkGroupSize : public AAAMDSizeRangeAttribute {
 
   void initialize(Attributor &A) override {
     Function *F = getAssociatedFunction();
+
     auto &InfoCache = static_cast<AMDGPUInformationCache &>(A.getInfoCache());
-    unsigned MinGroupSize, MaxGroupSize;
-    std::tie(MinGroupSize, MaxGroupSize) = InfoCache.getFlatWorkGroupSizes(*F);
-    intersectKnown(
-        ConstantRange(APInt(32, MinGroupSize), APInt(32, MaxGroupSize + 1)));
+    std::optional<std::pair<unsigned, unsigned>> Attr =
+        InfoCache.getFlatWorkGroupSizeAttr(*F);
+
+    if (Attr.has_value()) {
+      auto [Min, Max] = *Attr;
+      intersectKnown(ConstantRange(APInt(32, Min), APInt(32, Max + 1)));
+      indicatePessimisticFixpoint();
+      return;
+    }
 
     if (AMDGPU::isEntryFunctionCC(F->getCallingConv()))
       indicatePessimisticFixpoint();
@@ -788,9 +832,8 @@ struct AAAMDFlatWorkGroupSize : public AAAMDSizeRangeAttribute {
   ChangeStatus manifest(Attributor &A) override {
     Function *F = getAssociatedFunction();
     auto &InfoCache = static_cast<AMDGPUInformationCache &>(A.getInfoCache());
-    unsigned Min, Max;
-    std::tie(Min, Max) = InfoCache.getMaximumFlatWorkGroupRange(*F);
-    return emitAttributeIfNotDefault(A, Min, Max);
+    auto [Min, Max] = InfoCache.getMaximumFlatWorkGroupRange(*F);
+    return emitAttributeIfNotDefaultAfterClamp(A, {Min, Max});
   }
 
   /// See AbstractAttribute::getName()

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test changes?

@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch 3 times, most recently from 19e0cce to f2314c9 Compare October 31, 2024 15:46
@shiltian shiltian marked this pull request as ready for review October 31, 2024 16:00
@shiltian shiltian changed the title [WIP][AMDGPU][Attributor] Make AAAMDFlatWorkGroupSize honor existing attribute [AMDGPU][Attributor] Make AAAMDFlatWorkGroupSize honor existing attribute Oct 31, 2024
@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch 4 times, most recently from 642636d to f1aea50 Compare October 31, 2024 17:54
@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch 4 times, most recently from c35f519 to 2803f73 Compare November 2, 2024 03:33
; NOINFS: attributes #0 = { nounwind "amdgpu-waves-per-eu"="4,10" "no-infs-fp-math"="true" "uniform-work-group-size"="false" }
; NOINFS: attributes #1 = { nounwind "less-precise-fpmad"="false" "no-infs-fp-math"="true" "no-nans-fp-math"="false" "uniform-work-group-size"="false" "unsafe-fp-math"="false" }
; NOINFS: attributes #0 = { nounwind "no-infs-fp-math"="true" "uniform-work-group-size"="false" "unsafe-fp-math"="true" }
; NOINFS: attributes #1 = { nounwind "less-precise-fpmad"="false" "no-infs-fp-math"="true" "no-nans-fp-math"="false" "uniform-work-group-size"="false" "unsafe-fp-math"="true" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change "unsafe-fp-math"="true"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should not and I don't see why this is changed...

@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch 3 times, most recently from fe201bc to b6b851f Compare November 4, 2024 00:35
@shiltian
Copy link
Contributor Author

shiltian commented Nov 5, 2024

bump

@choikwa
Copy link
Contributor

choikwa commented Nov 12, 2024

LGTM

@shiltian
Copy link
Contributor Author

shiltian commented Dec 9, 2024

@choikwa you might forget to accept it.

@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch 2 times, most recently from 43322a4 to 2d1641d Compare December 10, 2024 17:47
@shiltian
Copy link
Contributor Author

ping, I prefer not to update tests again and again...

@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch from 2d1641d to d949823 Compare December 11, 2024 16:59
Copy link
Contributor Author

shiltian commented Dec 11, 2024

Merge activity

  • Dec 11, 4:38 PM EST: A user started a stack merge that includes this pull request via Graphite.
  • Dec 11, 4:40 PM EST: Graphite rebased this pull request as part of a merge.
  • Dec 11, 4:42 PM EST: Graphite rebased this pull request as part of a merge.
  • Dec 11, 4:45 PM EST: Graphite rebased this pull request as part of a merge.
  • Dec 11, 4:47 PM EST: A user merged this pull request with Graphite.

@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch 2 times, most recently from 181386d to 768a316 Compare December 11, 2024 21:41
…ribute

If a function has `amdgpu-flat-work-group-size`, honor it in `initialize` by
taking its value directly; otherwise, it uses the default range as a starting
point. We will no longer manipulate the known range, which can cause issues
because the known range is a "throttle" to the assumed range such that the
assumed range can't get widened properly in `updateImpl` if the known range is
not set properly for whatever reasons. Another benefit of not touching the known
range is, if we indicate pessimistic state, it also invalidates the AA such that
`manifest` will not be called. Since we honor the attribute, we don't want and
will not add any half-baked attribute added to a function.
@shiltian shiltian force-pushed the users/shiltian/honor-attributes branch from 768a316 to a062b29 Compare December 11, 2024 21:45
@shiltian shiltian merged commit 7dbd6cd into main Dec 11, 2024
5 of 7 checks passed
@shiltian shiltian deleted the users/shiltian/honor-attributes branch December 11, 2024 21:47
auto R = AMDGPU::getIntegerPairAttribute(F, "amdgpu-flat-work-group-size");
if (!R)
return std::nullopt;
return std::make_pair(R->first, *(R->second));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra parens?

/// \returns \p std::nullopt and emits error if one of the requested values
/// cannot be converted to integer, or \p OnlyFirstRequired is false and
/// "second" value is not present.
std::optional<std::pair<unsigned, std::optional<unsigned>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the second entry is optional. This is spreading a parsing detail to users of the value. Users should be able to just treat it as-if it always has 2 components, the one entry forms are just a compatibility thing with old IR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our front end can still emit "X,". I think we could potentially fix this in the front end and then remove the 2nd optional entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should never happen. But as a parsing function, that should be abstracted from the attributor's usage

@jtramm
Copy link

jtramm commented Jan 14, 2025

Looks like this PR resulted in about a 2x slowdown on MI250 for the most expensive kernel in OpenMC, resulting in an overall 20% performance degradation. Any ideas what might have caused it?

@shiltian
Copy link
Contributor Author

You might have device-libs that still come with half-baked values. To verify it, you could use llvm-dis to convert it to .ll file, remove all attributes amdgpu-flat-work-group-size, convert it back to .bc file, and put it back. After that, compile everything of OpenMC.

@jtramm
Copy link

jtramm commented Jan 15, 2025

Ah, interesting! Which file(s) should I be converting to .ll? And how do I remove attributes in a .ll file -- just delete the entire line or change to something else? Any specific commands to give to llvm-dis for converting between file types?

@arsenm
Copy link
Contributor

arsenm commented Jan 15, 2025

Just rebuild the device libraries. They should be version locked to the llvm commit you are using

@jtramm
Copy link

jtramm commented Jan 15, 2025

Here's what my current cmake line looks like for LLVM:

PROJECTS="clang;lld"
RUNTIMES="openmp;offload"
TARGETS="all"
USE_CCACHE=OFF

cmake \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DLLVM_ENABLE_PROJECTS=${PROJECTS}  \
    -DLLVM_ENABLE_RUNTIMES=${RUNTIMES}   \
    -DLLVM_TARGETS_TO_BUILD=${TARGETS}   \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLIBOMPTARGET_ENABLE_DEBUG=ON \
    -DLLVM_OPTIMIZED_TABLEGEN=ON \
    -DBUILD_SHARED_LIBS=ON \
    -DLLVM_CCACHE_BUILD=${USE_CCACHE} \
    -DLLVM_APPEND_VC_REV=OFF \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
    -DCMAKE_CXX_FLAGS="-Wno-address" \
    -DCMAKE_INSTALL_PREFIX=../llvm-install \
    ../llvm-project/llvm

make -j128 install

What would I need to change to specifically rebuild the device libraries, or am I already doing that?

@shiltian
Copy link
Contributor Author

@jtramm You are building vanilla LLVM but the device-libs are in AMD's fork. You can simply use the LLVM you build to build device-libs. Here is the command I usually use to build device-libs:

cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DLLVM_ROOT=... -DClang_ROOT=... -B $BUILD_ROOT/device-libs/release -S llvm-project/amd/device-libs

@jtramm
Copy link

jtramm commented Jan 15, 2025

I don't understand -- are you saying I cannot use vanilla LLVM to compile for AMD anymore, that I need AMD's fork now instead?

@shiltian
Copy link
Contributor Author

No, vanilla LLVM doesn't have device-libs. That is only available in AMD's fork. For this particular case, you will have to build device-libs because the device-libs in your system have half-baked values that can cause performance regression. That's why I suggested to simply just remove the attributes from those /opt/rocm/amdgcn/bitcode/*.bc files.

You can still build vanilla LLVM, and use that to build device-libs. The only thing you want out of AMD's fork is the device-libs, which sits in amd/device-libs.

To make it simple, can you first verify if amdgpu-flat-work-group-size or amdgpu-waves-per-eu exist in any /opt/rocm/amdgcn/bitcode/*.bc files? If there is no, then you don't need to build device-libs. That would be a separate issue.

@jtramm
Copy link

jtramm commented Jan 15, 2025

The rocm install on this cluster seems to be in a different location, but the .bc files don't appear to have either of those attributes in them:

jtramm@amdgpu04:/soft/compilers/rocm/rocm-6.3.0/amdgcn/bitcode$ ls
asanrtl.bc                          oclc_isa_version_1010.bc          oclc_isa_version_10-3-generic.bc  oclc_isa_version_600.bc  oclc_isa_version_805.bc  oclc_isa_version_942.bc
hip.bc                              oclc_isa_version_1011.bc          oclc_isa_version_1100.bc          oclc_isa_version_601.bc  oclc_isa_version_810.bc  oclc_isa_version_9-generic.bc
ockl.bc                             oclc_isa_version_1012.bc          oclc_isa_version_1101.bc          oclc_isa_version_602.bc  oclc_isa_version_900.bc  oclc_unsafe_math_off.bc
oclc_abi_version_400.bc             oclc_isa_version_1013.bc          oclc_isa_version_1102.bc          oclc_isa_version_700.bc  oclc_isa_version_902.bc  oclc_unsafe_math_on.bc
oclc_abi_version_500.bc             oclc_isa_version_10-1-generic.bc  oclc_isa_version_1103.bc          oclc_isa_version_701.bc  oclc_isa_version_904.bc  oclc_wavefrontsize64_off.bc
oclc_abi_version_600.bc             oclc_isa_version_1030.bc          oclc_isa_version_1150.bc          oclc_isa_version_702.bc  oclc_isa_version_906.bc  oclc_wavefrontsize64_on.bc
oclc_correctly_rounded_sqrt_off.bc  oclc_isa_version_1031.bc          oclc_isa_version_1151.bc          oclc_isa_version_703.bc  oclc_isa_version_908.bc  ocml.bc
oclc_correctly_rounded_sqrt_on.bc   oclc_isa_version_1032.bc          oclc_isa_version_1152.bc          oclc_isa_version_704.bc  oclc_isa_version_909.bc  opencl.bc
oclc_daz_opt_off.bc                 oclc_isa_version_1033.bc          oclc_isa_version_11-generic.bc    oclc_isa_version_705.bc  oclc_isa_version_90a.bc
oclc_daz_opt_on.bc                  oclc_isa_version_1034.bc          oclc_isa_version_1200.bc          oclc_isa_version_801.bc  oclc_isa_version_90c.bc
oclc_finite_only_off.bc             oclc_isa_version_1035.bc          oclc_isa_version_1201.bc          oclc_isa_version_802.bc  oclc_isa_version_940.bc
oclc_finite_only_on.bc              oclc_isa_version_1036.bc          oclc_isa_version_12-generic.bc    oclc_isa_version_803.bc  oclc_isa_version_941.bc
jtramm@amdgpu04:/soft/compilers/rocm/rocm-6.3.0/amdgcn/bitcode$ grep amdgpu-flat-work-group-size *.bc
jtramm@amdgpu04:/soft/compilers/rocm/rocm-6.3.0/amdgcn/bitcode$ grep amdgpu-waves-per-eu *.bc
jtramm@amdgpu04:/soft/compilers/rocm/rocm-6.3.0/amdgcn/bitcode$ 

@jtramm
Copy link

jtramm commented Jan 15, 2025

Ah, it looks like the .bc files are binary files. Is there a specific program I need to use to check them for attributes?

@shiltian
Copy link
Contributor Author

Can you disassemble the .bc to .ll and then check again? Directly checking .bc will not work because they are compressed/encoded. You can use /soft/compilers/rocm/rocm-6.3.0/llvm/bin/llvm-dis xxx.bc -o xxx.ll.

jdoerfert added a commit to jdoerfert/llvm-project that referenced this pull request Jan 17, 2025
This should replaces llvm#114357.

The problem this resolves is that AAAMDWavesPerEU used "assumed"
information from AAAMDFlatWorkGroupSize to derive "known" information.
This is generally not valid. The cannonical way is to use
"assumed"/"known" information to improve "assumed"/"known" information,
respectively. What happend before was that the new "known" information
(derived from "assumed") was manifested even though it should not have
been. That is how we ended up with external functions that had
"amdgpu-waves-per-eu" set. This won't happen if we only set the assumed
information as the invalidation, caused by unknown callers, will
fallback to the "unset" known information.

In addition to the above, we will identify invalid waves-per-eu values,
emit a warning, and replace the invalid values with valid ones. This
happens also for kernels. See
`llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll`

This should hopefully fix llvm#123092.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants