Skip to content

[llvm-profgen] Improve sample profile density #92144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 24, 2024

Conversation

wlei-llvm
Copy link
Contributor

@wlei-llvm wlei-llvm commented May 14, 2024

The profile density feature(the amount of samples in the profile relative to the program size) is used to identify insufficient sample issue and provide hints for user to increase sample count. A low-density profile can be inaccurate due to statistical noise, which can hurt FDO performance.

This change introduces two improvements to the current density work.

  1. The density calculation/definition is changed. Previously, the density of a profile was calculated as the minimum density for all warm functions (a function was considered warm if its total samples were within the top N percent of the profile). However, there is a problem that a high total sample profile can have a very low density, which makes the density value unstable.
  • Instead, we want to find a density number such that if a function's density is below this value, it is considered low-density function. We consider the whole profile is bad if a group of low-density functions have the sum of samples that exceeds N percent cut-off of the total samples.

  • In implementation, we sort the function profiles by density, iterate them in descending order and keep accumulating the body samples until the sum exceeds the (100% - N) percentage of the total_samples, the profile-density is the last(minimum) function-density of processed functions. We introduce the a flag(--profile-density-threshold) for this percentage threshold.

  1. The density is now calculated based on final(compiler used) profiles instead of merged context-less profiles.

@wlei-llvm wlei-llvm requested review from WenleiHe and MatzeB May 15, 2024 03:28
@wlei-llvm wlei-llvm marked this pull request as ready for review May 15, 2024 03:31
@llvmbot llvmbot added the PGO Profile Guided Optimizations label May 15, 2024
@llvmbot
Copy link
Member

llvmbot commented May 15, 2024

@llvm/pr-subscribers-pgo

Author: Lei Wang (wlei-llvm)

Changes

The profile density feature(the amount of samples in the profile relative to the program size) is used to identify insufficient sample issue and provide hints for user to increase sample count. A low-density profile can be inaccurate due to statistical noise, which can hurt FDO performance.

This change introduces two improvements to the current density work.

  1. The density calculation/definition is changed. Previously, the density of a profile was calculated as the minimum density for all functions that were considered warm (a function was considered warm if its total samples were within the top N percent of the profile). However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.
  • Instead, we want to find a density number such that if a function's density is below this value, it is considered low-density function. We consider the whole profile is bad if a group of low-density functions have an accumulated sum of total samples that exceeds N percent of the total samples.

  • In implementation, we sort the function profiles by density, iterate them in descending order and keep accumulating the total samples until the sum exceeds the (100% - N) percentage of the total_samples of the whole profile, the profile-density is the last(minimum) function-density of processed functions, which also means that the profile-density is bad, if the accumulated samples for all the bad density profile exceeds the N percent. We introduce the a flag(--profile-density-hot-func-cutoff) for this percentage threshold.

  1. The density is now calculated based on context-sensitive profiles instead of merged profiles. Profiles merged from different contexts could hide low sample issues. For implementation, it leverages the sample context tracker which can fetch the CS profile before the pre-inliner.

Full diff: https://github.com/llvm/llvm-project/pull/92144.diff

3 Files Affected:

  • (modified) llvm/test/tools/llvm-profgen/profile-density.test (+1-1)
  • (modified) llvm/tools/llvm-profgen/ProfileGenerator.cpp (+84-6)
  • (modified) llvm/tools/llvm-profgen/ProfileGenerator.h (+4-1)
diff --git a/llvm/test/tools/llvm-profgen/profile-density.test b/llvm/test/tools/llvm-profgen/profile-density.test
index 0eb83838d16e7..f22c6f04914aa 100644
--- a/llvm/test/tools/llvm-profgen/profile-density.test
+++ b/llvm/test/tools/llvm-profgen/profile-density.test
@@ -7,7 +7,7 @@
 ;CHECK-DENSITY: Sample PGO is estimated to optimize better with 3.1x more samples. Please consider increasing sampling rate or profiling for longer duration to get more samples.
 ;CHECK-DENSITY: Minimum profile density for hot functions with top 99.00% total samples: 3.2
 
-;CHECK-DENSITY-CS: Minimum profile density for hot functions with top 99.00% total samples: 128.3
+;CHECK-DENSITY-CS: Minimum profile density for hot functions with top 99.00% total samples: 619.0
 
 ; original code:
 ; clang -O3 -g -fno-optimize-sibling-calls -fdebug-info-for-profiling qsort.c -o a.out
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.cpp b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
index 5aa44108f9660..ecbc6763e56f1 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.cpp
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
@@ -83,6 +83,10 @@ static cl::opt<double> HotFunctionDensityThreshold(
 static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
                                  llvm::cl::desc("show profile density details"),
                                  llvm::cl::Optional);
+static cl::opt<int> ProfileDensityHotFuncCutOff(
+    "profile-density-hot-func-cutoff", llvm::cl::init(990000),
+    llvm::cl::desc("Total sample cutoff for hot functions used to calculate "
+                   "the profile density."));
 
 static cl::opt<bool> UpdateTotalSamples(
     "update-total-samples", llvm::cl::init(false),
@@ -177,7 +181,8 @@ void ProfileGeneratorBase::write() {
   write(std::move(WriterOrErr.get()), ProfileMap);
 }
 
-void ProfileGeneratorBase::showDensitySuggestion(double Density) {
+void ProfileGeneratorBase::showDensitySuggestion(double Density,
+                                                 int DensityCutoffHot) {
   if (Density == 0.0)
     WithColor::warning() << "The --profile-summary-cutoff-hot option may be "
                             "set too low. Please check your command.\n";
@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) {
 
   if (ShowDensity)
     outs() << "Minimum profile density for hot functions with top "
-           << format("%.2f",
-                     static_cast<double>(ProfileSummaryCutoffHot.getValue()) /
-                         10000)
+           << format("%.2f", static_cast<double>(DensityCutoffHot) / 10000)
            << "% total samples: " << format("%.1f", Density) << "\n";
 }
 
@@ -771,7 +774,7 @@ void ProfileGenerator::populateBoundarySamplesForAllFunctions(
 void ProfileGeneratorBase::calculateAndShowDensity(
     const SampleProfileMap &Profiles) {
   double Density = calculateDensity(Profiles, HotCountThreshold);
-  showDensitySuggestion(Density);
+  showDensitySuggestion(Density, ProfileSummaryCutoffHot);
 }
 
 FunctionSamples *
@@ -1032,6 +1035,78 @@ void CSProfileGenerator::convertToProfileMap() {
   IsProfileValidOnTrie = false;
 }
 
+void CSProfileGenerator::calculateAndShowDensity(
+    SampleContextTracker &CTracker) {
+  double Density = calculateDensity(CTracker);
+  showDensitySuggestion(Density, ProfileDensityHotFuncCutOff);
+}
+
+// Calculate Profile-density:
+// Sort the list of function-density in descending order and iterate them once
+// their accumulated total samples exceeds the percentage_threshold of total
+// profile samples, the profile-density is the last(minimum) function-density of
+// the processed functions, which means all the functions significant to perf
+// are on good density if the profile-density is good, or in other words, if the
+// profile-density is bad, the accumulated samples for all the bad density
+// profile exceeds the (100% - percentage_threshold).
+// The percentage_threshold(--profile-density-hot-func-cutoff) is configurable
+// depending on how much regression the system want to tolerate.
+double CSProfileGenerator::calculateDensity(SampleContextTracker &CTracker) {
+  double ProfileDensity = 0.0;
+
+  uint64_t TotalProfileSamples = 0;
+  // A list of the function profile density and total samples.
+  std::vector<std::pair<double, uint64_t>> DensityList;
+  for (const auto *Node : CTracker) {
+    const auto *FSamples = Node->getFunctionSamples();
+    if (!FSamples)
+      continue;
+
+    uint64_t TotalBodySamples = 0;
+    uint64_t FuncBodySize = 0;
+    for (const auto &I : FSamples->getBodySamples()) {
+      TotalBodySamples += I.second.getSamples();
+      FuncBodySize++;
+    }
+    // The whole function could be inlined and optimized out, use the callsite
+    // head samples instead to estimate the body count.
+    if (FuncBodySize == 0) {
+      for (const auto &CallsiteSamples : FSamples->getCallsiteSamples()) {
+        FuncBodySize++;
+        for (const auto &Callee : CallsiteSamples.second)
+          TotalBodySamples += Callee.second.getHeadSamplesEstimate();
+      }
+    }
+
+    if (FuncBodySize == 0)
+      continue;
+
+    double FuncDensity = static_cast<double>(TotalBodySamples) / FuncBodySize;
+    TotalProfileSamples += TotalBodySamples;
+    DensityList.emplace_back(FuncDensity, TotalBodySamples);
+  }
+
+  // Sorted by the density in descending order.
+  llvm::stable_sort(DensityList, [&](const std::pair<double, uint64_t> &A,
+                                     const std::pair<double, uint64_t> &B) {
+    if (A.first != B.first)
+      return A.first > B.first;
+    return A.second < B.second;
+  });
+
+  uint64_t AccumulatedSamples = 0;
+  for (const auto &P : DensityList) {
+    AccumulatedSamples += P.second;
+    ProfileDensity = P.first;
+    if (AccumulatedSamples >=
+        TotalProfileSamples * static_cast<float>(ProfileDensityHotFuncCutOff) /
+            1000000)
+      break;
+  }
+
+  return ProfileDensity;
+}
+
 void CSProfileGenerator::postProcessProfiles() {
   // Compute hot/cold threshold based on profile. This will be used for cold
   // context profile merging/trimming.
@@ -1041,6 +1116,7 @@ void CSProfileGenerator::postProcessProfiles() {
   // inline decisions.
   if (EnableCSPreInliner) {
     ContextTracker.populateFuncToCtxtMap();
+    calculateAndShowDensity(ContextTracker);
     CSPreInliner(ContextTracker, *Binary, Summary.get()).run();
     // Turn off the profile merger by default unless it is explicitly enabled.
     if (!CSProfMergeColdContext.getNumOccurrences())
@@ -1061,7 +1137,9 @@ void CSProfileGenerator::postProcessProfiles() {
   sampleprof::SampleProfileMap ContextLessProfiles;
   ProfileConverter::flattenProfile(ProfileMap, ContextLessProfiles, true);
 
-  calculateAndShowDensity(ContextLessProfiles);
+  if (!EnableCSPreInliner)
+    ProfileGeneratorBase::calculateAndShowDensity(ContextLessProfiles);
+
   if (GenCSNestedProfile) {
     ProfileConverter CSConverter(ProfileMap);
     CSConverter.convertCSProfiles();
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.h b/llvm/tools/llvm-profgen/ProfileGenerator.h
index d258fb78bfb11..cf451f9d1a1a4 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.h
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.h
@@ -121,7 +121,7 @@ class ProfileGeneratorBase {
   double calculateDensity(const SampleProfileMap &Profiles,
                           uint64_t HotCntThreshold);
 
-  void showDensitySuggestion(double Density);
+  void showDensitySuggestion(double Density, int DensityCutoffHot);
 
   void collectProfiledFunctions();
 
@@ -363,6 +363,9 @@ class CSProfileGenerator : public ProfileGeneratorBase {
 
   void computeSummaryAndThreshold();
 
+  void calculateAndShowDensity(SampleContextTracker &CTracker);
+  double calculateDensity(SampleContextTracker &CTracker);
+
   bool collectFunctionsFromLLVMProfile(
       std::unordered_set<const BinaryFunction *> &ProfiledFunctions) override;
 

@WenleiHe
Copy link
Member

The density is now calculated based on context-sensitive profiles instead of merged profiles. Profiles merged from different contexts could hide low sample issues. For implementation, it leverages the sample context tracker which can fetch the CS profile before the pre-inliner.

As we discussed, since optimization is based on pre-inlined profile, what matters should be the profile density after preinline? If pre-inlined profile has good density, that means optimizations should be able to make reasonably good and stable decisions based on merged profile, even though original full context profile could be of low density. From POV, it makes sense to use pre-inlined profile to compute density?

@WenleiHe
Copy link
Member

However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable?

@@ -83,6 +83,10 @@ static cl::opt<double> HotFunctionDensityThreshold(
static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
llvm::cl::desc("show profile density details"),
llvm::cl::Optional);
static cl::opt<int> ProfileDensityHotFuncCutOff(
"profile-density-hot-func-cutoff", llvm::cl::init(990000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new switch? Should we use ProfileSummaryCutoffHot instead given they are the same value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to in case people only want to tune the the profile density threshold(like to reduce the noise) but not the ProfileSummaryCutoffHot which will affect the profile generation(pre-inliner).

@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) {

if (ShowDensity)
outs() << "Minimum profile density for hot functions with top "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new heuristic, this is message is no longer accurate? it's no longer hot functions with top x% total samples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the message.

@@ -121,7 +121,7 @@ class ProfileGeneratorBase {
double calculateDensity(const SampleProfileMap &Profiles,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change how density is calculated, we should change it everywhere to be consistent. Why leaving the old heuristic there for non-CS profile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I planned a new another diff, new it's updated in the new version.

uint64_t FuncBodySize = 0;
for (const auto &I : FSamples->getBodySamples()) {
TotalBodySamples += I.second.getSamples();
FuncBodySize++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at ProfileGeneratorBase::calculateDensity, the size used is the actual binary function size, I think that is more accurate / meaningful for density.

IIRC for BodySamples, we omit 0 entries, so counting number of body samples to compute size seems wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I noticed that. Assuming the ideal density is
density = total instruction samples / binary size. (binary size is num of instruction.)
The reason is right now we use the probe, and we don't have the instruction num for each probe, thus the total_samples is actually not the sum of all instruction samples, so using the binary function size is probably not accurate now.
Instead, using the probe seems more reasonable, because the samples is from the LBR, the minimum unit is a range not an address.
For example, say a function only have one BB/probe, but this BB has 10000 instructions, no any jmp, now say we only have one LBR hit cover all the instructions, the old one vs the new one is 1 vs 1/10000.

IIRC for BodySamples, we omit 0 entries, so counting number of body samples to compute size seems wrong?

We do generate the 0 count probe, we don't generate the probe that is fully optimized out, but as it doesn't show in the binary, it should ok to ignore it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good point. I didn't think about the implication of total samples with probes. I was hoping we can make it really simple by just querying total samples.

It would have been better if we don't actually rely on the behavior of emitting 0 counts. Do we also emit 0 counts for AutoFDO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth a comment explaining the nuances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also emit 0 counts for AutoFDO?

Yes, we also emit 0 counts for never hit ranges for autoFDO.

uint64_t TotalBodySamples = 0;
uint64_t FuncBodySize = 0;
for (const auto &I : FSamples->getBodySamples()) {
TotalBodySamples += I.second.getSamples();
Copy link
Member

@WenleiHe WenleiHe May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why accumulating each body samples instead of just using totalSamples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now as we change to use the finial nested binary, totalSamples also includes the nested children profile's totalSamples, so we want to exclude the children's samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is why do we want to exclude inlinees? we can count total samples including inlinees, and also size including inliness. Otherwise we don't cover such inlinees in density metrics -- what if inlinee has a low density problem?

}
// The whole function could be inlined and optimized out, use the callsite
// head samples instead to estimate the body count.
if (FuncBodySize == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only include inlinees when the function body itself is fully optimized away? This seems inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, I want to use the block count first. One BB should only have one probe but could have many calls. so just to be consistent to use
density = (sum of BB count) / num of BB

otherwise, it would be
density = (sum of BB count + sum of call count) / (num of BB + num of call

That said, it's probably the same thing, the call and probe in the same BB would have the same count and the division number should be close. No strong option, as we don't have the instruction num, so we just need a new definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, they are more or less equivalent, let's just make it simple and not excluding callsites regardless?

@wlei-llvm
Copy link
Contributor Author

The density is now calculated based on context-sensitive profiles instead of merged profiles. Profiles merged from different contexts could hide low sample issues. For implementation, it leverages the sample context tracker which can fetch the CS profile before the pre-inliner.

As we discussed, since optimization is based on pre-inlined profile, what matters should be the profile density after preinline? If pre-inlined profile has good density, that means optimizations should be able to make reasonably good and stable decisions based on merged profile, even though original full context profile could be of low density. From POV, it makes sense to use pre-inlined profile to compute density?

Good point, after thinking it over more, I agreed with you, if the profile is merged after inlining, the compiler optimization use the merged profile and doesn't use the context-sensitive profile. I will change to base on pre-inlined profile.

Though I still want to emphasize one thing that the existing one is using the completely context-less profile for this, which is also not what we want, IIUC, what we want is just the final profile.

@WenleiHe
Copy link
Member

what we want is just the final profile.

Is that difficult to change?

@wlei-llvm
Copy link
Contributor Author

what we want is just the final profile.

Is that difficult to change?

Ah, that should be even easier, just remove the

  // Merge function samples of CS profile to calculate profile density.
  sampleprof::SampleProfileMap ContextLessProfiles;
  for (const auto &I : ProfileMap) {
    ContextLessProfiles[I.second.getName()].merge(I.second);
  }
    calculateAndShowDensity(ContextLessProfiles);

Then we don't need the post-pre-inliner one and only the last one.

@wlei-llvm
Copy link
Contributor Author

However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable?

I didn't dive very deep to the old heuristic, I just guess from the result, I saw many cases show super big number, for example and this come from normal profile.

Sample PGO is estimated to optimize better with 3633.9x more samples.

That said, as I mentioned in the above comments, there seems a few bugs in the previous work, (another "bug" is it use the total_samples to decide the hot functions(the cutoff num is a block level count)), maybe it's caused by other issue. Anyway, it seems it needs a big overhaul.

@WenleiHe
Copy link
Member

However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable?

I didn't dive very deep to the old heuristic, I just guess from the result, I saw many cases show super big number, for example and this come from normal profile.

Sample PGO is estimated to optimize better with 3633.9x more samples.

That said, as I mentioned in the above comments, there seems a few bugs in the previous work, (another "bug" is it use the total_samples to decide the hot functions(the cutoff num is a block level count)), maybe it's caused by other issue. Anyway, it seems it needs a big overhaul.

If it's sorted by density now as opposed to function hotness, the density cut off value can be more stable -- that can avoid suggestions with huge numbers. But I don't see bug in old density, it's just heuristic with different philosophy.

"hot-function-density-threshold", llvm::cl::init(1000),
llvm::cl::desc(
"specify density threshold for hot functions (default: 1000)"),
"hot-function-density-threshold", llvm::cl::init(20),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch name is also no longer accurate, since we are no longer selecting the top N functions. Perhaps profile-density-threshold is good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share some date to show that 20 is a good threshold?

Comment on lines 809 to 812
if (AccumulatedSamples >= TotalProfileSamples *
static_cast<float>(ProfileDensityCutOffHot) /
1000000)
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fold the exit condition into a while loop condition?

for (const auto &CallsiteSamples : FSamples.getCallsiteSamples()) {
FuncBodySize++;
for (const auto &Callee : CallsiteSamples.second) {
calculateDensity(Callee.second, DensityList, TotalProfileSamples);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about density as a binary level metric as opposed to source level, in which case it maps better with inlinee included. Just like when we dump hot function list, it's binary level function, instead of source level function.

Any particular reason to do this on source level with inlining excluded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to clarify this first, will address other comments later.

I don't have a particular reason to exclude the inlinees. The story is before we computed density on a full CS profile(or a merged context-less profile), so there is actually no inlinees in it.(see the original code: https://github.com/llvm/llvm-project/blob/main/llvm/tools/llvm-profgen/ProfileGenerator.cpp#L1060-L1064. this is before the nested profile conversion), so I just thought we need to base on an un-merged profile, didn't consider the density scope.

I guess the question is why do we want to exclude inlinees? we can count total samples including inlinees, and also size including inliness. Otherwise we don't cover such inlinees in density metrics -- what if inlinee has a low density problem?

another thing to clarify, this version is not to drop the inlinees profile, but compute as an independent function instead of part of the caller function. It should be able to catch the low inlinee profile .e.g.

main:101:100
  1:100 
  2:foo:1
   1:1
foo:100:1
 1:100   

In this version, it will compute three function density <main:100> <foo:1>``<foo:100>, we still have the low density inlinee <foo:1>
IIUC, your suggestion is to compute like a <main: 50 (101/2)> <foo: 100>

Yeah, if that makes sense, so my current data is based on current version, which seems work ok.

I think I can implement a binary-level density and compare the two, we can take a look together and discuss on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the recursion here covers inlinee samples. Good to know it's not dropped.

Though if we decided to compute density after pre-inline, it might make more sense to include inlinees directly? But yes, data will tell us which one is better. If they are similar, then we can aim for implementation simplicity.

@wlei-llvm
Copy link
Contributor Author

Though if we decided to compute density after pre-inline, it might make more sense to include inlinees directly? But yes, data will tell us which one is better. If they are similar, then we can aim for implementation simplicity.

Yeah, makes sense, updated to use the binary-level density.

Can you share some date to show that 20 is a good threshold?

Shared more detailed data with you internally, just summarized here for reference.

Service1:

total_samples Perf Density
2.5M +0.35 30.29
1.2M +0.18 18.79
280K -1.22 7.97
200K -1.16 4.50

Service2:

total_samples Perf Density
4.2M +0.22 23.80
1.2M +0.20 19.37
500K -0.15 8.72
200K -0.97 2.71

Service3:

total_samples Perf Density
700K +0.4 48.02
500K -0.4 37.95
100K -3 6.00
50K -10 7.02

Also working on other services, so far it seems the regression started when the density dropped below 10, so that' why set the threshold to 20(We can set to 50 if we want to be more conservative)

@WenleiHe
Copy link
Member

2.5M | +0.35 | 30.29
700K | +0.4 | 48.02

Thanks for the data. It looks like 50 is safer?

@@ -768,9 +748,89 @@ void ProfileGenerator::populateBoundarySamplesForAllFunctions(
}
}

// Note taht ideally the size should be the number of function's instruction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type taht -> that

@@ -1,13 +1,16 @@
; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -hot-function-density-threshold=10 --trim-cold-profile=0 &> %t2
; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -profile-density-threshold=10 --trim-cold-profile=0 &> %t2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there no test for suggestion? should we add one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the suggestion for CS profile (There was one for the non-CS profile)

llvm::cl::desc(
"specify density threshold for hot functions (default: 1000)"),
"Set the profile density threshold(default: 20), which is used to "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "Set" verb, which is obviously for any flags. the need to describe what is "profile-density-threshold". Maybe remove "(defalt: xx)" as well.

uint64_t &FuncBodySize) {
for (const auto &I : FSamples.getBodySamples()) {
TotalBodySamples += I.second.getSamples();
FuncBodySize++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to count here, instead FuncBodySize += FSamples.getBodySamples().size() + FSamples.getCallsiteSamples().size()?

// one block. Hence, we use the number of probe as a proxy for the function's
// size.
void ProfileGeneratorBase::calculateBodySamplesAndSize(
const FunctionSamples &FSamples, uint64_t &TotalBodySamples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How different is this TotalBodySamples comparing to FSample.getTotalSamples() which also includes all inlinees?

Copy link
Contributor Author

@wlei-llvm wlei-llvm May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be no difference for probe-based profile. But very different to line-number based (AutoFDO) profile, in which the entry is the max of the counts in the same line, however, the total_samples is a sum of all the counts in the same line, so the getTotalSamples is much bigger than sum of each entry. That said, if we can adjust to use the binary instruction size, it also should be good to use getTotalSamples then the density for autofdo is getTotalSamples / binary_size.

If so, we need to diverge the function size calculation:
For pseudo-probe, we use the way in this version.
for line-number, we use the binary size(binary dwarf start_address - end_address).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May still not accurate for AutoFDO to use the start_address - end_address, start_address - end_address is also not equal to the num of instruction since instruction size is not always 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_samples is a sum of all the counts in the same line

For AutoFDO, based on ProfileGeneratorBase::updateTotalSamples, isn't total sample the same as sum of all body samples + callsite (inlinee) samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_samples is a sum of all the counts in the same line

For AutoFDO, based on ProfileGeneratorBase::updateTotalSamples, isn't total sample the same as sum of all body samples + callsite (inlinee) samples?

Yes, using updateTotalSamples is equal to the sum of all body samples + callsite (inlinee) samples.. But the --update-total-samples is default off.(I remember because there is regression with this on)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though --update-total-samples is off, how big of a difference do we expect between total samples and the way you compute it? I actually don't remember what is the source of discrepency, and still don't understand why we have regression. If they are close enough, it shouldn't matter for computing density?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though --update-total-samples is off, how big of a difference do we expect between total samples and the way you compute it? I actually don't remember what is the source of discrepency, and still don't understand why we have regression. If they are close enough, it shouldn't matter for computing density?

It could be 2X~10X bigger, the multiplier is equal to how many physical instruction one source instruction(debug-line) emits, usually one source code line can generate many physical instructions, it could be big.

There is an example in our test case: https://github.com/llvm/llvm-project/blob/main/llvm/test/tools/llvm-profgen/inline-noprobe.test#L17-L45

See the CHECK: main:2609:0 vs CHECK-UPDATE-TOTAL-SAMPLE: main:292:0, it could be ~10X.

it's indeed hard to understand why it can cause regression, it's just the observation from data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you add a comment explaining why we need to accumulate the samples manually instead of using getTotalSamples? including the fact that --update-total-samples is off and the reasoning, difference between AutoFDO and CSSPGO. A comment would be very useful because after a while someone may look at this code be very tempted to refactor / simplify it with getTotalSamples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added

static cl::opt<double> HotFunctionDensityThreshold(
"hot-function-density-threshold", llvm::cl::init(1000),
static cl::opt<double> ProfileDensityThreshold(
"profile-density-threshold", llvm::cl::init(20),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, perhaps make it 50 to be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

// one block. Hence, we use the number of probe as a proxy for the function's
// size.
void ProfileGeneratorBase::calculateBodySamplesAndSize(
const FunctionSamples &FSamples, uint64_t &TotalBodySamples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you add a comment explaining why we need to accumulate the samples manually instead of using getTotalSamples? including the fact that --update-total-samples is off and the reasoning, difference between AutoFDO and CSSPGO. A comment would be very useful because after a while someone may look at this code be very tempted to refactor / simplify it with getTotalSamples.

// inlinees' samples and size should be included in the calculation.
calculateBodySamplesAndSize(Callee.second, TotalBodySamples,
FuncBodySize);
TotalBodySamples += Callee.second.getHeadSamplesEstimate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? It seems to duplicate the inlinee's body samples. Can we remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Good catch! With this, we also don't need to add the FSamples.getCallsiteSamples().size() to the FuncBodySize. I was based on the old source-level density, now the inlinees' body samples are included, we don't need the estimated callsite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, hope this doesn't change density values so we don't need to reevaluate the threshold.

Copy link
Member

@WenleiHe WenleiHe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the improvements and working through the feedbacks.

Comment on lines 769 to 770
// -update-total-samples). Hence, it's safer to re-calculate here to avoid
// such discrepancy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth pointing not explicitly that it makes no difference for probe-based profile?

@wlei-llvm wlei-llvm merged commit b9d40a7 into llvm:main May 24, 2024
7 checks passed
aaupov added a commit that referenced this pull request Oct 25, 2024
Reuse the definition of profile density from llvm-profgen (#92144):
- the density is computed in perf2bolt using raw samples (perf.data or
  pre-aggregated data),
- function density is the ratio of dynamically executed function bytes
  to the static function size in bytes,
- profile density:
  - functions are sorted by density in decreasing order, accumulating
    their respective sample counts,
  - profile density is the smallest density covering 99% of total sample
    count.

In other words, BOLT binary profile density is the minimum amount of
profile information per function (excluding functions in tail 1% sample
count) which is sufficient to optimize the binary well.

The density threshold of 60 was determined through experiments with
large binaries by reducing the sample count and checking resulting
profile density and performance. The threshold is conservative.

perf2bolt would print the warning if the density is below the threshold
and suggest to increase the sampling duration and/or frequency to reach
a given density, e.g.:
```
BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples.
```

Test Plan: updated pre-aggregated-perf.test

Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe

Reviewed By: WenleiHe, wlei-llvm

Pull Request: #101094
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
Reuse the definition of profile density from llvm-profgen (llvm#92144):
- the density is computed in perf2bolt using raw samples (perf.data or
  pre-aggregated data),
- function density is the ratio of dynamically executed function bytes
  to the static function size in bytes,
- profile density:
  - functions are sorted by density in decreasing order, accumulating
    their respective sample counts,
  - profile density is the smallest density covering 99% of total sample
    count.

In other words, BOLT binary profile density is the minimum amount of
profile information per function (excluding functions in tail 1% sample
count) which is sufficient to optimize the binary well.

The density threshold of 60 was determined through experiments with
large binaries by reducing the sample count and checking resulting
profile density and performance. The threshold is conservative.

perf2bolt would print the warning if the density is below the threshold
and suggest to increase the sampling duration and/or frequency to reach
a given density, e.g.:
```
BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples.
```

Test Plan: updated pre-aggregated-perf.test

Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe

Reviewed By: WenleiHe, wlei-llvm

Pull Request: llvm#101094
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants