[llvm-profgen] Improve sample profile density #92144

wlei-llvm · 2024-05-14T16:29:35Z

The profile density feature(the amount of samples in the profile relative to the program size) is used to identify insufficient sample issue and provide hints for user to increase sample count. A low-density profile can be inaccurate due to statistical noise, which can hurt FDO performance.

This change introduces two improvements to the current density work.

The density calculation/definition is changed. Previously, the density of a profile was calculated as the minimum density for all warm functions (a function was considered warm if its total samples were within the top N percent of the profile). However, there is a problem that a high total sample profile can have a very low density, which makes the density value unstable.

Instead, we want to find a density number such that if a function's density is below this value, it is considered low-density function. We consider the whole profile is bad if a group of low-density functions have the sum of samples that exceeds N percent cut-off of the total samples.
In implementation, we sort the function profiles by density, iterate them in descending order and keep accumulating the body samples until the sum exceeds the (100% - N) percentage of the total_samples, the profile-density is the last(minimum) function-density of processed functions. We introduce the a flag(--profile-density-threshold) for this percentage threshold.

The density is now calculated based on final(compiler used) profiles instead of merged context-less profiles.

llvmbot · 2024-05-15T03:32:32Z

@llvm/pr-subscribers-pgo

Author: Lei Wang (wlei-llvm)

Changes

The profile density feature(the amount of samples in the profile relative to the program size) is used to identify insufficient sample issue and provide hints for user to increase sample count. A low-density profile can be inaccurate due to statistical noise, which can hurt FDO performance.

This change introduces two improvements to the current density work.

The density calculation/definition is changed. Previously, the density of a profile was calculated as the minimum density for all functions that were considered warm (a function was considered warm if its total samples were within the top N percent of the profile). However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

Instead, we want to find a density number such that if a function's density is below this value, it is considered low-density function. We consider the whole profile is bad if a group of low-density functions have an accumulated sum of total samples that exceeds N percent of the total samples.
In implementation, we sort the function profiles by density, iterate them in descending order and keep accumulating the total samples until the sum exceeds the (100% - N) percentage of the total_samples of the whole profile, the profile-density is the last(minimum) function-density of processed functions, which also means that the profile-density is bad, if the accumulated samples for all the bad density profile exceeds the N percent. We introduce the a flag(--profile-density-hot-func-cutoff) for this percentage threshold.

The density is now calculated based on context-sensitive profiles instead of merged profiles. Profiles merged from different contexts could hide low sample issues. For implementation, it leverages the sample context tracker which can fetch the CS profile before the pre-inliner.

Full diff: https://github.com/llvm/llvm-project/pull/92144.diff

3 Files Affected:

(modified) llvm/test/tools/llvm-profgen/profile-density.test (+1-1)
(modified) llvm/tools/llvm-profgen/ProfileGenerator.cpp (+84-6)
(modified) llvm/tools/llvm-profgen/ProfileGenerator.h (+4-1)

diff --git a/llvm/test/tools/llvm-profgen/profile-density.test b/llvm/test/tools/llvm-profgen/profile-density.test
index 0eb83838d16e7..f22c6f04914aa 100644
--- a/llvm/test/tools/llvm-profgen/profile-density.test
+++ b/llvm/test/tools/llvm-profgen/profile-density.test
@@ -7,7 +7,7 @@
 ;CHECK-DENSITY: Sample PGO is estimated to optimize better with 3.1x more samples. Please consider increasing sampling rate or profiling for longer duration to get more samples.
 ;CHECK-DENSITY: Minimum profile density for hot functions with top 99.00% total samples: 3.2
 
-;CHECK-DENSITY-CS: Minimum profile density for hot functions with top 99.00% total samples: 128.3
+;CHECK-DENSITY-CS: Minimum profile density for hot functions with top 99.00% total samples: 619.0
 
 ; original code:
 ; clang -O3 -g -fno-optimize-sibling-calls -fdebug-info-for-profiling qsort.c -o a.out
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.cpp b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
index 5aa44108f9660..ecbc6763e56f1 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.cpp
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
@@ -83,6 +83,10 @@ static cl::opt<double> HotFunctionDensityThreshold(
 static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
                                  llvm::cl::desc("show profile density details"),
                                  llvm::cl::Optional);
+static cl::opt<int> ProfileDensityHotFuncCutOff(
+    "profile-density-hot-func-cutoff", llvm::cl::init(990000),
+    llvm::cl::desc("Total sample cutoff for hot functions used to calculate "
+                   "the profile density."));
 
 static cl::opt<bool> UpdateTotalSamples(
     "update-total-samples", llvm::cl::init(false),
@@ -177,7 +181,8 @@ void ProfileGeneratorBase::write() {
   write(std::move(WriterOrErr.get()), ProfileMap);
 }
 
-void ProfileGeneratorBase::showDensitySuggestion(double Density) {
+void ProfileGeneratorBase::showDensitySuggestion(double Density,
+                                                 int DensityCutoffHot) {
   if (Density == 0.0)
     WithColor::warning() << "The --profile-summary-cutoff-hot option may be "
                             "set too low. Please check your command.\n";
@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) {
 
   if (ShowDensity)
     outs() << "Minimum profile density for hot functions with top "
-           << format("%.2f",
-                     static_cast<double>(ProfileSummaryCutoffHot.getValue()) /
-                         10000)
+           << format("%.2f", static_cast<double>(DensityCutoffHot) / 10000)
            << "% total samples: " << format("%.1f", Density) << "\n";
 }
 
@@ -771,7 +774,7 @@ void ProfileGenerator::populateBoundarySamplesForAllFunctions(
 void ProfileGeneratorBase::calculateAndShowDensity(
     const SampleProfileMap &Profiles) {
   double Density = calculateDensity(Profiles, HotCountThreshold);
-  showDensitySuggestion(Density);
+  showDensitySuggestion(Density, ProfileSummaryCutoffHot);
 }
 
 FunctionSamples *
@@ -1032,6 +1035,78 @@ void CSProfileGenerator::convertToProfileMap() {
   IsProfileValidOnTrie = false;
 }
 
+void CSProfileGenerator::calculateAndShowDensity(
+    SampleContextTracker &CTracker) {
+  double Density = calculateDensity(CTracker);
+  showDensitySuggestion(Density, ProfileDensityHotFuncCutOff);
+}
+
+// Calculate Profile-density:
+// Sort the list of function-density in descending order and iterate them once
+// their accumulated total samples exceeds the percentage_threshold of total
+// profile samples, the profile-density is the last(minimum) function-density of
+// the processed functions, which means all the functions significant to perf
+// are on good density if the profile-density is good, or in other words, if the
+// profile-density is bad, the accumulated samples for all the bad density
+// profile exceeds the (100% - percentage_threshold).
+// The percentage_threshold(--profile-density-hot-func-cutoff) is configurable
+// depending on how much regression the system want to tolerate.
+double CSProfileGenerator::calculateDensity(SampleContextTracker &CTracker) {
+  double ProfileDensity = 0.0;
+
+  uint64_t TotalProfileSamples = 0;
+  // A list of the function profile density and total samples.
+  std::vector<std::pair<double, uint64_t>> DensityList;
+  for (const auto *Node : CTracker) {
+    const auto *FSamples = Node->getFunctionSamples();
+    if (!FSamples)
+      continue;
+
+    uint64_t TotalBodySamples = 0;
+    uint64_t FuncBodySize = 0;
+    for (const auto &I : FSamples->getBodySamples()) {
+      TotalBodySamples += I.second.getSamples();
+      FuncBodySize++;
+    }
+    // The whole function could be inlined and optimized out, use the callsite
+    // head samples instead to estimate the body count.
+    if (FuncBodySize == 0) {
+      for (const auto &CallsiteSamples : FSamples->getCallsiteSamples()) {
+        FuncBodySize++;
+        for (const auto &Callee : CallsiteSamples.second)
+          TotalBodySamples += Callee.second.getHeadSamplesEstimate();
+      }
+    }
+
+    if (FuncBodySize == 0)
+      continue;
+
+    double FuncDensity = static_cast<double>(TotalBodySamples) / FuncBodySize;
+    TotalProfileSamples += TotalBodySamples;
+    DensityList.emplace_back(FuncDensity, TotalBodySamples);
+  }
+
+  // Sorted by the density in descending order.
+  llvm::stable_sort(DensityList, [&](const std::pair<double, uint64_t> &A,
+                                     const std::pair<double, uint64_t> &B) {
+    if (A.first != B.first)
+      return A.first > B.first;
+    return A.second < B.second;
+  });
+
+  uint64_t AccumulatedSamples = 0;
+  for (const auto &P : DensityList) {
+    AccumulatedSamples += P.second;
+    ProfileDensity = P.first;
+    if (AccumulatedSamples >=
+        TotalProfileSamples * static_cast<float>(ProfileDensityHotFuncCutOff) /
+            1000000)
+      break;
+  }
+
+  return ProfileDensity;
+}
+
 void CSProfileGenerator::postProcessProfiles() {
   // Compute hot/cold threshold based on profile. This will be used for cold
   // context profile merging/trimming.
@@ -1041,6 +1116,7 @@ void CSProfileGenerator::postProcessProfiles() {
   // inline decisions.
   if (EnableCSPreInliner) {
     ContextTracker.populateFuncToCtxtMap();
+    calculateAndShowDensity(ContextTracker);
     CSPreInliner(ContextTracker, *Binary, Summary.get()).run();
     // Turn off the profile merger by default unless it is explicitly enabled.
     if (!CSProfMergeColdContext.getNumOccurrences())
@@ -1061,7 +1137,9 @@ void CSProfileGenerator::postProcessProfiles() {
   sampleprof::SampleProfileMap ContextLessProfiles;
   ProfileConverter::flattenProfile(ProfileMap, ContextLessProfiles, true);
 
-  calculateAndShowDensity(ContextLessProfiles);
+  if (!EnableCSPreInliner)
+    ProfileGeneratorBase::calculateAndShowDensity(ContextLessProfiles);
+
   if (GenCSNestedProfile) {
     ProfileConverter CSConverter(ProfileMap);
     CSConverter.convertCSProfiles();
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.h b/llvm/tools/llvm-profgen/ProfileGenerator.h
index d258fb78bfb11..cf451f9d1a1a4 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.h
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.h
@@ -121,7 +121,7 @@ class ProfileGeneratorBase {
   double calculateDensity(const SampleProfileMap &Profiles,
                           uint64_t HotCntThreshold);
 
-  void showDensitySuggestion(double Density);
+  void showDensitySuggestion(double Density, int DensityCutoffHot);
 
   void collectProfiledFunctions();
 
@@ -363,6 +363,9 @@ class CSProfileGenerator : public ProfileGeneratorBase {
 
   void computeSummaryAndThreshold();
 
+  void calculateAndShowDensity(SampleContextTracker &CTracker);
+  double calculateDensity(SampleContextTracker &CTracker);
+
   bool collectFunctionsFromLLVMProfile(
       std::unordered_set<const BinaryFunction *> &ProfiledFunctions) override;

WenleiHe · 2024-05-17T04:57:51Z

The density is now calculated based on context-sensitive profiles instead of merged profiles. Profiles merged from different contexts could hide low sample issues. For implementation, it leverages the sample context tracker which can fetch the CS profile before the pre-inliner.

As we discussed, since optimization is based on pre-inlined profile, what matters should be the profile density after preinline? If pre-inlined profile has good density, that means optimizations should be able to make reasonably good and stable decisions based on merged profile, even though original full context profile could be of low density. From POV, it makes sense to use pre-inlined profile to compute density?

WenleiHe · 2024-05-17T05:13:46Z

However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable?

WenleiHe · 2024-05-17T05:07:03Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

@@ -83,6 +83,10 @@ static cl::opt<double> HotFunctionDensityThreshold(
 static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
                                 llvm::cl::desc("show profile density details"),
                                 llvm::cl::Optional);
+static cl::opt<int> ProfileDensityHotFuncCutOff(
+    "profile-density-hot-func-cutoff", llvm::cl::init(990000),


Do we need a new switch? Should we use ProfileSummaryCutoffHot instead given they are the same value?

This is just to in case people only want to tune the the profile density threshold(like to reduce the noise) but not the ProfileSummaryCutoffHot which will affect the profile generation(pre-inliner).

WenleiHe · 2024-05-17T05:11:50Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) {

  if (ShowDensity)
    outs() << "Minimum profile density for hot functions with top "


With the new heuristic, this is message is no longer accurate? it's no longer hot functions with top x% total samples.

Updated the message.

WenleiHe · 2024-05-17T05:24:40Z

llvm/tools/llvm-profgen/ProfileGenerator.h

@@ -121,7 +121,7 @@ class ProfileGeneratorBase {
  double calculateDensity(const SampleProfileMap &Profiles,


If we change how density is calculated, we should change it everywhere to be consistent. Why leaving the old heuristic there for non-CS profile?

Yeah, I planned a new another diff, new it's updated in the new version.

WenleiHe · 2024-05-17T05:27:16Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+    uint64_t FuncBodySize = 0;
+    for (const auto &I : FSamples->getBodySamples()) {
+      TotalBodySamples += I.second.getSamples();
+      FuncBodySize++;


If you look at ProfileGeneratorBase::calculateDensity, the size used is the actual binary function size, I think that is more accurate / meaningful for density.

IIRC for BodySamples, we omit 0 entries, so counting number of body samples to compute size seems wrong?

yeah, I noticed that. Assuming the ideal density is
density = total instruction samples / binary size. (binary size is num of instruction.)
The reason is right now we use the probe, and we don't have the instruction num for each probe, thus the total_samples is actually not the sum of all instruction samples, so using the binary function size is probably not accurate now.
Instead, using the probe seems more reasonable, because the samples is from the LBR, the minimum unit is a range not an address.
For example, say a function only have one BB/probe, but this BB has 10000 instructions, no any jmp, now say we only have one LBR hit cover all the instructions, the old one vs the new one is 1 vs 1/10000.

IIRC for BodySamples, we omit 0 entries, so counting number of body samples to compute size seems wrong?

We do generate the 0 count probe, we don't generate the probe that is fully optimized out, but as it doesn't show in the binary, it should ok to ignore it.

Ok, good point. I didn't think about the implication of total samples with probes. I was hoping we can make it really simple by just querying total samples.

It would have been better if we don't actually rely on the behavior of emitting 0 counts. Do we also emit 0 counts for AutoFDO?

It's worth a comment explaining the nuances.

Do we also emit 0 counts for AutoFDO?

Yes, we also emit 0 counts for never hit ranges for autoFDO.

WenleiHe · 2024-05-17T05:27:45Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+    uint64_t TotalBodySamples = 0;
+    uint64_t FuncBodySize = 0;
+    for (const auto &I : FSamples->getBodySamples()) {
+      TotalBodySamples += I.second.getSamples();


Why accumulating each body samples instead of just using totalSamples?

Right now as we change to use the finial nested binary, totalSamples also includes the nested children profile's totalSamples, so we want to exclude the children's samples.

I guess the question is why do we want to exclude inlinees? we can count total samples including inlinees, and also size including inliness. Otherwise we don't cover such inlinees in density metrics -- what if inlinee has a low density problem?

WenleiHe · 2024-05-17T05:30:03Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+    }
+    // The whole function could be inlined and optimized out, use the callsite
+    // head samples instead to estimate the body count.
+    if (FuncBodySize == 0) {


Why do we only include inlinees when the function body itself is fully optimized away? This seems inconsistent.

Similar to the comment above, I want to use the block count first. One BB should only have one probe but could have many calls. so just to be consistent to use
density = (sum of BB count) / num of BB

otherwise, it would be
density = (sum of BB count + sum of call count) / (num of BB + num of call

That said, it's probably the same thing, the call and probe in the same BB would have the same count and the division number should be close. No strong option, as we don't have the instruction num, so we just need a new definition.

Ok, they are more or less equivalent, let's just make it simple and not excluding callsites regardless?

wlei-llvm · 2024-05-17T05:36:06Z

The density is now calculated based on context-sensitive profiles instead of merged profiles. Profiles merged from different contexts could hide low sample issues. For implementation, it leverages the sample context tracker which can fetch the CS profile before the pre-inliner.

As we discussed, since optimization is based on pre-inlined profile, what matters should be the profile density after preinline? If pre-inlined profile has good density, that means optimizations should be able to make reasonably good and stable decisions based on merged profile, even though original full context profile could be of low density. From POV, it makes sense to use pre-inlined profile to compute density?

Good point, after thinking it over more, I agreed with you, if the profile is merged after inlining, the compiler optimization use the merged profile and doesn't use the context-sensitive profile. I will change to base on pre-inlined profile.

Though I still want to emphasize one thing that the existing one is using the completely context-less profile for this, which is also not what we want, IIUC, what we want is just the final profile.

WenleiHe · 2024-05-17T05:37:34Z

what we want is just the final profile.

Is that difficult to change?

wlei-llvm · 2024-05-17T05:40:50Z

what we want is just the final profile.

Is that difficult to change?

Ah, that should be even easier, just remove the

  // Merge function samples of CS profile to calculate profile density.
  sampleprof::SampleProfileMap ContextLessProfiles;
  for (const auto &I : ProfileMap) {
    ContextLessProfiles[I.second.getName()].merge(I.second);
  }
    calculateAndShowDensity(ContextLessProfiles);

Then we don't need the post-pre-inliner one and only the last one.

wlei-llvm · 2024-05-17T18:43:24Z

However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable?

I didn't dive very deep to the old heuristic, I just guess from the result, I saw many cases show super big number, for example and this come from normal profile.

Sample PGO is estimated to optimize better with 3633.9x more samples.

That said, as I mentioned in the above comments, there seems a few bugs in the previous work, (another "bug" is it use the total_samples to decide the hot functions(the cutoff num is a block level count)), maybe it's caused by other issue. Anyway, it seems it needs a big overhaul.

WenleiHe · 2024-05-17T23:42:03Z

However, there is a problem that a high total sample profile can have a very low density, which makes the density number not as representative.

How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable?

I didn't dive very deep to the old heuristic, I just guess from the result, I saw many cases show super big number, for example and this come from normal profile.
Sample PGO is estimated to optimize better with 3633.9x more samples.
That said, as I mentioned in the above comments, there seems a few bugs in the previous work, (another "bug" is it use the total_samples to decide the hot functions(the cutoff num is a block level count)), maybe it's caused by other issue. Anyway, it seems it needs a big overhaul.

If it's sorted by density now as opposed to function hotness, the density cut off value can be more stable -- that can avoid suggestions with huge numbers. But I don't see bug in old density, it's just heuristic with different philosophy.

WenleiHe · 2024-05-18T01:29:25Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

-    "hot-function-density-threshold", llvm::cl::init(1000),
-    llvm::cl::desc(
-        "specify density threshold for hot functions (default: 1000)"),
+    "hot-function-density-threshold", llvm::cl::init(20),


The switch name is also no longer accurate, since we are no longer selecting the top N functions. Perhaps profile-density-threshold is good enough.

Can you share some date to show that 20 is a good threshold?

WenleiHe · 2024-05-18T01:33:07Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+    if (AccumulatedSamples >= TotalProfileSamples *
+                                  static_cast<float>(ProfileDensityCutOffHot) /
+                                  1000000)
+      break;


nit: fold the exit condition into a while loop condition?

WenleiHe · 2024-05-18T01:38:37Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+    for (const auto &CallsiteSamples : FSamples.getCallsiteSamples()) {
+      FuncBodySize++;
+      for (const auto &Callee : CallsiteSamples.second) {
+        calculateDensity(Callee.second, DensityList, TotalProfileSamples);


I was thinking about density as a binary level metric as opposed to source level, in which case it maps better with inlinee included. Just like when we dump hot function list, it's binary level function, instead of source level function.

Any particular reason to do this on source level with inlining excluded?

Try to clarify this first, will address other comments later.

I don't have a particular reason to exclude the inlinees. The story is before we computed density on a full CS profile(or a merged context-less profile), so there is actually no inlinees in it.(see the original code: https://github.com/llvm/llvm-project/blob/main/llvm/tools/llvm-profgen/ProfileGenerator.cpp#L1060-L1064. this is before the nested profile conversion), so I just thought we need to base on an un-merged profile, didn't consider the density scope.

I guess the question is why do we want to exclude inlinees? we can count total samples including inlinees, and also size including inliness. Otherwise we don't cover such inlinees in density metrics -- what if inlinee has a low density problem?

another thing to clarify, this version is not to drop the inlinees profile, but compute as an independent function instead of part of the caller function. It should be able to catch the low inlinee profile .e.g.

main:101:100 1:100 2:foo:1 1:1 foo:100:1 1:100

In this version, it will compute three function density <main:100> <foo:1>``<foo:100>, we still have the low density inlinee <foo:1>
IIUC, your suggestion is to compute like a <main: 50 (101/2)> <foo: 100>

Yeah, if that makes sense, so my current data is based on current version, which seems work ok.

I think I can implement a binary-level density and compare the two, we can take a look together and discuss on it.

Ok, the recursion here covers inlinee samples. Good to know it's not dropped.

Though if we decided to compute density after pre-inline, it might make more sense to include inlinees directly? But yes, data will tell us which one is better. If they are similar, then we can aim for implementation simplicity.

wlei-llvm · 2024-05-22T19:26:59Z

Though if we decided to compute density after pre-inline, it might make more sense to include inlinees directly? But yes, data will tell us which one is better. If they are similar, then we can aim for implementation simplicity.

Yeah, makes sense, updated to use the binary-level density.

Can you share some date to show that 20 is a good threshold?

Shared more detailed data with you internally, just summarized here for reference.

Service1:

total_samples	Perf	Density
2.5M	+0.35	30.29
1.2M	+0.18	18.79
280K	-1.22	7.97
200K	-1.16	4.50

Service2:

total_samples	Perf	Density
4.2M	+0.22	23.80
1.2M	+0.20	19.37
500K	-0.15	8.72
200K	-0.97	2.71

Service3:

total_samples	Perf	Density
700K	+0.4	48.02
500K	-0.4	37.95
100K	-3	6.00
50K	-10	7.02

Also working on other services, so far it seems the regression started when the density dropped below 10, so that' why set the threshold to 20(We can set to 50 if we want to be more conservative)

WenleiHe · 2024-05-22T20:20:31Z

2.5M | +0.35 | 30.29
700K | +0.4 | 48.02

Thanks for the data. It looks like 50 is safer?

WenleiHe · 2024-05-22T20:24:54Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

@@ -768,9 +748,89 @@ void ProfileGenerator::populateBoundarySamplesForAllFunctions(
  }
 }

+// Note taht ideally the size should be the number of function's instruction.


type taht -> that

WenleiHe · 2024-05-22T20:26:46Z

llvm/test/tools/llvm-profgen/profile-density.test

@@ -1,13 +1,16 @@
-; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -hot-function-density-threshold=10  --trim-cold-profile=0 &> %t2
+; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -profile-density-threshold=10  --trim-cold-profile=0 &> %t2


was there no test for suggestion? should we add one?

Added the suggestion for CS profile (There was one for the non-CS profile)

WenleiHe · 2024-05-22T20:30:20Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

    llvm::cl::desc(
-        "specify density threshold for hot functions (default: 1000)"),
+        "Set the profile density threshold(default: 20), which is used to "


nit: remove "Set" verb, which is obviously for any flags. the need to describe what is "profile-density-threshold". Maybe remove "(defalt: xx)" as well.

WenleiHe · 2024-05-22T20:35:11Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+    uint64_t &FuncBodySize) {
+  for (const auto &I : FSamples.getBodySamples()) {
+    TotalBodySamples += I.second.getSamples();
+    FuncBodySize++;


We don't need to count here, instead FuncBodySize += FSamples.getBodySamples().size() + FSamples.getCallsiteSamples().size()?

WenleiHe · 2024-05-22T20:38:16Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+// one block. Hence, we use the number of probe as a proxy for the function's
+// size.
+void ProfileGeneratorBase::calculateBodySamplesAndSize(
+    const FunctionSamples &FSamples, uint64_t &TotalBodySamples,


How different is this TotalBodySamples comparing to FSample.getTotalSamples() which also includes all inlinees?

It should be no difference for probe-based profile. But very different to line-number based (AutoFDO) profile, in which the entry is the max of the counts in the same line, however, the total_samples is a sum of all the counts in the same line, so the getTotalSamples is much bigger than sum of each entry. That said, if we can adjust to use the binary instruction size, it also should be good to use getTotalSamples then the density for autofdo is getTotalSamples / binary_size.

If so, we need to diverge the function size calculation:
For pseudo-probe, we use the way in this version.
for line-number, we use the binary size(binary dwarf start_address - end_address).

May still not accurate for AutoFDO to use the start_address - end_address, start_address - end_address is also not equal to the num of instruction since instruction size is not always 1.

total_samples is a sum of all the counts in the same line

For AutoFDO, based on ProfileGeneratorBase::updateTotalSamples, isn't total sample the same as sum of all body samples + callsite (inlinee) samples?

total_samples is a sum of all the counts in the same line

For AutoFDO, based on ProfileGeneratorBase::updateTotalSamples, isn't total sample the same as sum of all body samples + callsite (inlinee) samples?

Yes, using updateTotalSamples is equal to the sum of all body samples + callsite (inlinee) samples.. But the --update-total-samples is default off.(I remember because there is regression with this on)

Even though --update-total-samples is off, how big of a difference do we expect between total samples and the way you compute it? I actually don't remember what is the source of discrepency, and still don't understand why we have regression. If they are close enough, it shouldn't matter for computing density?

Even though --update-total-samples is off, how big of a difference do we expect between total samples and the way you compute it? I actually don't remember what is the source of discrepency, and still don't understand why we have regression. If they are close enough, it shouldn't matter for computing density?

It could be 2X~10X bigger, the multiplier is equal to how many physical instruction one source instruction(debug-line) emits, usually one source code line can generate many physical instructions, it could be big.

There is an example in our test case: https://github.com/llvm/llvm-project/blob/main/llvm/test/tools/llvm-profgen/inline-noprobe.test#L17-L45

See the CHECK: main:2609:0 vs CHECK-UPDATE-TOTAL-SAMPLE: main:292:0, it could be ~10X.

it's indeed hard to understand why it can cause regression, it's just the observation from data.

Ok, can you add a comment explaining why we need to accumulate the samples manually instead of using getTotalSamples? including the fact that --update-total-samples is off and the reasoning, difference between AutoFDO and CSSPGO. A comment would be very useful because after a while someone may look at this code be very tempted to refactor / simplify it with getTotalSamples.

Comments added

WenleiHe · 2024-05-22T20:39:30Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

-static cl::opt<double> HotFunctionDensityThreshold(
-    "hot-function-density-threshold", llvm::cl::init(1000),
+static cl::opt<double> ProfileDensityThreshold(
+    "profile-density-threshold", llvm::cl::init(20),


As discussed, perhaps make it 50 to be safer.

Sounds good!

WenleiHe · 2024-05-23T16:51:11Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+// one block. Hence, we use the number of probe as a proxy for the function's
+// size.
+void ProfileGeneratorBase::calculateBodySamplesAndSize(
+    const FunctionSamples &FSamples, uint64_t &TotalBodySamples,


Ok, can you add a comment explaining why we need to accumulate the samples manually instead of using getTotalSamples? including the fact that --update-total-samples is off and the reasoning, difference between AutoFDO and CSSPGO. A comment would be very useful because after a while someone may look at this code be very tempted to refactor / simplify it with getTotalSamples.

WenleiHe · 2024-05-23T16:56:16Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+      // inlinees' samples and size should be included in the calculation.
+      calculateBodySamplesAndSize(Callee.second, TotalBodySamples,
+                                  FuncBodySize);
+      TotalBodySamples += Callee.second.getHeadSamplesEstimate();


Is this needed? It seems to duplicate the inlinee's body samples. Can we remove?

Oh, Good catch! With this, we also don't need to add the FSamples.getCallsiteSamples().size() to the FuncBodySize. I was based on the old source-level density, now the inlinees' body samples are included, we don't need the estimated callsite.

Ok, hope this doesn't change density values so we don't need to reevaluate the threshold.

WenleiHe

LGTM, thanks for the improvements and working through the feedbacks.

WenleiHe · 2024-05-23T18:57:16Z

llvm/tools/llvm-profgen/ProfileGenerator.cpp

+  // -update-total-samples). Hence, it's safer to re-calculate here to avoid
+  // such discrepancy.


worth pointing not explicitly that it makes no difference for probe-based profile?

Reuse the definition of profile density from llvm-profgen (#92144): - the density is computed in perf2bolt using raw samples (perf.data or pre-aggregated data), - function density is the ratio of dynamically executed function bytes to the static function size in bytes, - profile density: - functions are sorted by density in decreasing order, accumulating their respective sample counts, - profile density is the smallest density covering 99% of total sample count. In other words, BOLT binary profile density is the minimum amount of profile information per function (excluding functions in tail 1% sample count) which is sufficient to optimize the binary well. The density threshold of 60 was determined through experiments with large binaries by reducing the sample count and checking resulting profile density and performance. The threshold is conservative. perf2bolt would print the warning if the density is below the threshold and suggest to increase the sampling duration and/or frequency to reach a given density, e.g.: ``` BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples. ``` Test Plan: updated pre-aggregated-perf.test Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe, wlei-llvm Pull Request: #101094

Reuse the definition of profile density from llvm-profgen (llvm#92144): - the density is computed in perf2bolt using raw samples (perf.data or pre-aggregated data), - function density is the ratio of dynamically executed function bytes to the static function size in bytes, - profile density: - functions are sorted by density in decreasing order, accumulating their respective sample counts, - profile density is the smallest density covering 99% of total sample count. In other words, BOLT binary profile density is the minimum amount of profile information per function (excluding functions in tail 1% sample count) which is sufficient to optimize the binary well. The density threshold of 60 was determined through experiments with large binaries by reducing the sample count and checking resulting profile density and performance. The threshold is conservative. perf2bolt would print the warning if the density is below the threshold and suggest to increase the sampling duration and/or frequency to reach a given density, e.g.: ``` BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples. ``` Test Plan: updated pre-aggregated-perf.test Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe, wlei-llvm Pull Request: llvm#101094

improve profile density

441a16d

wlei-llvm force-pushed the profile-density branch from ee5a970 to 441a16d Compare May 15, 2024 00:45

wlei-llvm requested review from WenleiHe and MatzeB May 15, 2024 03:28

wlei-llvm marked this pull request as ready for review May 15, 2024 03:31

llvmbot added the PGO Profile Guided Optimizations label May 15, 2024

WenleiHe reviewed May 17, 2024

View reviewed changes

wlei-llvm added 2 commits May 17, 2024 12:00

change to base on finial profile and addressing other comments

9082e49

fix missing callee sample

1a4679a

WenleiHe reviewed May 18, 2024

View reviewed changes

change to binary-level density

9cef0d9

WenleiHe reviewed May 22, 2024

View reviewed changes

wlei-llvm added 2 commits May 22, 2024 14:09

set profile density threshold to 50 and address other comments

a42f480

fix comment

5974082

WenleiHe reviewed May 23, 2024

View reviewed changes

fix body samples and update comments

7769d0f

WenleiHe approved these changes May 23, 2024

View reviewed changes

update comments

4539a39

wlei-llvm merged commit b9d40a7 into llvm:main May 24, 2024
7 checks passed

aaupov mentioned this pull request Jul 29, 2024

[BOLT] Add profile density computation #101094

Merged

		@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) {

		if (ShowDensity)
		outs() << "Minimum profile density for hot functions with top "

		@@ -121,7 +121,7 @@ class ProfileGeneratorBase {
		double calculateDensity(const SampleProfileMap &Profiles,

		@@ -1,13 +1,16 @@
		; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -hot-function-density-threshold=10 --trim-cold-profile=0 &> %t2
		; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -profile-density-threshold=10 --trim-cold-profile=0 &> %t2

		// -update-total-samples). Hence, it's safer to re-calculate here to avoid
		// such discrepancy.

[llvm-profgen] Improve sample profile density #92144

[llvm-profgen] Improve sample profile density #92144

Uh oh!

Conversation

wlei-llvm commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 15, 2024

Uh oh!

WenleiHe commented May 17, 2024

Uh oh!

WenleiHe commented May 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WenleiHe May 17, 2024 • edited by wlei-llvm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wlei-llvm commented May 17, 2024

Uh oh!

WenleiHe commented May 17, 2024

Uh oh!

wlei-llvm commented May 17, 2024

Uh oh!

wlei-llvm commented May 17, 2024

Uh oh!

WenleiHe commented May 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wlei-llvm commented May 22, 2024

Uh oh!

WenleiHe commented May 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wlei-llvm commented May 14, 2024 •

edited

Loading

WenleiHe May 17, 2024 •

edited by wlei-llvm

Loading

wlei-llvm May 22, 2024 •

edited

Loading