-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[llvm-profgen] Improve sample profile density #92144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ee5a970
to
441a16d
Compare
@llvm/pr-subscribers-pgo Author: Lei Wang (wlei-llvm) ChangesThe profile density feature(the amount of samples in the profile relative to the program size) is used to identify insufficient sample issue and provide hints for user to increase sample count. A low-density profile can be inaccurate due to statistical noise, which can hurt FDO performance. This change introduces two improvements to the current density work.
Full diff: https://github.com/llvm/llvm-project/pull/92144.diff 3 Files Affected:
diff --git a/llvm/test/tools/llvm-profgen/profile-density.test b/llvm/test/tools/llvm-profgen/profile-density.test
index 0eb83838d16e7..f22c6f04914aa 100644
--- a/llvm/test/tools/llvm-profgen/profile-density.test
+++ b/llvm/test/tools/llvm-profgen/profile-density.test
@@ -7,7 +7,7 @@
;CHECK-DENSITY: Sample PGO is estimated to optimize better with 3.1x more samples. Please consider increasing sampling rate or profiling for longer duration to get more samples.
;CHECK-DENSITY: Minimum profile density for hot functions with top 99.00% total samples: 3.2
-;CHECK-DENSITY-CS: Minimum profile density for hot functions with top 99.00% total samples: 128.3
+;CHECK-DENSITY-CS: Minimum profile density for hot functions with top 99.00% total samples: 619.0
; original code:
; clang -O3 -g -fno-optimize-sibling-calls -fdebug-info-for-profiling qsort.c -o a.out
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.cpp b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
index 5aa44108f9660..ecbc6763e56f1 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.cpp
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.cpp
@@ -83,6 +83,10 @@ static cl::opt<double> HotFunctionDensityThreshold(
static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
llvm::cl::desc("show profile density details"),
llvm::cl::Optional);
+static cl::opt<int> ProfileDensityHotFuncCutOff(
+ "profile-density-hot-func-cutoff", llvm::cl::init(990000),
+ llvm::cl::desc("Total sample cutoff for hot functions used to calculate "
+ "the profile density."));
static cl::opt<bool> UpdateTotalSamples(
"update-total-samples", llvm::cl::init(false),
@@ -177,7 +181,8 @@ void ProfileGeneratorBase::write() {
write(std::move(WriterOrErr.get()), ProfileMap);
}
-void ProfileGeneratorBase::showDensitySuggestion(double Density) {
+void ProfileGeneratorBase::showDensitySuggestion(double Density,
+ int DensityCutoffHot) {
if (Density == 0.0)
WithColor::warning() << "The --profile-summary-cutoff-hot option may be "
"set too low. Please check your command.\n";
@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) {
if (ShowDensity)
outs() << "Minimum profile density for hot functions with top "
- << format("%.2f",
- static_cast<double>(ProfileSummaryCutoffHot.getValue()) /
- 10000)
+ << format("%.2f", static_cast<double>(DensityCutoffHot) / 10000)
<< "% total samples: " << format("%.1f", Density) << "\n";
}
@@ -771,7 +774,7 @@ void ProfileGenerator::populateBoundarySamplesForAllFunctions(
void ProfileGeneratorBase::calculateAndShowDensity(
const SampleProfileMap &Profiles) {
double Density = calculateDensity(Profiles, HotCountThreshold);
- showDensitySuggestion(Density);
+ showDensitySuggestion(Density, ProfileSummaryCutoffHot);
}
FunctionSamples *
@@ -1032,6 +1035,78 @@ void CSProfileGenerator::convertToProfileMap() {
IsProfileValidOnTrie = false;
}
+void CSProfileGenerator::calculateAndShowDensity(
+ SampleContextTracker &CTracker) {
+ double Density = calculateDensity(CTracker);
+ showDensitySuggestion(Density, ProfileDensityHotFuncCutOff);
+}
+
+// Calculate Profile-density:
+// Sort the list of function-density in descending order and iterate them once
+// their accumulated total samples exceeds the percentage_threshold of total
+// profile samples, the profile-density is the last(minimum) function-density of
+// the processed functions, which means all the functions significant to perf
+// are on good density if the profile-density is good, or in other words, if the
+// profile-density is bad, the accumulated samples for all the bad density
+// profile exceeds the (100% - percentage_threshold).
+// The percentage_threshold(--profile-density-hot-func-cutoff) is configurable
+// depending on how much regression the system want to tolerate.
+double CSProfileGenerator::calculateDensity(SampleContextTracker &CTracker) {
+ double ProfileDensity = 0.0;
+
+ uint64_t TotalProfileSamples = 0;
+ // A list of the function profile density and total samples.
+ std::vector<std::pair<double, uint64_t>> DensityList;
+ for (const auto *Node : CTracker) {
+ const auto *FSamples = Node->getFunctionSamples();
+ if (!FSamples)
+ continue;
+
+ uint64_t TotalBodySamples = 0;
+ uint64_t FuncBodySize = 0;
+ for (const auto &I : FSamples->getBodySamples()) {
+ TotalBodySamples += I.second.getSamples();
+ FuncBodySize++;
+ }
+ // The whole function could be inlined and optimized out, use the callsite
+ // head samples instead to estimate the body count.
+ if (FuncBodySize == 0) {
+ for (const auto &CallsiteSamples : FSamples->getCallsiteSamples()) {
+ FuncBodySize++;
+ for (const auto &Callee : CallsiteSamples.second)
+ TotalBodySamples += Callee.second.getHeadSamplesEstimate();
+ }
+ }
+
+ if (FuncBodySize == 0)
+ continue;
+
+ double FuncDensity = static_cast<double>(TotalBodySamples) / FuncBodySize;
+ TotalProfileSamples += TotalBodySamples;
+ DensityList.emplace_back(FuncDensity, TotalBodySamples);
+ }
+
+ // Sorted by the density in descending order.
+ llvm::stable_sort(DensityList, [&](const std::pair<double, uint64_t> &A,
+ const std::pair<double, uint64_t> &B) {
+ if (A.first != B.first)
+ return A.first > B.first;
+ return A.second < B.second;
+ });
+
+ uint64_t AccumulatedSamples = 0;
+ for (const auto &P : DensityList) {
+ AccumulatedSamples += P.second;
+ ProfileDensity = P.first;
+ if (AccumulatedSamples >=
+ TotalProfileSamples * static_cast<float>(ProfileDensityHotFuncCutOff) /
+ 1000000)
+ break;
+ }
+
+ return ProfileDensity;
+}
+
void CSProfileGenerator::postProcessProfiles() {
// Compute hot/cold threshold based on profile. This will be used for cold
// context profile merging/trimming.
@@ -1041,6 +1116,7 @@ void CSProfileGenerator::postProcessProfiles() {
// inline decisions.
if (EnableCSPreInliner) {
ContextTracker.populateFuncToCtxtMap();
+ calculateAndShowDensity(ContextTracker);
CSPreInliner(ContextTracker, *Binary, Summary.get()).run();
// Turn off the profile merger by default unless it is explicitly enabled.
if (!CSProfMergeColdContext.getNumOccurrences())
@@ -1061,7 +1137,9 @@ void CSProfileGenerator::postProcessProfiles() {
sampleprof::SampleProfileMap ContextLessProfiles;
ProfileConverter::flattenProfile(ProfileMap, ContextLessProfiles, true);
- calculateAndShowDensity(ContextLessProfiles);
+ if (!EnableCSPreInliner)
+ ProfileGeneratorBase::calculateAndShowDensity(ContextLessProfiles);
+
if (GenCSNestedProfile) {
ProfileConverter CSConverter(ProfileMap);
CSConverter.convertCSProfiles();
diff --git a/llvm/tools/llvm-profgen/ProfileGenerator.h b/llvm/tools/llvm-profgen/ProfileGenerator.h
index d258fb78bfb11..cf451f9d1a1a4 100644
--- a/llvm/tools/llvm-profgen/ProfileGenerator.h
+++ b/llvm/tools/llvm-profgen/ProfileGenerator.h
@@ -121,7 +121,7 @@ class ProfileGeneratorBase {
double calculateDensity(const SampleProfileMap &Profiles,
uint64_t HotCntThreshold);
- void showDensitySuggestion(double Density);
+ void showDensitySuggestion(double Density, int DensityCutoffHot);
void collectProfiledFunctions();
@@ -363,6 +363,9 @@ class CSProfileGenerator : public ProfileGeneratorBase {
void computeSummaryAndThreshold();
+ void calculateAndShowDensity(SampleContextTracker &CTracker);
+ double calculateDensity(SampleContextTracker &CTracker);
+
bool collectFunctionsFromLLVMProfile(
std::unordered_set<const BinaryFunction *> &ProfiledFunctions) override;
|
As we discussed, since optimization is based on pre-inlined profile, what matters should be the profile density after preinline? If pre-inlined profile has good density, that means optimizations should be able to make reasonably good and stable decisions based on merged profile, even though original full context profile could be of low density. From POV, it makes sense to use pre-inlined profile to compute density? |
How big is the problem? Like do you see this old heuristic preventing us from getting a consistent density threshold as proxy for performance, because density is less stable? |
@@ -83,6 +83,10 @@ static cl::opt<double> HotFunctionDensityThreshold( | |||
static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false), | |||
llvm::cl::desc("show profile density details"), | |||
llvm::cl::Optional); | |||
static cl::opt<int> ProfileDensityHotFuncCutOff( | |||
"profile-density-hot-func-cutoff", llvm::cl::init(990000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a new switch? Should we use ProfileSummaryCutoffHot
instead given they are the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to in case people only want to tune the the profile density threshold(like to reduce the noise) but not the ProfileSummaryCutoffHot
which will affect the profile generation(pre-inliner).
@@ -190,9 +195,7 @@ void ProfileGeneratorBase::showDensitySuggestion(double Density) { | |||
|
|||
if (ShowDensity) | |||
outs() << "Minimum profile density for hot functions with top " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new heuristic, this is message is no longer accurate? it's no longer hot functions with top x% total samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the message.
@@ -121,7 +121,7 @@ class ProfileGeneratorBase { | |||
double calculateDensity(const SampleProfileMap &Profiles, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change how density is calculated, we should change it everywhere to be consistent. Why leaving the old heuristic there for non-CS profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I planned a new another diff, new it's updated in the new version.
uint64_t FuncBodySize = 0; | ||
for (const auto &I : FSamples->getBodySamples()) { | ||
TotalBodySamples += I.second.getSamples(); | ||
FuncBodySize++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at ProfileGeneratorBase::calculateDensity
, the size used is the actual binary function size, I think that is more accurate / meaningful for density.
IIRC for BodySamples, we omit 0 entries, so counting number of body samples to compute size seems wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I noticed that. Assuming the ideal density is
density = total instruction samples / binary size
. (binary size is num of instruction.)
The reason is right now we use the probe, and we don't have the instruction num for each probe, thus the total_samples is actually not the sum of all instruction samples, so using the binary function size is probably not accurate now.
Instead, using the probe seems more reasonable, because the samples is from the LBR, the minimum unit is a range not an address.
For example, say a function only have one BB/probe, but this BB has 10000 instructions, no any jmp, now say we only have one LBR hit cover all the instructions, the old one vs the new one is 1
vs 1/10000
.
IIRC for BodySamples, we omit 0 entries, so counting number of body samples to compute size seems wrong?
We do generate the 0 count probe, we don't generate the probe that is fully optimized out, but as it doesn't show in the binary, it should ok to ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, good point. I didn't think about the implication of total samples with probes. I was hoping we can make it really simple by just querying total samples.
It would have been better if we don't actually rely on the behavior of emitting 0 counts. Do we also emit 0 counts for AutoFDO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth a comment explaining the nuances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also emit 0 counts for AutoFDO?
Yes, we also emit 0 counts for never hit ranges for autoFDO.
uint64_t TotalBodySamples = 0; | ||
uint64_t FuncBodySize = 0; | ||
for (const auto &I : FSamples->getBodySamples()) { | ||
TotalBodySamples += I.second.getSamples(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why accumulating each body samples instead of just using totalSamples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now as we change to use the finial nested binary, totalSamples
also includes the nested children profile's totalSamples
, so we want to exclude the children's samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is why do we want to exclude inlinees? we can count total samples including inlinees, and also size including inliness. Otherwise we don't cover such inlinees in density metrics -- what if inlinee has a low density problem?
} | ||
// The whole function could be inlined and optimized out, use the callsite | ||
// head samples instead to estimate the body count. | ||
if (FuncBodySize == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only include inlinees when the function body itself is fully optimized away? This seems inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above, I want to use the block count first. One BB should only have one probe but could have many calls. so just to be consistent to use
density = (sum of BB count) / num of BB
otherwise, it would be
density = (sum of BB count + sum of call count) / (num of BB + num of call
That said, it's probably the same thing, the call and probe in the same BB would have the same count and the division number should be close. No strong option, as we don't have the instruction num, so we just need a new definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, they are more or less equivalent, let's just make it simple and not excluding callsites regardless?
Good point, after thinking it over more, I agreed with you, if the profile is merged after inlining, the compiler optimization use the merged profile and doesn't use the context-sensitive profile. I will change to base on pre-inlined profile. Though I still want to emphasize one thing that the existing one is using the completely context-less profile for this, which is also not what we want, IIUC, what we want is just the final profile. |
Is that difficult to change? |
Ah, that should be even easier, just remove the
Then we don't need the post-pre-inliner one and only the last one. |
I didn't dive very deep to the old heuristic, I just guess from the result, I saw many cases show super big number, for example and this come from normal profile.
That said, as I mentioned in the above comments, there seems a few bugs in the previous work, (another "bug" is it use the total_samples to decide the hot functions(the cutoff num is a block level count)), maybe it's caused by other issue. Anyway, it seems it needs a big overhaul. |
If it's sorted by density now as opposed to function hotness, the density cut off value can be more stable -- that can avoid suggestions with huge numbers. But I don't see bug in old density, it's just heuristic with different philosophy. |
"hot-function-density-threshold", llvm::cl::init(1000), | ||
llvm::cl::desc( | ||
"specify density threshold for hot functions (default: 1000)"), | ||
"hot-function-density-threshold", llvm::cl::init(20), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch name is also no longer accurate, since we are no longer selecting the top N functions. Perhaps profile-density-threshold
is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share some date to show that 20 is a good threshold?
if (AccumulatedSamples >= TotalProfileSamples * | ||
static_cast<float>(ProfileDensityCutOffHot) / | ||
1000000) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fold the exit condition into a while loop condition?
for (const auto &CallsiteSamples : FSamples.getCallsiteSamples()) { | ||
FuncBodySize++; | ||
for (const auto &Callee : CallsiteSamples.second) { | ||
calculateDensity(Callee.second, DensityList, TotalProfileSamples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about density as a binary level metric as opposed to source level, in which case it maps better with inlinee included. Just like when we dump hot function list, it's binary level function, instead of source level function.
Any particular reason to do this on source level with inlining excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to clarify this first, will address other comments later.
I don't have a particular reason to exclude the inlinees. The story is before we computed density on a full CS profile(or a merged context-less profile), so there is actually no inlinees in it.(see the original code: https://github.com/llvm/llvm-project/blob/main/llvm/tools/llvm-profgen/ProfileGenerator.cpp#L1060-L1064. this is before the nested profile conversion), so I just thought we need to base on an un-merged profile, didn't consider the density scope.
I guess the question is why do we want to exclude inlinees? we can count total samples including inlinees, and also size including inliness. Otherwise we don't cover such inlinees in density metrics -- what if inlinee has a low density problem?
another thing to clarify, this version is not to drop the inlinees profile, but compute as an independent function instead of part of the caller function. It should be able to catch the low inlinee profile .e.g.
main:101:100
1:100
2:foo:1
1:1
foo:100:1
1:100
In this version, it will compute three function density <main:100>
<foo:1>``<foo:100>
, we still have the low density inlinee <foo:1>
IIUC, your suggestion is to compute like a <main: 50 (101/2)>
<foo: 100>
Yeah, if that makes sense, so my current data is based on current version, which seems work ok.
I think I can implement a binary-level density and compare the two, we can take a look together and discuss on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the recursion here covers inlinee samples. Good to know it's not dropped.
Though if we decided to compute density after pre-inline, it might make more sense to include inlinees directly? But yes, data will tell us which one is better. If they are similar, then we can aim for implementation simplicity.
Yeah, makes sense, updated to use the binary-level density.
Shared more detailed data with you internally, just summarized here for reference. Service1:
Service2:
Service3:
Also working on other services, so far it seems the regression started when the density dropped below 10, so that' why set the threshold to 20(We can set to 50 if we want to be more conservative) |
Thanks for the data. It looks like 50 is safer? |
@@ -768,9 +748,89 @@ void ProfileGenerator::populateBoundarySamplesForAllFunctions( | |||
} | |||
} | |||
|
|||
// Note taht ideally the size should be the number of function's instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type taht -> that
@@ -1,13 +1,16 @@ | |||
; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -hot-function-density-threshold=10 --trim-cold-profile=0 &> %t2 | |||
; RUN: llvm-profgen --format=text --unsymbolized-profile=%S/Inputs/profile-density.raw.prof --binary=%S/Inputs/inline-noprobe2.perfbin --output=%t1 --use-offset=0 --show-density -profile-density-threshold=10 --trim-cold-profile=0 &> %t2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there no test for suggestion? should we add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the suggestion for CS profile (There was one for the non-CS profile)
llvm::cl::desc( | ||
"specify density threshold for hot functions (default: 1000)"), | ||
"Set the profile density threshold(default: 20), which is used to " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove "Set" verb, which is obviously for any flags. the need to describe what is "profile-density-threshold". Maybe remove "(defalt: xx)" as well.
uint64_t &FuncBodySize) { | ||
for (const auto &I : FSamples.getBodySamples()) { | ||
TotalBodySamples += I.second.getSamples(); | ||
FuncBodySize++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to count here, instead FuncBodySize += FSamples.getBodySamples().size() + FSamples.getCallsiteSamples().size()
?
// one block. Hence, we use the number of probe as a proxy for the function's | ||
// size. | ||
void ProfileGeneratorBase::calculateBodySamplesAndSize( | ||
const FunctionSamples &FSamples, uint64_t &TotalBodySamples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How different is this TotalBodySamples
comparing to FSample.getTotalSamples()
which also includes all inlinees?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be no difference for probe-based profile. But very different to line-number based (AutoFDO) profile, in which the entry is the max of the counts in the same line, however, the total_samples is a sum of all the counts in the same line, so the getTotalSamples
is much bigger than sum of each entry. That said, if we can adjust to use the binary instruction size, it also should be good to use getTotalSamples
then the density for autofdo is getTotalSamples
/ binary_size.
If so, we need to diverge the function size calculation:
For pseudo-probe, we use the way in this version.
for line-number, we use the binary size(binary dwarf start_address - end_address).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May still not accurate for AutoFDO to use the start_address - end_address
, start_address - end_address
is also not equal to the num of instruction since instruction size is not always 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_samples is a sum of all the counts in the same line
For AutoFDO, based on ProfileGeneratorBase::updateTotalSamples
, isn't total sample the same as sum of all body samples + callsite (inlinee) samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_samples is a sum of all the counts in the same line
For AutoFDO, based on
ProfileGeneratorBase::updateTotalSamples
, isn't total sample the same as sum of all body samples + callsite (inlinee) samples?
Yes, using updateTotalSamples
is equal to the sum of all body samples + callsite (inlinee) samples.. But the --update-total-samples
is default off.(I remember because there is regression with this on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though --update-total-samples
is off, how big of a difference do we expect between total samples and the way you compute it? I actually don't remember what is the source of discrepency, and still don't understand why we have regression. If they are close enough, it shouldn't matter for computing density?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though
--update-total-samples
is off, how big of a difference do we expect between total samples and the way you compute it? I actually don't remember what is the source of discrepency, and still don't understand why we have regression. If they are close enough, it shouldn't matter for computing density?
It could be 2X~10X bigger, the multiplier is equal to how many physical instruction one source instruction(debug-line) emits, usually one source code line can generate many physical instructions, it could be big.
There is an example in our test case: https://github.com/llvm/llvm-project/blob/main/llvm/test/tools/llvm-profgen/inline-noprobe.test#L17-L45
See the CHECK: main:2609:0
vs CHECK-UPDATE-TOTAL-SAMPLE: main:292:0
, it could be ~10X.
it's indeed hard to understand why it can cause regression, it's just the observation from data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can you add a comment explaining why we need to accumulate the samples manually instead of using getTotalSamples
? including the fact that --update-total-samples
is off and the reasoning, difference between AutoFDO and CSSPGO. A comment would be very useful because after a while someone may look at this code be very tempted to refactor / simplify it with getTotalSamples
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments added
static cl::opt<double> HotFunctionDensityThreshold( | ||
"hot-function-density-threshold", llvm::cl::init(1000), | ||
static cl::opt<double> ProfileDensityThreshold( | ||
"profile-density-threshold", llvm::cl::init(20), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, perhaps make it 50 to be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
// one block. Hence, we use the number of probe as a proxy for the function's | ||
// size. | ||
void ProfileGeneratorBase::calculateBodySamplesAndSize( | ||
const FunctionSamples &FSamples, uint64_t &TotalBodySamples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can you add a comment explaining why we need to accumulate the samples manually instead of using getTotalSamples
? including the fact that --update-total-samples
is off and the reasoning, difference between AutoFDO and CSSPGO. A comment would be very useful because after a while someone may look at this code be very tempted to refactor / simplify it with getTotalSamples
.
// inlinees' samples and size should be included in the calculation. | ||
calculateBodySamplesAndSize(Callee.second, TotalBodySamples, | ||
FuncBodySize); | ||
TotalBodySamples += Callee.second.getHeadSamplesEstimate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? It seems to duplicate the inlinee's body samples. Can we remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, Good catch! With this, we also don't need to add the FSamples.getCallsiteSamples().size()
to the FuncBodySize
. I was based on the old source-level
density, now the inlinees' body samples are included, we don't need the estimated callsite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, hope this doesn't change density values so we don't need to reevaluate the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the improvements and working through the feedbacks.
// -update-total-samples). Hence, it's safer to re-calculate here to avoid | ||
// such discrepancy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth pointing not explicitly that it makes no difference for probe-based profile?
Reuse the definition of profile density from llvm-profgen (#92144): - the density is computed in perf2bolt using raw samples (perf.data or pre-aggregated data), - function density is the ratio of dynamically executed function bytes to the static function size in bytes, - profile density: - functions are sorted by density in decreasing order, accumulating their respective sample counts, - profile density is the smallest density covering 99% of total sample count. In other words, BOLT binary profile density is the minimum amount of profile information per function (excluding functions in tail 1% sample count) which is sufficient to optimize the binary well. The density threshold of 60 was determined through experiments with large binaries by reducing the sample count and checking resulting profile density and performance. The threshold is conservative. perf2bolt would print the warning if the density is below the threshold and suggest to increase the sampling duration and/or frequency to reach a given density, e.g.: ``` BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples. ``` Test Plan: updated pre-aggregated-perf.test Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe, wlei-llvm Pull Request: #101094
Reuse the definition of profile density from llvm-profgen (llvm#92144): - the density is computed in perf2bolt using raw samples (perf.data or pre-aggregated data), - function density is the ratio of dynamically executed function bytes to the static function size in bytes, - profile density: - functions are sorted by density in decreasing order, accumulating their respective sample counts, - profile density is the smallest density covering 99% of total sample count. In other words, BOLT binary profile density is the minimum amount of profile information per function (excluding functions in tail 1% sample count) which is sufficient to optimize the binary well. The density threshold of 60 was determined through experiments with large binaries by reducing the sample count and checking resulting profile density and performance. The threshold is conservative. perf2bolt would print the warning if the density is below the threshold and suggest to increase the sampling duration and/or frequency to reach a given density, e.g.: ``` BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples. ``` Test Plan: updated pre-aggregated-perf.test Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe, wlei-llvm Pull Request: llvm#101094
The profile density feature(the amount of samples in the profile relative to the program size) is used to identify insufficient sample issue and provide hints for user to increase sample count. A low-density profile can be inaccurate due to statistical noise, which can hurt FDO performance.
This change introduces two improvements to the current density work.
Instead, we want to find a density number such that if a function's density is below this value, it is considered low-density function. We consider the whole profile is bad if a group of low-density functions have the sum of samples that exceeds N percent cut-off of the total samples.
In implementation, we sort the function profiles by density, iterate them in descending order and keep accumulating the body samples until the sum exceeds the (100% - N) percentage of the total_samples, the profile-density is the last(minimum) function-density of processed functions. We introduce the a flag(
--profile-density-threshold
) for this percentage threshold.