-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[VPlan] Speed up VPSlotTracker by using ModuleSlotTracker #139881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Example - https://godbolt.org/z/z9T5qaafb |
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Igor Kirillov (igogo-x86) ChangesCurrently, when VPSlotTracker is initialized with a VPlan, its assignName method calls printAsOperand on each underlying instruction. Each such call recomputes slot numbers for the entire function, leading to O(N × M) complexity, where M is the number of instructions in the loop and N is the number of instructions in the function. This results in slow debug output for large loops. For example, printing costs of all instructions becomes O(M² × N), which is especially painful when enabling verbose dumps. This patch improves debugging performance by caching slot numbers using ModuleSlotTracker. It avoids redundant recomputation and makes debug output significantly faster. Full diff: https://github.com/llvm/llvm-project/pull/139881.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 167aff737d3fd..a1567fab4eff2 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -1441,7 +1441,10 @@ void VPSlotTracker::assignName(const VPValue *V) {
std::string Name;
if (UV) {
raw_string_ostream S(Name);
- UV->printAsOperand(S, false);
+ if (MST)
+ UV->printAsOperand(S, false, *MST);
+ else
+ UV->printAsOperand(S, false);
} else
Name = VPI->getName();
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index 1d42c8f5f3737..0aa0133f4fe1c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -23,6 +23,7 @@
#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/DebugLoc.h"
+#include "llvm/IR/ModuleSlotTracker.h"
#include "llvm/Support/InstructionCost.h"
namespace llvm {
@@ -382,14 +383,25 @@ class VPSlotTracker {
/// Number to assign to the next VPValue without underlying value.
unsigned NextSlot = 0;
+ /// Cache slot indexes to avoid recomputing them on each printAsOperand call.
+ std::unique_ptr<ModuleSlotTracker> MST;
+
void assignName(const VPValue *V);
void assignNames(const VPlan &Plan);
void assignNames(const VPBasicBlock *VPBB);
public:
VPSlotTracker(const VPlan *Plan = nullptr) {
- if (Plan)
+ if (Plan) {
+ // This check is required to support unit tests with incomplete IR.
+ if (Function *F =
+ Plan->getScalarHeader()->getIRBasicBlock()->getParent()) {
+ Module *M = F->getParent();
+ MST = std::make_unique<ModuleSlotTracker>(M);
+ MST->incorporateFunction(*F);
+ }
assignNames(*Plan);
+ }
}
/// Returns the name assigned to \p V, if there is one, otherwise try to
|
if (Function *F = | ||
Plan->getScalarHeader()->getIRBasicBlock()->getParent()) { | ||
Module *M = F->getParent(); | ||
MST = std::make_unique<ModuleSlotTracker>(M); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're invoking this from these functions:
raw_ostream &llvm::operator<<(raw_ostream &OS, const VPRecipeBase &R) {
const VPBasicBlock *Parent = R.getParent();
VPSlotTracker SlotTracker(Parent ? Parent->getPlan() : nullptr);
R.print(OS, "", SlotTracker);
return OS;
}
void VPValue::dump() const {
const VPRecipeBase *Instr = dyn_cast_or_null<VPRecipeBase>(this->Def);
VPSlotTracker SlotTracker(
(Instr && Instr->getParent()) ? Instr->getParent()->getPlan() : nullptr);
print(dbgs(), SlotTracker);
dbgs() << "\n";
}
void VPDef::dump() const {
const VPRecipeBase *Instr = dyn_cast_or_null<VPRecipeBase>(this);
VPSlotTracker SlotTracker(
(Instr && Instr->getParent()) ? Instr->getParent()->getPlan() : nullptr);
print(dbgs(), "", SlotTracker);
dbgs() << "\n";
}
void VPBlockBase::print(raw_ostream &O) const {
VPSlotTracker SlotTracker(getPlan());
print(O, "", SlotTracker);
}
Isn't this actually making the general case a lot slower? It looks like incorporateFunction
could be quite expensive. Wouldn't it be faster to only create this once, perhaps by adding the ModuleSlotTracker to VPlan and guarded by #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorporateFunction
doesn't do any job beyond creating SlotTracker
and assigning F
. Any index calculation is delayed until printAsOperand
is called in assignName
.
Also, I don't think VPSlotTracker
exists outside of debug/assert build - all print-related functions are already guarde by this define:
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version of incorporateFunction
I'm looking at is this:
void ModuleSlotTracker::incorporateFunction(const Function &F) {
// Using getMachine() may lazily create the slot tracker.
if (!getMachine())
return;
// Nothing to do if this is the right function already.
if (this->F == &F)
return;
if (this->F)
Machine->purgeFunction();
Machine->incorporateFunction(&F);
this->F = &F;
}
which looks like it's doing more than setting a couple of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this may slow down the case where numbering the function isn't neccessary, e.g. because all values are named or we only have VPValues w/o underlying IR. Could we delay incorporateFunction
until it is definitely needed, i.e. printing an unnamed IR value?
Then we should get the best of both worlds, because otherwise printAsOperand would call incorporateFunction
each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, made it genuinely lazy
Currently, when VPSlotTracker is initialized with a VPlan, its assignName method calls printAsOperand on each underlying instruction. Each such call recomputes slot numbers for the entire function, leading to O(N × M) complexity, where M is the number of instructions in the loop and N is the number of instructions in the function. This results in slow debug output for large loops. For example, printing costs of all instructions becomes O(M² × N), which is especially painful when enabling verbose dumps. This patch improves debugging performance by caching slot numbers using ModuleSlotTracker. It avoids redundant recomputation and makes debug output significantly faster.
d9e8c34
to
d8fe7f8
Compare
auto *IUV = cast<Instruction>(UV); | ||
// This check is required to support unit tests with incomplete IR. | ||
if (IUV->getParent()) { | ||
MST = std::make_unique<ModuleSlotTracker>(IUV->getModule()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful this level of caching really is, because we're still throwing away the cached object for every dump of VPValue. Although in fairness I think that also seems to be true for VPValue2Name
, since that member variable is not static.
Can you not just make MST
a static member that's initialised to nullptr and lazily updated to std::make_unique<ModuleSlotTracker>(nullptr)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I've misunderstood how the VPSlotTracker is most commonly used. I guess the performance we're most worried is on the path where we print the vplan, right?
LLVM_DUMP_METHOD
void VPlan::print(raw_ostream &O) const {
VPSlotTracker SlotTracker(this);
O << "VPlan '" << getName() << "' {";
in which case the slot tracker is only instantiated once and passed down the hierarchy. Please ignore my comment and sorry for the noise! I just happened to notice that some dump() function also create a slot tracker, but perhaps these are rarely called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main slowdown comes from VPRecipeBase::cost
, which prints debug output like:
dbgs() << "Cost of " << RecipeCost << " for VF " << VF << ": ";
dump();
or:
Cost of 1 for VF 2: WIDEN ir<%1547> = add nsw ir<%1546>, ir<%10>
This is where we create and destroy a new VPSlotTracker
for each line of output. VPSlotTracker
assigns names to all VPValues in the VPlan every time it’s instantiated. That does not look okay, but I guess we can live with that.
The real issue is that for each underlying Value, VPSlotTracker
internally creates a SlotTracker
, which in turn traverses all instructions in the function. That makes dumping a value quadratic in complexity. The time spent initializing SlotTracker
itself is minor compared to the much larger cost of running VPSlotTracker::assignNames
.
// This check is required to support unit tests with incomplete IR. | ||
if (IUV->getParent()) { | ||
MST = std::make_unique<ModuleSlotTracker>(IUV->getModule()); | ||
MST->incorporateFunction(*IUV->getFunction()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an instruction that has a parent (i.e. a basic block) also guaranteed to have a function? I assume so, but just curious.
@@ -1441,7 +1441,23 @@ void VPSlotTracker::assignName(const VPValue *V) { | |||
std::string Name; | |||
if (UV) { | |||
raw_string_ostream S(Name); | |||
UV->printAsOperand(S, false); | |||
if (MST) { | |||
UV->printAsOperand(S, false, *MST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it might be easier to have just two case if you rewrite like this:
if (!isa<Instruction>(UV) || UV->hasName())
UV->printAsOperand(S, false);
else
UV->printAsOperand(S, false, getModuleTracker(cast<Instruction>(UV)));
where a new helper function getModuleTracker
can do the logic to lazily construct the module tracker. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth pulling out the code that gets the name for a VPValue, can simplify have an early exit for the cases where no ModuleSlotTracker is needed and then create it if needed and use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-arm I’m not sure that actually helps much—MST is only used here and isn’t reused, so keeping the logic in one place feels simpler to me.
@fhahn, could you clarify what exactly you mean by “pulling out the code that gets the name for a VPValue”? I’m not sure which part you’re referring to or what the intended simplification would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if having something like
getNameForVPValue() {
if (UV->hasName()) {
UV->printAsOperand(S, false, *MST);
return Name;
}
if (!MST) {
code to init slot tracker
}
UV->printAsOperand(S, false, *MST);
return Name;
}
Currently, when VPSlotTracker is initialized with a VPlan, its assignName method calls printAsOperand on each underlying instruction. Each such call recomputes slot numbers for the entire function, leading to O(N × M) complexity, where M is the number of instructions in the loop and N is the number of instructions in the function.
This results in slow debug output for large loops. For example, printing costs of all instructions becomes O(M² × N), which is especially painful when enabling verbose dumps.
This patch improves debugging performance by caching slot numbers using ModuleSlotTracker. It avoids redundant recomputation and makes debug output significantly faster.