Skip to content

[VPlan] Speed up VPSlotTracker by using ModuleSlotTracker #139881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

igogo-x86
Copy link
Contributor

Currently, when VPSlotTracker is initialized with a VPlan, its assignName method calls printAsOperand on each underlying instruction. Each such call recomputes slot numbers for the entire function, leading to O(N × M) complexity, where M is the number of instructions in the loop and N is the number of instructions in the function.

This results in slow debug output for large loops. For example, printing costs of all instructions becomes O(M² × N), which is especially painful when enabling verbose dumps.

This patch improves debugging performance by caching slot numbers using ModuleSlotTracker. It avoids redundant recomputation and makes debug output significantly faster.

@igogo-x86 igogo-x86 requested review from fhahn and david-arm May 14, 2025 11:37
@igogo-x86
Copy link
Contributor Author

Example - https://godbolt.org/z/z9T5qaafb

@llvmbot
Copy link
Member

llvmbot commented May 14, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Igor Kirillov (igogo-x86)

Changes

Currently, when VPSlotTracker is initialized with a VPlan, its assignName method calls printAsOperand on each underlying instruction. Each such call recomputes slot numbers for the entire function, leading to O(N × M) complexity, where M is the number of instructions in the loop and N is the number of instructions in the function.

This results in slow debug output for large loops. For example, printing costs of all instructions becomes O(M² × N), which is especially painful when enabling verbose dumps.

This patch improves debugging performance by caching slot numbers using ModuleSlotTracker. It avoids redundant recomputation and makes debug output significantly faster.


Full diff: https://github.com/llvm/llvm-project/pull/139881.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+4-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanHelpers.h (+13-1)
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 167aff737d3fd..a1567fab4eff2 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -1441,7 +1441,10 @@ void VPSlotTracker::assignName(const VPValue *V) {
   std::string Name;
   if (UV) {
     raw_string_ostream S(Name);
-    UV->printAsOperand(S, false);
+    if (MST)
+      UV->printAsOperand(S, false, *MST);
+    else
+      UV->printAsOperand(S, false);
   } else
     Name = VPI->getName();
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index 1d42c8f5f3737..0aa0133f4fe1c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -23,6 +23,7 @@
 #include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/IR/DebugLoc.h"
+#include "llvm/IR/ModuleSlotTracker.h"
 #include "llvm/Support/InstructionCost.h"
 
 namespace llvm {
@@ -382,14 +383,25 @@ class VPSlotTracker {
   /// Number to assign to the next VPValue without underlying value.
   unsigned NextSlot = 0;
 
+  /// Cache slot indexes to avoid recomputing them on each printAsOperand call.
+  std::unique_ptr<ModuleSlotTracker> MST;
+
   void assignName(const VPValue *V);
   void assignNames(const VPlan &Plan);
   void assignNames(const VPBasicBlock *VPBB);
 
 public:
   VPSlotTracker(const VPlan *Plan = nullptr) {
-    if (Plan)
+    if (Plan) {
+      // This check is required to support unit tests with incomplete IR.
+      if (Function *F =
+              Plan->getScalarHeader()->getIRBasicBlock()->getParent()) {
+        Module *M = F->getParent();
+        MST = std::make_unique<ModuleSlotTracker>(M);
+        MST->incorporateFunction(*F);
+      }
       assignNames(*Plan);
+    }
   }
 
   /// Returns the name assigned to \p V, if there is one, otherwise try to

if (Function *F =
Plan->getScalarHeader()->getIRBasicBlock()->getParent()) {
Module *M = F->getParent();
MST = std::make_unique<ModuleSlotTracker>(M);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're invoking this from these functions:

raw_ostream &llvm::operator<<(raw_ostream &OS, const VPRecipeBase &R) {
  const VPBasicBlock *Parent = R.getParent();
  VPSlotTracker SlotTracker(Parent ? Parent->getPlan() : nullptr);
  R.print(OS, "", SlotTracker);
  return OS;
}

void VPValue::dump() const {
  const VPRecipeBase *Instr = dyn_cast_or_null<VPRecipeBase>(this->Def);
  VPSlotTracker SlotTracker(
      (Instr && Instr->getParent()) ? Instr->getParent()->getPlan() : nullptr);
  print(dbgs(), SlotTracker);
  dbgs() << "\n";
}

void VPDef::dump() const {
  const VPRecipeBase *Instr = dyn_cast_or_null<VPRecipeBase>(this);
  VPSlotTracker SlotTracker(
      (Instr && Instr->getParent()) ? Instr->getParent()->getPlan() : nullptr);
  print(dbgs(), "", SlotTracker);
  dbgs() << "\n";
}

void VPBlockBase::print(raw_ostream &O) const {
  VPSlotTracker SlotTracker(getPlan());
  print(O, "", SlotTracker);
}

Isn't this actually making the general case a lot slower? It looks like incorporateFunction could be quite expensive. Wouldn't it be faster to only create this once, perhaps by adding the ModuleSlotTracker to VPlan and guarded by #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorporateFunction doesn't do any job beyond creating SlotTracker and assigning F. Any index calculation is delayed until printAsOperand is called in assignName.

Also, I don't think VPSlotTracker exists outside of debug/assert build - all print-related functions are already guarde by this define:
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of incorporateFunction I'm looking at is this:

void ModuleSlotTracker::incorporateFunction(const Function &F) {
  // Using getMachine() may lazily create the slot tracker.
  if (!getMachine())
    return;

  // Nothing to do if this is the right function already.
  if (this->F == &F)
    return;
  if (this->F)
    Machine->purgeFunction();
  Machine->incorporateFunction(&F);
  this->F = &F;
}

which looks like it's doing more than setting a couple of values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this may slow down the case where numbering the function isn't neccessary, e.g. because all values are named or we only have VPValues w/o underlying IR. Could we delay incorporateFunction until it is definitely needed, i.e. printing an unnamed IR value?

Then we should get the best of both worlds, because otherwise printAsOperand would call incorporateFunction each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, made it genuinely lazy

igogo-x86 added 2 commits May 16, 2025 11:01
Currently, when VPSlotTracker is initialized with a VPlan, its
assignName method calls printAsOperand on each underlying instruction.
Each such call recomputes slot numbers for the entire function, leading
to O(N × M) complexity, where M is the number of instructions in the
loop and N is the number of instructions in the function.

This results in slow debug output for large loops. For example, printing
costs of all instructions becomes O(M² × N), which is especially painful
when enabling verbose dumps.

This patch improves debugging performance by caching slot numbers using
ModuleSlotTracker. It avoids redundant recomputation and makes debug
output significantly faster.
@igogo-x86 igogo-x86 force-pushed the vpslot-tracker-speedup branch from d9e8c34 to d8fe7f8 Compare May 16, 2025 12:02
auto *IUV = cast<Instruction>(UV);
// This check is required to support unit tests with incomplete IR.
if (IUV->getParent()) {
MST = std::make_unique<ModuleSlotTracker>(IUV->getModule());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful this level of caching really is, because we're still throwing away the cached object for every dump of VPValue. Although in fairness I think that also seems to be true for VPValue2Name, since that member variable is not static.

Can you not just make MST a static member that's initialised to nullptr and lazily updated to std::make_unique<ModuleSlotTracker>(nullptr)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I've misunderstood how the VPSlotTracker is most commonly used. I guess the performance we're most worried is on the path where we print the vplan, right?

LLVM_DUMP_METHOD
void VPlan::print(raw_ostream &O) const {
  VPSlotTracker SlotTracker(this);

  O << "VPlan '" << getName() << "' {";

in which case the slot tracker is only instantiated once and passed down the hierarchy. Please ignore my comment and sorry for the noise! I just happened to notice that some dump() function also create a slot tracker, but perhaps these are rarely called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main slowdown comes from VPRecipeBase::cost, which prints debug output like:

dbgs() << "Cost of " << RecipeCost << " for VF " << VF << ": ";
dump();

or:

Cost of 1 for VF 2: WIDEN ir<%1547> = add nsw ir<%1546>, ir<%10>

This is where we create and destroy a new VPSlotTracker for each line of output. VPSlotTracker assigns names to all VPValues in the VPlan every time it’s instantiated. That does not look okay, but I guess we can live with that.

The real issue is that for each underlying Value, VPSlotTracker internally creates a SlotTracker, which in turn traverses all instructions in the function. That makes dumping a value quadratic in complexity. The time spent initializing SlotTracker itself is minor compared to the much larger cost of running VPSlotTracker::assignNames.

// This check is required to support unit tests with incomplete IR.
if (IUV->getParent()) {
MST = std::make_unique<ModuleSlotTracker>(IUV->getModule());
MST->incorporateFunction(*IUV->getFunction());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an instruction that has a parent (i.e. a basic block) also guaranteed to have a function? I assume so, but just curious.

@@ -1441,7 +1441,23 @@ void VPSlotTracker::assignName(const VPValue *V) {
std::string Name;
if (UV) {
raw_string_ostream S(Name);
UV->printAsOperand(S, false);
if (MST) {
UV->printAsOperand(S, false, *MST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it might be easier to have just two case if you rewrite like this:

  if (!isa<Instruction>(UV) || UV->hasName())
    UV->printAsOperand(S, false);
  else
    UV->printAsOperand(S, false, getModuleTracker(cast<Instruction>(UV)));

where a new helper function getModuleTracker can do the logic to lazily construct the module tracker. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pulling out the code that gets the name for a VPValue, can simplify have an early exit for the cases where no ModuleSlotTracker is needed and then create it if needed and use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-arm I’m not sure that actually helps much—MST is only used here and isn’t reused, so keeping the logic in one place feels simpler to me.

@fhahn, could you clarify what exactly you mean by “pulling out the code that gets the name for a VPValue”? I’m not sure which part you’re referring to or what the intended simplification would be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if having something like

getNameForVPValue() {
  if (UV->hasName()) {
     UV->printAsOperand(S, false, *MST);
    return Name;
  }
  if (!MST) {
    code to init slot tracker
  }
  UV->printAsOperand(S, false, *MST);
  return Name;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants