[WebAssembly] Enable a limited amount of stackification for debug code #136510

SingleAccretion · 2025-04-20T21:42:47Z

This change is a step towards fixing one long-standing problem with LLVM's debug WASM codegen: excessive use of locals. One local for each temporary value in IR (roughly speaking).

This has a lot of problems:

It makes it easy to hit engine limitations of 50K locals with certain code patterns and large functions.
It makes for larger binaries that are slower to load and slower to compile to native code.
It makes certain compilation strategies (spill all WASM locals to stack, for example) for debug code excessively expensive and makes debug WASM code either run very slow, or be less debuggable.
It slows down LLVM itself.

This change addresses these partially by running a limited version of the stackification pass for unoptimized code, one that gets rid of the most 'obviously' unnecessary locals.

Care needs to be taken to not impact LLVM's ability to produce high quality debug variable locations with this pass. To that end:

We only allow stackification when it doesn't require moving any instructions.
We disable stackification of any locals that are used in DEBUG_VALUEs, or as a frame base.
I have verified on a moderately large example that the baseline and the diff produce the same kinds (local/global/stack) of locations, and the only differences are due to the shifting of instruction offsets, with many local.[get|set]s not being present anymore.

Even with this quite conservative approach, the results are pretty good:

30% reduction in raw code size, up to 10x reduction in the number of locals for select large methods (~1000 => ~100).
~10% reduction in instructions retired for an "llc -O0" run on a moderately sized input.

github-actions · 2025-04-20T21:45:12Z

✅ With the latest revision this PR passed the C/C++ code formatter.

This change is a step towards fixing one long-standing problem with LLVM's debug WASM codegen: excessive use of locals. One local for each temporary value in IR (roughly speaking). This has a lot of problems: 1) It makes it easy to hit engine limitations of 50K locals with certain code patterns and large functions. 2) It makes for larger binaries that are slower to load and slower to compile to native code. 3) It makes certain compilation strategies (spill all WASM locals to stack, for example) for debug code excessively expensive and makes debug WASM code either run very slow, or be less debuggable. 4) It slows down LLVM itself. This change addresses these partially by running a limited version of the stackification pass for unoptimized code, one that gets rid of the most 'obviously' unnecessary locals. Care needs to be taken to not impact LLVM's ability to produce high quality debug variable locations with this pass. To that end: 1) We only allow stackification when it doesn't require moving any instructions. 2) We disable stackification of any locals that are used in DEBUG_VALUEs, or as a frame base. I have verified on a moderately large example that the baseline and the diff produce the same kinds (local/global/stack) of locations, and the only differences are due to the shifting of instruction offsets, with many local.[get|set]s not being present anymore. Even with this quite conservative approach, the results are pretty good: 1) 30% reduction in raw code size, up to 10x reduction in the number of locals for select large methods (~1000 => ~100). 2) ~10% reduction in instructions retired for an "llc -O0" run on a moderately sized input.

llvmbot · 2025-04-21T20:14:16Z

@llvm/pr-subscribers-backend-webassembly

Author: None (SingleAccretion)

Changes

This change is a step towards fixing one long-standing problem with LLVM's debug WASM codegen: excessive use of locals. One local for each temporary value in IR (roughly speaking).

This has a lot of problems:

It makes it easy to hit engine limitations of 50K locals with certain code patterns and large functions.
It makes for larger binaries that are slower to load and slower to compile to native code.
It makes certain compilation strategies (spill all WASM locals to stack, for example) for debug code excessively expensive and makes debug WASM code either run very slow, or be less debuggable.
It slows down LLVM itself.

This change addresses these partially by running a limited version of the stackification pass for unoptimized code, one that gets rid of the most 'obviously' unnecessary locals.

Care needs to be taken to not impact LLVM's ability to produce high quality debug variable locations with this pass. To that end:

We only allow stackification when it doesn't require moving any instructions.
We disable stackification of any locals that are used in DEBUG_VALUEs, or as a frame base.
I have verified on a moderately large example that the baseline and the diff produce the same kinds (local/global/stack) of locations, and the only differences are due to the shifting of instruction offsets, with many local.[get|set]s not being present anymore.

Even with this quite conservative approach, the results are pretty good:

30% reduction in raw code size, up to 10x reduction in the number of locals for select large methods (~1000 => ~100).
~10% reduction in instructions retired for an "llc -O0" run on a moderately sized input.

Patch is 31.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/136510.diff

9 Files Affected:

(modified) llvm/lib/Target/WebAssembly/WebAssembly.h (+1-1)
(modified) llvm/lib/Target/WebAssembly/WebAssemblyRegStackify.cpp (+88-49)
(modified) llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp (+8-6)
(modified) llvm/test/CodeGen/WebAssembly/PR40172.ll (+3-3)
(modified) llvm/test/CodeGen/WebAssembly/PR41841.ll (+8-8)
(added) llvm/test/CodeGen/WebAssembly/debug-code-stackification.ll (+89)
(modified) llvm/test/CodeGen/WebAssembly/pr51651.ll (+7-7)
(modified) llvm/test/CodeGen/WebAssembly/signext-zeroext-callsite.ll (+10-50)
(modified) llvm/test/CodeGen/WebAssembly/suboptimal-compare.ll (+6-19)

diff --git a/llvm/lib/Target/WebAssembly/WebAssembly.h b/llvm/lib/Target/WebAssembly/WebAssembly.h
index 8f142fa0928ca..7c455c2168cde 100644
--- a/llvm/lib/Target/WebAssembly/WebAssembly.h
+++ b/llvm/lib/Target/WebAssembly/WebAssembly.h
@@ -44,7 +44,7 @@ FunctionPass *createWebAssemblyReplacePhysRegs();
 FunctionPass *createWebAssemblyNullifyDebugValueLists();
 FunctionPass *createWebAssemblyOptimizeLiveIntervals();
 FunctionPass *createWebAssemblyMemIntrinsicResults();
-FunctionPass *createWebAssemblyRegStackify();
+FunctionPass *createWebAssemblyRegStackify(CodeGenOptLevel OptLevel);
 FunctionPass *createWebAssemblyRegColoring();
 FunctionPass *createWebAssemblyFixBrTableDefaults();
 FunctionPass *createWebAssemblyFixIrreducibleControlFlow();
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyRegStackify.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyRegStackify.cpp
index 428d573668fb4..05c89bd0e09fa 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyRegStackify.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyRegStackify.cpp
@@ -41,14 +41,18 @@ using namespace llvm;
 
 namespace {
 class WebAssemblyRegStackify final : public MachineFunctionPass {
+  bool Optimize;
+
   StringRef getPassName() const override {
     return "WebAssembly Register Stackify";
   }
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.setPreservesCFG();
-    AU.addRequired<MachineDominatorTreeWrapperPass>();
-    AU.addRequired<LiveIntervalsWrapperPass>();
+    if (Optimize) {
+      AU.addRequired<LiveIntervalsWrapperPass>();
+      AU.addRequired<MachineDominatorTreeWrapperPass>();
+    }
     AU.addPreserved<MachineBlockFrequencyInfoWrapperPass>();
     AU.addPreserved<SlotIndexesWrapperPass>();
     AU.addPreserved<LiveIntervalsWrapperPass>();
@@ -61,7 +65,9 @@ class WebAssemblyRegStackify final : public MachineFunctionPass {
 
 public:
   static char ID; // Pass identification, replacement for typeid
-  WebAssemblyRegStackify() : MachineFunctionPass(ID) {}
+  WebAssemblyRegStackify(CodeGenOptLevel OptLevel)
+      : MachineFunctionPass(ID), Optimize(OptLevel != CodeGenOptLevel::None) {}
+  WebAssemblyRegStackify() : WebAssemblyRegStackify(CodeGenOptLevel::Default) {}
 };
 } // end anonymous namespace
 
@@ -70,8 +76,8 @@ INITIALIZE_PASS(WebAssemblyRegStackify, DEBUG_TYPE,
                 "Reorder instructions to use the WebAssembly value stack",
                 false, false)
 
-FunctionPass *llvm::createWebAssemblyRegStackify() {
-  return new WebAssemblyRegStackify();
+FunctionPass *llvm::createWebAssemblyRegStackify(CodeGenOptLevel OptLevel) {
+  return new WebAssemblyRegStackify(OptLevel);
 }
 
 // Decorate the given instruction with implicit operands that enforce the
@@ -96,8 +102,7 @@ static void imposeStackOrdering(MachineInstr *MI) {
 static void convertImplicitDefToConstZero(MachineInstr *MI,
                                           MachineRegisterInfo &MRI,
                                           const TargetInstrInfo *TII,
-                                          MachineFunction &MF,
-                                          LiveIntervals &LIS) {
+                                          MachineFunction &MF) {
   assert(MI->getOpcode() == TargetOpcode::IMPLICIT_DEF);
 
   const auto *RegClass = MRI.getRegClass(MI->getOperand(0).getReg());
@@ -262,36 +267,52 @@ static bool shouldRematerialize(const MachineInstr &Def,
 // LiveIntervals to handle complex cases.
 static MachineInstr *getVRegDef(unsigned Reg, const MachineInstr *Insert,
                                 const MachineRegisterInfo &MRI,
-                                const LiveIntervals &LIS) {
+                                const LiveIntervals *LIS) {
   // Most registers are in SSA form here so we try a quick MRI query first.
   if (MachineInstr *Def = MRI.getUniqueVRegDef(Reg))
     return Def;
 
   // MRI doesn't know what the Def is. Try asking LIS.
-  if (const VNInfo *ValNo = LIS.getInterval(Reg).getVNInfoBefore(
-          LIS.getInstructionIndex(*Insert)))
-    return LIS.getInstructionFromIndex(ValNo->def);
+  if (LIS != nullptr) {
+    SlotIndex InstIndex = LIS->getInstructionIndex(*Insert);
+    if (const VNInfo *ValNo = LIS->getInterval(Reg).getVNInfoBefore(InstIndex))
+      return LIS->getInstructionFromIndex(ValNo->def);
+  }
 
   return nullptr;
 }
 
 // Test whether Reg, as defined at Def, has exactly one use. This is a
 // generalization of MachineRegisterInfo::hasOneNonDBGUse that uses
-// LiveIntervals to handle complex cases.
-static bool hasOneNonDBGUse(unsigned Reg, MachineInstr *Def,
-                            MachineRegisterInfo &MRI, MachineDominatorTree &MDT,
-                            LiveIntervals &LIS) {
+// LiveIntervals to handle complex cases in optimized code.
+static bool hasSingleUse(unsigned Reg, MachineRegisterInfo &MRI,
+                         WebAssemblyFunctionInfo &MFI, bool Optimize,
+                         MachineInstr *Def, LiveIntervals *LIS) {
+  if (!Optimize) {
+    // We don't want to stackify DBG_VALUE operands since WASM stack locations
+    // are less useful and less widely supported than WASM local locations.
+    if (!MRI.hasOneUse(Reg))
+      return false;
+    // The frame base always has an implicit DBG use as DW_AT_frame_base.
+    if (MFI.isFrameBaseVirtual() && MFI.getFrameBaseVreg() == Reg)
+      return false;
+    return true;
+  }
+
   // Most registers are in SSA form here so we try a quick MRI query first.
   if (MRI.hasOneNonDBGUse(Reg))
     return true;
 
+  if (LIS == nullptr)
+    return false;
+
   bool HasOne = false;
-  const LiveInterval &LI = LIS.getInterval(Reg);
+  const LiveInterval &LI = LIS->getInterval(Reg);
   const VNInfo *DefVNI =
-      LI.getVNInfoAt(LIS.getInstructionIndex(*Def).getRegSlot());
+      LI.getVNInfoAt(LIS->getInstructionIndex(*Def).getRegSlot());
   assert(DefVNI);
   for (auto &I : MRI.use_nodbg_operands(Reg)) {
-    const auto &Result = LI.Query(LIS.getInstructionIndex(*I.getParent()));
+    const auto &Result = LI.Query(LIS->getInstructionIndex(*I.getParent()));
     if (Result.valueIn() == DefVNI) {
       if (!Result.isKill())
         return false;
@@ -311,7 +332,7 @@ static bool hasOneNonDBGUse(unsigned Reg, MachineInstr *Def,
 static bool isSafeToMove(const MachineOperand *Def, const MachineOperand *Use,
                          const MachineInstr *Insert,
                          const WebAssemblyFunctionInfo &MFI,
-                         const MachineRegisterInfo &MRI) {
+                         const MachineRegisterInfo &MRI, bool Optimize) {
   const MachineInstr *DefI = Def->getParent();
   const MachineInstr *UseI = Use->getParent();
   assert(DefI->getParent() == Insert->getParent());
@@ -357,6 +378,12 @@ static bool isSafeToMove(const MachineOperand *Def, const MachineOperand *Use,
   if (NextI == Insert)
     return true;
 
+  // When not optimizing, we only handle the trivial case above
+  // to guarantee no impact to debugging and to avoid spending
+  // compile time.
+  if (!Optimize)
+    return false;
+
   // 'catch' and 'catch_all' should be the first instruction of a BB and cannot
   // move.
   if (WebAssembly::isCatch(DefI->getOpcode()))
@@ -520,14 +547,15 @@ static void shrinkToUses(LiveInterval &LI, LiveIntervals &LIS) {
 /// dependencies; move the def down and nest it with the current instruction.
 static MachineInstr *moveForSingleUse(unsigned Reg, MachineOperand &Op,
                                       MachineInstr *Def, MachineBasicBlock &MBB,
-                                      MachineInstr *Insert, LiveIntervals &LIS,
+                                      MachineInstr *Insert, LiveIntervals *LIS,
                                       WebAssemblyFunctionInfo &MFI,
                                       MachineRegisterInfo &MRI) {
   LLVM_DEBUG(dbgs() << "Move for single use: "; Def->dump());
 
   WebAssemblyDebugValueManager DefDIs(Def);
   DefDIs.sink(Insert);
-  LIS.handleMove(*Def);
+  if (LIS != nullptr)
+    LIS->handleMove(*Def);
 
   if (MRI.hasOneDef(Reg) && MRI.hasOneNonDBGUse(Reg)) {
     // No one else is using this register for anything so we can just stackify
@@ -540,17 +568,18 @@ static MachineInstr *moveForSingleUse(unsigned Reg, MachineOperand &Op,
     Op.setReg(NewReg);
     DefDIs.updateReg(NewReg);
 
-    // Tell LiveIntervals about the new register.
-    LIS.createAndComputeVirtRegInterval(NewReg);
+    if (LIS != nullptr) {
+      // Tell LiveIntervals about the new register.
+      LIS->createAndComputeVirtRegInterval(NewReg);
 
-    // Tell LiveIntervals about the changes to the old register.
-    LiveInterval &LI = LIS.getInterval(Reg);
-    LI.removeSegment(LIS.getInstructionIndex(*Def).getRegSlot(),
-                     LIS.getInstructionIndex(*Op.getParent()).getRegSlot(),
-                     /*RemoveDeadValNo=*/true);
+      // Tell LiveIntervals about the changes to the old register.
+      LiveInterval &LI = LIS->getInterval(Reg);
+      LI.removeSegment(LIS->getInstructionIndex(*Def).getRegSlot(),
+                       LIS->getInstructionIndex(*Op.getParent()).getRegSlot(),
+                       /*RemoveDeadValNo=*/true);
+    }
 
     MFI.stackifyVReg(MRI, NewReg);
-
     LLVM_DEBUG(dbgs() << " - Replaced register: "; Def->dump());
   }
 
@@ -567,11 +596,12 @@ static MachineInstr *getPrevNonDebugInst(MachineInstr *MI) {
 
 /// A trivially cloneable instruction; clone it and nest the new copy with the
 /// current instruction.
-static MachineInstr *rematerializeCheapDef(
-    unsigned Reg, MachineOperand &Op, MachineInstr &Def, MachineBasicBlock &MBB,
-    MachineBasicBlock::instr_iterator Insert, LiveIntervals &LIS,
-    WebAssemblyFunctionInfo &MFI, MachineRegisterInfo &MRI,
-    const WebAssemblyInstrInfo *TII, const WebAssemblyRegisterInfo *TRI) {
+static MachineInstr *
+rematerializeCheapDef(unsigned Reg, MachineOperand &Op, MachineInstr &Def,
+                      MachineBasicBlock::instr_iterator Insert,
+                      LiveIntervals &LIS, WebAssemblyFunctionInfo &MFI,
+                      MachineRegisterInfo &MRI,
+                      const WebAssemblyInstrInfo *TII) {
   LLVM_DEBUG(dbgs() << "Rematerializing cheap def: "; Def.dump());
   LLVM_DEBUG(dbgs() << " - for use in "; Op.getParent()->dump());
 
@@ -811,9 +841,12 @@ bool WebAssemblyRegStackify::runOnMachineFunction(MachineFunction &MF) {
   MachineRegisterInfo &MRI = MF.getRegInfo();
   WebAssemblyFunctionInfo &MFI = *MF.getInfo<WebAssemblyFunctionInfo>();
   const auto *TII = MF.getSubtarget<WebAssemblySubtarget>().getInstrInfo();
-  const auto *TRI = MF.getSubtarget<WebAssemblySubtarget>().getRegisterInfo();
-  auto &MDT = getAnalysis<MachineDominatorTreeWrapperPass>().getDomTree();
-  auto &LIS = getAnalysis<LiveIntervalsWrapperPass>().getLIS();
+  MachineDominatorTree *MDT = nullptr;
+  LiveIntervals *LIS = nullptr;
+  if (Optimize) {
+    MDT = &getAnalysis<MachineDominatorTreeWrapperPass>().getDomTree();
+    LIS = &getAnalysis<LiveIntervalsWrapperPass>().getLIS();
+  }
 
   // Walk the instructions from the bottom up. Currently we don't look past
   // block boundaries, and the blocks aren't ordered so the block visitation
@@ -876,23 +909,28 @@ bool WebAssemblyRegStackify::runOnMachineFunction(MachineFunction &MF) {
         // supports intra-block moves) and it's MachineSink's job to catch all
         // the sinking opportunities anyway.
         bool SameBlock = DefI->getParent() == &MBB;
-        bool CanMove = SameBlock && isSafeToMove(Def, &Use, Insert, MFI, MRI) &&
+        bool CanMove = SameBlock &&
+                       isSafeToMove(Def, &Use, Insert, MFI, MRI, Optimize) &&
                        !TreeWalker.isOnStack(Reg);
-        if (CanMove && hasOneNonDBGUse(Reg, DefI, MRI, MDT, LIS)) {
+        if (CanMove && hasSingleUse(Reg, MRI, MFI, Optimize, DefI, LIS)) {
           Insert = moveForSingleUse(Reg, Use, DefI, MBB, Insert, LIS, MFI, MRI);
 
           // If we are removing the frame base reg completely, remove the debug
           // info as well.
           // TODO: Encode this properly as a stackified value.
-          if (MFI.isFrameBaseVirtual() && MFI.getFrameBaseVreg() == Reg)
+          if (MFI.isFrameBaseVirtual() && MFI.getFrameBaseVreg() == Reg) {
+            assert(
+                Optimize &&
+                "Stackifying away frame base in unoptimized code not expected");
             MFI.clearFrameBaseVreg();
-        } else if (shouldRematerialize(*DefI, TII)) {
-          Insert =
-              rematerializeCheapDef(Reg, Use, *DefI, MBB, Insert->getIterator(),
-                                    LIS, MFI, MRI, TII, TRI);
-        } else if (CanMove && oneUseDominatesOtherUses(Reg, Use, MBB, MRI, MDT,
-                                                       LIS, MFI)) {
-          Insert = moveAndTeeForMultiUse(Reg, Use, DefI, MBB, Insert, LIS, MFI,
+          }
+        } else if (Optimize && shouldRematerialize(*DefI, TII)) {
+          Insert = rematerializeCheapDef(Reg, Use, *DefI, Insert->getIterator(),
+                                         *LIS, MFI, MRI, TII);
+        } else if (Optimize && CanMove &&
+                   oneUseDominatesOtherUses(Reg, Use, MBB, MRI, *MDT, *LIS,
+                                            MFI)) {
+          Insert = moveAndTeeForMultiUse(Reg, Use, DefI, MBB, Insert, *LIS, MFI,
                                          MRI, TII);
         } else {
           // We failed to stackify the operand. If the problem was ordering
@@ -915,7 +953,8 @@ bool WebAssemblyRegStackify::runOnMachineFunction(MachineFunction &MF) {
           Register DefReg = SubsequentDef->getReg();
           Register UseReg = SubsequentUse->getReg();
           // TODO: This single-use restriction could be relaxed by using tees
-          if (DefReg != UseReg || !MRI.hasOneNonDBGUse(DefReg))
+          if (DefReg != UseReg ||
+              !hasSingleUse(DefReg, MRI, MFI, Optimize, nullptr, nullptr))
             break;
           MFI.stackifyVReg(MRI, DefReg);
           ++SubsequentDef;
@@ -926,7 +965,7 @@ bool WebAssemblyRegStackify::runOnMachineFunction(MachineFunction &MF) {
         // to a constant 0 so that the def is explicit, and the push/pop
         // correspondence is maintained.
         if (Insert->getOpcode() == TargetOpcode::IMPLICIT_DEF)
-          convertImplicitDefToConstZero(Insert, MRI, TII, MF, LIS);
+          convertImplicitDefToConstZero(Insert, MRI, TII, MF);
 
         // We stackified an operand. Add the defining instruction's operands to
         // the worklist stack now to continue to build an ever deeper tree.
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
index 8ec72d5c47833..6c637f11be0ec 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetMachine.cpp
@@ -603,14 +603,16 @@ void WebAssemblyPassConfig::addPreEmitPass() {
 
     // Prepare memory intrinsic calls for register stackifying.
     addPass(createWebAssemblyMemIntrinsicResults());
+  }
 
-    // Mark registers as representing wasm's value stack. This is a key
-    // code-compression technique in WebAssembly. We run this pass (and
-    // MemIntrinsicResults above) very late, so that it sees as much code as
-    // possible, including code emitted by PEI and expanded by late tail
-    // duplication.
-    addPass(createWebAssemblyRegStackify());
+  // Mark registers as representing wasm's value stack. This is a key
+  // code-compression technique in WebAssembly. We run this pass (and
+  // MemIntrinsicResults above) very late, so that it sees as much code as
+  // possible, including code emitted by PEI and expanded by late tail
+  // duplication.
+  addPass(createWebAssemblyRegStackify(getOptLevel()));
 
+  if (getOptLevel() != CodeGenOptLevel::None) {
     // Run the register coloring pass to reduce the total number of registers.
     // This runs after stackification so that it doesn't consider registers
     // that become stackified.
diff --git a/llvm/test/CodeGen/WebAssembly/PR40172.ll b/llvm/test/CodeGen/WebAssembly/PR40172.ll
index ed1630c4926d4..e70b115e67478 100644
--- a/llvm/test/CodeGen/WebAssembly/PR40172.ll
+++ b/llvm/test/CodeGen/WebAssembly/PR40172.ll
@@ -10,9 +10,9 @@ target triple = "wasm32-unknown-unknown"
 
 ; CHECK:  i32.sub $[[BASE:[0-9]+]]=,
 ; CHECK:  local.copy $[[ARG:[0-9]+]]=, $0{{$}}
-; CHECK:  i32.const $[[A0:[0-9]+]]=, 1{{$}}
-; CHECK:  i32.and $[[A1:[0-9]+]]=, $[[ARG]], $[[A0]]{{$}}
-; CHECK:  i32.store8 8($[[BASE]]), $[[A1]]{{$}}
+; CHECK:  i32.const $push[[A0:[0-9]+]]=, 1{{$}}
+; CHECK:  i32.and $push[[A1:[0-9]+]]=, $[[ARG]], $pop[[A0]]{{$}}
+; CHECK:  i32.store8 8($[[BASE]]), $pop[[A1]]{{$}}
 
 define void @test(i8 %byte) {
   %t = alloca { i8, i8 }, align 8
diff --git a/llvm/test/CodeGen/WebAssembly/PR41841.ll b/llvm/test/CodeGen/WebAssembly/PR41841.ll
index 4e3166adf7bb1..76ec83ecf8163 100644
--- a/llvm/test/CodeGen/WebAssembly/PR41841.ll
+++ b/llvm/test/CodeGen/WebAssembly/PR41841.ll
@@ -6,9 +6,9 @@ declare void @foo(i128)
 
 ; CHECK-LABEL: test_zext:
 ; CHECK-NEXT: .functype test_zext (i32) -> (){{$}}
-; CHECK-NEXT: i64.extend_i32_u $[[TMP3:[0-9]+]]=, $0{{$}}
-; CHECK-NEXT: i64.const $[[TMP4:[0-9]+]]=, 1{{$}}
-; CHECK-NEXT: i64.and $[[TMP1:[0-9]+]]=, $[[TMP3]], $[[TMP4]]{{$}}
+; CHECK-NEXT: i64.extend_i32_u $push[[TMP3:[0-9]+]]=, $0{{$}}
+; CHECK-NEXT: i64.const $push[[TMP4:[0-9]+]]=, 1{{$}}
+; CHECK-NEXT: i64.and $[[TMP1:[0-9]+]]=, $pop[[TMP3]], $pop[[TMP4]]{{$}}
 ; CHECK-NEXT: i64.const $[[TMP2:[0-9]+]]=, 0{{$}}
 ; CHECK-NEXT: call foo, $[[TMP1]], $[[TMP2]]{{$}}
 ; CHECK-NEXT: return{{$}}
@@ -23,11 +23,11 @@ next:                                             ; preds = %start
 
 ; CHECK-LABEL: test_sext:
 ; CHECK-NEXT:.functype test_sext (i32) -> (){{$}}
-; CHECK-NEXT: i64.extend_i32_u $[[TMP3:[0-9]+]]=, $0{{$}}
-; CHECK-NEXT: i64.const $[[TMP4:[0-9]+]]=, 1{{$}}
-; CHECK-NEXT: i64.and $[[TMP5:[0-9]+]]=, $[[TMP3]], $[[TMP4]]{{$}}
-; CHECK-NEXT: i64.const $[[TMP6:[0-9]+]]=, 0{{$}}
-; CHECK-NEXT: i64.sub $[[TMP1:[0-9]+]]=, $[[TMP6]], $[[TMP5]]{{$}}
+; CHECK-NEXT: i64.extend_i32_u $push[[TMP3:[0-9]+]]=, $0{{$}}
+; CHECK-NEXT: i64.const $push[[TMP4:[0-9]+]]=, 1{{$}}
+; CHECK-NEXT: i64.and $[[TMP5:[0-9]+]]=, $pop[[TMP3]], $pop[[TMP4]]{{$}}
+; CHECK-NEXT: i64.const $push[[TMP6:[0-9]+]]=, 0{{$}}
+; CHECK-NEXT: i64.sub $[[TMP1:[0-9]+]]=, $pop[[TMP6]], $[[TMP5]]{{$}}
 ; CHECK-NEXT: local.copy $[[TMP2:[0-9]+]]=, $[[TMP1]]{{$}}
 ; CHECK-NEXT: call foo, $[[TMP1]], $[[TMP2]]{{$}}
 ; CHECK-NEXT: return{{$}}
diff --git a/llvm/test/CodeGen/WebAssembly/debug-code-stackification.ll b/llvm/test/CodeGen/WebAssembly/debug-code-stackification.ll
new file mode 100644
index 0000000000000..a2a0ae7267290
--- /dev/null
+++ b/llvm/test/CodeGen/WebAssembly/debug-code-stackification.ll
@@ -0,0 +1,89 @@
+; RUN: llc < %s -O0 --filetype=obj -o - | llvm-dwarfdump - | FileCheck %s --check-prefixes DBG_USE
+
+target triple = "wasm32-unknown-unknown"
+
+declare i32 @extern_func(i32, i32)
+
+; We want to produce local DW_OP_WASM_locations in debug code instead
+; of operand stack locations since local locations are more widely
+; supported and can cover the entirety of the method.
+; DBG_USE: DW_TAG_subprogram
+; DBG_USE:   DW_AT_name ("single_non_dbg_use")
+; DBG_USE:   DW_TAG_variable
+; DBG_USE:     DW_AT_location
+; DBG_USE:       DW_OP_WASM_location 0x0
+; DBG_USE:     DW_AT_name    ("call_value")
+; DBG_USE:   DW_TAG_variable
+; DBG_USE:     DW_AT_location
+; DBG_USE:       DW_OP_WASM_location 0x0
+; DBG_USE:     DW_AT_name    ("sub_value")
+define i32 @single_non_dbg_use(i32 %0, i32 %1) !dbg !6 {
+  %call_value = call i32 @extern_func(i32 1, i32 2), !dbg !20
+  call void @llvm.dbg.value(metadata i32 %call_value, metadata !11, metadata !DIExpression()), !dbg !20
+  %div = udiv i32 %0, %1, !dbg !21
+  %sub = sub i32 %call_value, %div, !dbg !22
+  call void @llvm.dbg.value(metadata i32 %sub, metadata !12, metadata !DIExpression()), !dbg !22
+  ret i32 %sub, !dbg !23
+}
+
+!6 = distinct !DISubprogram(name: "single_non_dbg_use", scope: !1, file: !1, type: !7, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0)
+!7 = !DISubroutineType(types: !8)
+!8 = !{!9, !9, !9}
+!9 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
+!11 = !DILocalVariable(name: "call_value", scope: !6, type: !9)
+!12 = !DILocalVariabl...
[truncated]

dschuff · 2025-04-24T23:02:50Z

I think this is an interesting idea. Thanks for keeping in mind those constraints for debug info, I think those are basically the right goals to have. Does the code you're testing come from C#/CLR or somewhere else?
With clang's O0 compilation, basically all of the C locals do live on the stack, and are written back on every sequence point, so I would actually not expect this to have much of an effect on debugging for such code, but it might improve code size. We also have a decent number of users who debug in -O1/-Og mode (which IIRC uses CodeGenOptLevel::Less), because of the large code size at O0. It looks like full stackification is currently turned on for that case. So it might be worthwhile to try switching to limited stackifcation in that case if it improves debug info at a reasonable cost in in code size. Perhaps enabling this might even allow more debugging to be done with O0 compilation rather than O1, but my wild guess is that the mid-level optimizations that happen at O1 make a bigger difference in code size than this would.

I guess another variation on this could be to do something like enabling register coloring instead, which would reduce the local count (and maybe code size a little, since it could reduce the LEB-encoded local numbers). It's probably easy to try, but my guess is that it wouldn't be as effective as doing this.

/cc @aheejin @sunfishcode

SingleAccretion · 2025-04-26T19:26:06Z

Does the code you're testing come from C#/CLR or somewhere else?

It is indeed C#/IL, mostly from the standard library, compiled with our LLVM-based compiler. It differs from typical C/C++ WASM in that it has a lot of if (x != null) CallThrowMethod() code sequences, and passes around an additional first argument (our own shadow stack) so it also has a lot of x + 12-like expressions.

I expect these wins to translate very well to C/C++ code as well (most any code, really). Compiling this little GC implementation, I see a 45% reduction in code size (600Kb -> 335Kb).

With clang's O0 compilation, basically all of the C locals do live on the stack, and are written back on every sequence point, so I would actually not expect this to have much of an effect on debugging for such code, but it might improve code size

Indeed. In -O0 -g, the only actual local that matters is the frame base, everything else is computed based on it. So all these thousands of locals are not very useful.

(The same is true for our compiler, so to test the location coverage, I forced a mode where it would emit dbg.values instead of dbg.declares with optimized LLVM IR which is then ran through llc -O0.)

SingleAccretion · 2025-05-27T19:25:54Z

<Ping>

sunfishcode

Overall this looks good to me. Could you make some rough measurements of compile time, to make sure this isn't regressing -O0 compile times too much?

sunfishcode · 2025-05-27T20:23:59Z

llvm/lib/Target/WebAssembly/WebAssemblyRegStackify.cpp

@@ -915,7 +953,8 @@ bool WebAssemblyRegStackify::runOnMachineFunction(MachineFunction &MF) {
          Register DefReg = SubsequentDef->getReg();
          Register UseReg = SubsequentUse->getReg();
          // TODO: This single-use restriction could be relaxed by using tees
-          if (DefReg != UseReg || !MRI.hasOneNonDBGUse(DefReg))
+          if (DefReg != UseReg ||
+              !hasSingleUse(DefReg, MRI, MFI, Optimize, nullptr, nullptr))


This passes nullptr as the Def argument, which appears to be dereferenced unconditionally in hasSingleUse.

That's ok because Def is only used for the LIS != nullptr case, and we're passing LIS: nullptr here as well. The LIS/Def pair of arguments is sort of an optional precision enhancement.

In fact this part of the change is mostly about defensive coding, since currently we don't collect debug values for the non-first defs.

SingleAccretion · 2025-05-27T20:41:11Z

Could you make some rough measurements of compile time, to make sure this isn't regressing -O0 compile times too much?

Already measured - this is actually a big throughput win.

~10% reduction in instructions retired for an "llc -O0" run on a moderately sized input.

This is because we don't need to insert so many explicit LOCA_GET/LOCAL_SETs (according to a per-function profile I took at the time).

SingleAccretion force-pushed the Dbg-Stk branch from 8b39c4a to 3cc8acd Compare April 20, 2025 21:46

SingleAccretion force-pushed the Dbg-Stk branch from 3cc8acd to bb9e2ae Compare April 21, 2025 20:12

SingleAccretion marked this pull request as ready for review April 21, 2025 20:13

llvmbot added the backend:WebAssembly label Apr 21, 2025

SingleAccretion mentioned this pull request Apr 21, 2025

[NativeAOT-LLVM] WASM debugging dotnet/runtimelab#2408

Open

sunfishcode reviewed May 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WebAssembly] Enable a limited amount of stackification for debug code #136510

[WebAssembly] Enable a limited amount of stackification for debug code #136510

SingleAccretion commented Apr 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 20, 2025 •

edited

Loading

Uh oh!

llvmbot commented Apr 21, 2025

Uh oh!

dschuff commented Apr 24, 2025

Uh oh!

SingleAccretion commented Apr 26, 2025

Uh oh!

SingleAccretion commented May 27, 2025

Uh oh!

sunfishcode left a comment

Uh oh!

sunfishcode May 27, 2025

Uh oh!

SingleAccretion May 27, 2025 •

edited

Loading

Uh oh!

SingleAccretion commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WebAssembly] Enable a limited amount of stackification for debug code #136510

Are you sure you want to change the base?

[WebAssembly] Enable a limited amount of stackification for debug code #136510

Conversation

SingleAccretion commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 21, 2025

Uh oh!

dschuff commented Apr 24, 2025

Uh oh!

SingleAccretion commented Apr 26, 2025

Uh oh!

SingleAccretion commented May 27, 2025

Uh oh!

sunfishcode left a comment

Choose a reason for hiding this comment

Uh oh!

sunfishcode May 27, 2025

Choose a reason for hiding this comment

Uh oh!

SingleAccretion May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SingleAccretion commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SingleAccretion commented Apr 20, 2025 •

edited

Loading

github-actions bot commented Apr 20, 2025 •

edited

Loading

SingleAccretion May 27, 2025 •

edited

Loading

SingleAccretion commented May 27, 2025 •

edited

Loading