[RISCV] Use vsetvli instead of vlenb in Prologue/Epilogue #113756

kito-cheng · 2024-10-26T07:41:16Z

Currently, we use csrr with vlenb to obtain the VLEN, but this is not the only option. We can also use vsetvli with e8/m1 to get VLENMAX, which is equal to the VLEN. This method is preferable on some microarchitectures and makes it easier to obtain values like VLEN * 2, VLEN * 4, or VLEN * 8, reducing the number of instructions needed to calculate VLEN multiples.

However, this approach is NOT always interchangeable, as it changes the state of VTYPE and VL, which can alter the behavior of vector instructions, potentially causing incorrect code generation if applied after a vsetvli insertion. Therefore, we limit its use to the prologue/epilogue for now, as there are no vector operations within the prologue/epilogue sequence.

With further analysis, we may extend this approach beyond the prologue/epilogue in the future, but starting here should be a good first step.

This feature is gurded by the +prefer-vsetvli-over-read-vlenb feature, which is disabled by default for now.

llvmbot · 2024-10-26T07:41:52Z

@llvm/pr-subscribers-backend-risc-v

Author: Kito Cheng (kito-cheng)

Changes

Currently, we use csrr with vlenb to obtain the VLEN, but this is not the only option. We can also use vsetvli with e8/m1 to get VLENMAX, which is equal to the VLEN. This method is preferable on some microarchitectures and makes it easier to obtain values like VLEN * 2, VLEN * 4, or VLEN * 8, reducing the number of instructions needed to calculate VLEN multiples.

However, this approach is NOT always interchangeable, as it changes the state of VTYPE and VL, which can alter the behavior of vector instructions, potentially causing incorrect code generation if applied after a vsetvli insertion. Therefore, we limit its use to the prologue/epilogue for now, as there are no vector operations within the prologue/epilogue sequence.

With further analysis, we may extend this approach beyond the prologue/epilogue in the future, but starting here should be a good first step.

Patch is 560.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113756.diff

159 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp (+37)
(modified) llvm/lib/Target/RISCV/RISCVFrameLowering.cpp (+17-11)
(modified) llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td (+5)
(modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp (+38-14)
(modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.h (+3-1)
(modified) llvm/test/CodeGen/RISCV/calling-conv-vector-on-stack.ll (+1-2)
(modified) llvm/test/CodeGen/RISCV/early-clobber-tied-def-subreg-liveness.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/pr69586.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/regalloc-last-chance-recoloring-failure.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv-cfi-info.ll (+3-3)
(modified) llvm/test/CodeGen/RISCV/rvv/abs-vp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/access-fixed-objects-by-rvv.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/addi-scalable-offset.mir (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/alloca-load-store-scalable-array.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/alloca-load-store-scalable-struct.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/alloca-load-store-vector-tuple.ll (+6-12)
(modified) llvm/test/CodeGen/RISCV/rvv/allocate-lmul-2-4-8.ll (+66-163)
(modified) llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll (+14-22)
(modified) llvm/test/CodeGen/RISCV/rvv/bswap-sdnode.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/bswap-vp.ll (+14-22)
(modified) llvm/test/CodeGen/RISCV/rvv/callee-saved-regs.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/calling-conv-fastcc.ll (+12-14)
(modified) llvm/test/CodeGen/RISCV/rvv/calling-conv.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/compressstore.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/ctpop-vp.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/cttz-vp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/emergency-slot.mir (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/extractelt-fp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/extractelt-i1.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/extractelt-int-rv32.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/extractelt-int-rv64.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bitreverse-vp.ll (+14-22)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap-vp.ll (+14-22)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ceil-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz-vp.ll (+20-20)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctpop-vp.ll (+10-10)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll (+20-20)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-floor-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum-vp.ll (+6-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum-vp.ll (+6-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec-bf16.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fshr-fshl-vp.ll (+4-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-llrint.ll (+13-22)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+16-24)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-rint-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-round-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-roundeven-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-roundtozero-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-fp-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-int-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-trunc-vp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vcopysign-vp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfma-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmax-vp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmin-vp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmuladd-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfwadd.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfwmul.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfwsub.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vpmerge.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vpscatter.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vscale-range.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vselect-vp.ll (+10-14)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwadd.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwaddu.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmulsu.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmulu.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwsub.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwsubu.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/floor-vp.ll (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+24-36)
(modified) llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+24-36)
(modified) llvm/test/CodeGen/RISCV/rvv/fnearbyint-sdnode.ll (+4-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll (+48-60)
(modified) llvm/test/CodeGen/RISCV/rvv/frm-insert.ll (+20-20)
(modified) llvm/test/CodeGen/RISCV/rvv/fshr-fshl-vp.ll (+24-44)
(modified) llvm/test/CodeGen/RISCV/rvv/get-vlen-debugloc.mir (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/large-rvv-stack-size.mir (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/localvar.ll (+14-27)
(modified) llvm/test/CodeGen/RISCV/rvv/memory-args.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/mgather-sdnode.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/mscatter-sdnode.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/named-vector-shuffle-reverse.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/nearbyint-vp.ll (+8-16)
(modified) llvm/test/CodeGen/RISCV/rvv/no-reserved-frame.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/pr88576.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/reg-alloc-reserve-bp.ll (+1-2)
(modified) llvm/test/CodeGen/RISCV/rvv/remat.ll (+18-36)
(modified) llvm/test/CodeGen/RISCV/rvv/rint-vp.ll (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/round-vp.ll (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/roundeven-vp.ll (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/roundtozero-vp.ll (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/rv32-spill-vector-csr.ll (+4-8)
(modified) llvm/test/CodeGen/RISCV/rvv/rv32-spill-vector.ll (+20-32)
(modified) llvm/test/CodeGen/RISCV/rvv/rv32-spill-zvlsseg.ll (+20-34)
(modified) llvm/test/CodeGen/RISCV/rvv/rv64-spill-vector-csr.ll (+4-8)
(modified) llvm/test/CodeGen/RISCV/rvv/rv64-spill-vector.ll (+16-28)
(modified) llvm/test/CodeGen/RISCV/rvv/rv64-spill-zvlsseg.ll (+20-34)
(modified) llvm/test/CodeGen/RISCV/rvv/rvv-args-by-mem.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/rvv-framelayout.ll (+3-6)
(modified) llvm/test/CodeGen/RISCV/rvv/rvv-out-arguments.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/rvv-stack-align.mir (+10-20)
(modified) llvm/test/CodeGen/RISCV/rvv/scalar-stack-align.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/setcc-fp-vp.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/setcc-int-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/stack-folding.ll (+24-24)
(modified) llvm/test/CodeGen/RISCV/rvv/strided-vpstore.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+6-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-store.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+8-16)
(modified) llvm/test/CodeGen/RISCV/rvv/vfadd-vp.ll (+16-24)
(modified) llvm/test/CodeGen/RISCV/rvv/vfdiv-vp.ll (+16-24)
(modified) llvm/test/CodeGen/RISCV/rvv/vfma-vp.ll (+124-160)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmadd-constrained-sdnode.ll (+16-24)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmadd-sdnode.ll (+16-20)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmax-vp.ll (+8-16)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmin-vp.ll (+8-16)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmsub-constrained-sdnode.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmul-vp.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmuladd-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfnmadd-constrained-sdnode.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfnmsub-constrained-sdnode.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfptosi-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfptoui-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfptrunc-vp.ll (+4-6)
(modified) llvm/test/CodeGen/RISCV/rvv/vfsub-vp.ll (+16-24)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwmacc-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwnmacc-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfwnmsac-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/vpmerge-sdnode.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vpscatter-sdnode.ll (+6-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vpstore.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vselect-vp.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-insert-crossbb.ll (+4-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vsitofp-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vtrunc-vp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vuitofp-vp.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/wrong-stack-offset-for-rvv-object.mir (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/wrong-stack-slot-rv32.mir (+3-6)
(modified) llvm/test/CodeGen/RISCV/rvv/wrong-stack-slot-rv64.mir (+2-4)
(modified) llvm/test/CodeGen/RISCV/rvv/zvlsseg-spill.mir (+2-4)
(modified) llvm/test/CodeGen/RISCV/srem-seteq-illegal-types.ll (+2-4)

diff --git a/llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp b/llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
index 5dcec078856ead..299537e5047d2b 100644
--- a/llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
+++ b/llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp
@@ -56,6 +56,8 @@ class RISCVExpandPseudo : public MachineFunctionPass {
                             MachineBasicBlock::iterator MBBI);
   bool expandRV32ZdinxLoad(MachineBasicBlock &MBB,
                            MachineBasicBlock::iterator MBBI);
+  bool expandPseudoReadMulVLENB(MachineBasicBlock &MBB,
+                                MachineBasicBlock::iterator MBBI);
 #ifndef NDEBUG
   unsigned getInstSizeInBytes(const MachineFunction &MF) const {
     unsigned Size = 0;
@@ -164,6 +166,8 @@ bool RISCVExpandPseudo::expandMI(MachineBasicBlock &MBB,
   case RISCV::PseudoVMSET_M_B64:
     // vmset.m vd => vmxnor.mm vd, vd, vd
     return expandVMSET_VMCLR(MBB, MBBI, RISCV::VMXNOR_MM);
+  case RISCV::PseudoReadMulVLENB:
+    return expandPseudoReadMulVLENB(MBB, MBBI);
   }
 
   return false;
@@ -410,6 +414,39 @@ bool RISCVExpandPseudo::expandRV32ZdinxLoad(MachineBasicBlock &MBB,
   return true;
 }
 
+bool RISCVExpandPseudo::expandPseudoReadMulVLENB(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI) {
+  DebugLoc DL = MBBI->getDebugLoc();
+  Register Dst = MBBI->getOperand(0).getReg();
+  unsigned Mul = MBBI->getOperand(1).getImm();
+  RISCVII::VLMUL VLMUL = RISCVII::VLMUL::LMUL_1;
+  switch (Mul) {
+  case 1:
+    VLMUL = RISCVII::VLMUL::LMUL_1;
+    break;
+  case 2:
+    VLMUL = RISCVII::VLMUL::LMUL_2;
+    break;
+  case 4:
+    VLMUL = RISCVII::VLMUL::LMUL_4;
+    break;
+  case 8:
+    VLMUL = RISCVII::VLMUL::LMUL_8;
+    break;
+  default:
+    llvm_unreachable("Unexpected VLENB value");
+  }
+  unsigned VTypeImm = RISCVVType::encodeVTYPE(
+      VLMUL, /*SEW*/ 8, /*TailAgnostic*/ true, /*MaskAgnostic*/ true);
+
+  BuildMI(MBB, MBBI, DL, TII->get(RISCV::VSETVLI), Dst)
+      .addReg(RISCV::X0)
+      .addImm(VTypeImm);
+
+  MBBI->eraseFromParent();
+  return true;
+}
+
 class RISCVPreRAExpandPseudo : public MachineFunctionPass {
 public:
   const RISCVSubtarget *STI;
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
index b49cbab1876d79..b76b8e1df9996e 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
@@ -436,8 +436,8 @@ void RISCVFrameLowering::adjustStackForRVV(MachineFunction &MF,
   const RISCVRegisterInfo &RI = *STI.getRegisterInfo();
   // We must keep the stack pointer aligned through any intermediate
   // updates.
-  RI.adjustReg(MBB, MBBI, DL, SPReg, SPReg, Offset,
-               Flag, getStackAlign());
+  RI.adjustReg(MBB, MBBI, DL, SPReg, SPReg, Offset, Flag, getStackAlign(),
+               /*IsPrologueOrEpilogue*/ true);
 }
 
 static void appendScalableVectorExpression(const TargetRegisterInfo &TRI,
@@ -621,7 +621,7 @@ void RISCVFrameLowering::emitPrologue(MachineFunction &MF,
     // Allocate space on the stack if necessary.
     RI->adjustReg(MBB, MBBI, DL, SPReg, SPReg,
                   StackOffset::getFixed(-StackSize), MachineInstr::FrameSetup,
-                  getStackAlign());
+                  getStackAlign(), /*IsPrologueOrEpilogue*/ true);
   }
 
   // Emit ".cfi_def_cfa_offset RealStackSize"
@@ -666,9 +666,11 @@ void RISCVFrameLowering::emitPrologue(MachineFunction &MF,
     // The frame pointer does need to be reserved from register allocation.
     assert(MF.getRegInfo().isReserved(FPReg) && "FP not reserved");
 
-    RI->adjustReg(MBB, MBBI, DL, FPReg, SPReg,
-                  StackOffset::getFixed(RealStackSize - RVFI->getVarArgsSaveSize()),
-                  MachineInstr::FrameSetup, getStackAlign());
+    RI->adjustReg(
+        MBB, MBBI, DL, FPReg, SPReg,
+        StackOffset::getFixed(RealStackSize - RVFI->getVarArgsSaveSize()),
+        MachineInstr::FrameSetup, getStackAlign(),
+        /*IsPrologueOrEpilogue*/ true);
 
     // Emit ".cfi_def_cfa $fp, RVFI->getVarArgsSaveSize()"
     unsigned CFIIndex = MF.addFrameInst(MCCFIInstruction::cfiDefCfa(
@@ -686,7 +688,8 @@ void RISCVFrameLowering::emitPrologue(MachineFunction &MF,
            "SecondSPAdjustAmount should be greater than zero");
     RI->adjustReg(MBB, MBBI, DL, SPReg, SPReg,
                   StackOffset::getFixed(-SecondSPAdjustAmount),
-                  MachineInstr::FrameSetup, getStackAlign());
+                  MachineInstr::FrameSetup, getStackAlign(),
+                  /*IsPrologueOrEpilogue*/ true);
 
     // If we are using a frame-pointer, and thus emitted ".cfi_def_cfa fp, 0",
     // don't emit an sp-based .cfi_def_cfa_offset
@@ -765,7 +768,8 @@ void RISCVFrameLowering::deallocateStack(MachineFunction &MF,
   Register SPReg = getSPReg(STI);
 
   RI->adjustReg(MBB, MBBI, DL, SPReg, SPReg, StackOffset::getFixed(StackSize),
-                MachineInstr::FrameDestroy, getStackAlign());
+                MachineInstr::FrameDestroy, getStackAlign(),
+                /*IsPrologueOrEpilogue*/ true);
 }
 
 void RISCVFrameLowering::emitEpilogue(MachineFunction &MF,
@@ -839,7 +843,8 @@ void RISCVFrameLowering::emitEpilogue(MachineFunction &MF,
     if (!RestoreFP)
       RI->adjustReg(MBB, LastFrameDestroy, DL, SPReg, SPReg,
                     StackOffset::getFixed(SecondSPAdjustAmount),
-                    MachineInstr::FrameDestroy, getStackAlign());
+                    MachineInstr::FrameDestroy, getStackAlign(),
+                    /*IsPrologueOrEpilogue*/ true);
   }
 
   // Restore the stack pointer using the value of the frame pointer. Only
@@ -857,7 +862,7 @@ void RISCVFrameLowering::emitEpilogue(MachineFunction &MF,
 
     RI->adjustReg(MBB, LastFrameDestroy, DL, SPReg, FPReg,
                   StackOffset::getFixed(-FPOffset), MachineInstr::FrameDestroy,
-                  getStackAlign());
+                  getStackAlign(), /*IsPrologueOrEpilogue*/ true);
   }
 
   bool ApplyPop = RVFI->isPushable(MF) && MBBI != MBB.end() &&
@@ -1348,7 +1353,8 @@ MachineBasicBlock::iterator RISCVFrameLowering::eliminateCallFramePseudoInstr(
 
       const RISCVRegisterInfo &RI = *STI.getRegisterInfo();
       RI.adjustReg(MBB, MI, DL, SPReg, SPReg, StackOffset::getFixed(Amount),
-                   MachineInstr::NoFlags, getStackAlign());
+                   MachineInstr::NoFlags, getStackAlign(),
+                   /*IsPrologueOrEpilogue*/ true);
     }
   }
 
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td b/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
index 6b308bc8c9aa0f..fa8cde09be696b 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
@@ -6096,6 +6096,11 @@ let hasSideEffects = 0, mayLoad = 0, mayStore = 0, isCodeGenOnly = 1 in {
                                [(set GPR:$rd, (riscv_read_vlenb))]>,
                         PseudoInstExpansion<(CSRRS GPR:$rd, SysRegVLENB.Encoding, X0)>,
                         Sched<[WriteRdVLENB]>;
+  let Defs = [VL, VTYPE] in {
+  def PseudoReadMulVLENB : Pseudo<(outs GPR:$rd), (ins uimm5:$shamt),
+                                  []>,
+                           Sched<[WriteVSETVLI, ReadVSETVLI]>;
+  }
 }
 
 let hasSideEffects = 0, mayLoad = 0, mayStore = 0, isCodeGenOnly = 1,
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
index 26195ef721db39..b37899b148c283 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
@@ -175,7 +175,8 @@ void RISCVRegisterInfo::adjustReg(MachineBasicBlock &MBB,
                                   const DebugLoc &DL, Register DestReg,
                                   Register SrcReg, StackOffset Offset,
                                   MachineInstr::MIFlag Flag,
-                                  MaybeAlign RequiredAlign) const {
+                                  MaybeAlign RequiredAlign,
+                                  bool IsPrologueOrEpilogue) const {
 
   if (DestReg == SrcReg && !Offset.getFixed() && !Offset.getScalable())
     return;
@@ -205,21 +206,43 @@ void RISCVRegisterInfo::adjustReg(MachineBasicBlock &MBB,
     assert(isInt<32>(ScalableValue / (RISCV::RVVBitsPerBlock / 8)) &&
            "Expect the number of vector registers within 32-bits.");
     uint32_t NumOfVReg = ScalableValue / (RISCV::RVVBitsPerBlock / 8);
-    BuildMI(MBB, II, DL, TII->get(RISCV::PseudoReadVLENB), ScratchReg)
-        .setMIFlag(Flag);
-
-    if (ScalableAdjOpc == RISCV::ADD && ST.hasStdExtZba() &&
-        (NumOfVReg == 2 || NumOfVReg == 4 || NumOfVReg == 8)) {
-      unsigned Opc = NumOfVReg == 2 ? RISCV::SH1ADD :
-        (NumOfVReg == 4 ? RISCV::SH2ADD : RISCV::SH3ADD);
-      BuildMI(MBB, II, DL, TII->get(Opc), DestReg)
-          .addReg(ScratchReg, RegState::Kill).addReg(SrcReg)
+    // Only use vsetvli rather than vlenb if adjusting in the prologue or
+    // epilogue, otherwise it may distrube the VTYPE and VL status.
+    bool UseVsetvliRatherThanVlenb = IsPrologueOrEpilogue;
+    if (UseVsetvliRatherThanVlenb && (NumOfVReg == 1 || NumOfVReg == 2 ||
+                                      NumOfVReg == 4 || NumOfVReg == 8)) {
+      BuildMI(MBB, II, DL, TII->get(RISCV::PseudoReadMulVLENB), ScratchReg)
+          .addImm(NumOfVReg)
           .setMIFlag(Flag);
-    } else {
-      TII->mulImm(MF, MBB, II, DL, ScratchReg, NumOfVReg, Flag);
       BuildMI(MBB, II, DL, TII->get(ScalableAdjOpc), DestReg)
-          .addReg(SrcReg).addReg(ScratchReg, RegState::Kill)
+          .addReg(SrcReg)
+          .addReg(ScratchReg, RegState::Kill)
           .setMIFlag(Flag);
+    } else {
+      if (UseVsetvliRatherThanVlenb)
+        BuildMI(MBB, II, DL, TII->get(RISCV::PseudoReadMulVLENB), ScratchReg)
+            .addImm(1)
+            .setMIFlag(Flag);
+      else
+        BuildMI(MBB, II, DL, TII->get(RISCV::PseudoReadVLENB), ScratchReg)
+            .setMIFlag(Flag);
+
+      if (ScalableAdjOpc == RISCV::ADD && ST.hasStdExtZba() &&
+          (NumOfVReg == 2 || NumOfVReg == 4 || NumOfVReg == 8)) {
+        unsigned Opc = NumOfVReg == 2
+                           ? RISCV::SH1ADD
+                           : (NumOfVReg == 4 ? RISCV::SH2ADD : RISCV::SH3ADD);
+        BuildMI(MBB, II, DL, TII->get(Opc), DestReg)
+            .addReg(ScratchReg, RegState::Kill)
+            .addReg(SrcReg)
+            .setMIFlag(Flag);
+      } else {
+        TII->mulImm(MF, MBB, II, DL, ScratchReg, NumOfVReg, Flag);
+        BuildMI(MBB, II, DL, TII->get(ScalableAdjOpc), DestReg)
+            .addReg(SrcReg)
+            .addReg(ScratchReg, RegState::Kill)
+            .setMIFlag(Flag);
+      }
     }
     SrcReg = DestReg;
     KillSrcReg = true;
@@ -526,7 +549,8 @@ bool RISCVRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
     else
       DestReg = MRI.createVirtualRegister(&RISCV::GPRRegClass);
     adjustReg(*II->getParent(), II, DL, DestReg, FrameReg, Offset,
-              MachineInstr::NoFlags, std::nullopt);
+              MachineInstr::NoFlags, std::nullopt,
+              /*IsPrologueOrEpilogue*/ false);
     MI.getOperand(FIOperandNum).ChangeToRegister(DestReg, /*IsDef*/false,
                                                  /*IsImp*/false,
                                                  /*IsKill*/true);
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 6ddb1eb9c14d5e..b7aa120935747a 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -72,10 +72,12 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
   // used during frame layout, and we may need to ensure that if we
   // split the offset internally that the DestReg is always aligned,
   // assuming that source reg was.
+  // If IsPrologueOrEpilogue is true, the function is called during prologue
+  // or epilogue generation.
   void adjustReg(MachineBasicBlock &MBB, MachineBasicBlock::iterator II,
                  const DebugLoc &DL, Register DestReg, Register SrcReg,
                  StackOffset Offset, MachineInstr::MIFlag Flag,
-                 MaybeAlign RequiredAlign) const;
+                 MaybeAlign RequiredAlign, bool IsPrologueOrEpilogue) const;
 
   bool eliminateFrameIndex(MachineBasicBlock::iterator MI, int SPAdj,
                            unsigned FIOperandNum,
diff --git a/llvm/test/CodeGen/RISCV/calling-conv-vector-on-stack.ll b/llvm/test/CodeGen/RISCV/calling-conv-vector-on-stack.ll
index 70cdb6cec2449f..28a4d9166b1ef9 100644
--- a/llvm/test/CodeGen/RISCV/calling-conv-vector-on-stack.ll
+++ b/llvm/test/CodeGen/RISCV/calling-conv-vector-on-stack.ll
@@ -11,8 +11,7 @@ define void @bar() nounwind {
 ; CHECK-NEXT:    sd s0, 80(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    sd s1, 72(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    addi s0, sp, 96
-; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    vsetvli a0, zero, e8, m8, ta, ma
 ; CHECK-NEXT:    sub sp, sp, a0
 ; CHECK-NEXT:    andi sp, sp, -64
 ; CHECK-NEXT:    mv s1, sp
diff --git a/llvm/test/CodeGen/RISCV/early-clobber-tied-def-subreg-liveness.ll b/llvm/test/CodeGen/RISCV/early-clobber-tied-def-subreg-liveness.ll
index 0c2b809c0be20c..3704d0a5e20edb 100644
--- a/llvm/test/CodeGen/RISCV/early-clobber-tied-def-subreg-liveness.ll
+++ b/llvm/test/CodeGen/RISCV/early-clobber-tied-def-subreg-liveness.ll
@@ -16,7 +16,7 @@ define void @_Z3foov() {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    addi sp, sp, -16
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    slli a1, a0, 3
 ; CHECK-NEXT:    add a0, a1, a0
 ; CHECK-NEXT:    sub sp, sp, a0
@@ -82,7 +82,7 @@ define void @_Z3foov() {
 ; CHECK-NEXT:    lui a0, %hi(var_47)
 ; CHECK-NEXT:    addi a0, a0, %lo(var_47)
 ; CHECK-NEXT:    vsseg4e16.v v8, (a0)
-; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    slli a1, a0, 3
 ; CHECK-NEXT:    add a0, a1, a0
 ; CHECK-NEXT:    add sp, sp, a0
diff --git a/llvm/test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll b/llvm/test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll
index 8116d138d288e2..cc426ce3cad1a1 100644
--- a/llvm/test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll
+++ b/llvm/test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll
@@ -59,8 +59,7 @@ define i64 @ctz_nxv8i1_no_range(<vscale x 8 x i16> %a) {
 ; RV32-NEXT:    .cfi_def_cfa_offset 48
 ; RV32-NEXT:    sw ra, 44(sp) # 4-byte Folded Spill
 ; RV32-NEXT:    .cfi_offset ra, -4
-; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 1
+; RV32-NEXT:    vsetvli a0, zero, e8, m2, ta, ma
 ; RV32-NEXT:    sub sp, sp, a0
 ; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 2 * vlenb
 ; RV32-NEXT:    addi a0, sp, 32
@@ -97,8 +96,7 @@ define i64 @ctz_nxv8i1_no_range(<vscale x 8 x i16> %a) {
 ; RV32-NEXT:    sub a1, a1, a4
 ; RV32-NEXT:    sub a1, a1, a3
 ; RV32-NEXT:    sub a0, a0, a2
-; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    slli a2, a2, 1
+; RV32-NEXT:    vsetvli a2, zero, e8, m2, ta, ma
 ; RV32-NEXT:    add sp, sp, a2
 ; RV32-NEXT:    lw ra, 44(sp) # 4-byte Folded Reload
 ; RV32-NEXT:    addi sp, sp, 48
diff --git a/llvm/test/CodeGen/RISCV/pr69586.ll b/llvm/test/CodeGen/RISCV/pr69586.ll
index 7084c04805be72..3a62d8c2980802 100644
--- a/llvm/test/CodeGen/RISCV/pr69586.ll
+++ b/llvm/test/CodeGen/RISCV/pr69586.ll
@@ -35,7 +35,7 @@ define void @test(ptr %0, ptr %1, i64 %2) {
 ; NOREMAT-NEXT:    .cfi_offset s9, -88
 ; NOREMAT-NEXT:    .cfi_offset s10, -96
 ; NOREMAT-NEXT:    .cfi_offset s11, -104
-; NOREMAT-NEXT:    csrr a2, vlenb
+; NOREMAT-NEXT:    vsetvli a2, zero, e8, m1, ta, ma
 ; NOREMAT-NEXT:    li a3, 6
 ; NOREMAT-NEXT:    mul a2, a2, a3
 ; NOREMAT-NEXT:    sub sp, sp, a2
@@ -759,7 +759,7 @@ define void @test(ptr %0, ptr %1, i64 %2) {
 ; NOREMAT-NEXT:    vse32.v v8, (a0)
 ; NOREMAT-NEXT:    sf.vc.v.i 2, 0, v8, 0
 ; NOREMAT-NEXT:    sf.vc.v.i 2, 0, v8, 0
-; NOREMAT-NEXT:    csrr a0, vlenb
+; NOREMAT-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; NOREMAT-NEXT:    li a1, 6
 ; NOREMAT-NEXT:    mul a0, a0, a1
 ; NOREMAT-NEXT:    add sp, sp, a0
diff --git a/llvm/test/CodeGen/RISCV/regalloc-last-chance-recoloring-failure.ll b/llvm/test/CodeGen/RISCV/regalloc-last-chance-recoloring-failure.ll
index 6a0dbbe356a165..8708f766130c6a 100644
--- a/llvm/test/CodeGen/RISCV/regalloc-last-chance-recoloring-failure.ll
+++ b/llvm/test/CodeGen/RISCV/regalloc-last-chance-recoloring-failure.ll
@@ -19,7 +19,7 @@ define void @last_chance_recoloring_failure() {
 ; CHECK-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    .cfi_offset ra, -8
 ; CHECK-NEXT:    .cfi_offset s0, -16
-; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    slli a0, a0, 4
 ; CHECK-NEXT:    sub sp, sp, a0
 ; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x20, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 32 + 16 * vlenb
@@ -59,7 +59,7 @@ define void @last_chance_recoloring_failure() {
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, tu, mu
 ; CHECK-NEXT:    vfdiv.vv v8, v24, v8, v0.t
 ; CHECK-NEXT:    vse32.v v8, (a0)
-; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    slli a0, a0, 4
 ; CHECK-NEXT:    add sp, sp, a0
 ; CHECK-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
@@ -75,7 +75,7 @@ define void @last_chance_recoloring_failure() {
 ; SUBREGLIVENESS-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
 ; SUBREGLIVENESS-NEXT:    .cfi_offset ra, -8
 ; SUBREGLIVENESS-NEXT:    .cfi_offset s0, -16
-; SUBREGLIVENESS-NEXT:    csrr a0, vlenb
+; SUBREGLIVENESS-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; SUBREGLIVENESS-NEXT:    slli a0, a0, 4
 ; SUBREGLIVENESS-NEXT:    sub sp, sp, a0
 ; SUBREGLIVENESS-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x20, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 32 + 16 * vlenb
@@ -115,7 +115,7 @@ define void @last_chance_recoloring_failure() {
 ; SUBREGLIVENESS-NEXT:    vsetvli zero, zero, e32, m8, tu, mu
 ; SUBREGLIVENESS-NEXT:    vfdiv.vv v8, v24, v8, v0.t
 ; SUBREGLIVENESS-NEXT:    vse32.v v8, (a0)
-; SUBREGLIVENESS-NEXT:    csrr a0, vlenb
+; SUBREGLIVENESS-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; SUBREGLIVENESS-NEXT:    slli a0, a0, 4
 ; SUBREGLIVENESS-NEXT:    add sp, sp, a0
 ; SUBREGLIVENESS-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
diff --git a/llvm/test/CodeGen/RISCV/rvv-cfi-info.ll b/llvm/test/CodeGen/RISCV/rvv-cfi-info.ll
index 225680e846bac7..af88e39f18e195 100644
--- a/llvm/test/CodeGen/RISCV/rvv-cfi-info.ll
+++ b/llvm/test/CodeGen/RISCV/rvv-cfi-info.ll
@@ -9,7 +9,7 @@ define riscv_vector_cc <vscale x 1 x i32> @test_vector_callee_cfi(<vscale x 1 x
 ; OMIT-FP:       # %bb.0: # %entry
 ; OMIT-FP-NEXT:    addi sp, sp, -16
 ; OMIT-FP-NEXT:    .cfi_def_cfa_offset 16
-; OMIT-FP-NEXT:    csrr a0, vlenb
+; OMIT-FP-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; OMIT-FP-NEXT:    slli a1, a0, 3
 ; OMIT-FP-NEXT:    sub a0, a1, a0
 ; OMIT-FP-NEXT:    sub sp, sp, a0
@@ -49,7 +49,7 @@ define riscv_vector_cc <vscale x 1 x i32> @test_vector_callee_cfi(<vscale x 1 x
 ; OMIT-FP-NEXT:    vl2r.v v2, (a0) # Unknown-size Folded Reload
 ; OMIT-FP-NEXT:    addi a0, sp, 16
 ; OMIT-FP-NEXT:    vl4r.v v4, (a0) # Unknown-size Folded Reload
-; OMIT-FP-NEXT:    csrr a0, vlenb
+; OMIT-FP-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; OMIT-FP-NEXT:    slli a1, a0, 3
 ; OMIT-FP-NEXT:    sub a0, a1, a0
 ; OMIT-FP-NEXT:    add sp, sp, a0
@@ -66,7 +66,7 @@ define riscv_vector_cc <vscale x 1 x i32> @test_vector_callee_cfi(<vscale x 1 x
 ; NO-OMIT-FP-NEXT:    .cfi_offset s0, -16
 ; NO-OMIT-FP-NEXT:    addi s0, sp, 32
 ; NO-OMIT-FP-NEXT:    .cfi_def_cfa s0, 0
-; NO-OMIT-FP-NEXT:    csrr a0, vlenb
+; NO-OMIT-FP-NEXT:    vsetvli a0, zero, e8, m1, ta, ma
 ; NO-OMIT-FP-NEXT:    slli a1, a0, 3
 ; NO-OMIT-FP-NEXT:    sub a0, a1, a0
 ; NO-OMIT-FP-NEXT:    sub sp, sp, a0
diff --git a/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll b/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
index cd2208e31eb6d3..3a8100c57b26f7 100644
--- a/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
@@ -563,7 +563,7 @@ define <vscale x 16 x i64> @vp_abs_nxv16i64(<vscale x 16 x i64> %va, <vscale x 1
 ; CHECK:       # %b...
[truncated]

lukel97

Is this still profitable even for cases when there's no shift needed, e.g. would a vsetvli be better than a single csrr?

topperc · 2024-10-26T22:36:39Z

Does this work with shrink wrapping where the prologue might not be at the beginning?

topperc · 2024-10-26T22:39:36Z

Is this still profitable even for cases when there's no shift needed, e.g. would a vsetvli be better than a single csrr?

CSR reads in general are seializing. Vlenb needs to be special cased in the microarchitecture. Some versions of SiFive cores missed this optimization. Have we checked BananaPi?

topperc · 2024-10-26T22:39:36Z

Is this still profitable even for cases when there's no shift needed, e.g. would a vsetvli be better than a single csrr?

CSR reads in general are seializing. Vlenb needs to be special cased in the microarchitecture. Some versions of SiFive cores missed this optimization. Have we checked BananaPi?

lukel97 · 2024-10-27T00:42:06Z

Is this still profitable even for cases when there's no shift needed, e.g. would a vsetvli be better than a single csrr?

CSR reads in general are seializing. Vlenb needs to be special cased in the microarchitecture. Some versions of SiFive cores missed this optimization. Have we checked BananaPi?

I did some experimenting and it looks like the F3 is also missing the optimisation.

csrr.S:

.global start
start:
      	li a0, 0
        li a1, 10485760
loop:
     	csrr t0, COUNTER
        addi a0, a0, 1
        blt a0, a1, loop
exit:
        li a7, 93
        ecall

vsetvli.s:

.global _start
_start:
        li a0, 0
        li a1, 10485760
loop:
        vsetvli t0, zero, e8, m1, ta, ma
        addi a0, a0, 1
        blt a0, a1, loop
exit:
        li a7, 93
        ecall

$ clang csrr.S -nostdlib -DCOUNTER=vlenb
$ perf stat ./a.out
...
105,898,187      cycles:u
$ clang csrr.S -nostdlib -DCOUNTER=fflags
$ perf stat ./a.out
...
105,894,862      cycles:u
$ clang vsetvli.s -nostdlib
$ perf stat ./a.out
...
53,484,388      cycles:u

I guess this could also introduce VL/VTYPE toggles, but I don't have any data as to whether or not that would be an issue in practice given that this is restricted to prologues and epilogues. I also don't know how expensive a VL/VTYPE toggle is in comparison to a CSR read. Maybe it's always worthwhile to avoid the csrr.

kito-cheng · 2024-10-28T07:27:40Z

Put few more check in RISCVFrameLowering::canUseAsPrologue (which is used when doing shrink wrapping) to make sure no VL and VTYPE livein, but I can't make a testcase for that after I try 20 mins...

camel-cdr · 2024-10-28T11:27:13Z

clang and gcc usingc csrr vlenb for prolog/epilog code was an amazing unintended feature (Hyrum's Law).

Searching the generated assembly for the string "vlenb" is currently the easiest way to identify register spills when compiling intrinsics.

You could do that in a big codebases and have minimal false positives.

I understand that we should try to get the best codegen, and there will be implementations where csrr is slower than vsetvli.
Although one thing to consider is that ooo implementations will need to predict vtype/vl, and this may fill up the predictors quicker.

topperc · 2024-10-28T18:40:19Z

Although one thing to consider is that ooo implementations will need to predict vtype/vl, and this may fill up the predictors quicker.

Do you know of ooo implementations implementing a predictor?

camel-cdr · 2024-10-28T18:57:46Z

@topperc

Although one thing to consider is that ooo implementations will need to predict vtype/vl, and this may fill up the predictors quicker.

Do you know of ooo implementations implementing a predictor?

Yes I think so. Steam Computing open-sourced an ooo implementation with CSR speculation on top of BOOM at RISC-V Summit China: https://github.com/riscv-stc/riscv-boom/tree/matrix

Their default configuration seems to have 8 entries for vconfig speculation:
https://github.com/riscv-stc/riscv-boom/blob/8ccc5906f27d680ee9ef1b89f9a221da7b10f5df/src/main/scala/common/config-mixins.scala#L567C15-L567C30

I was not able to build it with verilator and contacted the author, who said that they only support vcs, which I don't have access to.
If someone here has a license, I can share my Dockerfile that got the farthest along the build process with verilator.

Edit: actually, this might just do speculation, but that also requires keeping track of multiple vtypes

I would hope there are a lot of proprietary cores with vtype speculation currently in development as well.

topperc · 2024-10-28T19:29:34Z

@topperc

Although one thing to consider is that ooo implementations will need to predict vtype/vl, and this may fill up the predictors quicker.

Do you know of ooo implementations implementing a predictor?

Yes I think so. Steam Computing open-sourced an ooo implementation with CSR speculation on top of BOOM at RISC-V Summit China: https://github.com/riscv-stc/riscv-boom/tree/matrix

Their default configuration seems to have 8 entries for vconfig speculation: https://github.com/riscv-stc/riscv-boom/blob/8ccc5906f27d680ee9ef1b89f9a221da7b10f5df/src/main/scala/common/config-mixins.scala#L567C15-L567C30

I was not able to build it with verilator and contacted the author, who said that they only support vcs, which I don't have access to. If someone here has a license, I can share my Dockerfile that got the farthest along the build process with verilator.

Edit: actually, this might just do speculation, but that also requires keeping track of multiple vtypes

I would hope there are a lot of proprietary cores with vtype speculation currently in development as well.

I thought you meant predicting the VL/VTYPE without waiting for the scalar instructions to compute the AVL. This looks like it just allowing it to speculatively execute across branches that might mispredict.

wangpc-pp · 2024-10-29T03:35:46Z

I think we should make it a feature. For XiangShan, IIUC, csrr a0, vlenb will be decoded as addi a0, zero, $vlenb (OpenXiangShan/XiangShan@63cb375), and with fusion/wide issue width, it won't be slower than vsetvli. There are too many things we can do with the implementation of vsetvli and as a consequence we can't assume the behavior here.

kito-cheng · 2024-10-30T00:48:46Z

I've considered adding an option or target feature before, but I decided not to include it in the patch in the end. I know that csrr vlenb and vsetvli have different performance on various microarchitectures. However, since we limited the replacement to the prologue and epilogue, I don't think this will result in significant performance differences. Generally, we still have a few instructions after vsetvli, like stack pointer adjustments, so it shouldn't cause too much interference or blocking with other vsetvli instructions. Additionally, as the test diffs show, this approach provides a net code size reduction.

So, why do we still try to replace VLEN * 1 with vsetvli? Two reasons from me: 1) for consistency and 2) because it might be further optimized with TII->mulImm.

XiangShan optimized csrr vlenb, while Banana Pi and some SiFive cores did not. However, I haven't seen any concrete cases showing that this transformation causes regressions aside from code size reduction.

But I do agree that we should have a target feature IF we try to replace csrr vlenb with vsetvli in the place other than prologue and epilogue, however that's not the scope of this patch :)

lukel97 · 2025-03-12T08:19:24Z

Reverse ping, does GCC currently do this? I think it would be nice to have it as a tuning option for spacemit-x60 + sifive

Currently, we use `csrr` with `vlenb` to obtain the `VLEN`, but this is not the only option. We can also use `vsetvli` with `e8`/`m1` to get `VLENMAX`, which is equal to the VLEN. This method is preferable on some microarchitectures and makes it easier to obtain values like `VLEN * 2`, `VLEN * 4`, or `VLEN * 8`, reducing the number of instructions needed to calculate VLEN multiples. However, this approach is *NOT* always interchangeable, as it changes the state of `VTYPE` and `VL`, which can alter the behavior of vector instructions, potentially causing incorrect code generation if applied after a vsetvli insertion. Therefore, we limit its use to the prologue/epilogue for now, as there are no vector operations within the prologue/epilogue sequence. With further analysis, we may extend this approach beyond the prologue/epilogue in the future, but starting here should be a good first step. This feature is gurded by the `+prefer-vsetvli-over-read-vlenb` feature, which is disabled by default for now.

kito-cheng · 2025-03-12T10:26:39Z

Changes:

Rebase
Add target feature +prefer-vsetvli-over-read-vlenb to toggle the code gen.

llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp

topperc · 2025-03-12T16:47:15Z

llvm/lib/Target/RISCV/RISCVFeatures.td

+    : SubtargetFeature<"prefer-vsetvli-over-read-vlenb",
+                       "PreferVsetvliOverReadVLENB",
+                       "true",
+                       "Prefer vsetvli over read vlenb CSR when calculate VLEN">;


when calculate -> to calculate

topperc · 2025-03-12T16:48:35Z

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp

+  // Make sure VTYPE and VL are not live-in since we will use vsetvli in the
+  // prologue to get the VLEN, and that will clobber these registers.
+  //
+  // We may do also check the stack has contain for the object with the


"has contain for the object with the " -> "contains objects with"

llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp

kito-cheng · 2025-03-14T06:23:52Z

Changes:

Apply Craig's comments

lukel97

LGTM

llvm/lib/Target/RISCV/RISCVRegisterInfo.h

kito-cheng · 2025-03-14T08:23:36Z

Changes:

Apply Luke's suggestion

kito-cheng · 2025-03-14T08:24:37Z

Will wait one more day before merge to see if any further comments

wangpc-pp

LGTM.

llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp

kito-cheng · 2025-03-17T06:29:05Z

Changes:

Apply Craig's suggestion, expand to PseudoVSETVLIX0 instead.
Add RegState when inserting PseudoVSETVLIX0.

preames

LGTM w/minor suggestion.

As a follow up, I think we should consider a) enabling this by default given a large fraction of existing hardware benefits and we don't have anyone raising concerns about regressions on the rest and b) extending this for non-power-of-twos by using mulimm on the resulting value.

llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td

kito-cheng requested review from preames, lukel97 and topperc October 26, 2024 07:41

llvmbot added the backend:RISC-V label Oct 26, 2024

lukel97 reviewed Oct 26, 2024

View reviewed changes

kito-cheng force-pushed the kitoc/vsetvli-rather-vlenb branch from 84036d4 to 7149518 Compare March 12, 2025 10:25

topperc reviewed Mar 12, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp Outdated Show resolved Hide resolved

topperc reviewed Mar 12, 2025

View reviewed changes

Apply Craig's suggestion

1573c2b

lukel97 approved these changes Mar 14, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVRegisterInfo.h Outdated Show resolved Hide resolved

Address Luke's comment

0b1e453

wangpc-pp approved these changes Mar 14, 2025

View reviewed changes

topperc reviewed Mar 14, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVExpandPseudoInsts.cpp Outdated Show resolved Hide resolved

Apply Craig's comment

bbd5b17

preames approved these changes Mar 17, 2025

View reviewed changes

llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td Outdated Show resolved Hide resolved

fixup Rename PseudoReadMulVLENB to PseudoReadVLENBViaVSETVLIX0

091fe30

kito-cheng merged commit 7f8451c into llvm:main Mar 21, 2025
6 of 11 checks passed

kito-cheng deleted the kitoc/vsetvli-rather-vlenb branch March 21, 2025 09:33

[RISCV] Use vsetvli instead of vlenb in Prologue/Epilogue #113756

[RISCV] Use vsetvli instead of vlenb in Prologue/Epilogue #113756

Conversation

kito-cheng commented Oct 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 26, 2024

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

topperc commented Oct 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc commented Oct 26, 2024

Uh oh!

topperc commented Oct 26, 2024

Uh oh!

lukel97 commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kito-cheng commented Oct 28, 2024

Uh oh!

camel-cdr commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc commented Oct 28, 2024

Uh oh!

camel-cdr commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc commented Oct 28, 2024

Uh oh!

wangpc-pp commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kito-cheng commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukel97 commented Mar 12, 2025

Uh oh!

kito-cheng commented Mar 12, 2025

Uh oh!

Uh oh!

Uh oh!

topperc Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

topperc Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kito-cheng commented Mar 14, 2025

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kito-cheng commented Mar 14, 2025

Uh oh!

kito-cheng commented Mar 14, 2025

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kito-cheng commented Mar 17, 2025

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kito-cheng commented Oct 26, 2024 •

edited

Loading

topperc commented Oct 26, 2024 •

edited

Loading

lukel97 commented Oct 27, 2024 •

edited

Loading

camel-cdr commented Oct 28, 2024 •

edited

Loading

camel-cdr commented Oct 28, 2024 •

edited

Loading

wangpc-pp commented Oct 29, 2024 •

edited

Loading

kito-cheng commented Oct 30, 2024 •

edited

Loading