Skip to content

[llvm][CodeGen] Add a new software pipeliner 'Window Scheduler' #84443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions llvm/include/llvm/CodeGen/MachinePipeliner.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,12 @@
#include "llvm/CodeGen/DFAPacketizer.h"
#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
#include "llvm/CodeGen/MachineScheduler.h"
#include "llvm/CodeGen/RegisterClassInfo.h"
#include "llvm/CodeGen/ScheduleDAGInstrs.h"
#include "llvm/CodeGen/ScheduleDAGMutation.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/WindowScheduler.h"
#include "llvm/InitializePasses.h"

#include <deque>
Expand Down Expand Up @@ -107,6 +109,9 @@ class MachinePipeliner : public MachineFunctionPass {
bool scheduleLoop(MachineLoop &L);
bool swingModuloScheduler(MachineLoop &L);
void setPragmaPipelineOptions(MachineLoop &L);
bool runWindowScheduler(MachineLoop &L);
bool useSwingModuloScheduler();
bool useWindowScheduler(bool Changed);
};

/// This class builds the dependence graph for the instructions in a loop,
Expand Down Expand Up @@ -449,7 +454,7 @@ class ResourceManager {
const MCSchedModel &SM;
const TargetSubtargetInfo *ST;
const TargetInstrInfo *TII;
SwingSchedulerDAG *DAG;
ScheduleDAGInstrs *DAG;
const bool UseDFA;
/// DFA resources for each slot
llvm::SmallVector<std::unique_ptr<DFAPacketizer>> DFAResources;
Expand Down Expand Up @@ -493,7 +498,7 @@ class ResourceManager {
#endif

public:
ResourceManager(const TargetSubtargetInfo *ST, SwingSchedulerDAG *DAG)
ResourceManager(const TargetSubtargetInfo *ST, ScheduleDAGInstrs *DAG)
: STI(ST), SM(ST->getSchedModel()), ST(ST), TII(ST->getInstrInfo()),
DAG(DAG), UseDFA(ST->useDFAforSMS()),
ProcResourceMasks(SM.getNumProcResourceKinds(), 0),
Expand Down
3 changes: 3 additions & 0 deletions llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,9 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
/// True if the subtarget should run MachinePipeliner
virtual bool enableMachinePipeliner() const { return true; };

/// True if the subtarget should run WindowScheduler.
virtual bool enableWindowScheduler() const { return true; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On our own DSA, this window algorithm typically can achieve a performance
improvement of over 10%.

Could you please share some performance data (e.g., SPEC benchmarks) on other non-VLIW architectures? IIRC AArch64 and PowerPC also support MachinePipeliner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverse ping :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over the past few days, we have specifically tested the performance of the Software Pipeliner on aarch64. The test hardware and software environment is as follows: Apple M1 Pro 32GB, Docker 25.0.3, Ubuntu 22.04.4 LTS, GCC 11.4.0, and SPEC2006 1.2. We ran 11 integer benchmarks in ref mode, each 5 times. The final result shows that the base performance is 51.2, and the performance with software pipelining enabled is 51.0, which are almost the same.
Although this result is strongly related to the local test conditions, we believe the relative results are credible. This is because there are very few loops in SPEC that meet the criteria for applying software pipelining, and even fewer loops with long computation times (which aligns with the original design intention of SPEC, "Computer Architecture: A Quantitative Approach" 1.11). Therefore, we still believe that the software pipelining algorithm should play a major role in DSP or DSA.


/// True if the subtarget should enable joining global copies.
///
/// By default this is enabled if the machine scheduler is enabled, but
Expand Down
171 changes: 171 additions & 0 deletions llvm/include/llvm/CodeGen/WindowScheduler.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
//======----------- WindowScheduler.cpp - window scheduler -------------======//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// An implementation of the Window Scheduling software pipelining algorithm.
//
// The concept of the window algorithm was first unveiled in Steven Muchnick's
// book, "Advanced Compiler Design And Implementation", and later elaborated
// upon in Venkatraman Govindaraju's report, "Implementation of Software
// Pipelining Using Window Scheduling".
//
// The window algorithm can be perceived as a modulo scheduling algorithm with a
// stage count of 2. It boasts a higher scheduling success rate in targets with
// severe resource conflicts when compared to the classic Swing Modulo
// Scheduling (SMS) algorithm. To align with the LLVM scheduling framework, we
// have enhanced the original window algorithm. The primary steps are as
// follows:
//
// 1. Instead of duplicating the original MBB twice as mentioned in the
// literature, we copy it three times, generating TripleMBB and the
// corresponding TripleDAG.
//
// 2. We establish a scheduling window on TripleMBB and execute list scheduling
// within it.
//
// 3. After multiple list scheduling, we select the best outcome and expand it
// into the final scheduling result.
//
// To cater to the needs of various targets, we have developed the window
// scheduler in a form that is easily derivable. We recommend employing this
// algorithm in targets with severe resource conflicts, and it can be utilized
// either before or after the Register Allocator (RA).
//
// The default implementation provided here is before RA. If it is to be used
// after RA, certain critical algorithm functions will need to be derived.
//
//===----------------------------------------------------------------------===//
#ifndef LLVM_CODEGEN_WINDOWSCHEDULER_H
#define LLVM_CODEGEN_WINDOWSCHEDULER_H

#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/MachineScheduler.h"
#include "llvm/CodeGen/ScheduleDAGInstrs.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"

namespace llvm {

enum WindowSchedulingFlag {
WS_Off, /// Turn off window algorithm.
WS_On, /// Use window algorithm after SMS algorithm fails.
WS_Force /// Use window algorithm instead of SMS algorithm.
};

/// The main class in the implementation of the target independent window
/// scheduler.
class WindowScheduler {
protected:
MachineSchedContext *Context = nullptr;
MachineFunction *MF = nullptr;
MachineBasicBlock *MBB = nullptr;
MachineLoop &Loop;
const TargetSubtargetInfo *Subtarget = nullptr;
const TargetInstrInfo *TII = nullptr;
const TargetRegisterInfo *TRI = nullptr;
MachineRegisterInfo *MRI = nullptr;

/// To innovatively identify the dependencies between MIs across two trips, we
/// construct a DAG for a new MBB, which is created by copying the original
/// MBB three times. We refer to this new MBB as 'TripleMBB' and the
/// corresponding DAG as 'TripleDAG'.
/// If the dependencies are more than two trips, we avoid applying window
/// algorithm by identifying successive phis in the old MBB.
std::unique_ptr<ScheduleDAGInstrs> TripleDAG;
/// OriMIs keeps the MIs removed from the original MBB.
SmallVector<MachineInstr *> OriMIs;
/// TriMIs keeps the MIs of TripleMBB, which is used to restore TripleMBB.
SmallVector<MachineInstr *> TriMIs;
/// TriToOri keeps the mappings between the MI clones in TripleMBB and their
/// original MI.
DenseMap<MachineInstr *, MachineInstr *> TriToOri;
/// OriToCycle keeps the mappings between the original MI and its issue cycle.
DenseMap<MachineInstr *, int> OriToCycle;
/// SchedResult keeps the result of each list scheduling, and the format of
/// the tuple is <MI pointer, Cycle, Stage, Order ID>.
SmallVector<std::tuple<MachineInstr *, int, int, int>, 256> SchedResult;
/// SchedPhiNum records the number of phi in the original MBB, and the
/// scheduling starts with MI after phis.
unsigned SchedPhiNum = 0;
/// SchedInstrNum records the MIs involved in scheduling in the original MBB,
/// excluding debug instructions.
unsigned SchedInstrNum = 0;
/// BestII and BestOffset record the characteristics of the best scheduling
/// result and are used together with SchedResult as the final window
/// scheduling result.
unsigned BestII = UINT_MAX;
unsigned BestOffset = 0;
/// BaseII is the II obtained when the window offset is SchedPhiNum. This
/// offset is the initial position of the sliding window.
unsigned BaseII = 0;

public:
WindowScheduler(MachineSchedContext *C, MachineLoop &ML);
virtual ~WindowScheduler() {}

bool run();

protected:
/// Two types of ScheduleDAGs are needed, one for creating dependency graphs
/// only, and the other for list scheduling as determined by the target.
virtual ScheduleDAGInstrs *
createMachineScheduler(bool OnlyBuildGraph = false);
/// Initializes the algorithm and determines if it can be executed.
virtual bool initialize();
/// Add some related processing before running window scheduling.
virtual void preProcess();
/// Add some related processing after running window scheduling.
virtual void postProcess();
/// Back up the MIs in the original MBB and remove them from MBB.
void backupMBB();
/// Erase the MIs in current MBB and restore the original MIs.
void restoreMBB();
/// Make three copies of the original MBB to generate a new TripleMBB.
virtual void generateTripleMBB();
/// Restore the order of MIs in TripleMBB after each list scheduling.
virtual void restoreTripleMBB();
/// Give the folding position in the window algorithm, where different
/// heuristics can be used. It determines the performance and compilation time
/// of the algorithm.
virtual SmallVector<unsigned> getSearchIndexes(unsigned SearchNum,
unsigned SearchRatio);
/// Calculate MIs execution cycle after list scheduling.
virtual int calculateMaxCycle(ScheduleDAGInstrs &DAG, unsigned Offset);
/// Calculate the stall cycle between two trips after list scheduling.
virtual int calculateStallCycle(unsigned Offset, int MaxCycle);
/// Analyzes the II value after each list scheduling.
virtual unsigned analyseII(ScheduleDAGInstrs &DAG, unsigned Offset);
/// Phis are scheduled separately after each list scheduling.
virtual void schedulePhi(int Offset, unsigned &II);
/// Get the final issue order of all scheduled MIs including phis.
DenseMap<MachineInstr *, int> getIssueOrder(unsigned Offset, unsigned II);
/// Update the scheduling result after each list scheduling.
virtual void updateScheduleResult(unsigned Offset, unsigned II);
/// Check whether the final result of window scheduling is valid.
virtual bool isScheduleValid() { return BestOffset != SchedPhiNum; }
/// Using the scheduling infrastructure to expand the results of window
/// scheduling. It is usually necessary to add prologue and epilogue MBBs.
virtual void expand();
/// Update the live intervals for all registers used within MBB.
virtual void updateLiveIntervals();
/// Estimate a II value at which all MIs will be scheduled successfully.
int getEstimatedII(ScheduleDAGInstrs &DAG);
/// Gets the iterator range of MIs in the scheduling window.
iterator_range<MachineBasicBlock::iterator> getScheduleRange(unsigned Offset,
unsigned Num);
/// Get the issue cycle of the new MI based on the cycle of the original MI.
int getOriCycle(MachineInstr *NewMI);
/// Get the original MI from which the new MI is cloned.
MachineInstr *getOriMI(MachineInstr *NewMI);
/// Get the scheduling stage, where the stage of the new MI is identical to
/// the original MI.
unsigned getOriStage(MachineInstr *OriMI, unsigned Offset);
/// Gets the register in phi which is generated from the current MBB.
Register getAntiRegister(MachineInstr *Phi);
};
} // namespace llvm
#endif
1 change: 1 addition & 0 deletions llvm/lib/CodeGen/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,7 @@ add_llvm_component_library(LLVMCodeGen
VLIWMachineScheduler.cpp
VirtRegMap.cpp
WasmEHPrepare.cpp
WindowScheduler.cpp
WinEHPrepare.cpp
XRayInstrumentation.cpp
${GeneratedMLSources}
Expand Down
43 changes: 42 additions & 1 deletion llvm/lib/CodeGen/MachinePipeliner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
#include "llvm/CodeGen/ScheduleDAGMutation.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetOpcodes.h"
#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"
#include "llvm/Config/llvm-config.h"
Expand Down Expand Up @@ -206,6 +207,17 @@ cl::opt<int> SwpForceIssueWidth(
cl::desc("Force pipeliner to use specified issue width."), cl::Hidden,
cl::init(-1));

/// A command line argument to set the window scheduling option.
cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
"window-sched", cl::Hidden, cl::init(WindowSchedulingFlag::WS_On),
cl::desc("Set how to use window scheduling algorithm."),
cl::values(clEnumValN(WindowSchedulingFlag::WS_Off, "off",
"Turn off window algorithm."),
clEnumValN(WindowSchedulingFlag::WS_On, "on",
"Use window algorithm after SMS algorithm fails."),
clEnumValN(WindowSchedulingFlag::WS_Force, "force",
"Use window algorithm instead of SMS algorithm.")));

} // end namespace llvm

unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
Expand Down Expand Up @@ -292,8 +304,11 @@ bool MachinePipeliner::scheduleLoop(MachineLoop &L) {
}

++NumTrytoPipeline;
if (useSwingModuloScheduler())
Changed = swingModuloScheduler(L);

Changed = swingModuloScheduler(L);
if (useWindowScheduler(Changed))
Changed = runWindowScheduler(L);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run one scheduler after another? This came up in a discussion today at the vectorizer meeting. cc: @ayalz

Copy link
Contributor Author

@huaatian huaatian Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your review comment. Let me explain the current design considerations:

  1. We understand that both SMS (Swing Modulo Scheduling) and WS (Window Scheduling) belong to the category of software pipelining algorithms, and the conditions for determining their feasibility are the same. To avoid redundant checks, we have placed both in the MachinePipeliner.
  2. The basic principle of both SMS and WS scheduling is to fold the loop multiple times to obtain the kernel. The advantage of SMS is that it can fold more times, i.e., the stage can be greater than 2. On the other hand, the advantage of WS is that it is less affected by resource conflicts and can always get a scheduling result. Therefore, performing WS after SMS fails can be seen as an enhanced algorithm for targets with many resource conflicts.

I hope my explanation addresses your concerns. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, as an example, our VLIW target experiences more hardware conflicts due to accurate modeling. This results in a higher failure rate for SMS. Therefore, we have opted to directly use the WS algorithm.


LI.LoopPipelinerInfo.reset();
return Changed;
Expand Down Expand Up @@ -484,9 +499,35 @@ void MachinePipeliner::getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequired<MachineDominatorTree>();
AU.addRequired<LiveIntervals>();
AU.addRequired<MachineOptimizationRemarkEmitterPass>();
AU.addRequired<TargetPassConfig>();
MachineFunctionPass::getAnalysisUsage(AU);
}

bool MachinePipeliner::runWindowScheduler(MachineLoop &L) {
MachineSchedContext Context;
Context.MF = MF;
Context.MLI = MLI;
Context.MDT = MDT;
Context.PassConfig = &getAnalysis<TargetPassConfig>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the WindowScheduler, we use createMachineScheduler() through 'PassConfig' to call the target's custom MachineScheduler:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, two ScheduleDAGs are used here: one for analyzing the dependencies between all instructions after copying, and the other for scheduling instructions in the window. The difference is that the former does not need to consider register pressure.

Context.AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
Context.LIS = &getAnalysis<LiveIntervals>();
Context.RegClassInfo->runOnMachineFunction(*MF);
WindowScheduler WS(&Context, L);
return WS.run();
}

bool MachinePipeliner::useSwingModuloScheduler() {
// SwingModuloScheduler does not work when WindowScheduler is forced.
return WindowSchedulingOption != WindowSchedulingFlag::WS_Force;
}

bool MachinePipeliner::useWindowScheduler(bool Changed) {
// WindowScheduler does not work when it is off or when SwingModuloScheduler
// is successfully scheduled.
return WindowSchedulingOption == WindowSchedulingFlag::WS_Force ||
(WindowSchedulingOption == WindowSchedulingFlag::WS_On && !Changed);
}

void SwingSchedulerDAG::setMII(unsigned ResMII, unsigned RecMII) {
if (SwpForceII > 0)
MII = SwpForceII;
Expand Down
Loading
Loading