Skip to content

Commit cc2fbc6

Browse files
authored
[CodeLayout] Faster basic block reordering, ext-tsp (#68617)
Aggressive inlining might produce huge functions with >10K of basic blocks. Since BFI treats _all_ blocks and jumps as "hot" having non-negative (but perhaps small) weight, the current implementation can be slow, taking minutes to produce an layout. This change introduces a few modifications that significantly (up to 50x on some instances) speeds up the computation. Some notable changes: - reduced the maximum chain size to 512 (from the prior 4096); - introduced MaxMergeDensityRatio param to avoid merging chains with very different densities; - dropped a couple of params that seem unnecessary. Looking at some "offline" metrics (e.g., the number of created fall-throughs), there shouldn't be problems; in fact, I do see some metrics go up. But it might be hard/impossible to measure perf difference for such small changes. I did test the performance clang-14 binary and do not record a perf or i-cache-related differences. My 5 benchmarks, with ext-tsp runtime (the lower the better) and "tsp-score" (the higher the better). **Before**: - benchmark 1: num functions: 13,047 reordering running time is 2.4 seconds score: 125503458 (128.3102%) - benchmark 2: num functions: 16,438 reordering running time is 3.4 seconds score: 12613997277 (129.7495%) - benchmark 3: num functions: 12,359 reordering running time is 1.9 seconds score: 1315881613 (105.8991%) - benchmark 4: num functions: 96,588 reordering running time is 7.3 seconds score: 89513906284 (100.3413%) - benchmark 5: num functions: 1 reordering running time is 372 seconds score: 21292505965077 (99.9979%) - benchmark 6: num functions: 71,155 reordering running time is 314 seconds score: 29795381626270671437824 (102.7519%) **After**: - benchmark 1: reordering running time is 2.2 seconds score: 125510418 (128.3130%) - benchmark 2: reordering running time is 2.6 seconds score: 12614502162 (129.7525%) - benchmark 3: reordering running time is 1.6 seconds score: 1315938168 (105.9024%) - benchmark 4: reordering running time is 4.9 seconds score: 89518095837 (100.3454%) - benchmark 5: reordering running time is 4.8 seconds score: 21292295939119 (99.9971%) - benchmark 6: reordering running time is 104 seconds score: 29796710925310302879744 (102.7565%)
1 parent 28a8f1b commit cc2fbc6

File tree

3 files changed

+69
-65
lines changed

3 files changed

+69
-65
lines changed

llvm/lib/Transforms/Utils/CodeLayout.cpp

Lines changed: 66 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -101,20 +101,19 @@ static cl::opt<unsigned> BackwardDistance(
101101
// The maximum size of a chain created by the algorithm. The size is bounded
102102
// so that the algorithm can efficiently process extremely large instances.
103103
static cl::opt<unsigned>
104-
MaxChainSize("ext-tsp-max-chain-size", cl::ReallyHidden, cl::init(4096),
105-
cl::desc("The maximum size of a chain to create."));
104+
MaxChainSize("ext-tsp-max-chain-size", cl::ReallyHidden, cl::init(512),
105+
cl::desc("The maximum size of a chain to create"));
106106

107107
// The maximum size of a chain for splitting. Larger values of the threshold
108108
// may yield better quality at the cost of worsen run-time.
109109
static cl::opt<unsigned> ChainSplitThreshold(
110110
"ext-tsp-chain-split-threshold", cl::ReallyHidden, cl::init(128),
111111
cl::desc("The maximum size of a chain to apply splitting"));
112112

113-
// The option enables splitting (large) chains along in-coming and out-going
114-
// jumps. This typically results in a better quality.
115-
static cl::opt<bool> EnableChainSplitAlongJumps(
116-
"ext-tsp-enable-chain-split-along-jumps", cl::ReallyHidden, cl::init(true),
117-
cl::desc("The maximum size of a chain to apply splitting"));
113+
// The maximum ratio between densities of two chains for merging.
114+
static cl::opt<double> MaxMergeDensityRatio(
115+
"ext-tsp-max-merge-density-ratio", cl::ReallyHidden, cl::init(100),
116+
cl::desc("The maximum ratio between densities of two chains for merging"));
118117

119118
// Algorithm-specific options for CDS.
120119
static cl::opt<unsigned> CacheEntries("cds-cache-entries", cl::ReallyHidden,
@@ -226,6 +225,9 @@ struct NodeT {
226225

227226
bool isEntry() const { return Index == 0; }
228227

228+
// Check if Other is a successor of the node.
229+
bool isSuccessor(const NodeT *Other) const;
230+
229231
// The total execution count of outgoing jumps.
230232
uint64_t outCount() const;
231233

@@ -289,7 +291,7 @@ struct ChainT {
289291

290292
size_t numBlocks() const { return Nodes.size(); }
291293

292-
double density() const { return static_cast<double>(ExecutionCount) / Size; }
294+
double density() const { return ExecutionCount / Size; }
293295

294296
bool isEntry() const { return Nodes[0]->Index == 0; }
295297

@@ -350,8 +352,9 @@ struct ChainT {
350352
uint64_t Id;
351353
// Cached ext-tsp score for the chain.
352354
double Score{0};
353-
// The total execution count of the chain.
354-
uint64_t ExecutionCount{0};
355+
// The total execution count of the chain. Since the execution count of
356+
// a basic block is uint64_t, using doubles here to avoid overflow.
357+
double ExecutionCount{0};
355358
// The total size of the chain.
356359
uint64_t Size{0};
357360
// Nodes of the chain.
@@ -446,6 +449,13 @@ struct ChainEdge {
446449
bool CacheValidBackward{false};
447450
};
448451

452+
bool NodeT::isSuccessor(const NodeT *Other) const {
453+
for (JumpT *Jump : OutJumps)
454+
if (Jump->Target == Other)
455+
return true;
456+
return false;
457+
}
458+
449459
uint64_t NodeT::outCount() const {
450460
uint64_t Count = 0;
451461
for (JumpT *Jump : OutJumps)
@@ -514,8 +524,6 @@ struct MergedNodesT {
514524

515525
const NodeT *getFirstNode() const { return *Begin1; }
516526

517-
bool empty() const { return Begin1 == End1; }
518-
519527
private:
520528
NodeIter Begin1;
521529
NodeIter End1;
@@ -639,7 +647,8 @@ class ExtTSPImpl {
639647
}
640648
}
641649
for (JumpT &Jump : AllJumps) {
642-
assert(OutDegree[Jump.Source->Index] > 0);
650+
assert(OutDegree[Jump.Source->Index] > 0 &&
651+
"incorrectly computed out-degree of the block");
643652
Jump.IsConditional = OutDegree[Jump.Source->Index] > 1;
644653
}
645654

@@ -741,12 +750,23 @@ class ExtTSPImpl {
741750
// Get candidates for merging with the current chain.
742751
for (const auto &[ChainSucc, Edge] : ChainPred->Edges) {
743752
// Ignore loop edges.
744-
if (ChainPred == ChainSucc)
753+
if (Edge->isSelfEdge())
745754
continue;
746-
747-
// Stop early if the combined chain violates the maximum allowed size.
755+
// Skip the merge if the combined chain violates the maximum specified
756+
// size.
748757
if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)
749758
continue;
759+
// Don't merge the chains if they have vastly different densities.
760+
// Skip the merge if the ratio between the densities exceeds
761+
// MaxMergeDensityRatio. Smaller values of the option result in fewer
762+
// merges, and hence, more chains.
763+
auto [minDensity, maxDensity] =
764+
std::minmax(ChainPred->density(), ChainSucc->density());
765+
assert(minDensity > 0.0 && maxDensity > 0.0 &&
766+
"incorrectly computed chain densities");
767+
const double Ratio = maxDensity / minDensity;
768+
if (Ratio > MaxMergeDensityRatio)
769+
continue;
750770

751771
// Compute the gain of merging the two chains.
752772
MergeGainT CurGain = getBestMergeGain(ChainPred, ChainSucc, Edge);
@@ -858,36 +878,42 @@ class ExtTSPImpl {
858878
Gain.updateIfLessThan(
859879
computeMergeGain(ChainPred, ChainSucc, Jumps, 0, MergeTypeT::X_Y));
860880

861-
if (EnableChainSplitAlongJumps) {
862-
// Attach (a part of) ChainPred before the first node of ChainSucc.
863-
for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
864-
const NodeT *SrcBlock = Jump->Source;
865-
if (SrcBlock->CurChain != ChainPred)
866-
continue;
867-
size_t Offset = SrcBlock->CurIndex + 1;
868-
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
869-
}
881+
// Attach (a part of) ChainPred before the first node of ChainSucc.
882+
for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
883+
const NodeT *SrcBlock = Jump->Source;
884+
if (SrcBlock->CurChain != ChainPred)
885+
continue;
886+
size_t Offset = SrcBlock->CurIndex + 1;
887+
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
888+
}
870889

871-
// Attach (a part of) ChainPred after the last node of ChainSucc.
872-
for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
873-
const NodeT *DstBlock = Jump->Target;
874-
if (DstBlock->CurChain != ChainPred)
875-
continue;
876-
size_t Offset = DstBlock->CurIndex;
877-
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
878-
}
890+
// Attach (a part of) ChainPred after the last node of ChainSucc.
891+
for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
892+
const NodeT *DstBlock = Jump->Target;
893+
if (DstBlock->CurChain != ChainPred)
894+
continue;
895+
size_t Offset = DstBlock->CurIndex;
896+
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
879897
}
880898

881899
// Try to break ChainPred in various ways and concatenate with ChainSucc.
882900
if (ChainPred->Nodes.size() <= ChainSplitThreshold) {
883901
for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {
884-
// Try to split the chain in different ways. In practice, applying
885-
// X2_Y_X1 merging is almost never provides benefits; thus, we exclude
886-
// it from consideration to reduce the search space.
902+
// Do not split the chain along a fall-through jump. One of the two
903+
// loops above may still "break" such a jump whenever it results in a
904+
// new fall-through.
905+
const NodeT *BB = ChainPred->Nodes[Offset - 1];
906+
const NodeT *BB2 = ChainPred->Nodes[Offset];
907+
if (BB->isSuccessor(BB2))
908+
continue;
909+
910+
// In practice, applying X2_Y_X1 merging almost never provides benefits;
911+
// thus, we exclude it from consideration to reduce the search space.
887912
tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,
888913
MergeTypeT::X2_X1_Y});
889914
}
890915
}
916+
891917
Edge->setCachedMergeGain(ChainPred, ChainSucc, Gain);
892918
return Gain;
893919
}
@@ -946,22 +972,11 @@ class ExtTSPImpl {
946972

947973
/// Concatenate all chains into the final order.
948974
std::vector<uint64_t> concatChains() {
949-
// Collect chains and calculate density stats for their sorting.
975+
// Collect non-empty chains.
950976
std::vector<const ChainT *> SortedChains;
951-
DenseMap<const ChainT *, double> ChainDensity;
952977
for (ChainT &Chain : AllChains) {
953-
if (!Chain.Nodes.empty()) {
978+
if (!Chain.Nodes.empty())
954979
SortedChains.push_back(&Chain);
955-
// Using doubles to avoid overflow of ExecutionCounts.
956-
double Size = 0;
957-
double ExecutionCount = 0;
958-
for (NodeT *Node : Chain.Nodes) {
959-
Size += static_cast<double>(Node->Size);
960-
ExecutionCount += static_cast<double>(Node->ExecutionCount);
961-
}
962-
assert(Size > 0 && "a chain of zero size");
963-
ChainDensity[&Chain] = ExecutionCount / Size;
964-
}
965980
}
966981

967982
// Sorting chains by density in the decreasing order.
@@ -971,11 +986,9 @@ class ExtTSPImpl {
971986
if (L->isEntry() != R->isEntry())
972987
return L->isEntry();
973988

974-
const double DL = ChainDensity[L];
975-
const double DR = ChainDensity[R];
976989
// Compare by density and break ties by chain identifiers.
977-
return std::make_tuple(-DL, L->Id) <
978-
std::make_tuple(-DR, R->Id);
990+
return std::make_tuple(-L->density(), L->Id) <
991+
std::make_tuple(-R->density(), R->Id);
979992
});
980993

981994
// Collect the nodes in the order specified by their chains.

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
;; See also llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
22
; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
3-
; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=0 -ext-tsp-enable-chain-split-along-jumps=0 < %s | FileCheck %s -check-prefix=CHECK2
43

54
define void @func1a() {
65
; Test that the algorithm positions the most likely successor first
@@ -329,8 +328,8 @@ end:
329328
}
330329

331330
define void @func4() !prof !11 {
332-
; Test verifying that, if enabled, chains can be split in order to improve the
333-
; objective (by creating more fallthroughs)
331+
; Test verifying that chains can be split in order to improve the objective
332+
; by creating more fallthroughs
334333
;
335334
; +-------+
336335
; | entry |--------+
@@ -354,19 +353,11 @@ define void @func4() !prof !11 {
354353
; | b2 | <+ ----+
355354
; +-------+
356355
;
357-
; With chain splitting enabled:
358356
; CHECK-LABEL: func4:
359357
; CHECK: entry
360358
; CHECK: b1
361359
; CHECK: b3
362360
; CHECK: b2
363-
;
364-
; With chain splitting disabled:
365-
; CHECK2-LABEL: func4:
366-
; CHECK2: entry
367-
; CHECK2: b1
368-
; CHECK2: b2
369-
; CHECK2: b3
370361

371362
entry:
372363
call void @b()

llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
@yydebug = dso_local global i32 0, align 4
77

88
define void @func_large() !prof !0 {
9-
; A largee CFG instance where chain splitting helps to
9+
; A large CFG instance where chain splitting helps to
1010
; compute a better basic block ordering. The test verifies that with chain
1111
; splitting, the resulting layout is improved (e.g., the score is increased).
1212
;

0 commit comments

Comments
 (0)