Skip to content

[CodeGenPrepare] Unfold slow ctpop when used in power-of-two test #102731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions llvm/lib/CodeGen/CodeGenPrepare.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -474,6 +474,7 @@ class CodeGenPrepare {
bool optimizeURem(Instruction *Rem);
bool combineToUSubWithOverflow(CmpInst *Cmp, ModifyDT &ModifiedDT);
bool combineToUAddWithOverflow(CmpInst *Cmp, ModifyDT &ModifiedDT);
bool unfoldPow2Test(CmpInst *Cmp);
void verifyBFIUpdates(Function &F);
bool _run(Function &F);
};
Expand Down Expand Up @@ -1762,6 +1763,61 @@ bool CodeGenPrepare::combineToUSubWithOverflow(CmpInst *Cmp,
return true;
}

// Decanonicalizes icmp+ctpop power-of-two test if ctpop is slow.
bool CodeGenPrepare::unfoldPow2Test(CmpInst *Cmp) {
CmpPredicate Pred;
Value *X;
const APInt *C;

// (icmp (ctpop x), c)
if (!match(Cmp, m_ICmp(Pred, m_Intrinsic<Intrinsic::ctpop>(m_Value(X)),
m_APIntAllowPoison(C))))
return false;

// This transformation increases the number of instructions, don't do it if
// ctpop is fast.
Type *OpTy = X->getType();
if (TLI->isCtpopFast(TLI->getValueType(*DL, OpTy)))
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could sink this after checking the IR is the right shape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sank it down


// ctpop(x) u< 2 -> (x & (x - 1)) == 0
// ctpop(x) u> 1 -> (x & (x - 1)) != 0
// Also handles ctpop(x) == 1 and ctpop(x) != 1 if ctpop(x) is known non-zero.
if ((Pred == CmpInst::ICMP_ULT && *C == 2) ||
(Pred == CmpInst::ICMP_UGT && *C == 1) ||
(ICmpInst::isEquality(Pred) && *C == 1 &&
isKnownNonZero(Cmp->getOperand(0), *DL))) {
IRBuilder<> Builder(Cmp);
Value *Sub = Builder.CreateAdd(X, Constant::getAllOnesValue(OpTy));
Value *And = Builder.CreateAnd(X, Sub);
CmpInst::Predicate NewPred =
(Pred == CmpInst::ICMP_ULT || Pred == CmpInst::ICMP_EQ)
? CmpInst::ICMP_EQ
: CmpInst::ICMP_NE;
Value *NewCmp =
Builder.CreateICmp(NewPred, And, ConstantInt::getNullValue(OpTy));
Cmp->replaceAllUsesWith(NewCmp);
RecursivelyDeleteTriviallyDeadInstructions(Cmp);
return true;
}

// ctpop(x) == 1 -> (x ^ (x - 1)) u> (x - 1)
// ctpop(x) != 1 -> (x ^ (x - 1)) u<= (x - 1)
if (ICmpInst::isEquality(Pred) && *C == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to test both forms? I would hope these canonicalize one way or the other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, this code compiles to two different icmps (one to eq, the other to ne):

void bar();
int test_eq(int x, int y) { if (__builtin_popcount(x) == 1 && y) bar(); }
int test_ne(int x, int y) { if (__builtin_popcount(x) != 1 && y) bar(); }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get handling the invert and non-inverted forms. But I mean these two blocks of conditions

Copy link
Contributor Author

@s-barannikov s-barannikov Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Middle end doesn't do this until CGP. adjustIsPower2Test changes ==/!= 1 --> u< 2 / u> 1 if ctpop(x) is known non-zero.
CGP doesn't revisit an instruction if it was optimized once, so this unfoldPow2Test has to be called before adjustIsPower2Test. I guess these two functions can be joined to one optimizePow2Test, should I do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged with adjustIsPower2Test, does this look better?

IRBuilder<> Builder(Cmp);
Value *Sub = Builder.CreateAdd(X, Constant::getAllOnesValue(OpTy));
Value *Xor = Builder.CreateXor(X, Sub);
CmpInst::Predicate NewPred =
Pred == CmpInst::ICMP_EQ ? CmpInst::ICMP_UGT : CmpInst::ICMP_ULE;
Value *NewCmp = Builder.CreateICmp(NewPred, Xor, Sub);
Cmp->replaceAllUsesWith(NewCmp);
RecursivelyDeleteTriviallyDeadInstructions(Cmp);
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add an alive2 proof for the expansion? This probably needs freeze due to use duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not look it needs a freeze (unless I'm missing something)
https://alive2.llvm.org/ce/z/ARzh99

I've updated the description with the link.

}

return false;
}

/// Sink the given CmpInst into user blocks to reduce the number of virtual
/// registers that must be created and coalesced. This is a clear win except on
/// targets with multiple condition code registers (PowerPC), where it might
Expand Down Expand Up @@ -2183,6 +2239,9 @@ bool CodeGenPrepare::optimizeCmp(CmpInst *Cmp, ModifyDT &ModifiedDT) {
if (combineToUSubWithOverflow(Cmp, ModifiedDT))
return true;

if (unfoldPow2Test(Cmp))
return true;

if (foldICmpWithDominatingICmp(Cmp, *TLI))
return true;

Expand Down
16 changes: 8 additions & 8 deletions llvm/test/CodeGen/PowerPC/vector-popcnt-128-ult-ugt.ll
Original file line number Diff line number Diff line change
Expand Up @@ -11945,23 +11945,23 @@ define <2 x i64> @ugt_1_v2i64(<2 x i64> %0) {
; PWR5-LABEL: ugt_1_v2i64:
; PWR5: # %bb.0:
; PWR5-NEXT: addi 5, 3, -1
; PWR5-NEXT: addi 6, 4, -1
; PWR5-NEXT: and 3, 3, 5
; PWR5-NEXT: addi 5, 4, -1
; PWR5-NEXT: and 4, 4, 6
; PWR5-NEXT: subfic 3, 3, 0
; PWR5-NEXT: subfe 3, 3, 3
; PWR5-NEXT: and 4, 4, 5
; PWR5-NEXT: subfic 4, 4, 0
; PWR5-NEXT: subfe 4, 4, 4
; PWR5-NEXT: blr
;
; PWR6-LABEL: ugt_1_v2i64:
; PWR6: # %bb.0:
; PWR6-NEXT: addi 5, 3, -1
; PWR6-NEXT: addi 6, 4, -1
; PWR6-NEXT: and 3, 3, 5
; PWR6-NEXT: addi 5, 4, -1
; PWR6-NEXT: and 4, 4, 6
; PWR6-NEXT: subfic 3, 3, 0
; PWR6-NEXT: subfe 3, 3, 3
; PWR6-NEXT: and 4, 4, 5
; PWR6-NEXT: subfic 4, 4, 0
; PWR6-NEXT: subfe 4, 4, 4
; PWR6-NEXT: blr
Expand Down Expand Up @@ -12016,23 +12016,23 @@ define <2 x i64> @ult_2_v2i64(<2 x i64> %0) {
; PWR5-LABEL: ult_2_v2i64:
; PWR5: # %bb.0:
; PWR5-NEXT: addi 5, 3, -1
; PWR5-NEXT: addi 6, 4, -1
; PWR5-NEXT: and 3, 3, 5
; PWR5-NEXT: addi 5, 4, -1
; PWR5-NEXT: and 4, 4, 6
; PWR5-NEXT: addic 3, 3, -1
; PWR5-NEXT: subfe 3, 3, 3
; PWR5-NEXT: and 4, 4, 5
; PWR5-NEXT: addic 4, 4, -1
; PWR5-NEXT: subfe 4, 4, 4
; PWR5-NEXT: blr
;
; PWR6-LABEL: ult_2_v2i64:
; PWR6: # %bb.0:
; PWR6-NEXT: addi 5, 3, -1
; PWR6-NEXT: addi 6, 4, -1
; PWR6-NEXT: and 3, 3, 5
; PWR6-NEXT: addi 5, 4, -1
; PWR6-NEXT: and 4, 4, 6
; PWR6-NEXT: addic 3, 3, -1
; PWR6-NEXT: subfe 3, 3, 3
; PWR6-NEXT: and 4, 4, 5
; PWR6-NEXT: addic 4, 4, -1
; PWR6-NEXT: subfe 4, 4, 4
; PWR6-NEXT: blr
Expand Down
205 changes: 41 additions & 164 deletions llvm/test/CodeGen/RISCV/GlobalISel/rv32zbb.ll
Original file line number Diff line number Diff line change
Expand Up @@ -357,49 +357,14 @@ define i64 @ctpop_i64(i64 %a) nounwind {
define i1 @ctpop_i64_ugt_two(i64 %a) nounwind {
; RV32I-LABEL: ctpop_i64_ugt_two:
; RV32I: # %bb.0:
; RV32I-NEXT: j .LBB6_2
; RV32I-NEXT: # %bb.1:
; RV32I-NEXT: sltiu a0, zero, 0
; RV32I-NEXT: ret
; RV32I-NEXT: .LBB6_2:
; RV32I-NEXT: srli a2, a0, 1
; RV32I-NEXT: lui a3, 349525
; RV32I-NEXT: lui a4, 209715
; RV32I-NEXT: srli a5, a1, 1
; RV32I-NEXT: addi a3, a3, 1365
; RV32I-NEXT: and a2, a2, a3
; RV32I-NEXT: and a3, a5, a3
; RV32I-NEXT: lui a5, 61681
; RV32I-NEXT: addi a4, a4, 819
; RV32I-NEXT: addi a5, a5, -241
; RV32I-NEXT: sub a0, a0, a2
; RV32I-NEXT: sub a1, a1, a3
; RV32I-NEXT: srli a2, a0, 2
; RV32I-NEXT: and a0, a0, a4
; RV32I-NEXT: srli a3, a1, 2
; RV32I-NEXT: and a1, a1, a4
; RV32I-NEXT: and a2, a2, a4
; RV32I-NEXT: and a3, a3, a4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: srli a2, a0, 4
; RV32I-NEXT: srli a3, a1, 4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: and a0, a0, a5
; RV32I-NEXT: and a1, a1, a5
; RV32I-NEXT: slli a2, a0, 8
; RV32I-NEXT: slli a3, a1, 8
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: slli a2, a0, 16
; RV32I-NEXT: slli a3, a1, 16
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: srli a0, a0, 24
; RV32I-NEXT: srli a1, a1, 24
; RV32I-NEXT: add a0, a1, a0
; RV32I-NEXT: sltiu a0, a0, 2
; RV32I-NEXT: addi a2, a0, -1
; RV32I-NEXT: addi a3, a1, -1
; RV32I-NEXT: sltiu a4, a2, -1
; RV32I-NEXT: add a3, a3, a4
; RV32I-NEXT: and a0, a0, a2
; RV32I-NEXT: and a1, a1, a3
; RV32I-NEXT: or a0, a0, a1
; RV32I-NEXT: seqz a0, a0
; RV32I-NEXT: ret
;
; RV32ZBB-LABEL: ctpop_i64_ugt_two:
Expand All @@ -422,50 +387,14 @@ define i1 @ctpop_i64_ugt_two(i64 %a) nounwind {
define i1 @ctpop_i64_ugt_one(i64 %a) nounwind {
; RV32I-LABEL: ctpop_i64_ugt_one:
; RV32I: # %bb.0:
; RV32I-NEXT: j .LBB7_2
; RV32I-NEXT: # %bb.1:
; RV32I-NEXT: snez a0, zero
; RV32I-NEXT: ret
; RV32I-NEXT: .LBB7_2:
; RV32I-NEXT: srli a2, a0, 1
; RV32I-NEXT: lui a3, 349525
; RV32I-NEXT: lui a4, 209715
; RV32I-NEXT: srli a5, a1, 1
; RV32I-NEXT: addi a3, a3, 1365
; RV32I-NEXT: and a2, a2, a3
; RV32I-NEXT: and a3, a5, a3
; RV32I-NEXT: lui a5, 61681
; RV32I-NEXT: addi a4, a4, 819
; RV32I-NEXT: addi a5, a5, -241
; RV32I-NEXT: sub a0, a0, a2
; RV32I-NEXT: sub a1, a1, a3
; RV32I-NEXT: srli a2, a0, 2
; RV32I-NEXT: and a0, a0, a4
; RV32I-NEXT: srli a3, a1, 2
; RV32I-NEXT: and a1, a1, a4
; RV32I-NEXT: and a2, a2, a4
; RV32I-NEXT: and a3, a3, a4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: srli a2, a0, 4
; RV32I-NEXT: srli a3, a1, 4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: and a0, a0, a5
; RV32I-NEXT: and a1, a1, a5
; RV32I-NEXT: slli a2, a0, 8
; RV32I-NEXT: slli a3, a1, 8
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: slli a2, a0, 16
; RV32I-NEXT: slli a3, a1, 16
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: srli a0, a0, 24
; RV32I-NEXT: srli a1, a1, 24
; RV32I-NEXT: add a0, a1, a0
; RV32I-NEXT: sltiu a0, a0, 2
; RV32I-NEXT: xori a0, a0, 1
; RV32I-NEXT: addi a2, a0, -1
; RV32I-NEXT: addi a3, a1, -1
; RV32I-NEXT: sltiu a4, a2, -1
; RV32I-NEXT: add a3, a3, a4
; RV32I-NEXT: and a0, a0, a2
; RV32I-NEXT: and a1, a1, a3
; RV32I-NEXT: or a0, a0, a1
; RV32I-NEXT: snez a0, a0
; RV32I-NEXT: ret
;
; RV32ZBB-LABEL: ctpop_i64_ugt_one:
Expand All @@ -489,45 +418,18 @@ define i1 @ctpop_i64_ugt_one(i64 %a) nounwind {
define i1 @ctpop_i64_eq_one(i64 %a) nounwind {
; RV32I-LABEL: ctpop_i64_eq_one:
; RV32I: # %bb.0:
; RV32I-NEXT: srli a2, a0, 1
; RV32I-NEXT: lui a3, 349525
; RV32I-NEXT: lui a4, 209715
; RV32I-NEXT: srli a5, a1, 1
; RV32I-NEXT: addi a3, a3, 1365
; RV32I-NEXT: and a2, a2, a3
; RV32I-NEXT: and a3, a5, a3
; RV32I-NEXT: lui a5, 61681
; RV32I-NEXT: addi a4, a4, 819
; RV32I-NEXT: addi a5, a5, -241
; RV32I-NEXT: sub a0, a0, a2
; RV32I-NEXT: sub a1, a1, a3
; RV32I-NEXT: srli a2, a0, 2
; RV32I-NEXT: and a0, a0, a4
; RV32I-NEXT: srli a3, a1, 2
; RV32I-NEXT: and a1, a1, a4
; RV32I-NEXT: and a2, a2, a4
; RV32I-NEXT: and a3, a3, a4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: srli a2, a0, 4
; RV32I-NEXT: srli a3, a1, 4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: and a0, a0, a5
; RV32I-NEXT: and a1, a1, a5
; RV32I-NEXT: slli a2, a0, 8
; RV32I-NEXT: slli a3, a1, 8
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: slli a2, a0, 16
; RV32I-NEXT: slli a3, a1, 16
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: srli a0, a0, 24
; RV32I-NEXT: srli a1, a1, 24
; RV32I-NEXT: add a0, a1, a0
; RV32I-NEXT: xori a0, a0, 1
; RV32I-NEXT: seqz a0, a0
; RV32I-NEXT: addi a2, a0, -1
; RV32I-NEXT: sltiu a3, a2, -1
; RV32I-NEXT: addi a4, a1, -1
; RV32I-NEXT: add a3, a4, a3
; RV32I-NEXT: xor a1, a1, a3
; RV32I-NEXT: beq a1, a3, .LBB8_2
; RV32I-NEXT: # %bb.1:
; RV32I-NEXT: sltu a0, a3, a1
; RV32I-NEXT: ret
; RV32I-NEXT: .LBB8_2:
; RV32I-NEXT: xor a0, a0, a2
; RV32I-NEXT: sltu a0, a2, a0
; RV32I-NEXT: ret
;
; RV32ZBB-LABEL: ctpop_i64_eq_one:
Expand All @@ -546,45 +448,20 @@ define i1 @ctpop_i64_eq_one(i64 %a) nounwind {
define i1 @ctpop_i64_ne_one(i64 %a) nounwind {
; RV32I-LABEL: ctpop_i64_ne_one:
; RV32I: # %bb.0:
; RV32I-NEXT: srli a2, a0, 1
; RV32I-NEXT: lui a3, 349525
; RV32I-NEXT: lui a4, 209715
; RV32I-NEXT: srli a5, a1, 1
; RV32I-NEXT: addi a3, a3, 1365
; RV32I-NEXT: and a2, a2, a3
; RV32I-NEXT: and a3, a5, a3
; RV32I-NEXT: lui a5, 61681
; RV32I-NEXT: addi a4, a4, 819
; RV32I-NEXT: addi a5, a5, -241
; RV32I-NEXT: sub a0, a0, a2
; RV32I-NEXT: sub a1, a1, a3
; RV32I-NEXT: srli a2, a0, 2
; RV32I-NEXT: and a0, a0, a4
; RV32I-NEXT: srli a3, a1, 2
; RV32I-NEXT: and a1, a1, a4
; RV32I-NEXT: and a2, a2, a4
; RV32I-NEXT: and a3, a3, a4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: srli a2, a0, 4
; RV32I-NEXT: srli a3, a1, 4
; RV32I-NEXT: add a0, a2, a0
; RV32I-NEXT: add a1, a3, a1
; RV32I-NEXT: and a0, a0, a5
; RV32I-NEXT: and a1, a1, a5
; RV32I-NEXT: slli a2, a0, 8
; RV32I-NEXT: slli a3, a1, 8
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: slli a2, a0, 16
; RV32I-NEXT: slli a3, a1, 16
; RV32I-NEXT: add a0, a0, a2
; RV32I-NEXT: add a1, a1, a3
; RV32I-NEXT: srli a0, a0, 24
; RV32I-NEXT: srli a1, a1, 24
; RV32I-NEXT: add a0, a1, a0
; RV32I-NEXT: addi a2, a0, -1
; RV32I-NEXT: sltiu a3, a2, -1
; RV32I-NEXT: addi a4, a1, -1
; RV32I-NEXT: add a3, a4, a3
; RV32I-NEXT: xor a1, a1, a3
; RV32I-NEXT: beq a1, a3, .LBB9_2
; RV32I-NEXT: # %bb.1:
; RV32I-NEXT: sltu a0, a3, a1
; RV32I-NEXT: xori a0, a0, 1
; RV32I-NEXT: ret
; RV32I-NEXT: .LBB9_2:
; RV32I-NEXT: xor a0, a0, a2
; RV32I-NEXT: sltu a0, a2, a0
; RV32I-NEXT: xori a0, a0, 1
; RV32I-NEXT: snez a0, a0
; RV32I-NEXT: ret
;
; RV32ZBB-LABEL: ctpop_i64_ne_one:
Expand Down
Loading
Loading