Closed
Description
The current implementation of Ord::cmp
for integral types results in less than optimal code. Currently the implementation looks something like this:
fn cmp(&self, other: &$t) -> Ordering {
if *self == *other { Equal }
else if *self < *other { Less }
else { Greater }
}
This results in the following IR:
%2 = icmp eq i64 %0, %1
%3 = icmp ult i64 %0, %1
%..i = select i1 %3, i8 -1, i8 1
%_0.0.i = select i1 %2, i8 0, i8 %..i
ret i8 %_0.0.i
which, on x86, becomes
cmpq %rsi, %rdi
setae %al
cmpq %rsi, %rdi
je .LBB0_1
addb %al, %al
addb $-1, %al
; exit (retq)
.LBB0_1:
xorl %eax, %eax
; exit (retq)
where the critical path looks like this (courtesy of llvm-mca -mcpu=broadwell
):
Instruction Dependency Information
+----< 0. cmpq %rsi, %rdi
+----> 1. setae %al ## REGISTER dependency: %flags
| 2. cmpq %rsi, %rdi
| 3. je .LBB0_1
+----> 4. addb %al, %al ## REGISTER dependency: %al
+----> 5. addb $-1, %al ## REGISTER dependency: %al
| 6. ; exit
| 7. xorl %eax, %eax
| 8. ; exit
|
| < loop carried >
|
+----> 2. cmpq %rsi, %rdi ## RESOURCE interference: BWPort1 [ probability: 25% ]
if the implementation instead was:
if a < b {
std::cmp::Ordering::Less
} else if a > b {
std::cmp::Ordering::Greater
} else {
std::cmp::Ordering::Equal
}
the IR would become
%0 = icmp ult i64 %a, %b
%1 = icmp ugt i64 %a, %b
%. = zext i1 %1 to i8
%_0.0 = select i1 %0, i8 -1, i8 %.
ret i8 %_0.0
which in turn would compile down to
movb $-1, %al
cmpq %rsi, %rdi
seta %cl
jb .LBB1_2
movl %ecx, %eax
.LBB1_2:
; exit (retq)
for which the critical path (as expected) only becomes visible on a 2nd iteration:
Instruction Dependency Information
+----< 2. seta %cl
|
| < loop carried >
|
| 0. movb $-1, %al
+----> 1. cmpq %rsi, %rdi ## RESOURCE interference: BWPort6 [ probability: 97% ]
+----> 2. seta %cl ## REGISTER dependency: %flags
| 3. jb .LBB1_2
+----> 4. movl %ecx, %eax ## REGISTER dependency: %cl
llvm-mca reports that the reciprocal throughput for the improved version is 1.5 or lower (getting as low as 1.3 on znver1; lower is better for reciprocal throughput) for various x86 architectures whereas the old code always exceeds 2.0, reaching 3.0 on core2
.
Metadata
Metadata
Assignees
Labels
Category: An issue proposing an enhancement or a PR with one.Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.Call for participation: Help is requested to fix this issue.Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion.Issue: Problems and improvements with respect to performance of generated code.Relevant to the library API team, which will review and decide on the PR/issue.