Description
The current (2021-10-08 nightly) and recent implementations of slice::rotate_right
produce 15x more instructions than a manual special-case implementation for a small rotate by a compile-time constant offset. This results in a 5-50% performance difference, depending on the length of the slice in question.
godbolt demonstration: https://godbolt.org/z/a4YdqP1rn
pub fn shift_after_fast(vec: &mut [usize], i: usize) {
if i > vec.len() || (vec.len() - i) < 2 {
return;
}
unsafe {
let tmp = core::ptr::read(vec.as_ptr().add(vec.len() - 1));
core::ptr::copy(
vec.as_ptr().add(i),
vec.as_mut_ptr().add(i + 1),
vec.len() - i - 1,
);
core::ptr::write(vec.as_mut_ptr().add(i), tmp);
}
}
pub fn shift_after(vec: &mut [usize], i: usize) {
vec[i..].rotate_right(1);
}
I think the current implementation doesn't optimize well because it is a loop {
surrounding three rotate algorithms with their own predicates, the first of which is almost never known at compile time:
if (left + right < 24) || (mem::size_of::<T>() > mem::size_of::<[usize; 4]>()) {
and it is the second algorithm that we want to run in this case:
} else if cmp::min(left, right) <= mem::size_of::<BufType>() / mem::size_of::<T>() {
Simply pasting a copy of the second algorithm above the loop removes the performance difference and most but not all of the code size difference.
cc @scottmcm I've finally gotten around to reporting this after months
rustc 1.57.0-nightly (54bb4fec6 2021-10-08)
binary: rustc
commit-hash: 54bb4fec68cb592e23077896baea072919721573
commit-date: 2021-10-08
host: x86_64-unknown-linux-gnu
release: 1.57.0-nightly
LLVM version: 13.0.0