Description
The compiler struggles to optimize the current default implementation for position on Iterators.
Simplifying the implementation has increased the efficiency in various scenarios OMM:
Old position():
#[inline]
fn check<T>(
mut predicate: impl FnMut(T) -> bool,
) -> impl FnMut(usize, T) -> ControlFlow<usize, usize> {
#[rustc_inherit_overflow_checks]
move |i, x| {
if predicate(x) { ControlFlow::Break(i) } else { ControlFlow::Continue(i + 1) }
}
}
self.try_fold(0, check(predicate)).break_value()
Simplified version (There are several alternative implementations producing similar results as well):
let mut position = 0;
while let Some(x) = self.next() {
if predicate(x) {
return Some(position);
}
position += 1;
}
None
Bench results:
position fastest │ slowest │ median │ mean │ samples │ iters
├─ arcstr_new 103.4 µs │ 134.8 µs │ 105.4 µs │ 106.4 µs │ 118 │ 944
├─ arcstr_old 103.8 µs │ 143.7 µs │ 105.6 µs │ 107.1 µs │ 117 │ 936
├─ big_range_new 33.33 ns │ 25.79 µs │ 34.95 ns │ 36.06 ns │ 642049 │ 642049
├─ big_range_old 805 µs │ 893.4 µs │ 828.3 µs │ 830 µs │ 121 │ 121
├─ small_range_new 547.3 ns │ 60.17 µs │ 622 ns │ 659 ns │ 128056 │ 128056
├─ small_range_old 3.174 µs │ 54.47 µs │ 3.195 µs │ 3.298 µs │ 28676 │ 28676
├─ u32_new 109.7 µs │ 125.7 µs │ 111.6 µs │ 112.2 µs │ 112 │ 896
├─ u32_old 245 µs │ 261.4 µs │ 247.6 µs │ 248.5 µs │ 100 │ 800
├─ u64_new 110.5 µs │ 147.9 µs │ 112.8 µs │ 114.6 µs │ 109 │ 872
├─ u64_old 255 µs │ 378.5 µs │ 258.4 µs │ 264.1 µs │ 100 │ 800
├─ u8_new 10.86 µs │ 34.03 µs │ 11.49 µs │ 11.71 µs │ 1064 │ 8512
╰─ u8_old 17.71 µs │ 28.99 µs │ 18.13 µs │ 18.18 µs │ 686 │ 5488
bench.rs
use std::sync::Arc;
use once_cell::sync::Lazy;
fn main() {
divan::main();
}
static VECSTRING: Lazy<Vec<Arc<str>>> =
Lazy::new(|| Vec::from_iter((0..1_000).map(|x| x.to_string().into())));
static VEC64: Lazy<Vec<u64>> = Lazy::new(|| Vec::from_iter(0..10_000));
static VEC32: Lazy<Vec<u32>> = Lazy::new(|| Vec::from_iter(0..10_000));
static VEC8: Lazy<Vec<u8>> = Lazy::new(|| Vec::from_iter(0..=255));
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u32_old() {
for x in (0..10_000).step_by(100) {
assert_eq!(
divan::black_box(&VEC32)
.iter()
.copied()
.position(|s| s == x),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u32_new() {
for x in (0..10_000).step_by(100) {
assert_eq!(
divan::black_box(&VEC32).iter().copied().locate(|s| s == x),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u64_old() {
for x in (0..10_000).step_by(100) {
assert_eq!(
divan::black_box(&VEC64)
.iter()
.copied()
.position(|s| s == x),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u64_new() {
for x in (0..10_000).step_by(100) {
assert_eq!(
divan::black_box(&VEC64).iter().copied().locate(|s| s == x),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u8_old() {
for x in 0..=255 {
assert_eq!(
divan::black_box(&VEC8).iter().copied().position(|s| s == x),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u8_new() {
for x in 0..=255 {
assert_eq!(
divan::black_box(&VEC8).iter().copied().locate(|s| s == x),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn arcstr_old() {
for x in (0..1_000).step_by(10) {
let xstr = x.to_string();
assert_eq!(
divan::black_box(&VECSTRING)
.iter()
.clone()
.position(|s| **s == *xstr.as_str()),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 8)]
fn arcstr_new() {
for x in (0..1_000).step_by(10) {
let xstr = x.to_string();
assert_eq!(
divan::black_box(&VECSTRING)
.iter()
.clone()
.locate(|s| **s == *xstr.as_str()),
Some(x as usize)
);
}
}
#[divan::bench(min_time = 0.1, sample_size = 1)]
fn big_range_old() {
for x in (0..1_000_000).step_by(100_000) {
assert_eq!(
divan::black_box(0..1_000_000).position(|s| s == x),
Some(x as usize)
)
}
}
#[divan::bench(min_time = 0.1, sample_size = 1)]
fn big_range_new() {
for x in (0..1_000_000).step_by(100_000) {
assert_eq!(
divan::black_box(0..1_000_000).locate(|s| s == x),
Some(x as usize)
)
}
}
#[divan::bench(min_time = 0.1, sample_size = 1)]
fn small_range_old() {
for x in 0..100 {
assert_eq!(
divan::black_box(0..100).position(|s| s == x),
Some(x as usize)
)
}
}
#[divan::bench(min_time = 0.1, sample_size = 1)]
fn small_range_new() {
for x in 0..100 {
assert_eq!(
divan::black_box(0..100).locate(|s| s == x),
Some(x as usize)
)
}
}
trait Locate {
type Item;
fn locate<P>(&mut self, predicate: P) -> Option<usize>
where
P: FnMut(Self::Item) -> bool;
}
impl<T, U> Locate for U
where
U: Iterator<Item = T>,
{
type Item = T;
#[inline]
fn locate<P>(&mut self, mut predicate: P) -> Option<usize>
where
P: FnMut(Self::Item) -> bool,
{
let mut position = 0;
while let Some(x) = self.next() {
if predicate(x) {
return Some(position);
}
position += 1;
}
None
}
}
Benchmarks are named after the iterator and were run using Divan.
Ranges benefit massively (big ones by multiple orders of magnitude), while for other types, speedups are usually between 0% and 100%.
A specific implementation for range calculating the position would of course be even faster, but I don't think position() is called that often on ranges.
The implementation is similar to the slice::iter::Iter one, which was introduced to reduce compile times #72166, so perhaps this would help in that regard as well.
However, the compiler appears to occasionally forget how to optimize the function after editing unrelated code. This also happens with the current implementation, so I don't know how reproducible the speedups are (FWIW Ranges have never failed to improve, though).
I did not attempt to benchmark compilation speed, but according to godbolt the memory usage is down from 11.8 MB to 8.3 MB which looks promising.
I was motivated by a thread on reddit, where some context and speculation can be found.