Description
While examining a weird performance regression on Zulip, we noticed a peculiar thing caused by a simple for loop.
Consider this code:
#[inline(never)]
fn add_index(bools: &[bool]) -> usize {
let mut count: usize = 0;
for i in 0..bools.len() {
if bools[i] {
count += 1;
}
}
count
}
#[inline(never)]
fn add_iter(bools: &[bool]) -> usize {
let mut count: usize = 0;
for b in bools {
if *b {
count += 1;
}
}
count
}
These two functions do the same thing, but the first one uses slice indexing, while the other one uses a more natural for loop. One would expect that the second version will optimize better (or same).
However, when these functions are called multiple times, only the first function is eligible for CSE (common subexpression elimination):
pub fn foo1(bools: &[bool]) {
// CSE applied
println!("a: {}", add_index(bools));
println!("b: {}", add_index(bools));
}
pub fn foo2(bools: &[bool]) {
// CSE not applied
println!("a: {}", add_iter(bools));
println!("b: {}", add_iter(bools));
}
Link to Godbolt.
When examining these two functions, I noticed that the indexed version uses nocapture
for its argument, while the for loop version doesn't:
define noundef i64 @add_index(ptr noalias nocapture noundef nonnull readonly align 1 %bools.0, i64 noundef %bools.1) unnamed_addr #0 !dbg !7 {
define noundef i64 @add_iter(ptr noalias noundef nonnull readonly align 1 %bools.0, i64 noundef %bools.1) unnamed_addr #1 !dbg !57 {
which might be the reason of why CSE isn't applied.
It seems weird that the canonical way of iterating over a slice produces worse code (or at least function attributes) than manual indexing.