Description
Adding the follow method as part of a benchmark to library/alloc/benches/vec.rs
#[repr(transparent)]
pub struct Foo(usize);
#[inline(never)]
pub fn vec_cast(input: Vec<Foo>) -> Vec<usize> {
input.into_iter().map(|e| unsafe { std::mem::transmute(e) }).collect()
}
which exercises this specialization in Vec
results in the following assembly (extracted with objdump):
0000000000086130 <collectionsbenches::vec::vec_cast>:
86130: 48 8b 0e mov (%rsi),%rcx
86133: 48 89 f8 mov %rdi,%rax
86136: 48 8b 56 08 mov 0x8(%rsi),%rdx
8613a: 48 8b 7e 10 mov 0x10(%rsi),%rdi
8613e: 48 89 ce mov %rcx,%rsi
86141: 48 85 ff test %rdi,%rdi
86144: 74 10 je 86156 <collectionsbenches::vec::vec_cast+0x26>
86146: 48 8d 34 f9 lea (%rcx,%rdi,8),%rsi
8614a: 48 c1 e7 03 shl $0x3,%rdi
8614e: 66 90 xchg %ax,%ax
86150: 48 83 c7 f8 add $0xfffffffffffffff8,%rdi ; <= RDI unused from here onwards
86154: 75 fa jne 86150 <collectionsbenches::vec::vec_cast+0x20>
86156: 48 29 ce sub %rcx,%rsi
86159: 48 89 08 mov %rcx,(%rax)
8615c: 48 89 50 08 mov %rdx,0x8(%rax)
86160: 48 c1 fe 03 sar $0x3,%rsi
86164: 48 89 70 10 mov %rsi,0x10(%rax)
86168: c3 retq
The ghidra decompile for the same function (comments are mine):
void collectionsbenches::vec::vec_cast(long *param_1,long *param_2)
{
long lVar1;
long lVar2;
long lVar3;
long lVar4;
lVar1 = *param_2; // pointer
lVar2 = param_2[1]; // capacity
lVar4 = param_2[2]; // len
lVar3 = lVar1;
if (lVar4 != 0) {
lVar3 = lVar1 + lVar4 * 8; // end pointer of vec::IntoIter
lVar4 = lVar4 << 3; // len in bytes
do {
lVar4 = lVar4 + -8;
} while (lVar4 != 0); // <== lVar4 unused from here onwards
}
*param_1 = lVar1; // pointer
param_1[1] = lVar2; // capacity
param_1[2] = lVar3 - lVar1 >> 3; // len from pointer difference
return;
}
Note the useless loop.
The number of loop iterations (or rather the pointer increments) is needed to calculate the new length of the output Vec
. LLVM already manages to hoist lVar3 = lVar1 + lVar4 * 8;
but then it fails to eliminate the now-useless loop.
The issue does not occur if one uses input.into_iter().flat_map(|e| None).collect()
instead, which always results in length == 0.
I tried several variations of the loop (e.g. replacing try_fold
with a simple while let Some() ...
) but it generally results in the same or worse assembly.
Note: The assembly looks somewhat different if I run this on godbolt but the decrementing loop without side-effect is still there. I assume the differences are due to LTO or some other compiler settings.
Tested on commit a1a13b2 2020-11-21 22:46
@rustbot modify labels: +I-slow