Description
Today for no reason in particular I was curious about the behaviour of arrays when zipped together and then mapped into a single array, roughly:
fn zip_with<T, U, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
a.zip(b).map(|(a, b)| f(a, b))
}
I wrote some tests for this and disassembled them in Godbolt to see what was going on behind the scenes and wow was a whole lot going on that seemed like it didn't need to be. So I wrote something kinda like what's in the stdlib for mapping and zipping arrays, just without the iterators, to see if I could do any better. Turns out I absolutely can:
struct ZipWithGuard<T, U, const N: usize> {
a: *mut [T; N],
b: *mut [U; N],
moved: usize,
}
impl<T, U, const N: usize> Drop for ZipWithGuard<T, U, N> {
fn drop(&mut self) {
for i in self.moved + 1..N {
unsafe {
drop(self.a.cast::<T>().add(i).read());
drop(self.b.cast::<U>().add(i).read());
}
}
}
}
pub fn zip_with_guarded<T, U, V, F: FnMut(T, U) -> V, const N: usize>(
a: [T; N],
b: [U; N],
mut f: F,
) -> [V; N] {
let aref = &a as *const _ as *mut [ManuallyDrop<T>; N];
forget(a);
let bref = &b as *const _ as *mut [ManuallyDrop<U>; N];
forget(b);
let mut guard = ZipWithGuard {
a: aref,
b: bref,
moved: 0,
};
let mut out: MaybeUninit<[V; N]> = unsafe { MaybeUninit::uninit().assume_init() };
let out_ptr = out.as_mut_ptr().cast::<MaybeUninit<V>>();
while guard.moved < N {
let i = guard.moved;
unsafe {
*out_ptr.add(i) = MaybeUninit::new(f(
ptr::read(guard.a.cast::<T>().add(i)),
ptr::read(guard.b.cast::<U>().add(i)),
));
}
guard.moved += 1;
}
unsafe { out.assume_init() }
}
Maybe the design isn't as optimal (or as safe) as it could be, but it performs perfectly well in benchmarks (read: as fast as an unguarded version and anywhere between 3-25* times faster than the zip_with
definition shown at the top of this issue. Here's a rough table of execution times averaged over the primitive ops (+
, -
, *
, &
, |
, ^
):
Not really liking how things are looking for .zip().map()
. So, question then is: is there something in Rust that needs to change to let these optimisations happen, or do I need to look at putting this in a library somewhere (or maybe stdlib wink wink nudge nudge?)?
*The 25x comes from the [u8; 8]
benchmarks. zip_with
benches at 11.230ns, and zip_with_guarded
benches at 452.593ps, which seems suspicious but I can't seem to find anything wrong with the result.