You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Auto merge of #41764 - scottmcm:faster-reverse, r=brson
Make [u8]::reverse() 5x faster
Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.
Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.
Thank you ptr::*_unaligned for making this easy :)
Benchmark results (from my i5-2500K):
```text
# Before
test slice::reverse_u8 ... bench: 273,836 ns/iter (+/- 15,592) = 3829 MB/s
test slice::reverse_u16 ... bench: 139,793 ns/iter (+/- 17,748) = 7500 MB/s
test slice::reverse_u32 ... bench: 74,997 ns/iter (+/- 5,130) = 13981 MB/s
test slice::reverse_u64 ... bench: 47,452 ns/iter (+/- 2,213) = 22097 MB/s
# After
test slice::reverse_u8 ... bench: 52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16 ... bench: 93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32 ... bench: 74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64 ... bench: 47,556 ns/iter (+/- 3,025) = 22049 MB/s
```
If you're curious about the assembly, instead of doing this
```
movzx eax, byte ptr [rdi]
movzx ecx, byte ptr [rsi]
mov byte ptr [rdi], cl
mov byte ptr [rsi], al
```
it does this
```
mov rax, qword ptr [rdx]
mov rbx, qword ptr [r11 + rcx - 8]
bswap rbx
mov qword ptr [rdx], rbx
bswap rax
mov qword ptr [r11 + rcx - 8], rax
```
0 commit comments