Open
Description
While working on Itertools::collect_array
for itertools
I wanted to compare the efficiency of
pub fn try_into(sl: &[u32]) -> Option<[u32; N]> {
Some(*<&[u32; N]>::try_from(sl.get(0..N)?).unwrap())
}
pub fn collect_array(sl: &[u32]) -> Option<[u32; N]> {
sl.iter().copied().collect_array()
}
You can see my comparison here: https://rust.godbolt.org/z/qPW3K8aTx.
I was rather shocked, both look good for N = 4
, but for N = 16
we see the following nice implementation for try_into
:
try_into:
mov rax, rdi
xor ecx, ecx
cmp rdx, 15
jbe .LBB0_2
vmovups ymm0, ymmword ptr [rsi + 32]
vmovups ymmword ptr [rax + 36], ymm0
vmovups ymm0, ymmword ptr [rsi]
vmovups ymmword ptr [rax + 4], ymm0
mov ecx, 1
.LBB0_2:
mov dword ptr [rax], ecx
vzeroupper
ret
But the following abomination for collect_array
:
collect_array:
mov rax, rdi
xor ecx, ecx
test rdx, rdx
je .LBB1_14
cmp rdx, 1
je .LBB1_14
cmp rdx, 2
je .LBB1_14
cmp rdx, 3
je .LBB1_14
cmp rdx, 4
je .LBB1_14
cmp rdx, 5
je .LBB1_14
cmp rdx, 6
je .LBB1_14
cmp rdx, 7
je .LBB1_14
cmp rdx, 8
je .LBB1_14
cmp rdx, 9
je .LBB1_14
cmp rdx, 10
je .LBB1_14
cmp rdx, 11
je .LBB1_14
and rdx, -4
cmp rdx, 12
je .LBB1_14
push rbp
push r15
push r14
push r12
push rbx
mov ecx, dword ptr [rsi]
mov edx, dword ptr [rsi + 4]
mov edi, dword ptr [rsi + 8]
mov r8d, dword ptr [rsi + 12]
mov r9d, dword ptr [rsi + 16]
mov r10d, dword ptr [rsi + 20]
mov r11d, dword ptr [rsi + 24]
mov ebx, dword ptr [rsi + 28]
mov ebp, dword ptr [rsi + 32]
mov r14d, dword ptr [rsi + 36]
mov r15d, dword ptr [rsi + 40]
mov r12d, dword ptr [rsi + 44]
mov dword ptr [rax + 4], ecx
mov dword ptr [rax + 8], edx
mov dword ptr [rax + 12], edi
mov dword ptr [rax + 16], r8d
mov dword ptr [rax + 20], r9d
mov dword ptr [rax + 24], r10d
mov dword ptr [rax + 28], r11d
mov dword ptr [rax + 32], ebx
mov dword ptr [rax + 36], ebp
mov dword ptr [rax + 40], r14d
mov dword ptr [rax + 44], r15d
mov dword ptr [rax + 48], r12d
vmovups xmm0, xmmword ptr [rsi + 48]
vmovups xmmword ptr [rax + 52], xmm0
mov ecx, 1
pop rbx
pop r12
pop r14
pop r15
pop rbp
.LBB1_14:
mov dword ptr [rax], ecx
ret
While it not folding the consecutive address mov
s into efficient SIMD mov
s is disappointing, I would argue that there's probably a bug somewhere since it compares rdx
THIRTEEN TIMES IN A ROW to ultimately just check if it is < 16
.
This doesn't appear to be a recent regression, the same happens in 1.60 through nightly, and it happens on both x86 as well as ARM.
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: Code generationCategory: This is a bug.Issue: Problems and improvements with respect to binary size of generated code.Issue: Problems and improvements with respect to performance of generated code.