Skip to content

Inefficient codegen collecting slice iterator into array #126000

Open
@orlp

Description

@orlp

While working on Itertools::collect_array for itertools I wanted to compare the efficiency of

pub fn try_into(sl: &[u32]) -> Option<[u32; N]> {
    Some(*<&[u32; N]>::try_from(sl.get(0..N)?).unwrap())
}

pub fn collect_array(sl: &[u32]) -> Option<[u32; N]> {
    sl.iter().copied().collect_array()
}

You can see my comparison here: https://rust.godbolt.org/z/qPW3K8aTx.

I was rather shocked, both look good for N = 4, but for N = 16 we see the following nice implementation for try_into:

try_into:
        mov     rax, rdi
        xor     ecx, ecx
        cmp     rdx, 15
        jbe     .LBB0_2
        vmovups ymm0, ymmword ptr [rsi + 32]
        vmovups ymmword ptr [rax + 36], ymm0
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rax + 4], ymm0
        mov     ecx, 1
.LBB0_2:
        mov     dword ptr [rax], ecx
        vzeroupper
        ret

But the following abomination for collect_array:

collect_array:
        mov     rax, rdi
        xor     ecx, ecx
        test    rdx, rdx
        je      .LBB1_14
        cmp     rdx, 1
        je      .LBB1_14
        cmp     rdx, 2
        je      .LBB1_14
        cmp     rdx, 3
        je      .LBB1_14
        cmp     rdx, 4
        je      .LBB1_14
        cmp     rdx, 5
        je      .LBB1_14
        cmp     rdx, 6
        je      .LBB1_14
        cmp     rdx, 7
        je      .LBB1_14
        cmp     rdx, 8
        je      .LBB1_14
        cmp     rdx, 9
        je      .LBB1_14
        cmp     rdx, 10
        je      .LBB1_14
        cmp     rdx, 11
        je      .LBB1_14
        and     rdx, -4
        cmp     rdx, 12
        je      .LBB1_14
        push    rbp
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     ecx, dword ptr [rsi]
        mov     edx, dword ptr [rsi + 4]
        mov     edi, dword ptr [rsi + 8]
        mov     r8d, dword ptr [rsi + 12]
        mov     r9d, dword ptr [rsi + 16]
        mov     r10d, dword ptr [rsi + 20]
        mov     r11d, dword ptr [rsi + 24]
        mov     ebx, dword ptr [rsi + 28]
        mov     ebp, dword ptr [rsi + 32]
        mov     r14d, dword ptr [rsi + 36]
        mov     r15d, dword ptr [rsi + 40]
        mov     r12d, dword ptr [rsi + 44]
        mov     dword ptr [rax + 4], ecx
        mov     dword ptr [rax + 8], edx
        mov     dword ptr [rax + 12], edi
        mov     dword ptr [rax + 16], r8d
        mov     dword ptr [rax + 20], r9d
        mov     dword ptr [rax + 24], r10d
        mov     dword ptr [rax + 28], r11d
        mov     dword ptr [rax + 32], ebx
        mov     dword ptr [rax + 36], ebp
        mov     dword ptr [rax + 40], r14d
        mov     dword ptr [rax + 44], r15d
        mov     dword ptr [rax + 48], r12d
        vmovups xmm0, xmmword ptr [rsi + 48]
        vmovups xmmword ptr [rax + 52], xmm0
        mov     ecx, 1
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        pop     rbp
.LBB1_14:
        mov     dword ptr [rax], ecx
        ret

While it not folding the consecutive address movs into efficient SIMD movs is disappointing, I would argue that there's probably a bug somewhere since it compares rdx THIRTEEN TIMES IN A ROW to ultimately just check if it is < 16.

This doesn't appear to be a recent regression, the same happens in 1.60 through nightly, and it happens on both x86 as well as ARM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-codegenArea: Code generationC-bugCategory: This is a bug.I-heavyIssue: Problems and improvements with respect to binary size of generated code.I-slowIssue: Problems and improvements with respect to performance of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions