Skip to content

Loop without side-effect is not eliminated. Leads to O(n) instead of O(1) runtime #79308

Closed
@the8472

Description

@the8472

Adding the follow method as part of a benchmark to library/alloc/benches/vec.rs

#[repr(transparent)]
pub struct Foo(usize);


#[inline(never)]
pub fn vec_cast(input: Vec<Foo>) -> Vec<usize> {
    input.into_iter().map(|e| unsafe { std::mem::transmute(e) }).collect()
}

which exercises this specialization in Vec

results in the following assembly (extracted with objdump):

0000000000086130 <collectionsbenches::vec::vec_cast>:
   86130:	48 8b 0e             	mov    (%rsi),%rcx
   86133:	48 89 f8             	mov    %rdi,%rax
   86136:	48 8b 56 08          	mov    0x8(%rsi),%rdx
   8613a:	48 8b 7e 10          	mov    0x10(%rsi),%rdi
   8613e:	48 89 ce             	mov    %rcx,%rsi
   86141:	48 85 ff             	test   %rdi,%rdi
   86144:	74 10                	je     86156 <collectionsbenches::vec::vec_cast+0x26>
   86146:	48 8d 34 f9          	lea    (%rcx,%rdi,8),%rsi
   8614a:	48 c1 e7 03          	shl    $0x3,%rdi
   8614e:	66 90                	xchg   %ax,%ax
   86150:	48 83 c7 f8          	add    $0xfffffffffffffff8,%rdi ; <= RDI unused from here onwards
   86154:	75 fa                	jne    86150 <collectionsbenches::vec::vec_cast+0x20>
   86156:	48 29 ce             	sub    %rcx,%rsi
   86159:	48 89 08             	mov    %rcx,(%rax)
   8615c:	48 89 50 08          	mov    %rdx,0x8(%rax)
   86160:	48 c1 fe 03          	sar    $0x3,%rsi
   86164:	48 89 70 10          	mov    %rsi,0x10(%rax)
   86168:	c3                   	retq   

The ghidra decompile for the same function (comments are mine):

void collectionsbenches::vec::vec_cast(long *param_1,long *param_2)

{
  long lVar1;
  long lVar2;
  long lVar3;
  long lVar4;
  
  lVar1 = *param_2; // pointer
  lVar2 = param_2[1]; // capacity
  lVar4 = param_2[2]; // len
  lVar3 = lVar1;
  if (lVar4 != 0) {
    lVar3 = lVar1 + lVar4 * 8; // end pointer of vec::IntoIter
    lVar4 = lVar4 << 3; // len in bytes
    do {
      lVar4 = lVar4 + -8;
    } while (lVar4 != 0); // <== lVar4 unused from here onwards
  }
  *param_1 = lVar1; // pointer
  param_1[1] = lVar2; // capacity
  param_1[2] = lVar3 - lVar1 >> 3; // len from pointer difference
  return;
}

Note the useless loop.

The number of loop iterations (or rather the pointer increments) is needed to calculate the new length of the output Vec. LLVM already manages to hoist lVar3 = lVar1 + lVar4 * 8; but then it fails to eliminate the now-useless loop.

The issue does not occur if one uses input.into_iter().flat_map(|e| None).collect() instead, which always results in length == 0.

I tried several variations of the loop (e.g. replacing try_fold with a simple while let Some() ...) but it generally results in the same or worse assembly.

Note: The assembly looks somewhat different if I run this on godbolt but the decrementing loop without side-effect is still there. I assume the differences are due to LTO or some other compiler settings.

Tested on commit a1a13b2 2020-11-21 22:46

@rustbot modify labels: +I-slow

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-enhancementCategory: An issue proposing an enhancement or a PR with one.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions