Open
Description
Consider this simple function:
const SIZE: usize = 4096;
fn array_of_twos() -> [u64; SIZE] {
[2; SIZE]
}
Because 2u64
doesn't have the same bytes throughout, the compile can't call memset
and instead creates a vectorized loop.
However, from my testing, using the rep stosq
instruction is over twice as fast for large arrays (more than a few hundred elements). Here is a faster version of the same function:
fn array_of_twos_faster() -> [u64; SIZE] {
let mut arr = MaybeUninit::uninit();
unsafe {
asm!(
"mov rax, 2",
"mov rcx, {}",
"mov rdi, {}",
"rep stosq",
const SIZE,
in(reg) arr.as_mut_ptr(),
lateout("rax") _, lateout("rdi") _, lateout("rcx") _,
options(nostack, preserves_flags)
);
arr.assume_init()
}
}
Benchmarking both with Criterion:
normal time: [1.5435 µs 1.5465 µs 1.5501 µs]
change: [-3.1683% -2.2863% -1.4243%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
rep stosq time: [633.94 ns 636.36 ns 639.77 ns]
change: [-2.2975% -1.8986% -1.4693%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
Compare both of them on Godbolt.
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Category: An issue highlighting optimization opportunities or PRs implementing suchIssue: Problems and improvements with respect to binary size of generated code.Relevant to the compiler team, which will review and decide on the PR/issue.