Description
The AVX2 intrinsic _mm256_loadu_si256
fully loads all 256 bits from memory into the register when compiled without any optimization, but only loads 128 bits when compiled with the default cargo build --release
option. This small program exhibits the issue:
use std::arch::x86_64;
fn main() {
if is_x86_feature_detected!("avx2") {
let load_bytes: [u8; 32] = [0x0f; 32];
let lb_ptr = load_bytes.as_ptr();
let reg_load = unsafe {
x86_64::_mm256_loadu_si256(
lb_ptr as *const x86_64::__m256i
)
};
println!("{:?}", reg_load);
let mut store_bytes: [u8; 32] = [0; 32];
let sb_ptr = store_bytes.as_mut_ptr();
unsafe {
x86_64::_mm256_storeu_si256(sb_ptr as *mut x86_64::__m256i, reg_load);
}
assert_eq!(load_bytes, store_bytes);
} else {
println!("AVX2 is not supported on this machine/build.");
}
}
When I run cargo run
, this is the output:
Compiling...
Finished dev [unoptimized + debuginfo] target(s) in 0.33s
Running `target/debug/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 1085102592571150095, 1085102592571150095)
However, with cargo run --release
, this is the output:
Compiling...
Finished release [optimized] target(s) in 0.26s
Running `target/release/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 0, 0)
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]`,
right: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`', src/main.rs:18:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
I am on macOS 10.13.6 with a Core i7 I7-4960HQ, and the output of rustc --version --verbose
is
rustc 1.27.2
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-apple-darwin
release: 1.27.2
LLVM version: 6.0
Curiously, when inspecting the assembly of main
, the call to _mm256_loadu_si256
is not inlined, but instead generates this function:
avx2_bug_hunt`core::coresimd::x86::avx::_mm256_loadu_si256::hd7fc98ebefdce593:
avx2_bug_hunt[0x1000018a0] <+0>: pushq %rbp
avx2_bug_hunt[0x1000018a1] <+1>: movq %rsp, %rbp
avx2_bug_hunt[0x1000018a4] <+4>: vmovaps %ymm0, (%rdi)
avx2_bug_hunt[0x1000018a8] <+8>: popq %rbp
avx2_bug_hunt[0x1000018a9] <+9>: vzeroupper
avx2_bug_hunt[0x1000018ac] <+12>: retq
avx2_bug_hunt[0x1000018ad] <+13>: nopl (%rax)
Note the vzeroupper
instruction, which clears out the non-XMM registers. This is incorrect behavior, _m256i
requires the full YMM register to be loaded unmodified. A similar spurious vzeroupper
is also present in the assembly generated for _mm256_storeu_si256
, but after the register is stored into memory.