Skip to content

_mm256_loadu_si256 only loads 128 bits when compiled with default cargo build --release #52636

Closed
@djsweet

Description

@djsweet

The AVX2 intrinsic _mm256_loadu_si256 fully loads all 256 bits from memory into the register when compiled without any optimization, but only loads 128 bits when compiled with the default cargo build --release option. This small program exhibits the issue:

use std::arch::x86_64;

fn main() {
    if is_x86_feature_detected!("avx2") {
        let load_bytes: [u8; 32] = [0x0f; 32];
        let lb_ptr = load_bytes.as_ptr();
        let reg_load = unsafe {
            x86_64::_mm256_loadu_si256(
                lb_ptr as *const x86_64::__m256i
            )
        };
        println!("{:?}", reg_load);
        let mut store_bytes: [u8; 32] = [0; 32];
        let sb_ptr = store_bytes.as_mut_ptr();
        unsafe {
            x86_64::_mm256_storeu_si256(sb_ptr as *mut x86_64::__m256i, reg_load);
        }
        assert_eq!(load_bytes, store_bytes);
    } else {
        println!("AVX2 is not supported on this machine/build.");
    }
}

When I run cargo run, this is the output:

   Compiling...
    Finished dev [unoptimized + debuginfo] target(s) in 0.33s
     Running `target/debug/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 1085102592571150095, 1085102592571150095)

However, with cargo run --release, this is the output:

   Compiling...
    Finished release [optimized] target(s) in 0.26s
     Running `target/release/avx2_bug_hunt`
__m256i(1085102592571150095, 1085102592571150095, 0, 0)
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]`,
 right: `[15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`', src/main.rs:18:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I am on macOS 10.13.6 with a Core i7 I7-4960HQ, and the output of rustc --version --verbose is

rustc 1.27.2
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-apple-darwin
release: 1.27.2
LLVM version: 6.0

Curiously, when inspecting the assembly of main, the call to _mm256_loadu_si256 is not inlined, but instead generates this function:

avx2_bug_hunt`core::coresimd::x86::avx::_mm256_loadu_si256::hd7fc98ebefdce593:
avx2_bug_hunt[0x1000018a0] <+0>:  pushq  %rbp
avx2_bug_hunt[0x1000018a1] <+1>:  movq   %rsp, %rbp
avx2_bug_hunt[0x1000018a4] <+4>:  vmovaps %ymm0, (%rdi)
avx2_bug_hunt[0x1000018a8] <+8>:  popq   %rbp
avx2_bug_hunt[0x1000018a9] <+9>:  vzeroupper 
avx2_bug_hunt[0x1000018ac] <+12>: retq   
avx2_bug_hunt[0x1000018ad] <+13>: nopl   (%rax)

Note the vzeroupper instruction, which clears out the non-XMM registers. This is incorrect behavior, _m256i requires the full YMM register to be loaded unmodified. A similar spurious vzeroupper is also present in the assembly generated for _mm256_storeu_si256, but after the register is stored into memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions