Skip to content

pclmulqdq intrinsics don't inline well across target_feature changes anymore #139029

Open
@iximeow

Description

@iximeow

i'd noticed the pclmulqdq intrinsics in crc32fast were notable in a perf report of a benchmark last night. somewhat shockingly, there were functions whose body was pclmulqdq xmm0, xmm1, 17; ret and pclmulqdq xmm0, xmm1, 0; ret, complete with constraining callers' choice of xmm registers! after a bit of digging it seems to be a regression in nightly.

the specific regression i'd started at can be reproduced with cargo bench in https://github.com/srijs/rust-crc32fast .
cargo +1.85.1 bench produces

running 4 tests
test bench_kilobyte_baseline    ... bench:         130 ns/iter (+/- 2) = 7876 MB/s
test bench_kilobyte_specialized ... bench:          48 ns/iter (+/- 0) = 21333 MB/s
test bench_megabyte_baseline    ... bench:     137,017 ns/iter (+/- 247) = 7652 MB/s
test bench_megabyte_specialized ... bench:      48,153 ns/iter (+/- 51) = 21775 MB/s

whereas
cargo +nightly bench produces

running 4 tests
test bench_kilobyte_baseline    ... bench:         130 ns/iter (+/- 1) = 7876 MB/s
test bench_kilobyte_specialized ... bench:         156 ns/iter (+/- 0) = 6564 MB/s
test bench_megabyte_baseline    ... bench:     137,229 ns/iter (+/- 393) = 7641 MB/s
test bench_megabyte_specialized ... bench:     145,632 ns/iter (+/- 377) = 7200 MB/s

after looking at perf a bit i believe this is representative: https://rust.godbolt.org/z/8dxcE4vo1 . i'm including everything there in this issue as well.

Code

I tried this code:

use std::arch::x86_64 as arch;

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
#[no_mangle]
pub unsafe fn reduce128_caller(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    reduce128(a, b, keys)
}

unsafe fn reduce128(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    let t1 = arch::_mm_clmulepi64_si128(a, keys, 0x00);
    let t2 = arch::_mm_clmulepi64_si128(a, keys, 0x11);
    arch::_mm_xor_si128(arch::_mm_xor_si128(b, t1), t2)
}

I expected to see this happen (with -C opt-level=3):

reduce128_caller:
        mov     rax, rdi
        movdqa  xmm0, xmmword ptr [rsi]
        movdqa  xmm1, xmmword ptr [rcx]
        movdqa  xmm2, xmm0
        pclmulqdq       xmm2, xmm1, 0
        pxor    xmm2, xmmword ptr [rdx]
        pclmulqdq       xmm0, xmm1, 17
        pxor    xmm2, xmm0
        movdqa  xmmword ptr [rdi], xmm2
        ret

Instead, this happened (also -C opt-level=3):

core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::h9ea1fa421d47acc5:
        pclmulqdq       xmm0, xmm1, 17
        ret

core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::heb2402630e2a6f04:
        pclmulqdq       xmm0, xmm1, 0
        ret

reduce128_caller:
        jmp     example::reduce128::h8b5076bc8edc1d53

example::reduce128::h8b5076bc8edc1d53:
        sub     rsp, 72
        movaps  xmmword ptr [rsp + 16], xmm2
        movaps  xmmword ptr [rsp + 48], xmm1
        movaps  xmmword ptr [rsp], xmm0
        movaps  xmm1, xmm2
        call    core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::heb2402630e2a6f04
        movaps  xmmword ptr [rsp + 32], xmm0
        movaps  xmm0, xmmword ptr [rsp]
        movaps  xmm1, xmmword ptr [rsp + 16]
        call    core::core_arch::x86::pclmulqdq::_mm_clmulepi64_si128::h9ea1fa421d47acc5
        movaps  xmm1, xmmword ptr [rsp + 32]
        xorps   xmm1, xmmword ptr [rsp + 48]
        xorps   xmm0, xmm1
        add     rsp, 72
        ret

Version it worked on

1.85.1, 1.31.0, and a half dozen in between.

additionally, beta (rust version 1.86.0-beta.7 (7824ede 2025-03-22) seems good.

nightly with -C opt-level=3 -C target-feature=+pclmul still does great.

Version with regression

in the above godbolt link, i see --version in the rustc nightly tab provides rustc 1.87.0-nightly (a2e63569f 2025-03-26). this is consistent with how i first saw this locally:

rustc +nightly --version --verbose:

rustc +nightly --version --verbose
rustc 1.87.0-nightly (a2e63569f 2025-03-26)
binary: rustc
commit-hash: a2e63569fd6702ac5dd027a80a9fdaadce73adae
commit-date: 2025-03-26
host: x86_64-unknown-linux-gnu
release: 1.87.0-nightly
LLVM version: 20.1.1

Related improvement along the way

adding the same target_feature block on the inner function sees nightly produce somewhat better-than-baseline code: https://rust.godbolt.org/z/sGrYedeaP

use std::arch::x86_64 as arch;

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
#[no_mangle]
pub unsafe fn reduce128_caller(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    reduce128(a, b, keys)
}

#[target_feature(enable = "pclmulqdq", enable = "sse2", enable = "sse4.1")]
unsafe fn reduce128(a: arch::__m128i, b: arch::__m128i, keys: arch::__m128i) -> arch::__m128i {
    let t1 = arch::_mm_clmulepi64_si128(a, keys, 0x00);
    let t2 = arch::_mm_clmulepi64_si128(a, keys, 0x11);
    arch::_mm_xor_si128(arch::_mm_xor_si128(b, t1), t2)
}

with rustc +nightly -C opt-level 3 yields:

reduce128_caller:
        movdqa  xmm3, xmm0
        pclmulqdq       xmm3, xmm2, 0
        pclmulqdq       xmm0, xmm2, 17
        pxor    xmm3, xmm1
        pxor    xmm0, xmm3
        ret

whereas before the codegen was identical regardless of the target_feature attribute on the inner function. so at least in some cases there is a modest improvement?

@rustbot modify labels: +regression-from-stable-to-nightly -regression-untriaged

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-target-featureArea: Enabling/disabling target features like AVX, Neon, etc.C-bugCategory: This is a bug.I-prioritizeIssue: Indicates that prioritization has been requested for this issue.I-slowIssue: Problems and improvements with respect to performance of generated code.O-x86_64Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64)S-has-bisectionStatus: a bisection has been found for this issueT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.regression-from-stable-to-nightlyPerformance or correctness regression from stable to nightly.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions