Closed
Description
Code generation for std::intrinsics::simd::simd_reduce_add_unordered
generates an extra floating-point add that adds +0.0 to the result: https://godbolt.org/z/Y496nxv3E
use std::simd::*;
unsafe fn reduce_add_unordered(v: f32x4) -> f32 {
std::intrinsics::simd::simd_reduce_add_unordered(v)
}
The problem seems to be because the compiler uses +0.0 as the starting value of @llvm.vector.reduce.fadd.*
instead of -0.0. Comparing LLVM code generation for the two cases, we get the more efficient version when using -0.0: https://godbolt.org/z/fhaz7ced6
define float @reduce_fadd_positive_zero(ptr %p) {
%v = load <4 x float>, ptr %p, align 16
%result = tail call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> %v)
ret float %result
}
define float @reduce_fadd_negative_zero(ptr %p) {
%v = load <4 x float>, ptr %p, align 16
%result = tail call reassoc float @llvm.vector.reduce.fadd.v4f32(float -0.000000e+00, <4 x float> %v)
ret float %result
}
declare float @llvm.vector.reduce.fadd.v4f32(float, <4 x float>)
This generates the following assembly for AArch64:
reduce_fadd_positive_zero: // @reduce_fadd_positive_zero
ldr q1, [x0]
movi d0, #0000000000000000
faddp v1.4s, v1.4s, v1.4s
faddp s1, v1.2s
fadd s0, s1, s0
ret
reduce_fadd_negative_zero: // @reduce_fadd_negative_zero
ldr q0, [x0]
faddp v0.4s, v0.4s, v0.4s
faddp s0, v0.2s
ret
To me, this behaviour seems to be caused by using +0.0 instead of -0.0 here in the compiler:
rust/compiler/rustc_codegen_llvm/src/intrinsic.rs
Lines 2095 to 2101 in a3af208