Description
While playing with saturating_mul
to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signed saturating_mul
can actually be improved.
I tested two implementations:
-
pub fn saturating_mul(a: i32, b: i32) -> i32 { a.saturating_mul(b) }
-
#[inline] const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 { // If cond is false: not_mask is -1 == all ones // If cond is true: not_mask is 0 == all zeros let not_mask = (cond as i32).wrapping_sub(1); // If cond is false: (a & !all_ones) | (b & all_ones) == b // If cond is true: (a & !all_zeros) | (b & all_zeros) == a (a & !not_mask) | (b & not_mask) } pub const fn saturating_mul(a: i32, b: i32) -> i32 { let (val, overflow) = a.overflowing_mul(b); cond_if_else( !overflow, val, cond_if_else( (a < 0) != (b < 0), i32::min_value(), i32::max_value(), ), ) }
The cond_if_else
function acts kind of like C's ternary operator, including its constness, but works only for integers.
The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB
For unsigned, the second case can become:
pub const fn saturating_mul(a: u32, b: u32) -> u32 {
let (val, overflow) = a.overflowing_mul(b);
cond_if_else(!overflow, val, u32::max_value())
}
In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD
I found some similar improvements for saturating_sub
, even though currently the implementation uses intrinsics::saturating_sub
; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeE
My concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.