Skip to content

performance of saturating_mul can be improved by removing branches #65309

Closed
@tspiteri

Description

@tspiteri

While playing with saturating_mul to see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signed saturating_mul can actually be improved.

I tested two implementations:

  1. pub fn saturating_mul(a: i32, b: i32) -> i32 {
        a.saturating_mul(b)
    }
  2. #[inline]
    const fn cond_if_else(cond: bool, a: i32, b: i32) -> i32 {
        // If cond is false: not_mask is -1 == all ones
        // If cond is true: not_mask is 0 == all zeros
        let not_mask = (cond as i32).wrapping_sub(1);
        // If cond is false: (a & !all_ones) | (b & all_ones) == b
        // If cond is true: (a & !all_zeros) | (b & all_zeros) == a
        (a & !not_mask) | (b & not_mask)
    }
    
    pub const fn saturating_mul(a: i32, b: i32) -> i32 {
        let (val, overflow) = a.overflowing_mul(b);
        cond_if_else(
            !overflow,
            val,
            cond_if_else(
                (a < 0) != (b < 0),
                i32::min_value(),
                i32::max_value(),
            ),
        )
    }

The cond_if_else function acts kind of like C's ternary operator, including its constness, but works only for integers.

The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB

For unsigned, the second case can become:

pub const fn saturating_mul(a: u32, b: u32) -> u32 {
    let (val, overflow) = a.overflowing_mul(b);
    cond_if_else(!overflow, val, u32::max_value())
}

In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD

I found some similar improvements for saturating_sub, even though currently the implementation uses intrinsics::saturating_sub; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeE

My concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions