Skip to content

Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions #55202

Open
@Hendiadyoin1

Description

@Hendiadyoin1

X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to l{0,2}rint[fl]?

Clang/LLVM does not seem to replace calls to rint with these, and neither does it vectorise these when used to round/convert vectors in all cases.
(truncation is properly replaced)

Some examples follow below

GCC is listed aswell,
The main difference to them is, that they do schedule their fldcw for truncation earlier and replace rintl, as well as use some bit-magic for rintf

Note: Using f32x4 for float __vector(4) and i32x4 for int __vector(4)
Note: cvtss2si != cvttss2si
Note: Assuming Overflows etc are UB, and HW's behaviour is acceptable

Scenario LLVM GCC Effective instruction(s)
rintl call rintl@PLT frndint frndint
(int)rintl call rintl@PLT +truncation call rintl@PLT +truncation fistp m16/m32/m64
lrintl call lrintl call lrintl fistp m16/m32/m64
lrint call lrintl call lrintl cvtss2si r32/r64, xmmX
(int)rintf call rintf@PLT;cvttss2si Bit magic+cvttss2si cvtss2si r32, xmmX
(int)rintf (SSE4.2) roundss + cvttss2si roundss + cvttss2si cvtss2si r32, xmmX
4x lrintf (f32x4->i32x4) 4x (shuffle+call lrintl) 4x (shuffle+call lrintl cvtps2dq xmmY, xmmX

Tested using glodbolt and x86_64 Clang 14.0.0 as well as x86_64 GCC 11.2 with O2 and O3


Update: Seems like most cases are now cought, only coalecsing cvtss2sis to cvtps2dq

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions