Description
X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to l{0,2}rint[fl]?
Clang/LLVM does not seem to replace calls to rint
with these, and neither does it vectorise these when used to round/convert vectors in all cases.
(truncation is properly replaced)
Some examples follow below
GCC is listed aswell,
The main difference to them is, that they do schedule their fldcw
for truncation earlier and replace rintl
, as well as use some bit-magic for rintf
Note: Using f32x4
for float __vector(4)
and i32x4
for int __vector(4)
Note: cvtss2si
!= cvttss2si
Note: Assuming Overflows etc are UB, and HW's behaviour is acceptable
Scenario | LLVM | GCC | Effective instruction(s) |
---|---|---|---|
rintl |
call rintl@PLT |
frndint |
frndint |
(int)rintl |
call rintl@PLT +truncation |
call rintl@PLT +truncation |
fistp m16/m32/m64 |
lrintl |
call lrintl |
call lrintl |
fistp m16/m32/m64 |
lrint |
call lrintl |
call lrintl |
cvtss2si r32/r64, xmmX |
(int)rintf |
call rintf@PLT;cvttss2si |
Bit magic+cvttss2si |
cvtss2si r32, xmmX |
(int)rintf (SSE4.2) |
roundss + cvttss2si |
roundss + cvttss2si |
cvtss2si r32, xmmX |
4x lrintf (f32x4->i32x4) |
4x (shuffle+call lrintl ) |
4x (shuffle+call lrintl |
cvtps2dq xmmY, xmmX |
Tested using glodbolt and x86_64 Clang 14.0.0
as well as x86_64 GCC 11.2
with O2 and O3
Update: Seems like most cases are now cought, only coalecsing cvtss2si
s to cvtps2dq