Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm

# Summary

When the fast-path algorithm cannot be used (see #85198), Rust defaults back to a the Bellerophon algorithm, based off this [paper](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4152). Examples of floats that can be correctly parsed via the Bellerophon algorithm include `"9007199254740992.0"` (`1 << 53`), while near-halfway cases such `"9007199254740992992e-3"` must fall back to slower algorithms (just less than halfway of `(1 << 53) + 1`). Unfortunately, the current implementation of the Bellerophon algorithm requires the use of arbitrary-precision arithmetic, which can lead to a 10,000x performance penalty.

Please see the "Sample Repository" below for the exact specifics, or in order to replicate these changes. This is an [initial attempt](https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670) as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library.

# Issue

When parsing floating-point numbers, if the number cannot be exactly parsed by the fast-path algorithm, it falls back to an extended-precision representation (often consisting of 64-bits for the significant digits, or mantissa, and 16-bits for the exponent). For a more detailed description of halfway cases, see the halfway cases section below.

If this extended-precision algorithm can be unambiguously rounded to the nearest native float, by showing that the max error is less than the different to the nearest halfway case, then we have an accurate representation and can skip slower algorithms.

These slower algorithms make use of arbitrary-precision arithmetic to exactly represent the significant digits of the float, and therefore round to the nearest native float. The current implementation of Bellerophon, however, generates the significant digits from a [big integer](https://github.com/rust-lang/rust/blob/28e2b29b8952485679367cc05699fb5154f4e5c3/library/core/src/num/dec2flt/algorithm.rs#L161-L178), which leads to significantly reduced performance.

By using a 64-bit representation of the significant digits parsed from the first 19-20 digits of the float, we can improve performance by orders of magnitude.

# Halfway Cases

When parsing floats, the most significant problem is determining how to round the resulting value. The IEEE-754 standard specifies rounding to nearest, then tie even.

For example, using this rounding scheme to decimal numbers:
- `8.9` would round to `9.0`.
- `9.1` would round to `9.0`.
- `9.5` would round to `10.0`.
- `10.5` would round to `10.0`.

With parsing from decimal strings to binary, fixed-width floating point numbers, we must round to the nearest float. This becomes tricky when values are near their halfway point. For example, with a single-precision float `f32`, we would round as follows:
- `16777216.9` rounds to `16777216.0`
- `16777217.0` rounds to `16777216.0`
- `16777217.1` rounds to `16777218.0`

This is easier illustrated if we represent the float in binary. First, here's the layout of an IEEE-754 single-precision float as bits:

🟦🟩🟩🟩🟩🟩🟩🟩🟩🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪

Where:
- 🟦 is the sign bit.
- 🟩 are the exponent bits.
- 🟪 are the mantissa, or significant digit, bits.

We'll ignore the exponent and sign bits right now, and only consider the mantissa, or significant digits. The lowest exponent bit, also called the hidden bit, is used as an implicit, extra bit of precision for normal floats, meaning we have 24-bits of precision. For 3 numbers, we would therefore have the following representations, where the last bit is truncated off:

- `16777216.0` => `100000000000000000000000 0`
- `16777217.0` => `100000000000000000000000 1`
- `16777218.0` => `100000000000000000000001 0`

Therefore, `16777217.0` is exactly halfway between `16777216.0` and `16777218.0`. Although solving these halfway cases can superficially seem easy, simple algorithms will fail even when parsing the shortest, accurate decimal representation.

# Binary Sizes

These were compiled on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The sizes reflect the binary sizes reported by `ls -sh`, both before and after running the `strip` command. The debug profile was used for opt-levels `0` and `1`, and was as follows:

```toml
[profile.dev]
opt-level = "..."
debug = true
lto = false
```

The release profile was used for opt-levels `2`, `3`, `s` and `z` and was as follows:

```toml
[profile.release]
opt-level = "..."
debug = false
debug-assertions = false
lto = true
```

**core**

These are the binary sizes prior to making changes.

opt-level|size|size(stripped)
|:-:|:-:|:-:|
0|3.6M|360K
1|3.5M|316K
2|1.3M|236K
3|1.3M|248K
s|1.3M|244K
z|1.3M|248K

**moderate**

These are the binary sizes after making changes to speed up the Bellerophon algorithm.

opt-level|size|size(stripped)
|:-:|:-:|:-:|
0|3.6M|364K
1|3.5M|316K
2|1.3M|248K
3|1.3M|252K
s|1.3M|244K
z|1.3M|244K

# Performance

Overall, the changes to speed up Bellerophon algorithm led to a:
- ~-79% change in performance for the `MODERATE` float.
- ~-99.7% change in performance for the `LARGE` float.
- ~-99.94% change in performance for the `DENORMAL` float.

And it did not affect the performance of the fast-path algorithm.

These benchmarks were run on an `i7-6560U CPU @ 2.20GHz`, on a target of `x86_64-unknown-linux-gnu`, running kernel version `5.11.16-100`, on a Rust version of `rustc 1.53.0-nightly (132b4e5d1 2021-04-13)`. The performance CPU governor was used for all benchmarks, and were run on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows:

```rust
// Example fast-path value.
const FAST: &str = "1.2345e22";
// Example disguised fast-path value.
const DISGUISED: &str = "1.2345e30";
// Example moderate path value: clearly not halfway `1 << 53`.
const MODERATE: &str = "9007199254740992.0";
// Example exactly-halfway value `(1<<53) + 1`.
const HALFWAY: &str = "9007199254740993.0";
// Example large, near-halfway value.
const LARGE: &str = "8.988465674311580536566680e307";
// Example denormal, near-halfway value.
const DENORMAL: &str = "8.442911973260991817129021e-309";
```

**core**

These are the benchmarks prior to making changes.

|float|speed|
|:-:|:-:|
|fast|32.952ns|
|disguised|129.86ns|
|moderate|237.08ns|
|halfway|371.21ns|
|large|287.81us|
|denormal|122.36us|

**moderate**

These are the binary sizes after making changes to speed up the Bellerophon algorithm.

|float|speed|
|:-:|:-:|
|fast|26.668ns|
|disguised|34.599ns|
|moderate|49.378ns|
|halfway|224.81ns|
|large|796.34ns|
|denormal|63.763ns|

# Correctness Concerns

There are a few correctness concerns, since this uses a potentially truncated representation of the significant digits for error calculation. I've therefore made the error detection stricter, so it rejects more [halfway cases](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/algorithm.rs#L168-L184) than before and correctly compounds error with truncated cases and [non-normalized representations](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/algorithm.rs#L186-L188) after multiplication.

In practice, this only rejects a [handful of cases](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/tests/bellerophon.rs) that would be normally accept by the algorithm, with a major benefit to overall performance.

I've also extended the powers-of-10 to handle [denormal floats](https://github.com/Alexhuszagh/rust-dec2flt/blob/ad4596cc85ee3647cb20d594651a6061b16ca6a9/src/dec2flt/table.rs#L4-L6), as well as values that could lead to Infinity, and updated the internal logic to ensure correct rounding.

This passes all of Rust's float parsing [tests](https://github.com/rust-lang/rust/blob/master/src/etc/test-float-parse/runtests.py), as well as carefully crafted examples to try to detect errors, and therefore is unlikely to have any correctness issues.

# Sample Repository

I've created a simple, minimal repository tracking these changes on [rust-dec2flt](https://github.com/Alexhuszagh/rust-dec2flt), which has a [core branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/core) that is identical to Rust's current implementation in the core library. The [moderate branch](https://github.com/Alexhuszagh/rust-dec2flt/tree/moderate) contains the changes to improve parsing speeds for Bellerophon algorithm.  This currently relies on changes made to [infer binary exponents](https://github.com/rust-lang/rust/issues/85214), however, can be trivially re-written to explicitly store them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Summary

Issue

Halfway Cases

Binary Sizes

Performance

Correctness Concerns

Sample Repository

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

float	speed
fast	32.952ns
disguised	129.86ns
moderate	237.08ns
halfway	371.21ns
large	287.81us
denormal	122.36us

float	speed
fast	26.668ns
disguised	34.599ns
moderate	49.378ns
halfway	224.81ns
large	796.34ns
denormal	63.763ns

opt-level	size	size(stripped)
0	3.6M	360K
1	3.5M	316K
2	1.3M	236K
3	1.3M	248K
s	1.3M	244K
z	1.3M	248K

opt-level	size	size(stripped)
0	3.6M	364K
1	3.5M	316K
2	1.3M	248K
3	1.3M	252K
s	1.3M	244K
z	1.3M	244K

Improve Float Parsing Speeds by up to 99.94% Through Improvements to the Bellerophon Algorithm #85234

Description

Summary

Issue

Halfway Cases

Binary Sizes

Performance

Correctness Concerns

Sample Repository

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions