-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Rewrite libcore's UTF-8 validation for performance #107760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @m-ou-se (rustbot has picked a reviewer for you, use r? to override) |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
@rustbot author |
This comment was marked as resolved.
This comment was marked as resolved.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit f254d4c with merge f6005e27d21dc675f50fe61b6992a431700cefe5... |
let was_mid_char = state != END; | ||
debug_assert!(state != ERR); | ||
if !tail.is_empty() { | ||
// Check and early return if the last CHUNK_LEN bytes were all ASCII. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(note to self: reword this comment, since it seems like I got distracted halfway through it)
// Use a generic to help the compiler out some — we pass both `&[u8]` and `&[u8; | ||
// CHUNK_LEN]` in here, and would like it to know about the constant. | ||
#[must_use] | ||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this would not be inlined for the chunk case (so split into separate functions for chunk vs tail), since LLVM sometimes completely beefs it after inlining for whatever reason, and perf tanks for no reason.
I had thought that we need all of str::from_utf8 inlinable so that LLVM can const-fold it, hence the old version being inline(always)
, but apparently str::from_utf8
itself is not inline, so that just must be a case where I'm mistaken.
// check the length first, which can end up having some pretty disasterous | ||
// impacts on performance, seemingly due to inlining(?). In any case. | ||
// | ||
// Note that doing this for many sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to finish this comment. Was going to mention that this seems to still beat the naïve version even if the branch on length is hard to predict, but that stops holding if you add more length conditions.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
debug_assert!(!inp.is_empty() && inp.get(..pos).is_some()); | ||
while pos != 0 { | ||
pos -= 1; | ||
let is_cont = (inp[pos] & 0b1100_0000) == 0b1000_0000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this isn't hot code so probably not significant, but the as i8 >= -64
construction from the old impl should take 1 less instruction since it can use status flags for signed operations for that comparison.
There also is the utf8_is_cont_byte
method for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. I think there's a function in the parent module too.
8kb can be a pretty big in an embedded context. Can we keep the older, small no-table function available, perhaps under some other name or something? |
We have a bunch of places in the standard library where that's a concern. Ideally we'd have a some umbrella cfg to make that tradeoff. |
Finished benchmarking commit (f6005e27d21dc675f50fe61b6992a431700cefe5): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Genuinely not sure why I thought it was 8kb, the table is a |
We don't track cycle in perf because it's not very reliable, but it's pretty nice to see that this is a 3.5% reduction there. (Could be noise ofc, but you'd expect something like this from an algo that is designed to leverage ILP/pipelining) |
One useful feature that I added to the Julia port of the DFA was to create a ASCII state for the machine., using one of the un-used states in the original design. This means that even though you should be bulk checking for ASCII some otherway, if on a short string you just put it throught DFA, it will tell you if it is ascii only. You can see that in the diagram here: ndinsmore/julia So by making the states Also important in the julia port I changed the order of the ops so that the state returned is always "clean"
|
@ndinsmore thanks a lot. I think the adjustment you describe to the DFA isn't directly useful, but it gives me a really good idea for a similar change.
Changing to It does have a downside in Rust though — it would lead to additional branches in the code when overflow checks were enabled. Avoiding that would require what we're doing here (or using
I'm surprised you saw a perf difference, but this kind of thing can be very fiddly (I don't see one, but my states are 32 bits, and I'm using Rust not Julia — hard to say). That said, I don't really expect this is the kind of code that LLVM can vectorize no matter how you write it. |
@thomcc how did you manage to fit the encoding in 32 bits? You need at least 9 states by my count and I can't figure out how you packed them in. |
While 32-bit rows aren't sufficient to represent all 9-state DFAs, some 9-state DFAs are still representable. Soon after the original article @dougallj used an SMT solver to find a 32-bit-compatible encoding of the DFA transitions for UTF-8 validation. That's what Postgres uses and @thomcc's implementation appears to take the same approach: https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1791 |
There don't seem to be that many resources available online on DFA based UTF-8 validation, the best one I could find was this: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ |
Alright I understand now how it's possible to use 9 states in a 32 bit DFA. The states aren't numbered 0-8, and instead use non-contiguous (I think?) values in the range 0-31. That also explains why you'd need an SMT solver to come up with the transitions and state numbers. I still don't see why we're using a DFA with 9 states instead of one with 8 though. |
Oh right, I should've read the article more carefully. The diagrams in there are omitting the error state. |
@Sp00ph Sorry, I'll be writing some more documentation soon. I'm aware at the moment it would be unmaintainable. |
@thomcc Ping from triage: Can you post your status on this PR? |
I've been busy with various issues, but I'll try to get back to this sooner rather than later so that there can be a comparisons between the different approaches for this vs what's used in #111367 (perhaps taking the best of both could be done, although I'm not sure, although the ASCII path is not one I spent much time in this PR) |
ping from triage - can you post your status on this PR? This PR has not received an update in a few months. Thank you! |
Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this. @rustbot label: +S-inactive |
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
This optimizes the
core::str::from_utf8
function significantly (initial measurements indicates that it's often 1.5x faster, especially for non-ASCII, where it can be up to 3x faster).It does this mostly by leveraging the shift-based DFA technique (a recent obsession), but also it adding SIMD to the ASCII fast path (and really just completely rewrites and restructures how the validation is done).
For prior art: shift-DFAs are now used for UTF-8 validation in PostgreSQL, and seems to be in progress or under consideration for use in Julia and perhaps Go. Of these, PG's impl is the most similar to this one, at least at a high level1.
This PR is not quite ready for review. I'm mostly getting this up now so I can check some perf runs and such. Assuming they (or further benchmarks) don't reveal issues, I believe the approach is basically complete, but the PR needs the following work to be ready for review I intend to:
I'm deliberately leaving anything that touches other functions as follow-up work that I'll do after this lands. That includes improving
String::from_utf8_lossy
or sharing any logic withis_ascii
(Note: some stuff came up in $dayjob so it may be a week or two before I finish all this, just wanted to get this up in the meantime so that it's not really in the back of my mind as much)
Appendix: FAQ
Potential Drawbacks
const_eval_select
. This is both because LLVM was doing worse on thewhile ...
version of some of the loops, and becausecore::simd
isn't const-compatible (trait usage). The function we call would still exist either way though.Why not
simdutf8
's algorithm?Short version:
-Zbuild-std
users (okay, andaarch64
users).Long version: https://gist.github.com/thomcc/f153a122f680023f937f2c912978b8e6.
Footnotes
The main similarity is that PG also uses 32bit rows in the transition table, and has a special case for ASCII (even if the way we special case it is very different). Beyond that, the impls are really totally different (theirs has fewer optimizations, doesn't need tracking any info for error positions, uses a different UTF8 automaton, and is just completely different code). ↩
The benchmarks I have locally have too much PII to be published as-is, since I derived them from real use including browser history (among other things), but that's a temporary situation. Some preliminary benchmarks based on the
simdutf8
corpus are here. It demonstrates a speedup in basically every case (across all string sizes and character compositions), although the real-world improvement seems to be even higher (the impl this replaces has branch misprediction issues for non-ASCII, which are not reflected in these benchmarks). ↩