Rewrite libcore's UTF-8 validation for performance #107760

thomcc · 2023-02-07T13:47:43Z

This optimizes the core::str::from_utf8 function significantly (initial measurements indicates that it's often 1.5x faster, especially for non-ASCII, where it can be up to 3x faster).

It does this mostly by leveraging the shift-based DFA technique (a recent obsession), but also it adding SIMD to the ASCII fast path (and really just completely rewrites and restructures how the validation is done).

For prior art: shift-DFAs are now used for UTF-8 validation in PostgreSQL, and seems to be in progress or under consideration for use in Julia and perhaps Go. Of these, PG's impl is the most similar to this one, at least at a high level¹.

This PR is not quite ready for review. I'm mostly getting this up now so I can check some perf runs and such. Assuming they (or further benchmarks) don't reveal issues, I believe the approach is basically complete, but the PR needs the following work to be ready for review I intend to:

Add more comments/docs, it's totally impossible to follow at the moment.
Rewrite the code for the table so that it's comprehensible instead of a total black box, and include a link to a generator the constants use.
Rewrite the ASCII scan loop (which is pretty clownshoes at the moment, and probably should still handle alignment after the first load).
Ensure the tests cover all the optimized cases (both short/long strings).
Integrate some of my benchmarks².

I'm deliberately leaving anything that touches other functions as follow-up work that I'll do after this lands. That includes improving String::from_utf8_lossy or sharing any logic with is_ascii

(Note: some stuff came up in $dayjob so it may be a week or two before I finish all this, just wanted to get this up in the meantime so that it's not really in the back of my mind as much)

Appendix: FAQ

Potential Drawbacks

Weird algo involving a magic table. I will document this way better so that this drawback goes away.
Slightly slower in the invalid UTF-8 path, as we may re-validate up to 16 bytes. I'll probably fix this for truncation, but even so it is pretty minimal. It's very likely that if the error isn't right at the start of the string, this is version is still much faster.
Probably worse on some machines, including ones slow (slower than doing scalar operations) unaligned loads, and slow dynamic right shifts (slower than branching).
Additional 1kb table, although I'm very careful not to touch it for the pure-ASCII path (doing this without eliminating the benefit of the shift-DFA was hard).
Need const_eval_select. This is both because LLVM was doing worse on the while ... version of some of the loops, and because core::simd isn't const-compatible (trait usage). The function we call would still exist either way though.

Why not `simdutf8`'s algorithm?

Short version:

Can't use algos requiring new instructions in libcore for various reasons. Not worth it to maintain just for -Zbuild-std users (okay, and aarch64 users).
This impl is pretty small, mostly portable (a couple special cases in the ASCII handling, but nothing complex or conditionally compiled), and does not require much target-specific complexity.
On the bright side, we're faster for strings up to around 40-120 bytes anyway. At least on my machines.

Long version: https://gist.github.com/thomcc/f153a122f680023f937f2c912978b8e6.

The main similarity is that PG also uses 32bit rows in the transition table, and has a special case for ASCII (even if the way we special case it is very different). Beyond that, the impls are really totally different (theirs has fewer optimizations, doesn't need tracking any info for error positions, uses a different UTF8 automaton, and is just completely different code). ↩
The benchmarks I have locally have too much PII to be published as-is, since I derived them from real use including browser history (among other things), but that's a temporary situation. Some preliminary benchmarks based on the simdutf8 corpus are here. It demonstrates a speedup in basically every case (across all string sizes and character compositions), although the real-world improvement seems to be even higher (the impl this replaces has branch misprediction issues for non-ASCII, which are not reflected in these benchmarks). ↩

rustbot · 2023-02-07T13:47:50Z

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2023-02-07T13:47:53Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

thomcc · 2023-02-07T13:47:54Z

@rustbot author

thomcc · 2023-02-07T13:50:32Z

@bors try @rust-timer queue

bors · 2023-02-07T13:50:43Z

⌛ Trying commit f254d4c with merge f6005e27d21dc675f50fe61b6992a431700cefe5...

thomcc · 2023-02-07T15:08:45Z

library/core/src/str/validations/utf8_dfa.rs

+    let was_mid_char = state != END;
+    debug_assert!(state != ERR);
+    if !tail.is_empty() {
+        // Check and early return if the last CHUNK_LEN bytes were all ASCII. The


(note to self: reword this comment, since it seems like I got distracted halfway through it)

thomcc · 2023-02-07T15:19:21Z

library/core/src/str/validations/utf8_dfa.rs

+// Use a generic to help the compiler out some — we pass both `&[u8]` and `&[u8;
+// CHUNK_LEN]` in here, and would like it to know about the constant.
+#[must_use]
+#[inline]


Ideally this would not be inlined for the chunk case (so split into separate functions for chunk vs tail), since LLVM sometimes completely beefs it after inlining for whatever reason, and perf tanks for no reason.

I had thought that we need all of str::from_utf8 inlinable so that LLVM can const-fold it, hence the old version being inline(always), but apparently str::from_utf8 itself is not inline, so that just must be a case where I'm mistaken.

thomcc · 2023-02-07T15:23:35Z

library/core/src/str/validations/utf8_dfa.rs

+    // check the length first, which can end up having some pretty disasterous
+    // impacts on performance, seemingly due to inlining(?). In any case.
+    //
+    // Note that doing this for many sizes


Forgot to finish this comment. Was going to mention that this seems to still beat the naïve version even if the branch on length is hard to predict, but that stops holding if you add more length conditions.

bors · 2023-02-07T16:42:53Z

☀️ Try build successful - checks-actions
Build commit: f6005e27d21dc675f50fe61b6992a431700cefe5 (f6005e27d21dc675f50fe61b6992a431700cefe5)

the8472 · 2023-02-07T18:06:10Z

library/core/src/str/validations/utf8_dfa.rs

+    debug_assert!(!inp.is_empty() && inp.get(..pos).is_some());
+    while pos != 0 {
+        pos -= 1;
+        let is_cont = (inp[pos] & 0b1100_0000) == 0b1000_0000;


Well, this isn't hot code so probably not significant, but the as i8 >= -64 construction from the old impl should take 1 less instruction since it can use status flags for signed operations for that comparison.

There also is the utf8_is_cont_byte method for that

Yeah good point. I think there's a function in the parent module too.

Lokathor · 2023-02-07T18:14:24Z

8kb can be a pretty big in an embedded context. Can we keep the older, small no-table function available, perhaps under some other name or something?

the8472 · 2023-02-07T18:22:56Z

8kb can be a pretty big in an embedded context.

We have a bunch of places in the standard library where that's a concern. Ideally we'd have a some umbrella cfg to make that tradeoff.

rust-timer · 2023-02-07T18:34:43Z

Finished benchmarking commit (f6005e27d21dc675f50fe61b6992a431700cefe5): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.5%	[-5.5%, -1.2%]	8
Improvements ✅ (secondary)	-2.1%	[-2.1%, -2.0%]	3
All ❌✅ (primary)	-3.5%	[-5.5%, -1.2%]	8

thomcc · 2023-02-07T21:27:40Z

8kb can be a pretty big in an embedded context. Can we keep the older, small no-table function available, perhaps under some other name or something?

Genuinely not sure why I thought it was 8kb, the table is a [u32; 256], so it's 1kb. Also the current version has a 256b table which might be able to be removed after all is said and done.

thomcc · 2023-02-07T21:30:37Z

We don't track cycle in perf because it's not very reliable, but it's pretty nice to see that this is a 3.5% reduction there. (Could be noise ofc, but you'd expect something like this from an algo that is designed to leverage ILP/pipelining)

ndinsmore · 2023-02-08T12:49:35Z

One useful feature that I added to the Julia port of the DFA was to create a ASCII state for the machine., using one of the un-used states in the original design. This means that even though you should be bulk checking for ASCII some otherway, if on a short string you just put it throught DFA, it will tell you if it is ascii only.

You can see that in the diagram here: ndinsmore/julia
In this case the diagram is a modified version of the diagram by @hoehrmann : @hoehrmann website

So by making the states UTF8_ASCII = 0, UTF8_ACCEPT = 1, and UTF8_INVALID = 2. The states are ordered in decreasing specificity. Now for example valid ready states are state <= UTF8_ACCEPT and valid stoping states are 'state <= UTF8_INVALID`.

Also important in the julia port I changed the order of the ops so that the state returned is always "clean"
Where as @pervognsen had: state = row >> (state & 63)
The julia port does : state = (row >> state) & 63
This has a few benefits:

The state is always "clean" so it can be used in checks without blindly having to & 63 everywhere
The code is a bit faster, and LLVM vectorizes it just as well.
It gets rid of the & 63 before you exit the state machine

thomcc · 2023-02-08T20:06:30Z

@ndinsmore thanks a lot. I think the adjustment you describe to the DFA isn't directly useful, but it gives me a really good idea for a similar change.

Also important in the julia port I changed the order of the ops so that the state returned is always "clean"

Changing to state = (row >> state) & $mask should result in the same code on most architectures, and seems to https://godbolt.org/z/ha6v6EEoq.

It does have a downside in Rust though — it would lead to additional branches in the code when overflow checks were enabled. Avoiding that would require what we're doing here (or using wrapping_shr so I'm not sure there's a point). Regarding having a "clean" state this way, the way the code is structured, this isn't an issue. The code drives the DFA is localized to dedicated functions, and always returns "clean" states. The code outside of that does not have to mask to use comparisons.

The code is a bit faster, and LLVM vectorizes it just as well

I'm surprised you saw a perf difference, but this kind of thing can be very fiddly (I don't see one, but my states are 32 bits, and I'm using Rust not Julia — hard to say). That said, I don't really expect this is the kind of code that LLVM can vectorize no matter how you write it.

ndinsmore · 2023-02-09T13:59:32Z

@thomcc how did you manage to fit the encoding in 32 bits? You need at least 9 states by my count and I can't figure out how you packed them in.

pervognsen · 2023-02-09T23:50:50Z

@thomcc how did you manage to fit the encoding in 32 bits? You need at least 9 states by my count and I can't figure out how you packed them in.

While 32-bit rows aren't sufficient to represent all 9-state DFAs, some 9-state DFAs are still representable. Soon after the original article @dougallj used an SMT solver to find a 32-bit-compatible encoding of the DFA transitions for UTF-8 validation. That's what Postgres uses and @thomcc's implementation appears to take the same approach: https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1791

Sp00ph · 2023-02-10T01:01:20Z

There don't seem to be that many resources available online on DFA based UTF-8 validation, the best one I could find was this: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
The DFAs in that article only have 8 states. Why are we using one with 9 states? Also, how can we ever cram 9 states into 32 bits with the shift based approach? The bits per transition must be constant across all states, so at most 3 bits per state for 9 states, which would only leave 8 of those states addressable, right?

Sp00ph · 2023-02-10T17:23:44Z

Alright I understand now how it's possible to use 9 states in a 32 bit DFA. The states aren't numbered 0-8, and instead use non-contiguous (I think?) values in the range 0-31. That also explains why you'd need an SMT solver to come up with the transitions and state numbers. I still don't see why we're using a DFA with 9 states instead of one with 8 though.

Sp00ph · 2023-02-10T19:55:43Z

Oh right, I should've read the article more carefully. The diagrams in there are omitting the error state.

thomcc · 2023-02-11T18:55:05Z

@Sp00ph Sorry, I'll be writing some more documentation soon. I'm aware at the moment it would be unmaintainable.

JohnCSimon · 2023-04-30T03:41:35Z

@thomcc Ping from triage: Can you post your status on this PR?

thomcc · 2023-05-09T13:36:00Z

I've been busy with various issues, but I'll try to get back to this sooner rather than later so that there can be a comparisons between the different approaches for this vs what's used in #111367 (perhaps taking the best of both could be done, although I'm not sure, although the ASCII path is not one I spent much time in this PR)

JohnCSimon · 2023-12-17T21:49:14Z

@thomcc

ping from triage - can you post your status on this PR? This PR has not received an update in a few months. Thank you!

oskgo · 2024-07-26T06:23:45Z

@thomcc

Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this.
Note: if you are going to continue please open the PR BEFORE you push to it, else you won't be able to reopen - this is a quirk of github.
Thanks for your contribution.

@rustbot label: +S-inactive

Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.

Rewrite libcore's UTF-8 validation for performance

f254d4c

thomcc added T-libs Relevant to the library team, which will review and decide on the PR/issue. A-str Area: str and String labels Feb 7, 2023

rustbot assigned m-ou-se Feb 7, 2023

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 7, 2023

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 7, 2023

This comment was marked as resolved.

Sign in to view

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 7, 2023

thomcc commented Feb 7, 2023

View reviewed changes

This comment has been minimized.

Sign in to view

the8472 reviewed Feb 7, 2023

View reviewed changes

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 7, 2023

the8472 mentioned this pull request Feb 18, 2023

core: minor refactoring of run_utf8_validation #108217

Closed

the8472 mentioned this pull request May 8, 2023

Faster UTF-8 string validation #111367

Closed

workingjubilee added the A-Unicode Area: Unicode label Jul 22, 2023

oskgo closed this Jul 26, 2024

rustbot added the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Jul 26, 2024

oxalica mentioned this pull request Feb 7, 2025

Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings #136693

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite libcore's UTF-8 validation for performance #107760

Rewrite libcore's UTF-8 validation for performance #107760

thomcc commented Feb 7, 2023 •

edited

Loading

rustbot commented Feb 7, 2023

rustbot commented Feb 7, 2023

thomcc commented Feb 7, 2023

This comment was marked as resolved.

thomcc commented Feb 7, 2023

This comment has been minimized.

bors commented Feb 7, 2023

thomcc Feb 7, 2023

thomcc Feb 7, 2023

thomcc Feb 7, 2023

bors commented Feb 7, 2023

This comment has been minimized.

the8472 Feb 7, 2023 •

edited

Loading

thomcc Feb 7, 2023

Lokathor commented Feb 7, 2023

the8472 commented Feb 7, 2023

rust-timer commented Feb 7, 2023

thomcc commented Feb 7, 2023

thomcc commented Feb 7, 2023 •

edited

Loading

ndinsmore commented Feb 8, 2023

thomcc commented Feb 8, 2023 •

edited

Loading

ndinsmore commented Feb 9, 2023

pervognsen commented Feb 9, 2023 •

edited

Loading

Sp00ph commented Feb 10, 2023

Sp00ph commented Feb 10, 2023

Sp00ph commented Feb 10, 2023

thomcc commented Feb 11, 2023 •

edited

Loading

JohnCSimon commented Apr 30, 2023

thomcc commented May 9, 2023 •

edited

Loading

JohnCSimon commented Dec 17, 2023

oskgo commented Jul 26, 2024

Rewrite libcore's UTF-8 validation for performance #107760

Rewrite libcore's UTF-8 validation for performance #107760

Conversation

thomcc commented Feb 7, 2023 • edited Loading

Appendix: FAQ

Potential Drawbacks

Why not simdutf8's algorithm?

Footnotes

rustbot commented Feb 7, 2023

rustbot commented Feb 7, 2023

thomcc commented Feb 7, 2023

This comment was marked as resolved.

thomcc commented Feb 7, 2023

This comment has been minimized.

bors commented Feb 7, 2023

thomcc Feb 7, 2023

Choose a reason for hiding this comment

thomcc Feb 7, 2023

Choose a reason for hiding this comment

thomcc Feb 7, 2023

Choose a reason for hiding this comment

bors commented Feb 7, 2023

This comment has been minimized.

the8472 Feb 7, 2023 • edited Loading

Choose a reason for hiding this comment

thomcc Feb 7, 2023

Choose a reason for hiding this comment

Lokathor commented Feb 7, 2023

the8472 commented Feb 7, 2023

rust-timer commented Feb 7, 2023

Overall result: no relevant changes - no action needed

Instruction count

Max RSS (memory usage)

Cycles

thomcc commented Feb 7, 2023

thomcc commented Feb 7, 2023 • edited Loading

ndinsmore commented Feb 8, 2023

thomcc commented Feb 8, 2023 • edited Loading

ndinsmore commented Feb 9, 2023

pervognsen commented Feb 9, 2023 • edited Loading

Sp00ph commented Feb 10, 2023

Sp00ph commented Feb 10, 2023

Sp00ph commented Feb 10, 2023

thomcc commented Feb 11, 2023 • edited Loading

JohnCSimon commented Apr 30, 2023

thomcc commented May 9, 2023 • edited Loading

JohnCSimon commented Dec 17, 2023

oskgo commented Jul 26, 2024

thomcc commented Feb 7, 2023 •

edited

Loading

Why not `simdutf8`'s algorithm?

the8472 Feb 7, 2023 •

edited

Loading

thomcc commented Feb 7, 2023 •

edited

Loading

thomcc commented Feb 8, 2023 •

edited

Loading

pervognsen commented Feb 9, 2023 •

edited

Loading

thomcc commented Feb 11, 2023 •

edited

Loading

thomcc commented May 9, 2023 •

edited

Loading