Skip to content

Rewrite libcore's UTF-8 validation for performance #107760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

thomcc
Copy link
Member

@thomcc thomcc commented Feb 7, 2023

This optimizes the core::str::from_utf8 function significantly (initial measurements indicates that it's often 1.5x faster, especially for non-ASCII, where it can be up to 3x faster).

It does this mostly by leveraging the shift-based DFA technique (a recent obsession), but also it adding SIMD to the ASCII fast path (and really just completely rewrites and restructures how the validation is done).

For prior art: shift-DFAs are now used for UTF-8 validation in PostgreSQL, and seems to be in progress or under consideration for use in Julia and perhaps Go. Of these, PG's impl is the most similar to this one, at least at a high level1.


This PR is not quite ready for review. I'm mostly getting this up now so I can check some perf runs and such. Assuming they (or further benchmarks) don't reveal issues, I believe the approach is basically complete, but the PR needs the following work to be ready for review I intend to:

  • Add more comments/docs, it's totally impossible to follow at the moment.
  • Rewrite the code for the table so that it's comprehensible instead of a total black box, and include a link to a generator the constants use.
  • Rewrite the ASCII scan loop (which is pretty clownshoes at the moment, and probably should still handle alignment after the first load).
  • Ensure the tests cover all the optimized cases (both short/long strings).
  • Integrate some of my benchmarks2.

I'm deliberately leaving anything that touches other functions as follow-up work that I'll do after this lands. That includes improving String::from_utf8_lossy or sharing any logic with is_ascii

(Note: some stuff came up in $dayjob so it may be a week or two before I finish all this, just wanted to get this up in the meantime so that it's not really in the back of my mind as much)

Appendix: FAQ

Potential Drawbacks

  • Weird algo involving a magic table. I will document this way better so that this drawback goes away.
  • Slightly slower in the invalid UTF-8 path, as we may re-validate up to 16 bytes. I'll probably fix this for truncation, but even so it is pretty minimal. It's very likely that if the error isn't right at the start of the string, this is version is still much faster.
  • Probably worse on some machines, including ones slow (slower than doing scalar operations) unaligned loads, and slow dynamic right shifts (slower than branching).
  • Additional 1kb table, although I'm very careful not to touch it for the pure-ASCII path (doing this without eliminating the benefit of the shift-DFA was hard).
  • Need const_eval_select. This is both because LLVM was doing worse on the while ... version of some of the loops, and because core::simd isn't const-compatible (trait usage). The function we call would still exist either way though.

Why not simdutf8's algorithm?

Short version:

  • Can't use algos requiring new instructions in libcore for various reasons. Not worth it to maintain just for -Zbuild-std users (okay, and aarch64 users).
  • This impl is pretty small, mostly portable (a couple special cases in the ASCII handling, but nothing complex or conditionally compiled), and does not require much target-specific complexity.
  • On the bright side, we're faster for strings up to around 40-120 bytes anyway. At least on my machines.

Long version: https://gist.github.com/thomcc/f153a122f680023f937f2c912978b8e6.

Footnotes

  1. The main similarity is that PG also uses 32bit rows in the transition table, and has a special case for ASCII (even if the way we special case it is very different). Beyond that, the impls are really totally different (theirs has fewer optimizations, doesn't need tracking any info for error positions, uses a different UTF8 automaton, and is just completely different code).

  2. The benchmarks I have locally have too much PII to be published as-is, since I derived them from real use including browser history (among other things), but that's a temporary situation. Some preliminary benchmarks based on the simdutf8 corpus are here. It demonstrates a speedup in basically every case (across all string sizes and character compositions), although the real-world improvement seems to be even higher (the impl this replaces has branch misprediction issues for non-ASCII, which are not reflected in these benchmarks).

@thomcc thomcc added T-libs Relevant to the library team, which will review and decide on the PR/issue. A-str Area: str and String labels Feb 7, 2023
@rustbot
Copy link
Collaborator

rustbot commented Feb 7, 2023

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 7, 2023
@rustbot
Copy link
Collaborator

rustbot commented Feb 7, 2023

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@thomcc
Copy link
Member Author

thomcc commented Feb 7, 2023

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 7, 2023
@rustbot

This comment was marked as resolved.

@thomcc
Copy link
Member Author

thomcc commented Feb 7, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 7, 2023
@bors
Copy link
Collaborator

bors commented Feb 7, 2023

⌛ Trying commit f254d4c with merge f6005e27d21dc675f50fe61b6992a431700cefe5...

let was_mid_char = state != END;
debug_assert!(state != ERR);
if !tail.is_empty() {
// Check and early return if the last CHUNK_LEN bytes were all ASCII. The
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note to self: reword this comment, since it seems like I got distracted halfway through it)

// Use a generic to help the compiler out some — we pass both `&[u8]` and `&[u8;
// CHUNK_LEN]` in here, and would like it to know about the constant.
#[must_use]
#[inline]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would not be inlined for the chunk case (so split into separate functions for chunk vs tail), since LLVM sometimes completely beefs it after inlining for whatever reason, and perf tanks for no reason.

I had thought that we need all of str::from_utf8 inlinable so that LLVM can const-fold it, hence the old version being inline(always), but apparently str::from_utf8 itself is not inline, so that just must be a case where I'm mistaken.

// check the length first, which can end up having some pretty disasterous
// impacts on performance, seemingly due to inlining(?). In any case.
//
// Note that doing this for many sizes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to finish this comment. Was going to mention that this seems to still beat the naïve version even if the branch on length is hard to predict, but that stops holding if you add more length conditions.

@bors
Copy link
Collaborator

bors commented Feb 7, 2023

☀️ Try build successful - checks-actions
Build commit: f6005e27d21dc675f50fe61b6992a431700cefe5 (f6005e27d21dc675f50fe61b6992a431700cefe5)

@rust-timer

This comment has been minimized.

debug_assert!(!inp.is_empty() && inp.get(..pos).is_some());
while pos != 0 {
pos -= 1;
let is_cont = (inp[pos] & 0b1100_0000) == 0b1000_0000;
Copy link
Member

@the8472 the8472 Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this isn't hot code so probably not significant, but the as i8 >= -64 construction from the old impl should take 1 less instruction since it can use status flags for signed operations for that comparison.

There also is the utf8_is_cont_byte method for that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I think there's a function in the parent module too.

@Lokathor
Copy link
Contributor

Lokathor commented Feb 7, 2023

8kb can be a pretty big in an embedded context. Can we keep the older, small no-table function available, perhaps under some other name or something?

@the8472
Copy link
Member

the8472 commented Feb 7, 2023

8kb can be a pretty big in an embedded context.

We have a bunch of places in the standard library where that's a concern. Ideally we'd have a some umbrella cfg to make that tradeoff.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (f6005e27d21dc675f50fe61b6992a431700cefe5): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.5% [-5.5%, -1.2%] 8
Improvements ✅
(secondary)
-2.1% [-2.1%, -2.0%] 3
All ❌✅ (primary) -3.5% [-5.5%, -1.2%] 8

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 7, 2023
@thomcc
Copy link
Member Author

thomcc commented Feb 7, 2023

8kb can be a pretty big in an embedded context. Can we keep the older, small no-table function available, perhaps under some other name or something?

Genuinely not sure why I thought it was 8kb, the table is a [u32; 256], so it's 1kb. Also the current version has a 256b table which might be able to be removed after all is said and done.

@thomcc
Copy link
Member Author

thomcc commented Feb 7, 2023

We don't track cycle in perf because it's not very reliable, but it's pretty nice to see that this is a 3.5% reduction there. (Could be noise ofc, but you'd expect something like this from an algo that is designed to leverage ILP/pipelining)

@ndinsmore
Copy link

One useful feature that I added to the Julia port of the DFA was to create a ASCII state for the machine., using one of the un-used states in the original design. This means that even though you should be bulk checking for ASCII some otherway, if on a short string you just put it throught DFA, it will tell you if it is ascii only.

You can see that in the diagram here: ndinsmore/julia
In this case the diagram is a modified version of the diagram by @hoehrmann : @hoehrmann website

So by making the states UTF8_ASCII = 0, UTF8_ACCEPT = 1, and UTF8_INVALID = 2. The states are ordered in decreasing specificity. Now for example valid ready states are state <= UTF8_ACCEPT and valid stoping states are 'state <= UTF8_INVALID`.

Also important in the julia port I changed the order of the ops so that the state returned is always "clean"
Where as @pervognsen had: state = row >> (state & 63)
The julia port does : state = (row >> state) & 63
This has a few benefits:

  1. The state is always "clean" so it can be used in checks without blindly having to & 63 everywhere
  2. The code is a bit faster, and LLVM vectorizes it just as well.
  3. It gets rid of the & 63 before you exit the state machine

@thomcc
Copy link
Member Author

thomcc commented Feb 8, 2023

@ndinsmore thanks a lot. I think the adjustment you describe to the DFA isn't directly useful, but it gives me a really good idea for a similar change.

Also important in the julia port I changed the order of the ops so that the state returned is always "clean"

Changing to state = (row >> state) & $mask should result in the same code on most architectures, and seems to https://godbolt.org/z/ha6v6EEoq.

It does have a downside in Rust though — it would lead to additional branches in the code when overflow checks were enabled. Avoiding that would require what we're doing here (or using wrapping_shr so I'm not sure there's a point). Regarding having a "clean" state this way, the way the code is structured, this isn't an issue. The code drives the DFA is localized to dedicated functions, and always returns "clean" states. The code outside of that does not have to mask to use comparisons.

The code is a bit faster, and LLVM vectorizes it just as well

I'm surprised you saw a perf difference, but this kind of thing can be very fiddly (I don't see one, but my states are 32 bits, and I'm using Rust not Julia — hard to say). That said, I don't really expect this is the kind of code that LLVM can vectorize no matter how you write it.

@ndinsmore
Copy link

@thomcc how did you manage to fit the encoding in 32 bits? You need at least 9 states by my count and I can't figure out how you packed them in.

@pervognsen
Copy link

pervognsen commented Feb 9, 2023

@thomcc how did you manage to fit the encoding in 32 bits? You need at least 9 states by my count and I can't figure out how you packed them in.

While 32-bit rows aren't sufficient to represent all 9-state DFAs, some 9-state DFAs are still representable. Soon after the original article @dougallj used an SMT solver to find a 32-bit-compatible encoding of the DFA transitions for UTF-8 validation. That's what Postgres uses and @thomcc's implementation appears to take the same approach: https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1791

@Sp00ph
Copy link
Member

Sp00ph commented Feb 10, 2023

There don't seem to be that many resources available online on DFA based UTF-8 validation, the best one I could find was this: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
The DFAs in that article only have 8 states. Why are we using one with 9 states? Also, how can we ever cram 9 states into 32 bits with the shift based approach? The bits per transition must be constant across all states, so at most 3 bits per state for 9 states, which would only leave 8 of those states addressable, right?

@Sp00ph
Copy link
Member

Sp00ph commented Feb 10, 2023

Alright I understand now how it's possible to use 9 states in a 32 bit DFA. The states aren't numbered 0-8, and instead use non-contiguous (I think?) values in the range 0-31. That also explains why you'd need an SMT solver to come up with the transitions and state numbers. I still don't see why we're using a DFA with 9 states instead of one with 8 though.

@Sp00ph
Copy link
Member

Sp00ph commented Feb 10, 2023

Oh right, I should've read the article more carefully. The diagrams in there are omitting the error state.

@thomcc
Copy link
Member Author

thomcc commented Feb 11, 2023

@Sp00ph Sorry, I'll be writing some more documentation soon. I'm aware at the moment it would be unmaintainable.

@JohnCSimon
Copy link
Member

@thomcc Ping from triage: Can you post your status on this PR?

@thomcc
Copy link
Member Author

thomcc commented May 9, 2023

I've been busy with various issues, but I'll try to get back to this sooner rather than later so that there can be a comparisons between the different approaches for this vs what's used in #111367 (perhaps taking the best of both could be done, although I'm not sure, although the ASCII path is not one I spent much time in this PR)

@workingjubilee workingjubilee added the A-Unicode Area: Unicode label Jul 22, 2023
@JohnCSimon
Copy link
Member

@thomcc

ping from triage - can you post your status on this PR? This PR has not received an update in a few months. Thank you!

@oskgo
Copy link
Contributor

oskgo commented Jul 26, 2024

@thomcc

Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this.
Note: if you are going to continue please open the PR BEFORE you push to it, else you won't be able to reopen - this is a quirk of github.
Thanks for your contribution.

@rustbot label: +S-inactive

@oskgo oskgo closed this Jul 26, 2024
@rustbot rustbot added the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Jul 26, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 7, 2025
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings

Take 2 of rust-lang#107760 (cc `@thomcc)`

### Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in rust-lang#107760,
> For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)).

### Rationales

1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

### Implementation details

I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata.

The main algorithm consists of following parts:
1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:
1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

### Benchmarks

I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language.

In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around.

To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR.

On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:`

| Algorithm         | Input language | Throughput / (GiB/s)  |
|-------------------|----------------|-----------------------|
| std               | en             | 47.768 +-0.301        |
| shift-dfa-m16-a16 | en             | 27.337 +-0.002        |
| shift-dfa-m16-a32 | en             | 43.627 +-0.006        |
| std               | es             |  6.339 +-0.010        |
| shift-dfa-m16-a16 | es             |  9.721 +-0.014        |
| shift-dfa-m16-a32 | es             |  8.013 +-0.009        |
| std               | zh             |  1.463 +-0.000        |
| shift-dfa-m16-a16 | zh             |  3.401 +-0.002        |
| shift-dfa-m16-a32 | zh             |  3.407 +-0.001        |

### Unresolved

- [ ] Benchmark on aarch64-darwin, another tier 1 target.
  I don't have a machine to play with.

- [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16.

- [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function?
  It has a very similar code doing almost the same thing.
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 9, 2025
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings

Take 2 of rust-lang#107760 (cc `@thomcc)`

### Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in rust-lang#107760,
> For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)).

### Rationales

1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

### Implementation details

I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata.

The main algorithm consists of following parts:
1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:
1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

### Benchmarks

I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language.

In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around.

To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR.

On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:`

| Algorithm         | Input language | Throughput / (GiB/s)  |
|-------------------|----------------|-----------------------|
| std               | en             | 47.768 +-0.301        |
| shift-dfa-m16-a16 | en             | 27.337 +-0.002        |
| shift-dfa-m16-a32 | en             | 43.627 +-0.006        |
| std               | es             |  6.339 +-0.010        |
| shift-dfa-m16-a16 | es             |  9.721 +-0.014        |
| shift-dfa-m16-a32 | es             |  8.013 +-0.009        |
| std               | zh             |  1.463 +-0.000        |
| shift-dfa-m16-a16 | zh             |  3.401 +-0.002        |
| shift-dfa-m16-a32 | zh             |  3.407 +-0.001        |

### Unresolved

- [ ] Benchmark on aarch64-darwin, another tier 1 target.
  I don't have a machine to play with.

- [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16.

- [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function?
  It has a very similar code doing almost the same thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-str Area: str and String A-Unicode Area: Unicode S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.