-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…string This code is a few percent faster for short UTF-8 strings. For long (~10,000 byte) strings, it is also consistently faster on my local microbenchmarks, but by less than 1%.
For your convenience, here is the commit log message for the big commit here:
|
I tried also making an AVX2-based version of |
Hmm, I need to figure out why this code won't compile on a Mac. |
In the past I found UTF-8 validation like:
to perform very fast - how does the speed compare with |
@mvorisek Both mbstring and PCRE cache the results of UTF-8 validity checking (by setting a flag in the string object header) and can use this flag to bypass validating the same string a second time. So to compare the speed of these two validation functions, I had to disable that 'bypass' in both mbstring and PCRE. After doing so, results are: mbstring (with AVX2-based validation) is 9.3x faster for strings around 10,000 bytes, 5.8x faster for strings around 100 bytes, and 3.4x faster for strings around 5 bytes. I previously used a fuzzer to search for strings where PCRE's UTF-8 validation would return different results than mbstring's UTF-8 validation, and it didn't find any. |
@mvorisek Just one thought. I'm not sure how you tested |
Maybe, I also tested the perf with PHP 7.4 or 8.0... The performance of this PR sounds very promising! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know AVX2 but I read the paper yesterday and it seems like it's properly implemented (which is backed by the test suite). Just confused what the very large table is used for.
BAD_BYTE, BAD_BYTE, BAD_BYTE, BAD_BYTE, | ||
BAD_BYTE, BAD_BYTE, BAD_BYTE, INVALID_CP, | ||
0, 0, OVERLONG_2BYTE, OVERLONG_2BYTE | OVERLONG_3BYTE | OVERLONG_4BYTE); | ||
const __m256i bad_hi_nibble = _mm256_set_epi8( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused what's the purpose of this table again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three 32-byte tables are used for looking up the "possible bad 2-byte combination" bitmasks for the high nibble of the first byte, low nibble of the first byte, and high nibble of the second byte. If any bit is set in all 3 of the resulting bitmasks, then it means all three of those nibbles match the same bad 2-byte combination... meaning the string is invalid.
Each of the 32-byte tables are actually the same 16-byte table, repeated twice. This is because the AVX2 instruction which is used for table lookups splits a 256-bit YMM register into two halves, and uses the first 16 bytes as the lookup table for the first half, and the second 16 bytes as the lookup table for the second half.
We don't really want this behavior, as for our use case, we want to use the same lookup table for all the bytes in the entire register, but it's how the instruction works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right okay, thanks for the explanation. :)
My quick performance tests shows an incredible performance improvement between PHP 8.0 and 8.2 for However, running my experiment on the master branch seems to show that mbstring is unbeatable in terms of performance. https://3v4l.org/1Ghji/rfc#vgit.master If this is all true then these are amazing speed optimizations for mbstring. |
Dear @kamil-tekiela, thanks for the great testing! What I think we are seeing here is that while PCRE has cached the UTF-8 validity of strings for a long time, mbstring just started doing this in 8.3. In 8.2, If you look at the times which you kindly helped to measure, mbstring takes around 1000x longer for a 10,000-byte string than for a 10-byte string (actually a bit better), which is what we expect of a function which performs the validation every time. On the other hand, PCRE has almost the same performance for 10-byte strings and for 10,000-byte strings. That is not possible if it was actually performing the validation every time. The reason, again, is because it is only validating each string once, then when you call |
Just seeing if I can get this to build properly on Mac... |
This comment was marked as spam.
This comment was marked as spam.
@smokefire69 I didn't send anything to you deliberately. You can look at the above message in this thread and see your name isn't there. Looks like some bug in GitHub. |
From some local benchmarks which I ran, the AVX2-based version is about 2.8x faster than the SSE2-based version on long (~10,000 byte) strings, 1.6x faster on medium (~100 byte) strings, and just about the same on very short strings. I followed the example of the code in the 'standard' module, using preprocessor directives so that the code can be compiled in any of 4 ways: 1) With no AVX2 support at all (for example, when PHP is compiled for CPU architectures other than AMD64) 2) For CPUs with AVX2 only (for example, when PHP is built with CCFLAGS='-march=native' on a host which implements AVX2) 3) With runtime detection of AVX2 performed by the dynamic linker; this requires a dynamic linker which supports the STT_GNU_IFUNC symbol type extension to the ELF binary standard. This is true of glibc's dynamic linker, as of late 2009. 4) With runtime detection of AVX2 performed by the module init function. The detection is done by checking the output of CPUID and then a function pointer is set accordingly. In this case, all calls to the UTF-8 validation routine are indirect calls through that function pointer.
At least the code compiles on Mac now, let's see if the tests pass. It would be really nice to have someone with an x86-64 based Mac, 2014 or later, to test and see if they can actually see the increased performance from AVX2 acceleration or not. |
OK, the test run on Mac timed out... but reviewing the CI logs reveals that the tests which ran very slowly, causing CI to time out, had nothing to do with this PR. |
It looks like we are good to merge here, unless anyone has more ideas on how to improve. I am just trying to implement UTF-16 decoding/encoding using SSE2. I have UTF-16LE decoding/encoding routines which together are 2.9x faster than the non-vectorized ones... but I have used SSE4.1 instructions in one place 😓 It would be nice if I could figure out a fast way to do the same with just SSE2. |
Merged. Thank you very much to everyone who reviewed. |
I used libfuzzer to search for test cases where the output of the new routine would differ from the existing one, but didn't find any. It looks like the test cases in our test suite were good enough to flush out all the bugs.
FYA @cmb69 @Girgias @nikic @nielsdos @kamil-tekiela @youkidearitai