Skip to content

Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alexdowad
Copy link
Contributor

I used libfuzzer to search for test cases where the output of the new routine would differ from the existing one, but didn't find any. It looks like the test cases in our test suite were good enough to flush out all the bugs.

FYA @cmb69 @Girgias @nikic @nielsdos @kamil-tekiela @youkidearitai

…string

This code is a few percent faster for short UTF-8 strings. For long
(~10,000 byte) strings, it is also consistently faster on my local
microbenchmarks, but by less than 1%.
@alexdowad
Copy link
Contributor Author

For your convenience, here is the commit log message for the big commit here:

From some local benchmarks which I ran, the AVX2-based version is about
2.8x faster than the SSE2-based version on long (~10,000 byte) strings,
1.6x faster on medium (~100 byte) strings, and just about the same
on very short strings.

I followed the example of the code in the 'standard' module, using
preprocessor directives so that the code can be compiled in any of
4 ways:

1) With no AVX2 support at all (for example, when PHP is compiled for
   CPU architectures other than AMD64)
2) For CPUs with AVX2 only (for example, when PHP is built with
   CCFLAGS='-march=native' on a host which implements AVX2)
3) With runtime detection of AVX2 performed by the dynamic linker;
   this requires a dynamic linker which supports the STT_GNU_IFUNC
   symbol type extension to the ELF binary standard. This is true of
   glibc's dynamic linker, as of late 2009.
4) With runtime detection of AVX2 performed by the module init function.
   The detection is done by checking the output of CPUID and then a
   function pointer is set accordingly. In this case, all calls to the
   UTF-8 validation routine are indirect calls through that
   function pointer.

@alexdowad
Copy link
Contributor Author

I tried also making an AVX2-based version of mb_fast_strlen_utf8. On strings with about 10,000 bytes, it is just about twice as fast. But, interestingly, on strings with about 100,000 bytes, it's only about 25% faster than the SSE2-based version. I haven't investigated to find out the reason.

@alexdowad
Copy link
Contributor Author

Hmm, I need to figure out why this code won't compile on a Mac.

@mvorisek
Copy link
Contributor

In the past I found UTF-8 validation like:

if (!preg_match('~~u', $value)) {
    throw new Exception('Value is not valid UTF-8');
}

to perform very fast - how does the speed compare with mb_check_encoding and is it result really 1:1 for all possible inputs?

@alexdowad
Copy link
Contributor Author

In the past I found UTF-8 validation like:

if (!preg_match('~~u', $value)) {
    throw new Exception('Value is not valid UTF-8');
}

to perform very fast - how does the speed compare with mb_check_encoding and is it result really 1:1 for all possible inputs?

@mvorisek Both mbstring and PCRE cache the results of UTF-8 validity checking (by setting a flag in the string object header) and can use this flag to bypass validating the same string a second time. So to compare the speed of these two validation functions, I had to disable that 'bypass' in both mbstring and PCRE.

After doing so, results are: mbstring (with AVX2-based validation) is 9.3x faster for strings around 10,000 bytes, 5.8x faster for strings around 100 bytes, and 3.4x faster for strings around 5 bytes.

I previously used a fuzzer to search for strings where PCRE's UTF-8 validation would return different results than mbstring's UTF-8 validation, and it didn't find any.

@alexdowad
Copy link
Contributor Author

@mvorisek Just one thought. I'm not sure how you tested preg_match previously and determined it to be very fast for UTF-8 validation... but if you tested it by validating the same string repeatedly in a loop, then you were just measuring how fast it can bypass validation by checking the string's GC_FLAGS.

@mvorisek
Copy link
Contributor

Maybe, I also tested the perf with PHP 7.4 or 8.0... The performance of this PR sounds very promising!

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know AVX2 but I read the paper yesterday and it seems like it's properly implemented (which is backed by the test suite). Just confused what the very large table is used for.

BAD_BYTE, BAD_BYTE, BAD_BYTE, BAD_BYTE,
BAD_BYTE, BAD_BYTE, BAD_BYTE, INVALID_CP,
0, 0, OVERLONG_2BYTE, OVERLONG_2BYTE | OVERLONG_3BYTE | OVERLONG_4BYTE);
const __m256i bad_hi_nibble = _mm256_set_epi8(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused what's the purpose of this table again?

Copy link
Contributor Author

@alexdowad alexdowad Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three 32-byte tables are used for looking up the "possible bad 2-byte combination" bitmasks for the high nibble of the first byte, low nibble of the first byte, and high nibble of the second byte. If any bit is set in all 3 of the resulting bitmasks, then it means all three of those nibbles match the same bad 2-byte combination... meaning the string is invalid.

Each of the 32-byte tables are actually the same 16-byte table, repeated twice. This is because the AVX2 instruction which is used for table lookups splits a 256-bit YMM register into two halves, and uses the first 16 bytes as the lookup table for the first half, and the second 16 bytes as the lookup table for the second half.

We don't really want this behavior, as for our use case, we want to use the same lookup table for all the bytes in the entire register, but it's how the instruction works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right okay, thanks for the explanation. :)

@kamil-tekiela
Copy link
Member

Maybe, I also tested the perf with PHP 7.4 or 8.0... The performance of this PR sounds very promising!

My quick performance tests shows an incredible performance improvement between PHP 8.0 and 8.2 for mb_check_encoding. It is now faster than PCRE for short strings (https://3v4l.org/sM8Dr#v8.2.1). It seems to be much slower for long strings but it could also be the fault of my imperfect testing. With 10k+ characters, the performance is absolutely terrible. If Alex can confirm this, it would be great. https://3v4l.org/1Ghji#v8.2.1

However, running my experiment on the master branch seems to show that mbstring is unbeatable in terms of performance. https://3v4l.org/1Ghji/rfc#vgit.master

If this is all true then these are amazing speed optimizations for mbstring.

@alexdowad
Copy link
Contributor Author

My quick performance tests shows an incredible performance improvement between PHP 8.0 and 8.2 for mb_check_encoding. It is now faster than PCRE for short strings (https://3v4l.org/sM8Dr#v8.2.1). It seems to be much slower for long strings but it could also be the fault of my imperfect testing. With 10k+ characters, the performance is absolutely terrible. If Alex can confirm this, it would be great. https://3v4l.org/1Ghji#v8.2.1

However, running my experiment on the master branch seems to show that mbstring is unbeatable in terms of performance. https://3v4l.org/1Ghji/rfc#vgit.master

If this is all true then these are amazing speed optimizations for mbstring.

Dear @kamil-tekiela, thanks for the great testing!

What I think we are seeing here is that while PCRE has cached the UTF-8 validity of strings for a long time, mbstring just started doing this in 8.3. In 8.2, mb_check_encoding would perform the validation every time, even if you call it repeatedly on the same string object.

If you look at the times which you kindly helped to measure, mbstring takes around 1000x longer for a 10,000-byte string than for a 10-byte string (actually a bit better), which is what we expect of a function which performs the validation every time.

On the other hand, PCRE has almost the same performance for 10-byte strings and for 10,000-byte strings. That is not possible if it was actually performing the validation every time. The reason, again, is because it is only validating each string once, then when you call preg_match again 99 times on the same string object, it is using the cached UTF-8 validity status of the string.

@alexdowad
Copy link
Contributor Author

Just seeing if I can get this to build properly on Mac...

@smokefire69

This comment was marked as spam.

@alexdowad
Copy link
Contributor Author

What?? You must be sending this to me by mistake wtf is it and what is it about or for??

@smokefire69 I didn't send anything to you deliberately. You can look at the above message in this thread and see your name isn't there.

Looks like some bug in GitHub.

From some local benchmarks which I ran, the AVX2-based version is about
2.8x faster than the SSE2-based version on long (~10,000 byte) strings,
1.6x faster on medium (~100 byte) strings, and just about the same
on very short strings.

I followed the example of the code in the 'standard' module, using
preprocessor directives so that the code can be compiled in any of
4 ways:

1) With no AVX2 support at all (for example, when PHP is compiled for
   CPU architectures other than AMD64)
2) For CPUs with AVX2 only (for example, when PHP is built with
   CCFLAGS='-march=native' on a host which implements AVX2)
3) With runtime detection of AVX2 performed by the dynamic linker;
   this requires a dynamic linker which supports the STT_GNU_IFUNC
   symbol type extension to the ELF binary standard. This is true of
   glibc's dynamic linker, as of late 2009.
4) With runtime detection of AVX2 performed by the module init function.
   The detection is done by checking the output of CPUID and then a
   function pointer is set accordingly. In this case, all calls to the
   UTF-8 validation routine are indirect calls through that
   function pointer.
@alexdowad
Copy link
Contributor Author

At least the code compiles on Mac now, let's see if the tests pass.

It would be really nice to have someone with an x86-64 based Mac, 2014 or later, to test and see if they can actually see the increased performance from AVX2 acceleration or not.

@alexdowad
Copy link
Contributor Author

OK, the test run on Mac timed out... but reviewing the CI logs reveals that the tests which ran very slowly, causing CI to time out, had nothing to do with this PR.

@alexdowad
Copy link
Contributor Author

It looks like we are good to merge here, unless anyone has more ideas on how to improve.

I am just trying to implement UTF-16 decoding/encoding using SSE2. I have UTF-16LE decoding/encoding routines which together are 2.9x faster than the non-vectorized ones... but I have used SSE4.1 instructions in one place 😓

It would be nice if I could figure out a fast way to do the same with just SSE2.

@alexdowad
Copy link
Contributor Author

Merged. Thank you very much to everyone who reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants