Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436

alexdowad · 2023-01-24T17:31:16Z

I used libfuzzer to search for test cases where the output of the new routine would differ from the existing one, but didn't find any. It looks like the test cases in our test suite were good enough to flush out all the bugs.

FYA @cmb69 @Girgias @nikic @nielsdos @kamil-tekiela @youkidearitai

…string This code is a few percent faster for short UTF-8 strings. For long (~10,000 byte) strings, it is also consistently faster on my local microbenchmarks, but by less than 1%.

alexdowad · 2023-01-24T17:37:41Z

For your convenience, here is the commit log message for the big commit here:

From some local benchmarks which I ran, the AVX2-based version is about
2.8x faster than the SSE2-based version on long (~10,000 byte) strings,
1.6x faster on medium (~100 byte) strings, and just about the same
on very short strings.

I followed the example of the code in the 'standard' module, using
preprocessor directives so that the code can be compiled in any of
4 ways:

1) With no AVX2 support at all (for example, when PHP is compiled for
   CPU architectures other than AMD64)
2) For CPUs with AVX2 only (for example, when PHP is built with
   CCFLAGS='-march=native' on a host which implements AVX2)
3) With runtime detection of AVX2 performed by the dynamic linker;
   this requires a dynamic linker which supports the STT_GNU_IFUNC
   symbol type extension to the ELF binary standard. This is true of
   glibc's dynamic linker, as of late 2009.
4) With runtime detection of AVX2 performed by the module init function.
   The detection is done by checking the output of CPUID and then a
   function pointer is set accordingly. In this case, all calls to the
   UTF-8 validation routine are indirect calls through that
   function pointer.

alexdowad · 2023-01-24T18:25:20Z

I tried also making an AVX2-based version of mb_fast_strlen_utf8. On strings with about 10,000 bytes, it is just about twice as fast. But, interestingly, on strings with about 100,000 bytes, it's only about 25% faster than the SSE2-based version. I haven't investigated to find out the reason.

alexdowad · 2023-01-24T19:17:06Z

Hmm, I need to figure out why this code won't compile on a Mac.

mvorisek · 2023-01-24T19:24:00Z

In the past I found UTF-8 validation like:

if (!preg_match('~~u', $value)) {
    throw new Exception('Value is not valid UTF-8');
}

to perform very fast - how does the speed compare with mb_check_encoding and is it result really 1:1 for all possible inputs?

alexdowad · 2023-01-24T19:45:28Z

In the past I found UTF-8 validation like:
if (!preg_match('~~u', $value)) {
    throw new Exception('Value is not valid UTF-8');
}
to perform very fast - how does the speed compare with mb_check_encoding and is it result really 1:1 for all possible inputs?

@mvorisek Both mbstring and PCRE cache the results of UTF-8 validity checking (by setting a flag in the string object header) and can use this flag to bypass validating the same string a second time. So to compare the speed of these two validation functions, I had to disable that 'bypass' in both mbstring and PCRE.

After doing so, results are: mbstring (with AVX2-based validation) is 9.3x faster for strings around 10,000 bytes, 5.8x faster for strings around 100 bytes, and 3.4x faster for strings around 5 bytes.

I previously used a fuzzer to search for strings where PCRE's UTF-8 validation would return different results than mbstring's UTF-8 validation, and it didn't find any.

alexdowad · 2023-01-24T19:52:10Z

@mvorisek Just one thought. I'm not sure how you tested preg_match previously and determined it to be very fast for UTF-8 validation... but if you tested it by validating the same string repeatedly in a loop, then you were just measuring how fast it can bypass validation by checking the string's GC_FLAGS.

mvorisek · 2023-01-24T19:59:28Z

Maybe, I also tested the perf with PHP 7.4 or 8.0... The performance of this PR sounds very promising!

Girgias

I don't really know AVX2 but I read the paper yesterday and it seems like it's properly implemented (which is backed by the test suite). Just confused what the very large table is used for.

ext/mbstring/mbstring.c

Girgias · 2023-01-24T23:45:46Z

ext/mbstring/mbstring.c

+		BAD_BYTE, BAD_BYTE, BAD_BYTE, BAD_BYTE,
+		BAD_BYTE, BAD_BYTE, BAD_BYTE, INVALID_CP,
+		0, 0, OVERLONG_2BYTE, OVERLONG_2BYTE | OVERLONG_3BYTE | OVERLONG_4BYTE);
+	const __m256i bad_hi_nibble = _mm256_set_epi8(


I'm confused what's the purpose of this table again?

The three 32-byte tables are used for looking up the "possible bad 2-byte combination" bitmasks for the high nibble of the first byte, low nibble of the first byte, and high nibble of the second byte. If any bit is set in all 3 of the resulting bitmasks, then it means all three of those nibbles match the same bad 2-byte combination... meaning the string is invalid.

Each of the 32-byte tables are actually the same 16-byte table, repeated twice. This is because the AVX2 instruction which is used for table lookups splits a 256-bit YMM register into two halves, and uses the first 16 bytes as the lookup table for the first half, and the second 16 bytes as the lookup table for the second half.

We don't really want this behavior, as for our use case, we want to use the same lookup table for all the bytes in the entire register, but it's how the instruction works.

Right okay, thanks for the explanation. :)

kamil-tekiela · 2023-01-25T00:58:29Z

Maybe, I also tested the perf with PHP 7.4 or 8.0... The performance of this PR sounds very promising!

My quick performance tests shows an incredible performance improvement between PHP 8.0 and 8.2 for mb_check_encoding. It is now faster than PCRE for short strings (https://3v4l.org/sM8Dr#v8.2.1). It seems to be much slower for long strings but it could also be the fault of my imperfect testing. With 10k+ characters, the performance is absolutely terrible. If Alex can confirm this, it would be great. https://3v4l.org/1Ghji#v8.2.1

However, running my experiment on the master branch seems to show that mbstring is unbeatable in terms of performance. https://3v4l.org/1Ghji/rfc#vgit.master

If this is all true then these are amazing speed optimizations for mbstring.

alexdowad · 2023-01-25T06:34:43Z

My quick performance tests shows an incredible performance improvement between PHP 8.0 and 8.2 for mb_check_encoding. It is now faster than PCRE for short strings (https://3v4l.org/sM8Dr#v8.2.1). It seems to be much slower for long strings but it could also be the fault of my imperfect testing. With 10k+ characters, the performance is absolutely terrible. If Alex can confirm this, it would be great. https://3v4l.org/1Ghji#v8.2.1

However, running my experiment on the master branch seems to show that mbstring is unbeatable in terms of performance. https://3v4l.org/1Ghji/rfc#vgit.master

If this is all true then these are amazing speed optimizations for mbstring.

Dear @kamil-tekiela, thanks for the great testing!

What I think we are seeing here is that while PCRE has cached the UTF-8 validity of strings for a long time, mbstring just started doing this in 8.3. In 8.2, mb_check_encoding would perform the validation every time, even if you call it repeatedly on the same string object.

If you look at the times which you kindly helped to measure, mbstring takes around 1000x longer for a 10,000-byte string than for a 10-byte string (actually a bit better), which is what we expect of a function which performs the validation every time.

On the other hand, PCRE has almost the same performance for 10-byte strings and for 10,000-byte strings. That is not possible if it was actually performing the validation every time. The reason, again, is because it is only validating each string once, then when you call preg_match again 99 times on the same string object, it is using the cached UTF-8 validity status of the string.

alexdowad · 2023-01-25T06:42:47Z

Just seeing if I can get this to build properly on Mac...

alexdowad · 2023-01-25T06:51:14Z

What?? You must be sending this to me by mistake wtf is it and what is it about or for??

@smokefire69 I didn't send anything to you deliberately. You can look at the above message in this thread and see your name isn't there.

Looks like some bug in GitHub.

From some local benchmarks which I ran, the AVX2-based version is about 2.8x faster than the SSE2-based version on long (~10,000 byte) strings, 1.6x faster on medium (~100 byte) strings, and just about the same on very short strings. I followed the example of the code in the 'standard' module, using preprocessor directives so that the code can be compiled in any of 4 ways: 1) With no AVX2 support at all (for example, when PHP is compiled for CPU architectures other than AMD64) 2) For CPUs with AVX2 only (for example, when PHP is built with CCFLAGS='-march=native' on a host which implements AVX2) 3) With runtime detection of AVX2 performed by the dynamic linker; this requires a dynamic linker which supports the STT_GNU_IFUNC symbol type extension to the ELF binary standard. This is true of glibc's dynamic linker, as of late 2009. 4) With runtime detection of AVX2 performed by the module init function. The detection is done by checking the output of CPUID and then a function pointer is set accordingly. In this case, all calls to the UTF-8 validation routine are indirect calls through that function pointer.

alexdowad · 2023-01-25T08:08:56Z

At least the code compiles on Mac now, let's see if the tests pass.

It would be really nice to have someone with an x86-64 based Mac, 2014 or later, to test and see if they can actually see the increased performance from AVX2 acceleration or not.

alexdowad · 2023-01-25T13:16:26Z

OK, the test run on Mac timed out... but reviewing the CI logs reveals that the tests which ran very slowly, causing CI to time out, had nothing to do with this PR.

alexdowad · 2023-01-25T17:54:05Z

It looks like we are good to merge here, unless anyone has more ideas on how to improve.

I am just trying to implement UTF-16 decoding/encoding using SSE2. I have UTF-16LE decoding/encoding routines which together are 2.9x faster than the non-vectorized ones... but I have used SSE4.1 instructions in one place 😓

It would be nice if I could figure out a fast way to do the same with just SSE2.

alexdowad · 2023-01-26T07:52:25Z

Merged. Thank you very much to everyone who reviewed.

Adjust code to finish validating remaining 0-8 bytes at end of UTF-8 …

14f62cd

…string This code is a few percent faster for short UTF-8 strings. For long (~10,000 byte) strings, it is also consistently faster on my local microbenchmarks, but by less than 1%.

github-actions bot added the Extension: mbstring label Jan 24, 2023

alexdowad force-pushed the avx2 branch from f85b4d4 to 289dd7b Compare January 24, 2023 18:26

Girgias reviewed Jan 24, 2023

View reviewed changes

alexdowad force-pushed the avx2 branch from 289dd7b to 7c8badd Compare January 25, 2023 06:41

This comment was marked as spam.

Sign in to view

alexdowad force-pushed the avx2 branch from 7c8badd to d882511 Compare January 25, 2023 06:52

Girgias approved these changes Jan 25, 2023

View reviewed changes

alexdowad closed this Jan 26, 2023

alexdowad deleted the avx2 branch January 26, 2023 07:52

mvorisek mentioned this pull request Jan 30, 2023

Mark certain strings as valid utf8 based on operations #10463

Merged

mvorisek mentioned this pull request Mar 14, 2023

Check UTF-8 validity for all constant strings on compile time #10853

Open

cmb69 mentioned this pull request Aug 9, 2024

Support SHA256_Transform_shani() with MSVC #15312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436

Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

mvorisek commented Jan 24, 2023

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

mvorisek commented Jan 24, 2023

Girgias left a comment

Girgias Jan 24, 2023

alexdowad Jan 25, 2023 •

edited

Loading

Girgias Jan 25, 2023

kamil-tekiela commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

This comment was marked as spam.

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 26, 2023

Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436

Add fast AVX2-based implementation of mb_check_encoding for UTF-8 #10436

Conversation

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

mvorisek commented Jan 24, 2023

alexdowad commented Jan 24, 2023

alexdowad commented Jan 24, 2023

mvorisek commented Jan 24, 2023

Girgias left a comment

Choose a reason for hiding this comment

Girgias Jan 24, 2023

Choose a reason for hiding this comment

alexdowad Jan 25, 2023 • edited Loading

Choose a reason for hiding this comment

Girgias Jan 25, 2023

Choose a reason for hiding this comment

kamil-tekiela commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

This comment was marked as spam.

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 25, 2023

alexdowad commented Jan 26, 2023

alexdowad Jan 25, 2023 •

edited

Loading