-
Notifications
You must be signed in to change notification settings - Fork 7.9k
UTF-8 Validation optimization for NEON using mb_check_encoding #11076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
However, it is not seems speed accelation on M1 macOS...
@youkidearitai Thanks very much!! @cmb69 @Girgias Can we import MIT-licensed code into php-src? Should we? @youkidearitai So it looks like this code validates 16 bytes at a time. If so, a 1.7x speedup does seem less than expected. I don't have a Mac, so I can't build this code and test it. I wonder, what happens if you build the benchmark code from cyb70289/utf8 and run it? Do you get results which are similar to theirs? For x86 SIMD (SSE2, AVX, etc), I have found that one possible performance drain is constantly reloading constant values into an SIMD register. So I wonder if that is happening in this code or not. I can see that code like |
@iluuu1994 I think you are currently working on pulling my x86 SIMD UTF-8 validation routine into Zend core... please take note of what @youkidearitai is doing here as well. |
@alexdowad Thank you very much. Let me do it slowly and carefully. When I was implementing it, I found a similar source and used it as a reference. I'm trying to read assembly code. If license is problem or implementation is feels difficult, I will close this PR. |
It harder than I thought. I'm closing this PR. Thanks advice @alexdowad and @Girgias .
|
@youkidearitai If you are no longer interested in this problem, that is fine; this is OSS and no developer is 'forced' to work on something they don't want to work on. However, if you are still interested in this problem, but are closing the PR because figuring out what is going on seems difficult, I would encourage you to give it more time. Technical challenges which seem overwhelming can often be overcome if one is persistent and tries different approaches until something works. And if you do so, you may find that you learn a lot in the process. Up to you either way. |
@alexdowad Thank you very much. I don't wanna give up.
Thanks again, Please give me a little more time. |
@youkidearitai I'm glad to hear you are hoping to give this PR more time. I think it will be very valuable. As mentioned above, I can't directly help you on this, but I can give you ideas of what to try. I think a good first step would be to compile @cyb70289's benchmarking code on your machine, run it, and see if you get results comparable to theirs or not. |
@alexdowad Thanks for advice! I take benchmark below. Base code: https://gist.github.com/youkidearitai/7cd8771f6f6e40e21708129707b40204 (master is 5823955) I think M1 mac is originally fast, Raspberry Pi SoC is particularly effective (2.4x faster). M1 macOSmaster
average is neonutf8
average is Result
Raspbian on Raspberry Pi 4B+master
neonutf8
Result
|
I'm reading on Arm Neon Intrinsics Reference (PDF file) that studying more. |
@youkidearitai So if I have this right, it looks like your Mac M1 was able to process 1.4GB (14000 byte file repeated 100,000 times) of UTF-8 text in 461ms. Is that right? If so, your computer was able to process 2896MB/sec (1024 * 1024 bytes / MB). Looks like your computer is a lot faster than @cyb70289's. @cyb70289 found that his 'range2' NEON code was 2.8 times faster than his 'naive' code. But maybe our scalar validation function might be faster than that 'naive' one. When I have a bit of time I may try benchmarking to see whether that is true. If so, it would explain the difference between your results and @cyb70289's results. |
Apple M1 is indeed much faster than the machines I used. |
Indeed, thanks for the comment. |
I'm not sure of the your use case. Just want to mention that if the strings to be verified contain mostly ascii chars, with few multi byte chars, this library is not good for this condition. It may hurt performance. |
@cyb70289 Thanks for comment. I used your code as a reference. thank you again.
I will consider the logic to determine whether it is ASCII
@alexdowad Okay, I'll try to benchmark. |
if all registers (16 bytes) lower than 0x7F, assumed to be ASCII.
40ecd32
to
7101d42
Compare
ASCII logic included. but JST is 2:30. After sleep then benchmark further.
|
Lower than 0x7F that all bytes SIMD register, then reset previous struct.
@cyb70289 From the git commit logs in your repository, it looks like you and @easyaspi314 are the authors of the NEON-accelerated range2 UTF-8 validation implementation. A question, please... in order for more people to benefit from your work, would you be willing to give permission for your code to be incorporated in the PHP codebase and distributed under the PHP license? I don't expect that you will allow this, but if you do, it would be appreciated. (Of course, code comments would be included identifying you as the authors and pointing readers to the original code repository.) |
@alexdowad , I'm glad you find my utf8 library useful. It's okay to use it in php under php license. |
Investigation and my opinion. One of use case of
Therefore, I want to use |
Just FYI, simdjson (also from Lemire) implements a utf-8 validation said to be much faster than other libraries. |
@cyb70289 Thanks for pointing us to simdjson. This is the actual implementation of UTF-8 validation in that library: https://github.com/simdjson/simdjson/blob/d4ac1b51d0aeb2d4f792136fe7792de709006afa/src/generic/stage1/utf8_lookup4_algorithm.h It's using Lemire's algorithm, same as simdutf. If someone is interested in benchmarking it, that might be interesting, but (at the moment) I don't see any reason to suspect that it will be faster than your implementation of Lemire's algorithm or your range/range2 algorithms. |
Thanks for those good points. Please note that you can easily add an @cyb70289 has kindly given permission for his code to be distributed under the PHP license. Both the range and range2 code includes contributions from @easyaspi314, so it would be nice to hear from him/her as well. In the meantime, @youkidearitai, I would suggest you try importing the range/range2 NEON code (whichever you choose) and start testing. There is another important issue which needs to be addressed here, but first let's just confirm that everything works fine when range/range2 is imported into mbstring. |
I honestly forgot I wrote this code lol 😅 I give my permission to use it. |
@cyb70289 @easyaspi314 Thanks for approved to use to algorithm. |
fixed: I tried compile on Raspberry Pi 1 that using |
M1 mac benchmarkneonutf8 brnach
master branch
48542 / 6167 = 7.8712502026x faster. Raspberry Pi 4B+ benchmarkneonutf8 branch
master branch
102887 / 58592 = 1.7559905789x faster |
On second look at this code (and the original), there is a major problem, there is no short-circuit. If there is an error at the beginning of a very long string, it would still go through the entire string, forcing a full O(n) check. Perhaps the |
@easyaspi314 Thank you very much!
I read https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/UMAXV--Unsigned-Maximum-across-Vector-?lang=en. Certainly it seems that it can not be decided unless it is compiled. For example, Can we use
|
Well come to think of it, if we are going to be checking it frequently in the main loop, I could be more optimal. The timing for the reducing instructions is pretty bad so instead of /* Merge the error vectors */
uint8x16_t error = vorrq_u8(error1, error2);
/*
* Take the max of each adjacent element, selecting the errors (0xFF) into
* the low 8 elements of the vector. The upper bits are ignored.
*/
uint8x16_t error_paired = vpmaxq_u8(error, error);
/* Extract the raw bit pattern of the low 8 elements. */
uint64_t error_raw = vgetq_lane_u64(vreinterpretq_u64_u8(error_paired), 0);
/* If any bits are nonzero, there is an error. */
if (error_raw != 0) {
return false;
} This avoids the pipeline-stalling orr v0.16b, v0.16b, v1.16b
umaxp v0.16b, v0.16b, v0.16b
fmov x0, d0
cbnz x0, .Lfalse Edit: also |
Use vpmaxq instead of vmaxvq and extract low 64 bits that optimal timings and avoid pipeline-stalling umaxv.
@easyaspi314 Thank you very much for advice. I pushed your code.
Thanks again. I fixed missed it 😂 |
I did some on-device benchmarking and checking every 64 bytes is about 15% faster on my Tensor G1 (Cortex-X1) with It also fixes an endianness bug because some people like to see things burn. I will make a PR for the original repo later today if you want to take it from there but this is the gist: #define PROCESS_NEON(num_bytes) \
do { \
/* Avoid a dependency on other iterations */ \
uint8x16_t error1 = vdupq_n_u8(0); \
uint8x16_t error2 = vdupq_n_u8(0); \
size_t num_iters = num_bytes / sizeof(uint8x16_t); \
/* Parse a block of data, marking any errors in error1 and error2 */ \
for (size_t i = 0; i < num_iters; i++) { \
(parsing code) \
} \
/* Check the error flags */ \
(Test error flags) \
} while (0)
/* How much data to process before checking the error flag. */
size_t block_size = 4 * sizeof(uint8x16_t); /* 64 bytes */
/* Process 64 bytes at a time */
while (len >= block_size) {
PROCESS_NEON(block_size);
}
/* Process the remaining data */
if (len >= sizeof(uint8x16_t)) {
PROCESS_NEON(len);
}
/* Check if in the middle of a sequence */
if (len) {
const int8_t *token = (const int8_t *)(data - 3);
size_t lookahead = 0;
if (token[2] > (int8_t)0xBF) {
lookahead = 1;
} else if (token[1] > (int8_t)0xBF) {
lookahead = 2;
} else if (token[0] > (int8_t)0xBF) {
lookahead = 3;
}
data -= lookahead;
len += lookahead;
} |
...
Something tells me that is a better option... 😅 Although simdjson is apache 2.0 and written in C++ so that might be a problem. It also doesn't seem to short circuit but I don't think it needs to at that speed. |
Wow!! I guess we need to look more carefully at simdjson and figure out what their secret is. I gave it a cursory look-over, but it appeared to just be an implementation of the same Lemire algorithm. Not sure what I missed. |
wow...! |
memo: I running simdjson on GDB, it seems used multiple chunks, possibly is it reason why fast?
|
almost 1.48 times faster with this improvement on Raspberry Pi 4B+. but maybe this is limit of this approach.
I took a benchmark https://github.com/simdutf/simdutf on Raspberry Pi 4B+ with range2 algorithm. Maybe
|
is_utf8, simdutf, and simdjson all use the same code for UTF-8 validation. Just with a different namespace. Also yes there is a check to determine if a block is entirely ASCII which lets the code fly by all the twiddling and stuff. That is the reason it gets 9 GB/s (or 37 GB/s in my case) on an ASCII only file. |
@easyaspi314 Thanks for advice. I try more efficient performance bring little faster. I efficient performance ASCII check. I took benchmark using to simdutf on Raspberry Pi 4B+. If want to speed, I need an any idea🙇
|
Add UTF-8 Validation optimization for NEON.
However, it is not seems speed acceleration on M1 macOS...
macOS compile set to configure in
CPPFLAGS='-g -O3 -Wall'
then faster maybe 1.7x.Ubuntu (GCC) compiler faster maybe 1.7x.
This Pull Request is almost copied Opensource another code.(MIT License)
refer from below:
Possibly, there may be an omission. I would be happy if you could find it.
I made it because I thought it would be good to discuss whether NEON should be included.
FYA @alexdowad @Girgias @pakutoma