-
Notifications
You must be signed in to change notification settings - Fork 7.9k
UTF-8 validate strings before interning #10870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8 | ||
}; | ||
|
||
ZEND_API bool zend_string_validate_utf8(zend_string *string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/alexdowad/php-src/blob/d882511808d973a354f63bb5821f552d46c09d8e/ext/mbstring/mbstring.c#L4621 should be used (and moved to non-mbstring/core if needed)
To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call. |
439fa8a
to
8f9322b
Compare
This looks great to me. Thanks. The table-driven state machine for UTF-8 validity checking is very interesting. I'm interested to know how it compares when benchmarked against the 'fallback' UTF-8 validity checking implementation in mbstring (which was borrowed from PCRE). |
👍 It looks like availability of bits in the object header should not be a problem... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, but LGTM.
Might make sense to pull the MBString UTF-8 checking code as it may be faster, but I'll let @alexdowad benchmark the different implementation :-)
if (state == UTF8_REJECT) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CS Nit:
if (state == UTF8_REJECT) | |
break; | |
if (state == UTF8_REJECT) { | |
break; | |
} |
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8 | ||
}; | ||
|
||
ZEND_API bool zend_string_validate_utf8(zend_string *string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to modify the pointer
ZEND_API bool zend_string_validate_utf8(zend_string *string) { | |
ZEND_API bool zend_string_validate_utf8(const zend_string *string) { |
I wrote a test script https://gist.github.com/youkidearitai/566e348e2e23301063ef5a95579d4efd(I put to |
@alexdowad Did you have time benchmarking the two implementations? If either implementation is faster it might make sense to move that one to core and use it from mbstring. I'll also check whether we can see a performance hit during opcache persisting. |
// Copyright (c) 2008-2009 Bjoern Hoehrmann <[email protected]> | ||
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details. | ||
// https://stackoverflow.com/a/22135005/1320374 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is under BSD licence so it should be pointed here as well or ideally new header created so it doesn't mix licenses in the single file. I have been actually using this in jsond for some time and I separated the headers for that.
Hi @iluuu1994. Sorry for the delay. I just put together a lil' benchmark program which runs both functions on a bunch of random strings and compares how long they take. Using gcc with no optimization, the function from PCRE is somewhat faster. With |
Looks like compiler must have been optimizing calls to Bjoern's function out because the result was not used. After making sure the result is used, the PCRE function is consistently faster. |
Looks like @iluuu1994 would do better to use We already have a mechanism whereby Zend core calls into some mbstring function through function pointers which are set via I think you should go ahead and move both |
Maybe also rename |
Thank you @alexdowad for your investigation! Moving the functions into core sounds good to me. I'll do that and check if there are any performance regressions for startup/compilation. |
Fixes GH-10853
@alexdowad Sorry, I've only realized that you mentioned in #10409 that you'd like to work on this once I was already done.
I see it was also suggested that atomics could be used. I don't know C11 atomics at all but from what I've read it seems the
type_info
would have to be made atomic which is probably not a good idea as it's not usually necessary. But please correct me if I'm wrong. I'd also prefer if we didn't need to check all strings when interning, as it is only used in few cases atm and mbstrings implementation of UTF-8 validate is much more optimized.