Skip to content

UTF-8 validate strings before interning #10870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iluuu1994
Copy link
Member

Fixes GH-10853

@alexdowad Sorry, I've only realized that you mentioned in #10409 that you'd like to work on this once I was already done.

I see it was also suggested that atomics could be used. I don't know C11 atomics at all but from what I've read it seems the type_info would have to be made atomic which is probably not a good idea as it's not usually necessary. But please correct me if I'm wrong. I'd also prefer if we didn't need to check all strings when interning, as it is only used in few cases atm and mbstrings implementation of UTF-8 validate is much more optimized.

1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

ZEND_API bool zend_string_validate_utf8(zend_string *string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvorisek
Copy link
Contributor

To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.

@iluuu1994 iluuu1994 force-pushed the interned-strings-validate-utf8 branch from 439fa8a to 8f9322b Compare March 17, 2023 17:36
@alexdowad
Copy link
Contributor

This looks great to me. Thanks.

The table-driven state machine for UTF-8 validity checking is very interesting. I'm interested to know how it compares when benchmarked against the 'fallback' UTF-8 validity checking implementation in mbstring (which was borrowed from PCRE).

@alexdowad
Copy link
Contributor

To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.

👍

It looks like availability of bits in the object header should not be a problem... type_info, where the GC flags are kept, is 32 bits wide, and it looks like there are currently only 9 flags for zend_string objects.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, but LGTM.

Might make sense to pull the MBString UTF-8 checking code as it may be faster, but I'll let @alexdowad benchmark the different implementation :-)

Comment on lines +536 to +537
if (state == UTF8_REJECT)
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CS Nit:

Suggested change
if (state == UTF8_REJECT)
break;
if (state == UTF8_REJECT) {
break;
}

1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

ZEND_API bool zend_string_validate_utf8(zend_string *string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to modify the pointer

Suggested change
ZEND_API bool zend_string_validate_utf8(zend_string *string) {
ZEND_API bool zend_string_validate_utf8(const zend_string *string) {

@youkidearitai
Copy link
Contributor

I wrote a test script https://gist.github.com/youkidearitai/566e348e2e23301063ef5a95579d4efd(I put to ext/mbstring/tests) that this PR works fine. Would you like add this .phpt file?

@iluuu1994
Copy link
Member Author

@alexdowad Did you have time benchmarking the two implementations? If either implementation is faster it might make sense to move that one to core and use it from mbstring. I'll also check whether we can see a performance hit during opcache persisting.

Comment on lines +501 to +503
// Copyright (c) 2008-2009 Bjoern Hoehrmann <[email protected]>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
// https://stackoverflow.com/a/22135005/1320374
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is under BSD licence so it should be pointed here as well or ideally new header created so it doesn't mix licenses in the single file. I have been actually using this in jsond for some time and I separated the headers for that.

@alexdowad
Copy link
Contributor

@alexdowad Did you have time benchmarking the two implementations?

Hi @iluuu1994. Sorry for the delay. I just put together a lil' benchmark program which runs both functions on a bunch of random strings and compares how long they take.

Using gcc with no optimization, the function from PCRE is somewhat faster. With gcc -O3... Bjoern's function is so ridiculously fast that it's hard to believe. I am just investigating.

@alexdowad
Copy link
Contributor

Looks like compiler must have been optimizing calls to Bjoern's function out because the result was not used.

After making sure the result is used, the PCRE function is consistently faster.

@alexdowad
Copy link
Contributor

Looks like @iluuu1994 would do better to use mb_fast_check_utf8_default from mbstring, or mb_fast_check_utf8_avx2 in cases where it can be used.

We already have a mechanism whereby Zend core calls into some mbstring function through function pointers which are set via zend_multibyte_set_functions... but I guess you won't want to use that for this purpose, because then you could only use the UTF-8 checking functions if mbstring extension is present.

I think you should go ahead and move both mb_fast_check_utf8_{default,avx2} into core, rename them (perhaps change mb_ to zend_), and expose them to callers in extension modules.

@alexdowad
Copy link
Contributor

Maybe also rename _fast_check_ to just _check_.

@iluuu1994
Copy link
Member Author

Thank you @alexdowad for your investigation! Moving the functions into core sounds good to me. I'll do that and check if there are any performance regressions for startup/compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check UTF-8 validity for all constant strings on compile time
6 participants