UTF-8 validate strings before interning #10870

iluuu1994 · 2023-03-17T11:26:40Z

@alexdowad Sorry, I've only realized that you mentioned in #10409 that you'd like to work on this once I was already done.

I see it was also suggested that atomics could be used. I don't know C11 atomics at all but from what I've read it seems the type_info would have to be made atomic which is probably not a good idea as it's not usually necessary. But please correct me if I'm wrong. I'd also prefer if we didn't need to check all strings when interning, as it is only used in few cases atm and mbstrings implementation of UTF-8 validate is much more optimized.

mvorisek · 2023-03-17T14:04:47Z

Zend/zend_string.c

+    1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
+};
+
+ZEND_API bool zend_string_validate_utf8(zend_string *string) {


https://github.com/alexdowad/php-src/blob/d882511808d973a354f63bb5821f552d46c09d8e/ext/mbstring/mbstring.c#L4621 should be used (and moved to non-mbstring/core if needed)

mvorisek · 2023-03-17T14:07:47Z

To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.

alexdowad · 2023-03-17T18:21:38Z

This looks great to me. Thanks.

The table-driven state machine for UTF-8 validity checking is very interesting. I'm interested to know how it compares when benchmarked against the 'fallback' UTF-8 validity checking implementation in mbstring (which was borrowed from PCRE).

alexdowad · 2023-03-17T18:25:47Z

To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.

👍

It looks like availability of bits in the object header should not be a problem... type_info, where the GC flags are kept, is 32 bits wide, and it looks like there are currently only 9 flags for zend_string objects.

Girgias

Minor nits, but LGTM.

Might make sense to pull the MBString UTF-8 checking code as it may be faster, but I'll let @alexdowad benchmark the different implementation :-)

Girgias · 2023-03-17T22:18:27Z

Zend/zend_string.c

+        if (state == UTF8_REJECT)
+            break;


CS Nit:

Suggested change

if (state == UTF8_REJECT)

break;

if (state == UTF8_REJECT) {

break;

}

Girgias · 2023-03-17T22:23:40Z

Zend/zend_string.c

+    1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
+};
+
+ZEND_API bool zend_string_validate_utf8(zend_string *string) {


Doesn't seem to modify the pointer

Suggested change

ZEND_API bool zend_string_validate_utf8(zend_string *string) {

ZEND_API bool zend_string_validate_utf8(const zend_string *string) {

youkidearitai · 2023-03-18T08:57:48Z

I wrote a test script https://gist.github.com/youkidearitai/566e348e2e23301063ef5a95579d4efd(I put to ext/mbstring/tests) that this PR works fine. Would you like add this .phpt file?

iluuu1994 · 2023-03-25T17:09:24Z

@alexdowad Did you have time benchmarking the two implementations? If either implementation is faster it might make sense to move that one to core and use it from mbstring. I'll also check whether we can see a performance hit during opcache persisting.

bukka · 2023-03-25T20:43:44Z

Zend/zend_string.c

+// Copyright (c) 2008-2009 Bjoern Hoehrmann <[email protected]>
+// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
+// https://stackoverflow.com/a/22135005/1320374


This is under BSD licence so it should be pointed here as well or ideally new header created so it doesn't mix licenses in the single file. I have been actually using this in jsond for some time and I separated the headers for that.

alexdowad · 2023-03-27T19:00:59Z

@alexdowad Did you have time benchmarking the two implementations?

Hi @iluuu1994. Sorry for the delay. I just put together a lil' benchmark program which runs both functions on a bunch of random strings and compares how long they take.

Using gcc with no optimization, the function from PCRE is somewhat faster. With gcc -O3... Bjoern's function is so ridiculously fast that it's hard to believe. I am just investigating.

alexdowad · 2023-03-27T19:11:08Z

Looks like compiler must have been optimizing calls to Bjoern's function out because the result was not used.

After making sure the result is used, the PCRE function is consistently faster.

alexdowad · 2023-03-27T19:15:54Z

Looks like @iluuu1994 would do better to use mb_fast_check_utf8_default from mbstring, or mb_fast_check_utf8_avx2 in cases where it can be used.

We already have a mechanism whereby Zend core calls into some mbstring function through function pointers which are set via zend_multibyte_set_functions... but I guess you won't want to use that for this purpose, because then you could only use the UTF-8 checking functions if mbstring extension is present.

I think you should go ahead and move both mb_fast_check_utf8_{default,avx2} into core, rename them (perhaps change mb_ to zend_), and expose them to callers in extension modules.

alexdowad · 2023-03-27T19:16:35Z

Maybe also rename _fast_check_ to just _check_.

iluuu1994 · 2023-03-27T20:48:16Z

Thank you @alexdowad for your investigation! Moving the functions into core sounds good to me. I'll do that and check if there are any performance regressions for startup/compilation.

iluuu1994 requested review from alexdowad and Girgias March 17, 2023 11:26

github-actions bot added Category: Engine Extension: mbstring Extension: opcache Extension: pcre Extension: zend_test labels Mar 17, 2023

mvorisek reviewed Mar 17, 2023

View reviewed changes

UTF-8 validate strings before interning

8f9322b

iluuu1994 force-pushed the interned-strings-validate-utf8 branch from 439fa8a to 8f9322b Compare March 17, 2023 17:36

Girgias approved these changes Mar 17, 2023

View reviewed changes

bukka reviewed Mar 25, 2023

View reviewed changes

mvorisek mentioned this pull request Apr 12, 2023

Imply UTF8 validity in explode function #10805

Closed

mvorisek mentioned this pull request Dec 5, 2023

Use ZSTR_IS_VALID_UTF8 where possible #12869

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 validate strings before interning #10870

UTF-8 validate strings before interning #10870

iluuu1994 commented Mar 17, 2023

mvorisek Mar 17, 2023

mvorisek commented Mar 17, 2023

alexdowad commented Mar 17, 2023

alexdowad commented Mar 17, 2023

Girgias left a comment

Girgias Mar 17, 2023

Girgias Mar 17, 2023

youkidearitai commented Mar 18, 2023

iluuu1994 commented Mar 25, 2023

bukka Mar 25, 2023

alexdowad commented Mar 27, 2023

alexdowad commented Mar 27, 2023

alexdowad commented Mar 27, 2023

alexdowad commented Mar 27, 2023

iluuu1994 commented Mar 27, 2023

	ZEND_API bool zend_string_validate_utf8(zend_string *string) {
	ZEND_API bool zend_string_validate_utf8(const zend_string *string) {

UTF-8 validate strings before interning #10870

Are you sure you want to change the base?

UTF-8 validate strings before interning #10870

Conversation

iluuu1994 commented Mar 17, 2023

mvorisek Mar 17, 2023

Choose a reason for hiding this comment

mvorisek commented Mar 17, 2023

alexdowad commented Mar 17, 2023

alexdowad commented Mar 17, 2023

Girgias left a comment

Choose a reason for hiding this comment

Girgias Mar 17, 2023

Choose a reason for hiding this comment

Girgias Mar 17, 2023

Choose a reason for hiding this comment

youkidearitai commented Mar 18, 2023

iluuu1994 commented Mar 25, 2023

bukka Mar 25, 2023

Choose a reason for hiding this comment

alexdowad commented Mar 27, 2023

alexdowad commented Mar 27, 2023

alexdowad commented Mar 27, 2023

alexdowad commented Mar 27, 2023

iluuu1994 commented Mar 27, 2023