mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

alexdowad · 2023-01-21T21:33:04Z

Just a little performance optimization here for mb_scrub.

@cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

nielsdos · 2023-01-21T21:37:37Z

Looks correct to me :) Thank you.

ext/mbstring/mbstring.c

This means the same thing and makes the code read a tiny bit better. Thanks to Nikita Popov for the tip.

cmb69

Thank you!

alexdowad · 2023-01-22T11:54:18Z

Thanks, all. Just landing on master.

youkidearitai · 2023-01-22T14:43:46Z

ext/mbstring/tests/mb_scrub.phpt

+// This will enable optimized implementation of mb_scrub
+if (!mb_check_encoding($utf8str, 'UTF-8'))
+    die("Test string should be valid UTF-8");
+var_dump(mb_scrub($utf8str));


Sorry for late, I have a question. This testcase $utf8str is not seem marked valid UTF-8.
I ran gdb, Marked UTF-8 is works fine below case.

>>> r -r 'mb_scrub("a", "UTF-8");' !5082 if (enc == &mbfl_encoding_utf8 && (GC_FLAGS(str) & IS_STR_VALID_UTF8)) { 5083 /* A valid UTF-8 string will not be changed by mb_scrub; so just increment the refcount and return it */ 5084 RETURN_STR_COPY(str); 5085 }

mb_check_encoding seems marks valid UTF-8 when it is not interned string.

Thanks for pointing this out! You are right!

I don't really understand why we don't mark interned strings as valid UTF-8. I copied the test for ZSTR_IS_INTERNED from the PCRE extension (PCRE only marks strings as valid UTF-8 if they are not interned).

I would love to remove that test and mark interned strings as valid UTF-8 if that's what they are... but Chesterson's fence.

For now I will add another test case like the one you showed above.

@nikic I have just seen that you were the author of the ZSTR_IS_INTERNED check, in 2b9acd3.

Can you clarify why it is not OK to set the IS_STR_VALID_UTF8 flag on interned strings?

@alexdowad Interned strings are immutable. We could set the flag with an atomic rmw op. Ideally we'd just check validity of all strings during interning though.

@nikic The concern is for ZTS interpreters running multi-threaded programs, is that right?

What if I add a static inline function to mark a zend_string as valid UTF-8, then use preprocessor directives so on ZTS builds (and for interned strings only) it uses atomic ops to set that bit, but for non-ZTS builds it just uses normal, non-atomic stores? Any concerns about that?

Regarding the idea of checking validity of all interned strings... I do have concerns about performance. Just benchmarked locally and the new AVX2-based UTF-8 validation takes about 13ms on my computer for a 10MB string.

The non-vectorized validation function which I just merged (derived from PCRE) takes about 200ms for a 10MB string.

Any thoughts on the performance issue?

Interned strings are usually very short (at a guess, I'd expected 90% to be < 128 bytes). Of course, there's still a cost to it.

Certainly, there is a cost to it. So, is the ideal solution to implement atomic update for interned strings, or is it to validate all interned strings as UTF-8, at the time of interning?

I could (hopefully) implement either of those solutions in the next couple days, but am just trying to figure out which one to go for.

Certainly, there is a cost to it. So, is the ideal solution to implement atomic update for interned strings, or is it to validate all interned strings as UTF-8, at the time of interning?

I think we would need to measure for realistic cases (such as actually running some real world apps, simulating multiple concurrent clients), but that might be difficult; not sure if any of the devs has a respective enviroment available (the Windows team once had, but that is now gone; maybe @dstogov has such an environment available).

I think we would need to measure for realistic cases (such as actually running some real world apps, simulating multiple concurrent clients), but that might be difficult; not sure if any of the devs has a respective enviroment available (the Windows team once had, but that is now gone; maybe @dstogov has such an environment available).

If we are going to test/benchmark on "realistic cases", or even on unrealistic ones, I think it means I need to implement both solutions so performance comparisons can be done. Does that sound right? Otherwise, if neither solution has been implemented, I don't know what could actually be measured.

If we are going to test/benchmark on "realistic cases", or even on unrealistic ones, I think it means I need to implement both solutions so performance comparisons can be done. Does that sound right?

Yes; another drawback. :(

Yes; another drawback. :(

🤷 I'm not too worried about it.

dstogov · 2023-01-30T08:05:15Z

@alexdowad interned strings may lay in opcache shared memory. This memory shouldn't be updated without a lock. Even more, it may be read-only (see opcache.protect_memory).

In general, PHP/opcache may provide an API to update string flags.

alexdowad · 2023-01-30T08:12:25Z

@dstogov Thanks. Should I work on adding that API to opcache?

What do we do if interned strings are in read-only memory? Perhaps the API function for updating GC flags should just do nothing and return false in that case?

dstogov · 2023-01-30T08:31:50Z

@alexdowad see how zend_accel_inheritance_cache_add() is implemented and called

php-src/ext/opcache/ZendAccelerator.c

Line 2335 in 908d954

    
           static zend_class_entry* zend_accel_inheritance_cache_add(zend_class_entry *ce, zend_class_entry *proto, zend_class_entry *parent, zend_class_entry **traits_and_interfaces, HashTable *dependencies)

The function itself should make an update after unprotecting memory and under lock

static void zend_accel_add_interned_string_flags(zend_string *str, uint32_t flags)
{
    ZEND_ASSERT(ZSTR_IS_INTERNED(str));;
    if ((GC_FLAGS(str) & flags) != flags) {
	SHM_UNPROTECT();
	zend_shared_alloc_lock();
	GC_FLAGS(str) |= flags;
	zend_shared_alloc_unlock();
	SHM_PROTECT();
    }
}

github-actions bot added the Extension: mbstring label Jan 21, 2023

nikic reviewed Jan 21, 2023

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

alexdowad added 2 commits January 22, 2023 07:38

mb_scrub does not attempt to scrub known-valid UTF-8 strings

197f810

Use RETURN_STR_COPY in mb_output_handler

69c5af8

This means the same thing and makes the code read a tiny bit better. Thanks to Nikita Popov for the tip.

alexdowad force-pushed the mbscrub branch from 7a57919 to 69c5af8 Compare January 22, 2023 05:41

cmb69 approved these changes Jan 22, 2023

View reviewed changes

alexdowad closed this Jan 22, 2023

alexdowad deleted the mbscrub branch January 22, 2023 11:55

youkidearitai reviewed Jan 22, 2023

View reviewed changes

youkidearitai mentioned this pull request Mar 17, 2023

Check UTF-8 validity for all constant strings on compile time #10853

Open

iluuu1994 mentioned this pull request Mar 17, 2023

UTF-8 validate strings before interning #10870

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

alexdowad commented Jan 21, 2023

nielsdos commented Jan 21, 2023

cmb69 left a comment

alexdowad commented Jan 22, 2023

youkidearitai Jan 22, 2023

alexdowad Jan 27, 2023

alexdowad Jan 27, 2023

nikic Jan 27, 2023

alexdowad Jan 27, 2023

alexdowad Jan 27, 2023

cmb69 Jan 27, 2023

alexdowad Jan 27, 2023

cmb69 Jan 27, 2023

alexdowad Jan 27, 2023

dstogov commented Jan 30, 2023

alexdowad commented Jan 30, 2023

dstogov commented Jan 30, 2023

mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

Conversation

alexdowad commented Jan 21, 2023

nielsdos commented Jan 21, 2023

cmb69 left a comment

Choose a reason for hiding this comment

alexdowad commented Jan 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstogov commented Jan 30, 2023

alexdowad commented Jan 30, 2023

dstogov commented Jan 30, 2023