Skip to content

mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alexdowad
Copy link
Contributor

Just a little performance optimization here for mb_scrub.

@cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

@nielsdos
Copy link
Member

Looks correct to me :) Thank you.

This means the same thing and makes the code read a tiny bit better.

Thanks to Nikita Popov for the tip.
Copy link
Member

@cmb69 cmb69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@alexdowad
Copy link
Contributor Author

Thanks, all. Just landing on master.

@alexdowad alexdowad closed this Jan 22, 2023
@alexdowad alexdowad deleted the mbscrub branch January 22, 2023 11:55
// This will enable optimized implementation of mb_scrub
if (!mb_check_encoding($utf8str, 'UTF-8'))
die("Test string should be valid UTF-8");
var_dump(mb_scrub($utf8str));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late, I have a question. This testcase $utf8str is not seem marked valid UTF-8.
I ran gdb, Marked UTF-8 is works fine below case.

>>> r -r 'mb_scrub("a", "UTF-8");'

!5082      if (enc == &mbfl_encoding_utf8 && (GC_FLAGS(str) & IS_STR_VALID_UTF8)) {
 5083          /* A valid UTF-8 string will not be changed by mb_scrub; so just increment the refcount and return it */
 5084          RETURN_STR_COPY(str);
 5085      }

mb_check_encoding seems marks valid UTF-8 when it is not interned string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! You are right!

I don't really understand why we don't mark interned strings as valid UTF-8. I copied the test for ZSTR_IS_INTERNED from the PCRE extension (PCRE only marks strings as valid UTF-8 if they are not interned).

I would love to remove that test and mark interned strings as valid UTF-8 if that's what they are... but Chesterson's fence.

For now I will add another test case like the one you showed above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic I have just seen that you were the author of the ZSTR_IS_INTERNED check, in 2b9acd3.

Can you clarify why it is not OK to set the IS_STR_VALID_UTF8 flag on interned strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdowad Interned strings are immutable. We could set the flag with an atomic rmw op. Ideally we'd just check validity of all strings during interning though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic The concern is for ZTS interpreters running multi-threaded programs, is that right?

What if I add a static inline function to mark a zend_string as valid UTF-8, then use preprocessor directives so on ZTS builds (and for interned strings only) it uses atomic ops to set that bit, but for non-ZTS builds it just uses normal, non-atomic stores? Any concerns about that?

Regarding the idea of checking validity of all interned strings... I do have concerns about performance. Just benchmarked locally and the new AVX2-based UTF-8 validation takes about 13ms on my computer for a 10MB string.

The non-vectorized validation function which I just merged (derived from PCRE) takes about 200ms for a 10MB string.

Any thoughts on the performance issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interned strings are usually very short (at a guess, I'd expected 90% to be < 128 bytes). Of course, there's still a cost to it.

Certainly, there is a cost to it. So, is the ideal solution to implement atomic update for interned strings, or is it to validate all interned strings as UTF-8, at the time of interning?

I could (hopefully) implement either of those solutions in the next couple days, but am just trying to figure out which one to go for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, there is a cost to it. So, is the ideal solution to implement atomic update for interned strings, or is it to validate all interned strings as UTF-8, at the time of interning?

I think we would need to measure for realistic cases (such as actually running some real world apps, simulating multiple concurrent clients), but that might be difficult; not sure if any of the devs has a respective enviroment available (the Windows team once had, but that is now gone; maybe @dstogov has such an environment available).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to measure for realistic cases (such as actually running some real world apps, simulating multiple concurrent clients), but that might be difficult; not sure if any of the devs has a respective enviroment available (the Windows team once had, but that is now gone; maybe @dstogov has such an environment available).

If we are going to test/benchmark on "realistic cases", or even on unrealistic ones, I think it means I need to implement both solutions so performance comparisons can be done. Does that sound right? Otherwise, if neither solution has been implemented, I don't know what could actually be measured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to test/benchmark on "realistic cases", or even on unrealistic ones, I think it means I need to implement both solutions so performance comparisons can be done. Does that sound right?

Yes; another drawback. :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; another drawback. :(

🤷 I'm not too worried about it.

@dstogov
Copy link
Member

dstogov commented Jan 30, 2023

@alexdowad interned strings may lay in opcache shared memory. This memory shouldn't be updated without a lock. Even more, it may be read-only (see opcache.protect_memory).

In general, PHP/opcache may provide an API to update string flags.

@alexdowad
Copy link
Contributor Author

@dstogov Thanks. Should I work on adding that API to opcache?

What do we do if interned strings are in read-only memory? Perhaps the API function for updating GC flags should just do nothing and return false in that case?

@dstogov
Copy link
Member

dstogov commented Jan 30, 2023

@alexdowad see how zend_accel_inheritance_cache_add() is implemented and called

static zend_class_entry* zend_accel_inheritance_cache_add(zend_class_entry *ce, zend_class_entry *proto, zend_class_entry *parent, zend_class_entry **traits_and_interfaces, HashTable *dependencies)

The function itself should make an update after unprotecting memory and under lock

static void zend_accel_add_interned_string_flags(zend_string *str, uint32_t flags)
{
    ZEND_ASSERT(ZSTR_IS_INTERNED(str));;
    if ((GC_FLAGS(str) & flags) != flags) {
	SHM_UNPROTECT();
	zend_shared_alloc_lock();
	GC_FLAGS(str) |= flags;
	zend_shared_alloc_unlock();
	SHM_PROTECT();
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants