Skip to content

mb_scrub does not attempt to scrub known-valid UTF-8 strings #10409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions ext/mbstring/mbstring.c
Original file line number Diff line number Diff line change
Expand Up @@ -1534,7 +1534,7 @@ PHP_FUNCTION(mb_output_handler)

const mbfl_encoding *encoding = MBSTRG(current_http_output_encoding);
if (encoding == &mbfl_encoding_pass) {
RETURN_STR(zend_string_copy(str));
RETURN_STR_COPY(str);
}

if (arg_status & PHP_OUTPUT_HANDLER_START) {
Expand Down Expand Up @@ -1574,7 +1574,7 @@ PHP_FUNCTION(mb_output_handler)
}

if (!MBSTRG(outconv_enabled)) {
RETURN_STR(zend_string_copy(str));
RETURN_STR_COPY(str);
}

mb_convert_buf buf;
Expand Down Expand Up @@ -5066,12 +5066,10 @@ PHP_FUNCTION(mb_chr)
/* {{{ */
PHP_FUNCTION(mb_scrub)
{
char* str;
size_t str_len;
zend_string *enc_name = NULL;
zend_string *str, *enc_name = NULL;

ZEND_PARSE_PARAMETERS_START(1, 2)
Z_PARAM_STRING(str, str_len)
Z_PARAM_STR(str)
Z_PARAM_OPTIONAL
Z_PARAM_STR_OR_NULL(enc_name)
ZEND_PARSE_PARAMETERS_END();
Expand All @@ -5081,7 +5079,12 @@ PHP_FUNCTION(mb_scrub)
RETURN_THROWS();
}

RETURN_STR(php_mb_convert_encoding_ex(str, str_len, enc, enc));
if (enc == &mbfl_encoding_utf8 && (GC_FLAGS(str) & IS_STR_VALID_UTF8)) {
/* A valid UTF-8 string will not be changed by mb_scrub; so just increment the refcount and return it */
RETURN_STR_COPY(str);
}

RETURN_STR(php_mb_convert_encoding_ex(ZSTR_VAL(str), ZSTR_LEN(str), enc, enc));
}
/* }}} */

Expand Down
8 changes: 8 additions & 0 deletions ext/mbstring/tests/mb_scrub.phpt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,15 @@ var_dump(
"?" === mb_scrub("\x80"),
"?" === mb_scrub("\x80", 'UTF-8')
);

$utf8str = "abc 日本語 Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν ⡍⠔⠙⠖ ⡊ ⠙⠕⠝⠰⠞";
// Check $utf8str so it is marked as 'valid UTF-8'
// This will enable optimized implementation of mb_scrub
if (!mb_check_encoding($utf8str, 'UTF-8'))
die("Test string should be valid UTF-8");
var_dump(mb_scrub($utf8str));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late, I have a question. This testcase $utf8str is not seem marked valid UTF-8.
I ran gdb, Marked UTF-8 is works fine below case.

>>> r -r 'mb_scrub("a", "UTF-8");'

!5082      if (enc == &mbfl_encoding_utf8 && (GC_FLAGS(str) & IS_STR_VALID_UTF8)) {
 5083          /* A valid UTF-8 string will not be changed by mb_scrub; so just increment the refcount and return it */
 5084          RETURN_STR_COPY(str);
 5085      }

mb_check_encoding seems marks valid UTF-8 when it is not interned string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! You are right!

I don't really understand why we don't mark interned strings as valid UTF-8. I copied the test for ZSTR_IS_INTERNED from the PCRE extension (PCRE only marks strings as valid UTF-8 if they are not interned).

I would love to remove that test and mark interned strings as valid UTF-8 if that's what they are... but Chesterson's fence.

For now I will add another test case like the one you showed above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic I have just seen that you were the author of the ZSTR_IS_INTERNED check, in 2b9acd3.

Can you clarify why it is not OK to set the IS_STR_VALID_UTF8 flag on interned strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexdowad Interned strings are immutable. We could set the flag with an atomic rmw op. Ideally we'd just check validity of all strings during interning though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic The concern is for ZTS interpreters running multi-threaded programs, is that right?

What if I add a static inline function to mark a zend_string as valid UTF-8, then use preprocessor directives so on ZTS builds (and for interned strings only) it uses atomic ops to set that bit, but for non-ZTS builds it just uses normal, non-atomic stores? Any concerns about that?

Regarding the idea of checking validity of all interned strings... I do have concerns about performance. Just benchmarked locally and the new AVX2-based UTF-8 validation takes about 13ms on my computer for a 10MB string.

The non-vectorized validation function which I just merged (derived from PCRE) takes about 200ms for a 10MB string.

Any thoughts on the performance issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interned strings are usually very short (at a guess, I'd expected 90% to be < 128 bytes). Of course, there's still a cost to it.

Certainly, there is a cost to it. So, is the ideal solution to implement atomic update for interned strings, or is it to validate all interned strings as UTF-8, at the time of interning?

I could (hopefully) implement either of those solutions in the next couple days, but am just trying to figure out which one to go for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, there is a cost to it. So, is the ideal solution to implement atomic update for interned strings, or is it to validate all interned strings as UTF-8, at the time of interning?

I think we would need to measure for realistic cases (such as actually running some real world apps, simulating multiple concurrent clients), but that might be difficult; not sure if any of the devs has a respective enviroment available (the Windows team once had, but that is now gone; maybe @dstogov has such an environment available).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to measure for realistic cases (such as actually running some real world apps, simulating multiple concurrent clients), but that might be difficult; not sure if any of the devs has a respective enviroment available (the Windows team once had, but that is now gone; maybe @dstogov has such an environment available).

If we are going to test/benchmark on "realistic cases", or even on unrealistic ones, I think it means I need to implement both solutions so performance comparisons can be done. Does that sound right? Otherwise, if neither solution has been implemented, I don't know what could actually be measured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to test/benchmark on "realistic cases", or even on unrealistic ones, I think it means I need to implement both solutions so performance comparisons can be done. Does that sound right?

Yes; another drawback. :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; another drawback. :(

🤷 I'm not too worried about it.

?>
--EXPECT--
bool(true)
bool(true)
string(122) "abc 日本語 Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν ⡍⠔⠙⠖ ⡊ ⠙⠕⠝⠰⠞"