Skip to content

Imply UTF8 validity in explode function #10805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mvorisek
Copy link
Contributor

@mvorisek mvorisek commented Mar 7, 2023

like #10780, needs #10853/#10870

when valid UTF-8 string is exploded by valid UTF-8 delimiter, the result is always valid UTF-8 string

invalid UTF-8 string might became valid, this case requires check on the whole string, thus cannot be implied from the source

@mvorisek
Copy link
Contributor Author

mvorisek commented Mar 8, 2023

Any idea why for the following test case:

$string = 'můj žlutý kůň';
var_dump(explode('ů', $string, -2));

the result is marked as NOT valid UTF-8 string? The real result is an array with single character: https://3v4l.org/oRGNM/rfc#vgit.master, maybe due some interning?

@nielsdos
Copy link
Member

Both the delimiter and string don't have the "valid UTF8" flag. Using mb_check_encoding to attempt to add the flag doesn't work because they're both indeed interned.

@mvorisek
Copy link
Contributor Author

Both the delimiter and string don't have the "valid UTF8" flag. Using mb_check_encoding to attempt to add the flag doesn't work because they're both indeed interned.

If the delimiter is valid UTF-8 and the string is valid as well, the result "is the same but has the delimiter cut away", ie. leading to a valid UTF-8 string. What do you exactly mean?

@nielsdos
Copy link
Member

What do you exactly mean?

Print the flags of the delimiter and the input string and you'll see they both don't have the UTF-8 valid flag. They were interned by the compiler and the compiler doesn't set that flag. Trying to add that flag using mbstring wouldn't work either because a flag can't be added once they're interned.

@mvorisek
Copy link
Contributor Author

Thank you for the hint. I created #10853 feature request, as the check is quite cheap, all interned/compiled strings should be checked upfront to utilize opcache.

@nielsdos
Copy link
Member

Thank you for the hint. I created #10853 feature request, as the check is quite cheap, all interned/compiled strings should be checked upfront to utilize opcache.

You're welcome. I think Alex was going to look into something like this. I'm not sure anymore but I thought I saw him say something like that in a comment on a PR some time ago...

@mvorisek
Copy link
Contributor Author

maybe later

@mvorisek mvorisek closed this Mar 23, 2025
@mvorisek mvorisek deleted the imply_more_utf8 branch March 23, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants