Skip to content

ext/mbstring: Update to Unicode 15.1 #14680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

Ayesh
Copy link
Member

@Ayesh Ayesh commented Jun 26, 2024

Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming Unicode 16 version will be released roughly on 2024 Sept.

Previously: 0fdffc1, #7502

UCD 15.1 DerivedNormalizationProps contains multiple properties in the same line, which breaks the parser. This also updates the ucgendat.php script to allow 2 or three fields in each line, and to look for the Cased and Case_Ignorable properties in either of the fields to mimic the previous behavior.

@Ayesh Ayesh force-pushed the mbstring-ucd-15-1 branch from 854bfa4 to 8ca11ac Compare June 26, 2024 19:17
@Ayesh Ayesh marked this pull request as ready for review June 26, 2024 19:42
@Ayesh Ayesh requested a review from alexdowad as a code owner June 26, 2024 19:42
@alexdowad
Copy link
Contributor

Hi, @Ayesh, and thanks very much for the contribution.

This generally looks great, though I would love to see a few added tests... for example, it seems that a couple entries in the EAW table have changed in Unicode 15.1. A couple of added test cases for mb_strwidth could be used to verify that we are returning the expected values for the new Unicode standard.

I'd have to look at this a bit more to figure out what other test cases we could add to make sure that the updated parser for DerivedNormalizationProps is correctly extracting the data which we need.

@Ayesh
Copy link
Member Author

Ayesh commented Jun 26, 2024

Thank you @alexdowad - I will read more about the change code points and add tests soon.

@Ayesh Ayesh force-pushed the mbstring-ucd-15-1 branch from 8ca11ac to bbf7b37 Compare June 27, 2024 12:23
@Ayesh
Copy link
Member Author

Ayesh commented Jun 27, 2024

I added a new unicode_versions.phpt test that checks mb_strwidth with various scripts, existing Emojis, and a width=2 WiFi Emoji added in Unicode 15. I tested against the master branch, where mb_strwidth("\u{1F6DC}") returns 1, while in this PR branch, it returns 2 as expected. Thank you.

Ayesh added 2 commits June 29, 2024 20:00
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming
Unicode 16 version will be released roughly on 2024 Sept.

Previously: 0fdffc1, php#7502

UCD 15.1 `DerivedNormalizationProps` contains multiple properties in
the same line, which breaks the parser. This also updates the
`ucgendat.php` script to allow 2 or three fields in each line, and to
look for the `Cased` and `Case_Ignorable` properties in either of the
fields to mimic the previous behavior.
@Ayesh Ayesh force-pushed the mbstring-ucd-15-1 branch from bbf7b37 to 6c5b6e4 Compare June 29, 2024 13:00
@alexdowad
Copy link
Contributor

@Ayesh, I'm just looking at the UC 15.1 DerivedNormalizationProps file and trying to find which lines have "multiple properties in the same line". I haven't seen any yet. Could you clarify?

@Ayesh
Copy link
Member Author

Ayesh commented Jun 29, 2024

Parser was tripping on the DerivedCoreProperties.txt file on Indic_Conjunct_Break section. Unicode 15.1 DerivedCoreProperties.txt on line 12613. Relevant discussion/PR on Python.

094D          ; InCB; Linker # Mn       DEVANAGARI SIGN VIRAMA

My understanding is that the parser always expected two fields (codepoint or range, and the property), but because 15.1 has sections with two properties (making it three fields in total), it tripped the parser.

Thank you.

@alexdowad
Copy link
Contributor

@Ayesh Thanks for that!

I wanted to know what the actual properties which are written 2 per line are, and whether they affect anything else in mbstring.

It looks like the new properties relate to new rules for grapheme segmentation... that doesn't affect mbstring, but it will affect the built-in grapheme functions!

@alexdowad
Copy link
Contributor

Landed on master. Thank you so much, @Ayesh!

@alexdowad alexdowad closed this Jun 29, 2024
@Ayesh
Copy link
Member Author

Ayesh commented Jun 29, 2024

Yay, thank you so much! I will try and see if I can check with Intl and PCRE2 if they can workout Unicode 15.1 changes.

@Ayesh Ayesh deleted the mbstring-ucd-15-1 branch June 29, 2024 15:35
jorgsowa referenced this pull request Jun 30, 2024
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming
Unicode 16 version will be released roughly on 2024 Sept.

Previously: 0fdffc1, #7502

UCD 15.1 `DerivedNormalizationProps` contains multiple properties in
the same line, which breaks the parser. This also updates the
`ucgendat.php` script to allow 2 or three fields in each line, and to
look for the `Cased` and `Case_Ignorable` properties in either of the
fields to mimic the previous behavior.
Ayesh added a commit to Ayesh/php-src that referenced this pull request Sep 15, 2024
Updates UCD to Unicode 16.0 (released 2024 Sept).

Previously: 0fdffc1, php#7502, php#14680

Unicode 16 adds several new character sets and case folding rules.
However, the existing ucgendat script can still parse them.

This also adds a couple test cases to make sure the new rules for
East Asian Wide characters and case folding work correctly. These
tests fail on Unicode 15.1 and older because those verisons do not
contain those rules.
Ayesh added a commit to Ayesh/php-src that referenced this pull request Sep 16, 2024
Updates UCD to Unicode 16.0 (released 2024 Sept).

Previously: 0fdffc1, php#7502, php#14680

Unicode 16 adds several new character sets and case folding rules.
However, the existing ucgendat script can still parse them.

This also adds a couple test cases to make sure the new rules for
East Asian Wide characters and case folding work correctly. These
tests fail on Unicode 15.1 and older because those verisons do not
contain those rules.
alexdowad pushed a commit that referenced this pull request Sep 17, 2024
Updates UCD to Unicode 16.0 (released 2024 Sept).

Previously: 0fdffc1, #7502, #14680

Unicode 16 adds several new character sets and case folding rules.
However, the existing ucgendat script can still parse them.

This also adds a couple test cases to make sure the new rules for
East Asian Wide characters and case folding work correctly. These
tests fail on Unicode 15.1 and older because those verisons do not
contain those rules.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants