-
Notifications
You must be signed in to change notification settings - Fork 7.9k
ext/mbstring: Update to Unicode 15.1 #14680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi, @Ayesh, and thanks very much for the contribution. This generally looks great, though I would love to see a few added tests... for example, it seems that a couple entries in the EAW table have changed in Unicode 15.1. A couple of added test cases for I'd have to look at this a bit more to figure out what other test cases we could add to make sure that the updated parser for |
Thank you @alexdowad - I will read more about the change code points and add tests soon. |
I added a new |
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming Unicode 16 version will be released roughly on 2024 Sept. Previously: 0fdffc1, php#7502 UCD 15.1 `DerivedNormalizationProps` contains multiple properties in the same line, which breaks the parser. This also updates the `ucgendat.php` script to allow 2 or three fields in each line, and to look for the `Cased` and `Case_Ignorable` properties in either of the fields to mimic the previous behavior.
@Ayesh, I'm just looking at the UC 15.1 |
Parser was tripping on the
My understanding is that the parser always expected two fields (codepoint or range, and the property), but because 15.1 has sections with two properties (making it three fields in total), it tripped the parser. Thank you. |
@Ayesh Thanks for that! I wanted to know what the actual properties which are written 2 per line are, and whether they affect anything else in mbstring. It looks like the new properties relate to new rules for grapheme segmentation... that doesn't affect mbstring, but it will affect the built-in grapheme functions! |
Landed on master. Thank you so much, @Ayesh! |
Yay, thank you so much! I will try and see if I can check with Intl and PCRE2 if they can workout Unicode 15.1 changes. |
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming Unicode 16 version will be released roughly on 2024 Sept. Previously: 0fdffc1, #7502 UCD 15.1 `DerivedNormalizationProps` contains multiple properties in the same line, which breaks the parser. This also updates the `ucgendat.php` script to allow 2 or three fields in each line, and to look for the `Cased` and `Case_Ignorable` properties in either of the fields to mimic the previous behavior.
Updates UCD to Unicode 16.0 (released 2024 Sept). Previously: 0fdffc1, php#7502, php#14680 Unicode 16 adds several new character sets and case folding rules. However, the existing ucgendat script can still parse them. This also adds a couple test cases to make sure the new rules for East Asian Wide characters and case folding work correctly. These tests fail on Unicode 15.1 and older because those verisons do not contain those rules.
Updates UCD to Unicode 16.0 (released 2024 Sept). Previously: 0fdffc1, php#7502, php#14680 Unicode 16 adds several new character sets and case folding rules. However, the existing ucgendat script can still parse them. This also adds a couple test cases to make sure the new rules for East Asian Wide characters and case folding work correctly. These tests fail on Unicode 15.1 and older because those verisons do not contain those rules.
Updates UCD to Unicode 16.0 (released 2024 Sept). Previously: 0fdffc1, #7502, #14680 Unicode 16 adds several new character sets and case folding rules. However, the existing ucgendat script can still parse them. This also adds a couple test cases to make sure the new rules for East Asian Wide characters and case folding work correctly. These tests fail on Unicode 15.1 and older because those verisons do not contain those rules.
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming Unicode 16 version will be released roughly on 2024 Sept.
Previously: 0fdffc1, #7502
UCD 15.1
DerivedNormalizationProps
contains multiple properties in the same line, which breaks the parser. This also updates theucgendat.php
script to allow 2 or three fields in each line, and to look for theCased
andCase_Ignorable
properties in either of the fields to mimic the previous behavior.