Skip to content

Commit 3c6a957

Browse files
committed
ext/mbstring: Update to Unicode 16
Updates UCD to Unicode 16.0 (released 2024 Sept). Previously: 0fdffc1, php#7502, php#14680 Unicode 16 adds several new character sets and case folding rules. However, the existing ucgendat script can still parse them. This also adds a couple test cases to make sure the new rules for East Asian Wide characters and case folding work correctly. These tests fail on Unicode 15.1 and older because those verisons do not contain those rules.
1 parent 5121aca commit 3c6a957

File tree

3 files changed

+3900
-3767
lines changed

3 files changed

+3900
-3767
lines changed

ext/mbstring/libmbfl/mbfl/eaw_table.h

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,10 @@ static const struct {
2828
{ 0x23f3, 0x23f3 },
2929
{ 0x25fd, 0x25fe },
3030
{ 0x2614, 0x2615 },
31+
{ 0x2630, 0x2637 },
3132
{ 0x2648, 0x2653 },
3233
{ 0x267f, 0x267f },
34+
{ 0x268a, 0x268f },
3335
{ 0x2693, 0x2693 },
3436
{ 0x26a1, 0x26a1 },
3537
{ 0x26aa, 0x26ab },
@@ -63,11 +65,10 @@ static const struct {
6365
{ 0x3099, 0x30ff },
6466
{ 0x3105, 0x312f },
6567
{ 0x3131, 0x318e },
66-
{ 0x3190, 0x31e3 },
68+
{ 0x3190, 0x31e5 },
6769
{ 0x31ef, 0x321e },
6870
{ 0x3220, 0x3247 },
69-
{ 0x3250, 0x4dbf },
70-
{ 0x4e00, 0xa48c },
71+
{ 0x3250, 0xa48c },
7172
{ 0xa490, 0xa4c6 },
7273
{ 0xa960, 0xa97c },
7374
{ 0xac00, 0xd7a3 },
@@ -82,7 +83,7 @@ static const struct {
8283
{ 0x16ff0, 0x16ff1 },
8384
{ 0x17000, 0x187f7 },
8485
{ 0x18800, 0x18cd5 },
85-
{ 0x18d00, 0x18d08 },
86+
{ 0x18cff, 0x18d08 },
8687
{ 0x1aff0, 0x1aff3 },
8788
{ 0x1aff5, 0x1affb },
8889
{ 0x1affd, 0x1affe },
@@ -92,6 +93,8 @@ static const struct {
9293
{ 0x1b155, 0x1b155 },
9394
{ 0x1b164, 0x1b167 },
9495
{ 0x1b170, 0x1b2fb },
96+
{ 0x1d300, 0x1d356 },
97+
{ 0x1d360, 0x1d376 },
9598
{ 0x1f004, 0x1f004 },
9699
{ 0x1f0cf, 0x1f0cf },
97100
{ 0x1f18e, 0x1f18e },
@@ -132,11 +135,10 @@ static const struct {
132135
{ 0x1f93c, 0x1f945 },
133136
{ 0x1f947, 0x1f9ff },
134137
{ 0x1fa70, 0x1fa7c },
135-
{ 0x1fa80, 0x1fa88 },
136-
{ 0x1fa90, 0x1fabd },
137-
{ 0x1fabf, 0x1fac5 },
138-
{ 0x1face, 0x1fadb },
139-
{ 0x1fae0, 0x1fae8 },
138+
{ 0x1fa80, 0x1fa89 },
139+
{ 0x1fa8f, 0x1fac6 },
140+
{ 0x1face, 0x1fadc },
141+
{ 0x1fadf, 0x1fae9 },
140142
{ 0x1faf0, 0x1faf8 },
141143
{ 0x20000, 0x2fffd },
142144
{ 0x30000, 0x3fffd },

ext/mbstring/tests/unicode_versions.phpt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ mbstring
55
--FILE--
66
<?php
77

8+
echo "Char widths:\n";
9+
810
print "ASCII (PHP): " . mb_strwidth('PHP', 'UTF-8') . "\n";
911

1012
print "Vietnamese (Xin chào): " . mb_strwidth('Xin chào', 'UTF-8') . "\n";
@@ -18,11 +20,22 @@ print "Emoji (\u{1F418}): " . mb_strwidth("\u{1F418}", 'UTF-8') . "\n";
1820
// New in Unicode 15.0, width=2
1921
print "Emoji (\u{1F6DC}): " . mb_strwidth("\u{1F6DC}", 'UTF-8') . "\n";
2022

23+
// Changed in Unicode 16.0, U+2630...U+2637 are wide
24+
print "Emoji (\u{2630}): " . mb_strwidth("\u{2630}", 'UTF-8') . "\n";
25+
26+
echo "Char case changes:\n";
27+
28+
print "Upper(\u{019b}) = \u{a7dc} : ";
29+
var_dump(mb_strtoupper("\u{019b}", 'UTF-8') === "\u{a7dc}");
2130
?>
2231
--EXPECT--
32+
Char widths:
2333
ASCII (PHP): 3
2434
Vietnamese (Xin chào): 8
2535
Traditional Chinese (你好): 4
2636
Sinhalese (අයේෂ්): 5
2737
Emoji (🐘): 2
2838
Emoji (🛜): 2
39+
Emoji (☰): 2
40+
Char case changes:
41+
Upper(ƛ) = Ƛ : bool(true)

0 commit comments

Comments
 (0)