Skip to content

Commit f9b55bd

Browse files
SnoopJbenjaminp
authored andcommitted
fixes pythongh-109559: Update unicodedata for Unicode 15.1.0 (pythonGH-109560)
--------- Co-authored-by: Benjamin Peterson <[email protected]>
1 parent cd91e0b commit f9b55bd

File tree

9 files changed

+19006
-18588
lines changed

9 files changed

+19006
-18588
lines changed

Doc/library/stdtypes.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1641,7 +1641,7 @@ expression support in the :mod:`re` module).
16411641

16421642
The casefolding algorithm is
16431643
`described in section 3.13 'Default Case Folding' of the Unicode Standard
1644-
<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
1644+
<https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.
16451645

16461646
.. versionadded:: 3.3
16471647

@@ -1805,7 +1805,7 @@ expression support in the :mod:`re` module).
18051805
property being one of "Lm", "Lt", "Lu", "Ll", or "Lo". Note that this is different
18061806
from the `Alphabetic property defined in the section 4.10 'Letters, Alphabetic, and
18071807
Ideographic' of the Unicode Standard
1808-
<https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf>`_.
1808+
<https://www.unicode.org/versions/Unicode15.1.0/ch04.pdf>`_.
18091809

18101810

18111811
.. method:: str.isascii()
@@ -1941,7 +1941,7 @@ expression support in the :mod:`re` module).
19411941

19421942
The lowercasing algorithm used is
19431943
`described in section 3.13 'Default Case Folding' of the Unicode Standard
1944-
<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
1944+
<https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.
19451945

19461946

19471947
.. method:: str.lstrip([chars])
@@ -2290,7 +2290,7 @@ expression support in the :mod:`re` module).
22902290

22912291
The uppercasing algorithm used is
22922292
`described in section 3.13 'Default Case Folding' of the Unicode Standard
2293-
<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
2293+
<https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.
22942294

22952295

22962296
.. method:: str.zfill(width)

Doc/library/unicodedata.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717

1818
This module provides access to the Unicode Character Database (UCD) which
1919
defines character properties for all Unicode characters. The data contained in
20-
this database is compiled from the `UCD version 15.0.0
21-
<https://www.unicode.org/Public/15.0.0/ucd>`_.
20+
this database is compiled from the `UCD version 15.1.0
21+
<https://www.unicode.org/Public/15.1.0/ucd>`_.
2222

2323
The module uses the same names and symbols as defined by Unicode
2424
Standard Annex #44, `"Unicode Character Database"
@@ -175,6 +175,6 @@ Examples:
175175

176176
.. rubric:: Footnotes
177177

178-
.. [#] https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt
178+
.. [#] https://www.unicode.org/Public/15.1.0/ucd/NameAliases.txt
179179
180-
.. [#] https://www.unicode.org/Public/15.0.0/ucd/NamedSequences.txt
180+
.. [#] https://www.unicode.org/Public/15.1.0/ucd/NamedSequences.txt

Doc/reference/lexical_analysis.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -315,16 +315,16 @@ The Unicode category codes mentioned above stand for:
315315
* *Nd* - decimal numbers
316316
* *Pc* - connector punctuations
317317
* *Other_ID_Start* - explicit list of characters in `PropList.txt
318-
<https://www.unicode.org/Public/15.0.0/ucd/PropList.txt>`_ to support backwards
318+
<https://www.unicode.org/Public/15.1.0/ucd/PropList.txt>`_ to support backwards
319319
compatibility
320320
* *Other_ID_Continue* - likewise
321321

322322
All identifiers are converted into the normal form NFKC while parsing; comparison
323323
of identifiers is based on NFKC.
324324

325325
A non-normative HTML file listing all valid identifier characters for Unicode
326-
15.0.0 can be found at
327-
https://www.unicode.org/Public/15.0.0/ucd/DerivedCoreProperties.txt
326+
15.1.0 can be found at
327+
https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt
328328

329329

330330
.. _keywords:
@@ -1045,4 +1045,4 @@ occurrence outside string literals and comments is an unconditional error:
10451045
10461046
.. rubric:: Footnotes
10471047

1048-
.. [#] https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt
1048+
.. [#] https://www.unicode.org/Public/15.1.0/ucd/NameAliases.txt
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Update :mod:`unicodedata` database to Unicode 15.1.0.

Modules/unicodedata.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1035,6 +1035,7 @@ is_unified_ideograph(Py_UCS4 code)
10351035
(0x2B740 <= code && code <= 0x2B81D) || /* CJK Ideograph Extension D */
10361036
(0x2B820 <= code && code <= 0x2CEA1) || /* CJK Ideograph Extension E */
10371037
(0x2CEB0 <= code && code <= 0x2EBE0) || /* CJK Ideograph Extension F */
1038+
(0x2EBF0 <= code && code <= 0x2EE5D) || /* CJK Ideograph Extension I */
10381039
(0x30000 <= code && code <= 0x3134A) || /* CJK Ideograph Extension G */
10391040
(0x31350 <= code && code <= 0x323AF); /* CJK Ideograph Extension H */
10401041
}

Modules/unicodedata_db.h

Lines changed: 1486 additions & 1144 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/unicodename_db.h

Lines changed: 16476 additions & 16470 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Objects/unicodetype_db.h

Lines changed: 1013 additions & 950 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Tools/unicode/makeunicodedata.py

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
# * Doc/library/stdtypes.rst, and
4545
# * Doc/library/unicodedata.rst
4646
# * Doc/reference/lexical_analysis.rst (two occurrences)
47-
UNIDATA_VERSION = "15.0.0"
47+
UNIDATA_VERSION = "15.1.0"
4848
UNICODE_DATA = "UnicodeData%s.txt"
4949
COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"
5050
EASTASIAN_WIDTH = "EastAsianWidth%s.txt"
@@ -101,15 +101,16 @@
101101

102102
# these ranges need to match unicodedata.c:is_unified_ideograph
103103
cjk_ranges = [
104-
('3400', '4DBF'),
105-
('4E00', '9FFF'),
106-
('20000', '2A6DF'),
107-
('2A700', '2B739'),
108-
('2B740', '2B81D'),
109-
('2B820', '2CEA1'),
110-
('2CEB0', '2EBE0'),
111-
('30000', '3134A'),
112-
('31350', '323AF'),
104+
('3400', '4DBF'), # CJK Ideograph Extension A CJK
105+
('4E00', '9FFF'), # CJK Ideograph
106+
('20000', '2A6DF'), # CJK Ideograph Extension B
107+
('2A700', '2B739'), # CJK Ideograph Extension C
108+
('2B740', '2B81D'), # CJK Ideograph Extension D
109+
('2B820', '2CEA1'), # CJK Ideograph Extension E
110+
('2CEB0', '2EBE0'), # CJK Ideograph Extension F
111+
('2EBF0', '2EE5D'), # CJK Ideograph Extension I
112+
('30000', '3134A'), # CJK Ideograph Extension G
113+
('31350', '323AF'), # CJK Ideograph Extension H
113114
]
114115

115116

@@ -1105,11 +1106,15 @@ def __init__(self, version, cjk_check=True):
11051106
table[i].east_asian_width = widths[i]
11061107
self.widths = widths
11071108

1108-
for char, (p,) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded():
1109+
for char, (propname, *propinfo) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded():
1110+
if propinfo:
1111+
# this is not a binary property, ignore it
1112+
continue
1113+
11091114
if table[char]:
11101115
# Some properties (e.g. Default_Ignorable_Code_Point)
11111116
# apply to unassigned code points; ignore them
1112-
table[char].binary_properties.add(p)
1117+
table[char].binary_properties.add(propname)
11131118

11141119
for char_range, value in UcdFile(LINE_BREAK, version):
11151120
if value not in MANDATORY_LINE_BREAKS:

0 commit comments

Comments
 (0)