-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-109559: Update unicodedata
for Unicode 15.1
#109560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
21e297c
122a732
6d5238e
cd9cbf5
818a36c
24088ca
110c552
d8d9f98
27b1c13
af730eb
0db6920
44f6770
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Update :mod:`unicodedata` database to Unicode 15.1.0. |
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,7 +44,7 @@ | |
# * Doc/library/stdtypes.rst, and | ||
# * Doc/library/unicodedata.rst | ||
# * Doc/reference/lexical_analysis.rst (two occurrences) | ||
UNIDATA_VERSION = "15.0.0" | ||
UNIDATA_VERSION = "15.1.0" | ||
UNICODE_DATA = "UnicodeData%s.txt" | ||
COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt" | ||
EASTASIAN_WIDTH = "EastAsianWidth%s.txt" | ||
|
@@ -101,15 +101,16 @@ | |
|
||
# these ranges need to match unicodedata.c:is_unified_ideograph | ||
cjk_ranges = [ | ||
('3400', '4DBF'), | ||
('4E00', '9FFF'), | ||
('20000', '2A6DF'), | ||
('2A700', '2B739'), | ||
('2B740', '2B81D'), | ||
('2B820', '2CEA1'), | ||
('2CEB0', '2EBE0'), | ||
('30000', '3134A'), | ||
('31350', '323AF'), | ||
('3400', '4DBF'), # CJK Ideograph Extension A CJK | ||
('4E00', '9FFF'), # CJK Ideograph | ||
('20000', '2A6DF'), # CJK Ideograph Extension B | ||
('2A700', '2B739'), # CJK Ideograph Extension C | ||
('2B740', '2B81D'), # CJK Ideograph Extension D | ||
('2B820', '2CEA1'), # CJK Ideograph Extension E | ||
('2CEB0', '2EBE0'), # CJK Ideograph Extension F | ||
('2EBF0', '2EE5D'), # CJK Ideograph Extension I | ||
('30000', '3134A'), # CJK Ideograph Extension G | ||
('31350', '323AF'), # CJK Ideograph Extension H | ||
] | ||
|
||
|
||
|
@@ -1105,11 +1106,15 @@ def __init__(self, version, cjk_check=True): | |
table[i].east_asian_width = widths[i] | ||
self.widths = widths | ||
|
||
for char, (p,) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded(): | ||
for char, (propname, *propinfo) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded(): | ||
if propinfo: | ||
# this is not a binary property, ignore it | ||
continue | ||
Comment on lines
+1109
to
+1112
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All the properties defined in As of Unicode 15.1, this file also includes definitions that use the With this change, the loop skips over any non-binary properties, since we have nothing to do with them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like it would be safer to explicitly ignore There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a particular failure mode you have in mind? My rationale here was that the current internalized DB only cares about binary properties in this file, but in practice any of the property types enumerated by UAX#44 could appear in a future revision. I'm not strongly opposed to ignoring the specific property that breaks the tool against the current revision, but my rationale was that it seems safer to prevent this class of failure in the future if/when additional non-binary properties are added. |
||
|
||
if table[char]: | ||
# Some properties (e.g. Default_Ignorable_Code_Point) | ||
# apply to unassigned code points; ignore them | ||
table[char].binary_properties.add(p) | ||
table[char].binary_properties.add(propname) | ||
|
||
for char_range, value in UcdFile(LINE_BREAK, version): | ||
if value not in MANDATORY_LINE_BREAKS: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range check that occurs later in this file implicitly assumes this list is in sorted order. It seems simpler to have an idiosyncratic order here than to try to introduce
sorted()
or somesuch.