Description
These two methods should not be stabilized as-is. They should be changed to return a variable number of code points (between one and three), per Unicode’s SpecialCasing.txt
.
Such results could be represented as &'static str
or &'static [char]
slices of a static table in libunicode. The former avoid re-encoding to UTF-8 when accumulating results in a String
. To avoid having an entry in that table for every one of the 1114111 code points, the return type could be an Option
, where None
means that the code point is unchanged by the mapping. (This is by large the common case.) Or it could be a new special-purpose type like enum CaseMappingResult { Unchanged, MappedTo(&'static str) }
.
Since the Char
methods become less convenient to use, there should be str::to_{lower,upper}case() -> String
wrappers.
SpecialCasing.txt
also defines some language-sensitive mappings for Turkish and Lithuanian, but I suggest not including them, for a few reasons:
-
Using the system’s locale is a very bad idea. Programs behaving differently on different systems is a source of countless bugs, and the system’s locale may not even be that of the end users (e.g for server-side software.)
-
Forcing users to specify a language is counter-productive since it might often end up being hard-coded to English or something. There should be a default.
-
Users who do care about language-specific tailoring may want to do more anyway.
SpecialCasing.txt
says:Note that the preferred mechanism for defining tailored casing operations is the Unicode Common Locale Data Repository (CLDR).
Finally, there are conditional mappings that depend on the context of surrounding code points, but not on the language. They could be special cases in the str
methods, but I don’t know if it’s worth the bother since there is currently only one such special case. (Greek capital sigma at the end of a word.)
More background on Unicode case mappings:
http://unicode.org/faq/casemap_charprop.html
http://www.unicode.org/reports/tr44/tr44-14.html#Casemapping