Skip to content

Char::to_{lower,upper}case should return Option<&'static str> instead of char #20333

Closed
@SimonSapin

Description

@SimonSapin

These two methods should not be stabilized as-is. They should be changed to return a variable number of code points (between one and three), per Unicode’s SpecialCasing.txt.

Such results could be represented as &'static str or &'static [char] slices of a static table in libunicode. The former avoid re-encoding to UTF-8 when accumulating results in a String. To avoid having an entry in that table for every one of the 1114111 code points, the return type could be an Option, where None means that the code point is unchanged by the mapping. (This is by large the common case.) Or it could be a new special-purpose type like enum CaseMappingResult { Unchanged, MappedTo(&'static str) }.

Since the Char methods become less convenient to use, there should be str::to_{lower,upper}case() -> String wrappers.

SpecialCasing.txt also defines some language-sensitive mappings for Turkish and Lithuanian, but I suggest not including them, for a few reasons:

  • Using the system’s locale is a very bad idea. Programs behaving differently on different systems is a source of countless bugs, and the system’s locale may not even be that of the end users (e.g for server-side software.)

  • Forcing users to specify a language is counter-productive since it might often end up being hard-coded to English or something. There should be a default.

  • Users who do care about language-specific tailoring may want to do more anyway. SpecialCasing.txt says:

    Note that the preferred mechanism for defining tailored casing operations is the Unicode Common Locale Data Repository (CLDR).

Finally, there are conditional mappings that depend on the context of surrounding code points, but not on the language. They could be special cases in the str methods, but I don’t know if it’s worth the bother since there is currently only one such special case. (Greek capital sigma at the end of a word.)

More background on Unicode case mappings:

http://unicode.org/faq/casemap_charprop.html
http://www.unicode.org/reports/tr44/tr44-14.html#Casemapping

CC @huonw, @aturon

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions