Skip to content

unicode: add CategoryAliases, LC, Cn #70780

Closed
@rsc

Description

@rsc

The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".

The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.

In order to support \p{Letter}, I propose to add a new, small table to unicode,

var CategoryAliases = map[string]string{
	"Other": "C",
	"Control": "Cc",
	...,
	"Letter": "L",
	...
}

This would be auto-generated from the Unicode database like all our other tables. For Unicode 15, the table would have only 38 entries, listed below.

% grep '^gc' PropertyValueAliases.txt
gc ; C                                ; Other                            # Cc | Cf | Cn | Co | Cs
gc ; Cc                               ; Control                          ; cntrl
gc ; Cf                               ; Format
gc ; Cn                               ; Unassigned
gc ; Co                               ; Private_Use
gc ; Cs                               ; Surrogate
gc ; L                                ; Letter                           # Ll | Lm | Lo | Lt | Lu
gc ; LC                               ; Cased_Letter                     # Ll | Lt | Lu
gc ; Ll                               ; Lowercase_Letter
gc ; Lm                               ; Modifier_Letter
gc ; Lo                               ; Other_Letter
gc ; Lt                               ; Titlecase_Letter
gc ; Lu                               ; Uppercase_Letter
gc ; M                                ; Mark                             ; Combining_Mark                   # Mc | Me | Mn
gc ; Mc                               ; Spacing_Mark
gc ; Me                               ; Enclosing_Mark
gc ; Mn                               ; Nonspacing_Mark
gc ; N                                ; Number                           # Nd | Nl | No
gc ; Nd                               ; Decimal_Number                   ; digit
gc ; Nl                               ; Letter_Number
gc ; No                               ; Other_Number
gc ; P                                ; Punctuation                      ; punct                            # Pc | Pd | Pe | Pf | Pi | Po | Ps
gc ; Pc                               ; Connector_Punctuation
gc ; Pd                               ; Dash_Punctuation
gc ; Pe                               ; Close_Punctuation
gc ; Pf                               ; Final_Punctuation
gc ; Pi                               ; Initial_Punctuation
gc ; Po                               ; Other_Punctuation
gc ; Ps                               ; Open_Punctuation
gc ; S                                ; Symbol                           # Sc | Sk | Sm | So
gc ; Sc                               ; Currency_Symbol
gc ; Sk                               ; Modifier_Symbol
gc ; Sm                               ; Math_Symbol
gc ; So                               ; Other_Symbol
gc ; Z                                ; Separator                        # Zl | Zp | Zs
gc ; Zl                               ; Line_Separator
gc ; Zp                               ; Paragraph_Separator
gc ; Zs                               ; Space_Separator
%

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Accepted

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions