|
| 1 | +# Unicode conformance |
| 2 | + |
| 3 | +This document describes the regex crate's conformance to Unicode's |
| 4 | +[UTS#18](http://unicode.org/reports/tr18/) |
| 5 | +report, which lays out 3 levels of support: Basic, Extended and Tailored. |
| 6 | + |
| 7 | +Full support for Level 1 ("Basic Unicode Support") is provided with two |
| 8 | +exceptions: |
| 9 | + |
| 10 | +1. Line boundaries are not Unicode aware. Namely, only the `\n` |
| 11 | + (`END OF LINE`) character is recognized as a line boundary. |
| 12 | +2. The compatibility properties specified by |
| 13 | + [RL1.2a](http://unicode.org/reports/tr18/#RL1.2a) |
| 14 | + are ASCII-only definitions. |
| 15 | + |
| 16 | +Little to no support is provided for either Level 2 or Level 3. For the most |
| 17 | +part, this is because the features are either complex/hard to implement, or at |
| 18 | +the very least, very difficult to implement without sacrificing performance. |
| 19 | +For example, tackling canonical equivalence such that matching worked as one |
| 20 | +would expect regardless of normalization form would be a significant |
| 21 | +undertaking. This is at least partially a result of the fact that this regex |
| 22 | +engine is based on finite automata, which admits less flexibility normally |
| 23 | +associated with backtracking implementations. |
| 24 | + |
| 25 | + |
| 26 | +## RL1.1 Hex Notation |
| 27 | + |
| 28 | +[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation) |
| 29 | + |
| 30 | +Hex Notation refers to the ability to specify a Unicode code point in a regular |
| 31 | +expression via its hexadecimal code point representation. This is useful in |
| 32 | +environments that have poor Unicode font rendering or if you need to express a |
| 33 | +code point that is not normally displayable. All forms of hexadecimal notation |
| 34 | +are supported |
| 35 | + |
| 36 | + \x7F hex character code (exactly two digits) |
| 37 | + \x{10FFFF} any hex character code corresponding to a Unicode code point |
| 38 | + \u007F hex character code (exactly four digits) |
| 39 | + \u{7F} any hex character code corresponding to a Unicode code point |
| 40 | + \U0000007F hex character code (exactly eight digits) |
| 41 | + \U{7F} any hex character code corresponding to a Unicode code point |
| 42 | + |
| 43 | +Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways |
| 44 | +of expressing hexadecimal code points. Any number of digits can be written |
| 45 | +within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all |
| 46 | +fixed-width variants of the same idea. |
| 47 | + |
| 48 | +Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is |
| 49 | +banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode |
| 50 | +mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint |
| 51 | +U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches |
| 52 | +the literal byte `\xFF`. |
| 53 | + |
| 54 | + |
| 55 | +## RL1.2 Properties |
| 56 | + |
| 57 | +[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories) |
| 58 | + |
| 59 | +Full support for Unicode property syntax is provided. Unicode properties |
| 60 | +provide a convenient way to construct character classes of groups of code |
| 61 | +points specified by Unicode. The regex crate does not provide exhaustive |
| 62 | +support, but covers a useful subset. In particular: |
| 63 | + |
| 64 | +* [General categories](http://unicode.org/reports/tr18/#General_Category_Property) |
| 65 | +* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property) |
| 66 | +* [Age](http://unicode.org/reports/tr18/#Age) |
| 67 | +* A smattering of boolean properties, including all of those specified by |
| 68 | + [RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly. |
| 69 | + |
| 70 | +In all cases, property name and value abbreviations are supported, and all |
| 71 | +names/values are matched loosely without regard for case, whitespace or |
| 72 | +underscores. Property name aliases can be found in Unicode's |
| 73 | +[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) |
| 74 | +file, while property value aliases can be found in Unicode's |
| 75 | +[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) |
| 76 | +file. |
| 77 | + |
| 78 | +The syntax supported is also consistent with the UTS#18 recommendation: |
| 79 | + |
| 80 | +* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow: |
| 81 | + `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`, |
| 82 | + `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and |
| 83 | + `Script_Extensions` (or `scx` for short). |
| 84 | +* `\p{age:3.2}` selects all code points in Unicode 3.2. |
| 85 | +* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated |
| 86 | + via `\p{alpha}` (for example). |
| 87 | +* Single letter variants for properties with single letter abbreviations. |
| 88 | + For example, `\p{Letter}` can be equivalently written as `\pL`. |
| 89 | + |
| 90 | +The following is a list of all properties supported by the regex crate (starred |
| 91 | +properties correspond to properties required by RL1.2): |
| 92 | + |
| 93 | +* `General_Category` \* (including `Any`, `ASCII` and `Assigned`) |
| 94 | +* `Script` \* |
| 95 | +* `Script_Extensions` \* |
| 96 | +* `Age` |
| 97 | +* `ASCII_Hex_Digit` |
| 98 | +* `Alphabetic` \* |
| 99 | +* `Bidi_Control` |
| 100 | +* `Case_Ignorable` |
| 101 | +* `Cased` |
| 102 | +* `Changes_When_Casefolded` |
| 103 | +* `Changes_When_Casemapped` |
| 104 | +* `Changes_When_Lowercased` |
| 105 | +* `Changes_When_Titlecased` |
| 106 | +* `Changes_When_Uppercased` |
| 107 | +* `Dash` |
| 108 | +* `Default_Ignorable_Code_Point` \* |
| 109 | +* `Deprecated` |
| 110 | +* `Diacritic` |
| 111 | +* `Extender` |
| 112 | +* `Grapheme_Base` |
| 113 | +* `Grapheme_Extend` |
| 114 | +* `Hex_Digit` |
| 115 | +* `IDS_Binary_Operator` |
| 116 | +* `IDS_Trinary_Operator` |
| 117 | +* `ID_Continue` |
| 118 | +* `ID_Start` |
| 119 | +* `Join_Control` |
| 120 | +* `Logical_Order_Exception` |
| 121 | +* `Lowercase` \* |
| 122 | +* `Math` |
| 123 | +* `Noncharacter_Code_Point` \* |
| 124 | +* `Pattern_Syntax` |
| 125 | +* `Pattern_White_Space` |
| 126 | +* `Prepended_Concatenation_Mark` |
| 127 | +* `Quotation_Mark` |
| 128 | +* `Radical` |
| 129 | +* `Regional_Indicator` |
| 130 | +* `Sentence_Terminal` |
| 131 | +* `Soft_Dotted` |
| 132 | +* `Terminal_Punctuation` |
| 133 | +* `Unified_Ideograph` |
| 134 | +* `Uppercase` \* |
| 135 | +* `Variation_Selector` |
| 136 | +* `White_Space` \* |
| 137 | +* `XID_Continue` |
| 138 | +* `XID_Start` |
| 139 | + |
| 140 | + |
| 141 | +## RL1.2a Compatibility Properties |
| 142 | + |
| 143 | +[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a) |
| 144 | + |
| 145 | +The regex crate only provides ASCII definitions of the |
| 146 | +[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties) |
| 147 | +(sans the `\X` class, for matching grapheme clusters, which isn't provided |
| 148 | +at all). This is because it seems to be consistent with most other regular |
| 149 | +expression engines, and in particular, because these are often referred to as |
| 150 | +"ASCII" or "POSIX" character classes. |
| 151 | + |
| 152 | +Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware. |
| 153 | +Their traditional ASCII definition can be used by disabling Unicode. That is, |
| 154 | +`[[:word:]]` and `(?-u)\w` are equivalent. |
| 155 | + |
| 156 | + |
| 157 | +## RL1.3 Subtraction and Intersection |
| 158 | + |
| 159 | +[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection) |
| 160 | + |
| 161 | +The regex crate provides full support for nested character classes, along with |
| 162 | +union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`) |
| 163 | +operations on arbitrary character classes. |
| 164 | + |
| 165 | +For example, to match all non-ASCII letters, you could use either |
| 166 | +`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]` |
| 167 | +(intersecting the negation). |
| 168 | + |
| 169 | + |
| 170 | +## RL1.4 Simple Word Boundaries |
| 171 | + |
| 172 | +[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries) |
| 173 | + |
| 174 | +The regex crate provides basic Unicode aware word boundary assertions. A word |
| 175 | +boundary assertion can be written as `\b`, or `\B` as its negation. A word |
| 176 | +boundary negation corresponds to a zero-width match, where its adjacent |
| 177 | +characters correspond to word and non-word, or non-word and word characters. |
| 178 | + |
| 179 | +Conformance in this case chooses to define word character in the same way that |
| 180 | +the `\w` character class is defined: a code point that is a member of one of |
| 181 | +the following classes: |
| 182 | + |
| 183 | +* `\p{Alphabetic}` |
| 184 | +* `\p{Join_Control}` |
| 185 | +* `\p{gc:Mark}` |
| 186 | +* `\p{gc:Decimal_Number}` |
| 187 | +* `\p{gc:Connector_Punctuation}` |
| 188 | + |
| 189 | +In particular, this differs slightly from the |
| 190 | +[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries) |
| 191 | +but is permissible according to |
| 192 | +[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties). |
| 193 | +Namely, it is convenient and simpler to have `\w` and `\b` be in sync with |
| 194 | +one another. |
| 195 | + |
| 196 | +Finally, Unicode word boundaries can be disabled, which will cause ASCII word |
| 197 | +boundaries to be used instead. That is, `\b` is a Unicode word boundary while |
| 198 | +`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial |
| 199 | +if performance is important, since the implementation of Unicode word |
| 200 | +boundaries is currently sub-optimal on non-ASCII text. |
| 201 | + |
| 202 | + |
| 203 | +## RL1.5 Simple Loose Matches |
| 204 | + |
| 205 | +[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches) |
| 206 | + |
| 207 | +The regex crate provides full support for case insensitive matching in |
| 208 | +accordance with RL1.5. That is, it uses the "simple" case folding mapping. The |
| 209 | +"simple" mapping was chosen because of a key convenient property: every |
| 210 | +"simple" mapping is a mapping from exactly one code point to exactly one other |
| 211 | +code point. This makes case insensitive matching of character classes, for |
| 212 | +example, straight-forward to implement. |
| 213 | + |
| 214 | +When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`), |
| 215 | +then all characters classes are case folded as well. |
| 216 | + |
| 217 | + |
| 218 | +## RL1.6 Line Boundaries |
| 219 | + |
| 220 | +[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries) |
| 221 | + |
| 222 | +The regex crate only provides support for recognizing the `\n` (`END OF LINE`) |
| 223 | +character as a line boundary. This choice was made mostly for implementation |
| 224 | +convenience, and to avoid performance cliffs that Unicode word boundaries are |
| 225 | +subject to. |
| 226 | + |
| 227 | +Ideally, it would be nice to at least support `\r\n` as a line boundary as |
| 228 | +well, and in theory, this could be done efficiently. |
| 229 | + |
| 230 | + |
| 231 | +## RL1.7 Code Points |
| 232 | + |
| 233 | +[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters) |
| 234 | + |
| 235 | +The regex crate provides full support for Unicode code point matching. Namely, |
| 236 | +the fundamental atom of any match is always a single code point. |
| 237 | + |
| 238 | +Given Rust's strong ties to UTF-8, the following guarantees are also provided: |
| 239 | + |
| 240 | +* All matches are reported on valid UTF-8 code unit boundaries. That is, any |
| 241 | + match range returned by the public regex API is guaranteed to successfully |
| 242 | + slice the string that was searched. |
| 243 | +* By consequence of the above, it is impossible to match surrogode code points. |
| 244 | + No support for UTF-16 is provided, so this is never necessary. |
| 245 | + |
| 246 | +Note that when Unicode mode is disabled, the fundamental atom of matching is |
| 247 | +no longer a code point but a single byte. When Unicode mode is disabled, many |
| 248 | +Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid |
| 249 | +regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal |
| 250 | +byte `\xFF`) is, for example. |
0 commit comments