Skip to content

Commit 453198d

Browse files
committed
doc: document Unicode support
This commit provides exhaustive documentation for the regex crate's support for Level 1 ("Basic Unicode Support") as documented in UTS#18. We also document the small number of additions added to the concrete syntax as a result of the regex-syntax rewrite. See: http://unicode.org/reports/tr18/
1 parent 3951b93 commit 453198d

File tree

2 files changed

+277
-16
lines changed

2 files changed

+277
-16
lines changed

UNICODE.md

+250
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Unicode conformance
2+
3+
This document describes the regex crate's conformance to Unicode's
4+
[UTS#18](http://unicode.org/reports/tr18/)
5+
report, which lays out 3 levels of support: Basic, Extended and Tailored.
6+
7+
Full support for Level 1 ("Basic Unicode Support") is provided with two
8+
exceptions:
9+
10+
1. Line boundaries are not Unicode aware. Namely, only the `\n`
11+
(`END OF LINE`) character is recognized as a line boundary.
12+
2. The compatibility properties specified by
13+
[RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
14+
are ASCII-only definitions.
15+
16+
Little to no support is provided for either Level 2 or Level 3. For the most
17+
part, this is because the features are either complex/hard to implement, or at
18+
the very least, very difficult to implement without sacrificing performance.
19+
For example, tackling canonical equivalence such that matching worked as one
20+
would expect regardless of normalization form would be a significant
21+
undertaking. This is at least partially a result of the fact that this regex
22+
engine is based on finite automata, which admits less flexibility normally
23+
associated with backtracking implementations.
24+
25+
26+
## RL1.1 Hex Notation
27+
28+
[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
29+
30+
Hex Notation refers to the ability to specify a Unicode code point in a regular
31+
expression via its hexadecimal code point representation. This is useful in
32+
environments that have poor Unicode font rendering or if you need to express a
33+
code point that is not normally displayable. All forms of hexadecimal notation
34+
are supported
35+
36+
\x7F hex character code (exactly two digits)
37+
\x{10FFFF} any hex character code corresponding to a Unicode code point
38+
\u007F hex character code (exactly four digits)
39+
\u{7F} any hex character code corresponding to a Unicode code point
40+
\U0000007F hex character code (exactly eight digits)
41+
\U{7F} any hex character code corresponding to a Unicode code point
42+
43+
Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
44+
of expressing hexadecimal code points. Any number of digits can be written
45+
within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
46+
fixed-width variants of the same idea.
47+
48+
Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
49+
banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
50+
mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
51+
U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
52+
the literal byte `\xFF`.
53+
54+
55+
## RL1.2 Properties
56+
57+
[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
58+
59+
Full support for Unicode property syntax is provided. Unicode properties
60+
provide a convenient way to construct character classes of groups of code
61+
points specified by Unicode. The regex crate does not provide exhaustive
62+
support, but covers a useful subset. In particular:
63+
64+
* [General categories](http://unicode.org/reports/tr18/#General_Category_Property)
65+
* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property)
66+
* [Age](http://unicode.org/reports/tr18/#Age)
67+
* A smattering of boolean properties, including all of those specified by
68+
[RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly.
69+
70+
In all cases, property name and value abbreviations are supported, and all
71+
names/values are matched loosely without regard for case, whitespace or
72+
underscores. Property name aliases can be found in Unicode's
73+
[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
74+
file, while property value aliases can be found in Unicode's
75+
[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
76+
file.
77+
78+
The syntax supported is also consistent with the UTS#18 recommendation:
79+
80+
* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
81+
`\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
82+
`\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
83+
`Script_Extensions` (or `scx` for short).
84+
* `\p{age:3.2}` selects all code points in Unicode 3.2.
85+
* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
86+
via `\p{alpha}` (for example).
87+
* Single letter variants for properties with single letter abbreviations.
88+
For example, `\p{Letter}` can be equivalently written as `\pL`.
89+
90+
The following is a list of all properties supported by the regex crate (starred
91+
properties correspond to properties required by RL1.2):
92+
93+
* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
94+
* `Script` \*
95+
* `Script_Extensions` \*
96+
* `Age`
97+
* `ASCII_Hex_Digit`
98+
* `Alphabetic` \*
99+
* `Bidi_Control`
100+
* `Case_Ignorable`
101+
* `Cased`
102+
* `Changes_When_Casefolded`
103+
* `Changes_When_Casemapped`
104+
* `Changes_When_Lowercased`
105+
* `Changes_When_Titlecased`
106+
* `Changes_When_Uppercased`
107+
* `Dash`
108+
* `Default_Ignorable_Code_Point` \*
109+
* `Deprecated`
110+
* `Diacritic`
111+
* `Extender`
112+
* `Grapheme_Base`
113+
* `Grapheme_Extend`
114+
* `Hex_Digit`
115+
* `IDS_Binary_Operator`
116+
* `IDS_Trinary_Operator`
117+
* `ID_Continue`
118+
* `ID_Start`
119+
* `Join_Control`
120+
* `Logical_Order_Exception`
121+
* `Lowercase` \*
122+
* `Math`
123+
* `Noncharacter_Code_Point` \*
124+
* `Pattern_Syntax`
125+
* `Pattern_White_Space`
126+
* `Prepended_Concatenation_Mark`
127+
* `Quotation_Mark`
128+
* `Radical`
129+
* `Regional_Indicator`
130+
* `Sentence_Terminal`
131+
* `Soft_Dotted`
132+
* `Terminal_Punctuation`
133+
* `Unified_Ideograph`
134+
* `Uppercase` \*
135+
* `Variation_Selector`
136+
* `White_Space` \*
137+
* `XID_Continue`
138+
* `XID_Start`
139+
140+
141+
## RL1.2a Compatibility Properties
142+
143+
[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
144+
145+
The regex crate only provides ASCII definitions of the
146+
[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties)
147+
(sans the `\X` class, for matching grapheme clusters, which isn't provided
148+
at all). This is because it seems to be consistent with most other regular
149+
expression engines, and in particular, because these are often referred to as
150+
"ASCII" or "POSIX" character classes.
151+
152+
Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
153+
Their traditional ASCII definition can be used by disabling Unicode. That is,
154+
`[[:word:]]` and `(?-u)\w` are equivalent.
155+
156+
157+
## RL1.3 Subtraction and Intersection
158+
159+
[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection)
160+
161+
The regex crate provides full support for nested character classes, along with
162+
union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
163+
operations on arbitrary character classes.
164+
165+
For example, to match all non-ASCII letters, you could use either
166+
`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
167+
(intersecting the negation).
168+
169+
170+
## RL1.4 Simple Word Boundaries
171+
172+
[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
173+
174+
The regex crate provides basic Unicode aware word boundary assertions. A word
175+
boundary assertion can be written as `\b`, or `\B` as its negation. A word
176+
boundary negation corresponds to a zero-width match, where its adjacent
177+
characters correspond to word and non-word, or non-word and word characters.
178+
179+
Conformance in this case chooses to define word character in the same way that
180+
the `\w` character class is defined: a code point that is a member of one of
181+
the following classes:
182+
183+
* `\p{Alphabetic}`
184+
* `\p{Join_Control}`
185+
* `\p{gc:Mark}`
186+
* `\p{gc:Decimal_Number}`
187+
* `\p{gc:Connector_Punctuation}`
188+
189+
In particular, this differs slightly from the
190+
[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
191+
but is permissible according to
192+
[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
193+
Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
194+
one another.
195+
196+
Finally, Unicode word boundaries can be disabled, which will cause ASCII word
197+
boundaries to be used instead. That is, `\b` is a Unicode word boundary while
198+
`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
199+
if performance is important, since the implementation of Unicode word
200+
boundaries is currently sub-optimal on non-ASCII text.
201+
202+
203+
## RL1.5 Simple Loose Matches
204+
205+
[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches)
206+
207+
The regex crate provides full support for case insensitive matching in
208+
accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
209+
"simple" mapping was chosen because of a key convenient property: every
210+
"simple" mapping is a mapping from exactly one code point to exactly one other
211+
code point. This makes case insensitive matching of character classes, for
212+
example, straight-forward to implement.
213+
214+
When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
215+
then all characters classes are case folded as well.
216+
217+
218+
## RL1.6 Line Boundaries
219+
220+
[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries)
221+
222+
The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
223+
character as a line boundary. This choice was made mostly for implementation
224+
convenience, and to avoid performance cliffs that Unicode word boundaries are
225+
subject to.
226+
227+
Ideally, it would be nice to at least support `\r\n` as a line boundary as
228+
well, and in theory, this could be done efficiently.
229+
230+
231+
## RL1.7 Code Points
232+
233+
[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters)
234+
235+
The regex crate provides full support for Unicode code point matching. Namely,
236+
the fundamental atom of any match is always a single code point.
237+
238+
Given Rust's strong ties to UTF-8, the following guarantees are also provided:
239+
240+
* All matches are reported on valid UTF-8 code unit boundaries. That is, any
241+
match range returned by the public regex API is guaranteed to successfully
242+
slice the string that was searched.
243+
* By consequence of the above, it is impossible to match surrogode code points.
244+
No support for UTF-16 is provided, so this is never necessary.
245+
246+
Note that when Unicode mode is disabled, the fundamental atom of matching is
247+
no longer a code point but a single byte. When Unicode mode is disabled, many
248+
Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
249+
regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
250+
byte `\xFF`) is, for example.

src/lib.rs

+27-16
Original file line numberDiff line numberDiff line change
@@ -217,9 +217,8 @@ This implementation executes regular expressions **only** on valid UTF-8
217217
while exposing match locations as byte indices into the search string.
218218
219219
Only simple case folding is supported. Namely, when matching
220-
case-insensitively, the characters are first mapped using the [simple case
221-
folding](ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt) mapping
222-
before matching.
220+
case-insensitively, the characters are first mapped using the "simple" case
221+
folding rules defined by Unicode.
223222
224223
Regular expressions themselves are **only** interpreted as a sequence of
225224
Unicode scalar values. This means you can use Unicode characters directly
@@ -248,9 +247,9 @@ are some examples:
248247
recognize `\n` and not any of the other forms of line terminators defined
249248
by Unicode.
250249
251-
Finally, Unicode general categories and scripts are available as character
252-
classes. For example, you can match a sequence of numerals, Greek or
253-
Cherokee letters:
250+
Unicode general categories, scripts, script extensions, ages and a smattering
251+
of boolean properties are available as character classes. For example, you can
252+
match a sequence of numerals, Greek or Cherokee letters:
254253
255254
```rust
256255
# extern crate regex; use regex::Regex;
@@ -261,6 +260,12 @@ assert_eq!((mat.start(), mat.end()), (3, 23));
261260
# }
262261
```
263262
263+
For a more detailed breakdown of Unicode support with respect to
264+
[UTS#18](http://unicode.org/reports/tr18/),
265+
please see the
266+
[UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md)
267+
document in the root of the regex repository.
268+
264269
# Opt out of Unicode support
265270
266271
The `bytes` sub-module provides a `Regex` type that can be used to match
@@ -307,6 +312,8 @@ a separate crate, [`regex-syntax`](../regex_syntax/index.html).
307312
[x[^xyz]] Nested/grouping character class (matching any character except y and z)
308313
[a-y&&xyz] Intersection (matching x or y)
309314
[0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4)
315+
[0-9--4] Direct subtraction (matching 0-9 except 4)
316+
[a-g~~b-h] Symmetric difference (matching `a` and `h` only)
310317
[\[\]] Escaping in character classes (matching [ or ])
311318
</pre>
312319
@@ -431,16 +438,20 @@ assert_eq!(&cap[0], "abc");
431438
## Escape sequences
432439
433440
<pre class="rust">
434-
\* literal *, works for any punctuation character: \.+*?()|[]{}^$
435-
\a bell (\x07)
436-
\f form feed (\x0C)
437-
\t horizontal tab
438-
\n new line
439-
\r carriage return
440-
\v vertical tab (\x0B)
441-
\123 octal character code (up to three digits)
442-
\x7F hex character code (exactly two digits)
443-
\x{10FFFF} any hex character code corresponding to a Unicode code point
441+
\* literal *, works for any punctuation character: \.+*?()|[]{}^$
442+
\a bell (\x07)
443+
\f form feed (\x0C)
444+
\t horizontal tab
445+
\n new line
446+
\r carriage return
447+
\v vertical tab (\x0B)
448+
\123 octal character code (up to three digits)
449+
\x7F hex character code (exactly two digits)
450+
\x{10FFFF} any hex character code corresponding to a Unicode code point
451+
\u007F hex character code (exactly four digits)
452+
\u{7F} any hex character code corresponding to a Unicode code point
453+
\U0000007F hex character code (exactly eight digits)
454+
\U{7F} any hex character code corresponding to a Unicode code point
444455
</pre>
445456
446457
## Perl character classes (Unicode friendly)

0 commit comments

Comments
 (0)