doc: document Unicode support

BurntSushi · BurntSushi · commit 453198d5c321 · 2018-03-07T16:53:19.000-05:00
This commit provides exhaustive documentation for the regex crate's support for Level 1 ("Basic Unicode Support") as documented in UTS#18. We also document the small number of additions added to the concrete syntax as a result of the regex-syntax rewrite. See: http://unicode.org/reports/tr18/
diff --git a/UNICODE.md b/UNICODE.md
@@ -0,0 +1,250 @@
+# Unicode conformance
+
+This document describes the regex crate's conformance to Unicode's
+[UTS#18](http://unicode.org/reports/tr18/)
+report, which lays out 3 levels of support: Basic, Extended and Tailored.
+
+Full support for Level 1 ("Basic Unicode Support") is provided with two
+exceptions:
+
+1. Line boundaries are not Unicode aware. Namely, only the `\n`
+   (`END OF LINE`) character is recognized as a line boundary.
+2. The compatibility properties specified by
+   [RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
+   are ASCII-only definitions.
+
+Little to no support is provided for either Level 2 or Level 3. For the most
+part, this is because the features are either complex/hard to implement, or at
+the very least, very difficult to implement without sacrificing performance.
+For example, tackling canonical equivalence such that matching worked as one
+would expect regardless of normalization form would be a significant
+undertaking. This is at least partially a result of the fact that this regex
+engine is based on finite automata, which admits less flexibility normally
+associated with backtracking implementations.
+
+
+## RL1.1 Hex Notation
+
+[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)
+
+Hex Notation refers to the ability to specify a Unicode code point in a regular
+expression via its hexadecimal code point representation. This is useful in
+environments that have poor Unicode font rendering or if you need to express a
+code point that is not normally displayable. All forms of hexadecimal notation
+are supported
+
+    \x7F        hex character code (exactly two digits)
+    \x{10FFFF}  any hex character code corresponding to a Unicode code point
+    \u007F      hex character code (exactly four digits)
+    \u{7F}      any hex character code corresponding to a Unicode code point
+    \U0000007F  hex character code (exactly eight digits)
+    \U{7F}      any hex character code corresponding to a Unicode code point
+
+Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
+of expressing hexadecimal code points. Any number of digits can be written
+within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
+fixed-width variants of the same idea.
+
+Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
+banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
+mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
+U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
+the literal byte `\xFF`.
+
+
+## RL1.2 Properties
+
+[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)
+
+Full support for Unicode property syntax is provided. Unicode properties
+provide a convenient way to construct character classes of groups of code
+points specified by Unicode. The regex crate does not provide exhaustive
+support, but covers a useful subset. In particular:
+
+* [General categories](http://unicode.org/reports/tr18/#General_Category_Property)
+* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property)
+* [Age](http://unicode.org/reports/tr18/#Age)
+* A smattering of boolean properties, including all of those specified by
+  [RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly.
+
+In all cases, property name and value abbreviations are supported, and all
+names/values are matched loosely without regard for case, whitespace or
+underscores. Property name aliases can be found in Unicode's
+[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
+file, while property value aliases can be found in Unicode's
+[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
+file.
+
+The syntax supported is also consistent with the UTS#18 recommendation:
+
+* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
+  `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
+  `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
+  `Script_Extensions` (or `scx` for short).
+* `\p{age:3.2}` selects all code points in Unicode 3.2.
+* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
+  via `\p{alpha}` (for example).
+* Single letter variants for properties with single letter abbreviations.
+  For example, `\p{Letter}` can be equivalently written as `\pL`.
+
+The following is a list of all properties supported by the regex crate (starred
+properties correspond to properties required by RL1.2):
+
+* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
+* `Script` \*
+* `Script_Extensions` \*
+* `Age`
+* `ASCII_Hex_Digit`
+* `Alphabetic` \*
+* `Bidi_Control`
+* `Case_Ignorable`
+* `Cased`
+* `Changes_When_Casefolded`
+* `Changes_When_Casemapped`
+* `Changes_When_Lowercased`
+* `Changes_When_Titlecased`
+* `Changes_When_Uppercased`
+* `Dash`
+* `Default_Ignorable_Code_Point` \*
+* `Deprecated`
+* `Diacritic`
+* `Extender`
+* `Grapheme_Base`
+* `Grapheme_Extend`
+* `Hex_Digit`
+* `IDS_Binary_Operator`
+* `IDS_Trinary_Operator`
+* `ID_Continue`
+* `ID_Start`
+* `Join_Control`
+* `Logical_Order_Exception`
+* `Lowercase` \*
+* `Math`
+* `Noncharacter_Code_Point` \*
+* `Pattern_Syntax`
+* `Pattern_White_Space`
+* `Prepended_Concatenation_Mark`
+* `Quotation_Mark`
+* `Radical`
+* `Regional_Indicator`
+* `Sentence_Terminal`
+* `Soft_Dotted`
+* `Terminal_Punctuation`
+* `Unified_Ideograph`
+* `Uppercase` \*
+* `Variation_Selector`
+* `White_Space` \*
+* `XID_Continue`
+* `XID_Start`
+
+
+## RL1.2a Compatibility Properties
+
+[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
+
+The regex crate only provides ASCII definitions of the
+[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties)
+(sans the `\X` class, for matching grapheme clusters, which isn't provided
+at all). This is because it seems to be consistent with most other regular
+expression engines, and in particular, because these are often referred to as
+"ASCII" or "POSIX" character classes.
+
+Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
+Their traditional ASCII definition can be used by disabling Unicode. That is,
+`[[:word:]]` and `(?-u)\w` are equivalent.
+
+
+## RL1.3 Subtraction and Intersection
+
+[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection)
+
+The regex crate provides full support for nested character classes, along with
+union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
+operations on arbitrary character classes.
+
+For example, to match all non-ASCII letters, you could use either
+`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
+(intersecting the negation).
+
+
+## RL1.4 Simple Word Boundaries
+
+[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
+
+The regex crate provides basic Unicode aware word boundary assertions. A word
+boundary assertion can be written as `\b`, or `\B` as its negation. A word
+boundary negation corresponds to a zero-width match, where its adjacent
+characters correspond to word and non-word, or non-word and word characters.
+
+Conformance in this case chooses to define word character in the same way that
+the `\w` character class is defined: a code point that is a member of one of
+the following classes:
+
+* `\p{Alphabetic}`
+* `\p{Join_Control}`
+* `\p{gc:Mark}`
+* `\p{gc:Decimal_Number}`
+* `\p{gc:Connector_Punctuation}`
+
+In particular, this differs slightly from the
+[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
+but is permissible according to
+[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
+Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
+one another.
+
+Finally, Unicode word boundaries can be disabled, which will cause ASCII word
+boundaries to be used instead. That is, `\b` is a Unicode word boundary while
+`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
+if performance is important, since the implementation of Unicode word
+boundaries is currently sub-optimal on non-ASCII text.
+
+
+## RL1.5 Simple Loose Matches
+
+[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches)
+
+The regex crate provides full support for case insensitive matching in
+accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
+"simple" mapping was chosen because of a key convenient property: every
+"simple" mapping is a mapping from exactly one code point to exactly one other
+code point. This makes case insensitive matching of character classes, for
+example, straight-forward to implement.
+
+When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
+then all characters classes are case folded as well.
+
+
+## RL1.6 Line Boundaries
+
+[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries)
+
+The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
+character as a line boundary. This choice was made mostly for implementation
+convenience, and to avoid performance cliffs that Unicode word boundaries are
+subject to.
+
+Ideally, it would be nice to at least support `\r\n` as a line boundary as
+well, and in theory, this could be done efficiently.
+
+
+## RL1.7 Code Points
+
+[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters)
+
+The regex crate provides full support for Unicode code point matching. Namely,
+the fundamental atom of any match is always a single code point.
+
+Given Rust's strong ties to UTF-8, the following guarantees are also provided:
+
+* All matches are reported on valid UTF-8 code unit boundaries. That is, any
+  match range returned by the public regex API is guaranteed to successfully
+  slice the string that was searched.
+* By consequence of the above, it is impossible to match surrogode code points.
+  No support for UTF-16 is provided, so this is never necessary.
+
+Note that when Unicode mode is disabled, the fundamental atom of matching is
+no longer a code point but a single byte. When Unicode mode is disabled, many
+Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
+regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
+byte `\xFF`) is, for example.
diff --git a/src/lib.rs b/src/lib.rs
@@ -217,9 +217,8 @@ This implementation executes regular expressions **only** on valid UTF-8
 while exposing match locations as byte indices into the search string.
 
 Only simple case folding is supported. Namely, when matching
-case-insensitively, the characters are first mapped using the [simple case
-folding](ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt) mapping
-before matching.
+case-insensitively, the characters are first mapped using the "simple" case
+folding rules defined by Unicode.
 
 Regular expressions themselves are **only** interpreted as a sequence of
 Unicode scalar values. This means you can use Unicode characters directly
@@ -248,9 +247,9 @@ are some examples:
   recognize `\n` and not any of the other forms of line terminators defined
   by Unicode.
 
-Finally, Unicode general categories and scripts are available as character
-classes. For example, you can match a sequence of numerals, Greek or
-Cherokee letters:
+Unicode general categories, scripts, script extensions, ages and a smattering
+of boolean properties are available as character classes. For example, you can
+match a sequence of numerals, Greek or Cherokee letters:
 
 ```rust
 # extern crate regex; use regex::Regex;
@@ -261,6 +260,12 @@ assert_eq!((mat.start(), mat.end()), (3, 23));
 # }
 ```
 
+For a more detailed breakdown of Unicode support with respect to
+[UTS#18](http://unicode.org/reports/tr18/),
+please see the
+[UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md)
+document in the root of the regex repository.
+
 # Opt out of Unicode support
 
 The `bytes` sub-module provides a `Regex` type that can be used to match
@@ -307,6 +312,8 @@ a separate crate, [`regex-syntax`](../regex_syntax/index.html).
 [x[^xyz]]     Nested/grouping character class (matching any character except y and z)
 [a-y&&xyz]    Intersection (matching x or y)
 [0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
+[0-9--4]      Direct subtraction (matching 0-9 except 4)
+[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
 [\[\]]        Escaping in character classes (matching [ or ])
 </pre>
 
@@ -431,16 +438,20 @@ assert_eq!(&cap[0], "abc");
 ## Escape sequences
 
 <pre class="rust">
-\*         literal *, works for any punctuation character: \.+*?()|[]{}^$
-\a         bell (\x07)
-\f         form feed (\x0C)
-\t         horizontal tab
-\n         new line
-\r         carriage return
-\v         vertical tab (\x0B)
-\123       octal character code (up to three digits)
-\x7F       hex character code (exactly two digits)
-\x{10FFFF} any hex character code corresponding to a Unicode code point
+\*          literal *, works for any punctuation character: \.+*?()|[]{}^$
+\a          bell (\x07)
+\f          form feed (\x0C)
+\t          horizontal tab
+\n          new line
+\r          carriage return
+\v          vertical tab (\x0B)
+\123        octal character code (up to three digits)
+\x7F        hex character code (exactly two digits)
+\x{10FFFF}  any hex character code corresponding to a Unicode code point
+\u007F      hex character code (exactly four digits)
+\u{7F}      any hex character code corresponding to a Unicode code point
+\U0000007F  hex character code (exactly eight digits)
+\U{7F}      any hex character code corresponding to a Unicode code point
 </pre>
 
 ## Perl character classes (Unicode friendly)