syntax: remove guarantees in the HIR related to 'u' flag

BurntSushi · BurntSushi · commit a7aa7ebffa0e · 2023-10-06T11:39:10.000-04:00
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,9 @@ TBD
 * [BUG #1046](https://github.com/rust-lang/regex/issues/1046):
 Fix a bug that could result in incorrect match spans when using a Unicode word
 boundary and searching non-ASCII strings.
+* [BUG(regex-syntax) #1088](https://github.com/rust-lang/regex/issues/1088):
+Remove guarantees in the API that connect the `u` flag with a specific HIR
+representation.
 
 
 1.9.6 (2023-09-30)
diff --git a/regex-syntax/src/hir/mod.rs b/regex-syntax/src/hir/mod.rs
@@ -797,13 +797,18 @@ impl core::fmt::Debug for Literal {
 /// The high-level intermediate representation of a character class.
 ///
 /// A character class corresponds to a set of characters. A character is either
-/// defined by a Unicode scalar value or a byte. Unicode characters are used
-/// by default, while bytes are used when Unicode mode (via the `u` flag) is
-/// disabled.
+/// defined by a Unicode scalar value or a byte.
 ///
 /// A character class, regardless of its character type, is represented by a
 /// sequence of non-overlapping non-adjacent ranges of characters.
 ///
+/// There are no guarantees about which class variant is used. Generally
+/// speaking, the Unicode variat is used whenever a class needs to contain
+/// non-ASCII Unicode scalar values. But the Unicode variant can be used even
+/// when Unicode mode is disabled. For example, at the time of writing, the
+/// regex `(?-u:a|\xc2\xa0)` will compile down to HIR for the Unicode class
+/// `[a\u00A0]` due to optimizations.
+///
 /// Note that `Bytes` variant may be produced even when it exclusively matches
 /// valid UTF-8. This is because a `Bytes` variant represents an intention by
 /// the author of the regular expression to disable Unicode mode, which in turn
@@ -1326,8 +1331,9 @@ impl ClassUnicodeRange {
     }
 }
 
-/// A set of characters represented by arbitrary bytes (where one byte
-/// corresponds to one character).
+/// A set of characters represented by arbitrary bytes.
+///
+/// Each byte corresponds to one character.
 #[derive(Clone, Debug, Eq, PartialEq)]
 pub struct ClassBytes {
     set: IntervalSet<ClassBytesRange>,