Skip to content

Commit a7aa7eb

Browse files
committed
syntax: remove guarantees in the HIR related to 'u' flag
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
1 parent cafd46f commit a7aa7eb

File tree

2 files changed

+14
-5
lines changed

2 files changed

+14
-5
lines changed

CHANGELOG.md

+3
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ TBD
44
* [BUG #1046](https://github.com/rust-lang/regex/issues/1046):
55
Fix a bug that could result in incorrect match spans when using a Unicode word
66
boundary and searching non-ASCII strings.
7+
* [BUG(regex-syntax) #1088](https://github.com/rust-lang/regex/issues/1088):
8+
Remove guarantees in the API that connect the `u` flag with a specific HIR
9+
representation.
710

811

912
1.9.6 (2023-09-30)

regex-syntax/src/hir/mod.rs

+11-5
Original file line numberDiff line numberDiff line change
@@ -797,13 +797,18 @@ impl core::fmt::Debug for Literal {
797797
/// The high-level intermediate representation of a character class.
798798
///
799799
/// A character class corresponds to a set of characters. A character is either
800-
/// defined by a Unicode scalar value or a byte. Unicode characters are used
801-
/// by default, while bytes are used when Unicode mode (via the `u` flag) is
802-
/// disabled.
800+
/// defined by a Unicode scalar value or a byte.
803801
///
804802
/// A character class, regardless of its character type, is represented by a
805803
/// sequence of non-overlapping non-adjacent ranges of characters.
806804
///
805+
/// There are no guarantees about which class variant is used. Generally
806+
/// speaking, the Unicode variat is used whenever a class needs to contain
807+
/// non-ASCII Unicode scalar values. But the Unicode variant can be used even
808+
/// when Unicode mode is disabled. For example, at the time of writing, the
809+
/// regex `(?-u:a|\xc2\xa0)` will compile down to HIR for the Unicode class
810+
/// `[a\u00A0]` due to optimizations.
811+
///
807812
/// Note that `Bytes` variant may be produced even when it exclusively matches
808813
/// valid UTF-8. This is because a `Bytes` variant represents an intention by
809814
/// the author of the regular expression to disable Unicode mode, which in turn
@@ -1326,8 +1331,9 @@ impl ClassUnicodeRange {
13261331
}
13271332
}
13281333

1329-
/// A set of characters represented by arbitrary bytes (where one byte
1330-
/// corresponds to one character).
1334+
/// A set of characters represented by arbitrary bytes.
1335+
///
1336+
/// Each byte corresponds to one character.
13311337
#[derive(Clone, Debug, Eq, PartialEq)]
13321338
pub struct ClassBytes {
13331339
set: IntervalSet<ClassBytesRange>,

0 commit comments

Comments
 (0)