Description
When you negate a character class when Unicode mode is disabled, the negation includes all bytes except for what's in the class. Namely, the only way to write a character class over codepoints is when Unicode mode is enabled.
Usually, disabling Unicode means reducing the number of features you can use. For example, (?-u)\pL
will fail with a parse error because \pL
is fundamentally a Unicode construct with no "ASCII-only" interpretation. However, the "Perl" character classes (\w
, \d
and \s
) all revert to their corresponding ASCII definitions when Unicode mode is disabled.
That's all fine. It's also correct that the negated "Perl" character classes (\W
, \D
and \S
) also revert to their ASCII definitions. That's fine too.
But when you use something like \W
when Unicode mode is disabled, then it includes bytes that match invalid UTF-8 (like \xFF
, since it isn't a word "character"). This should cause the regex parser to return an error, because the regex parser is supposed to guarantee that you can't build a regex that can match invalid UTF-8 when UTF-8 mode is enabled, regardless of whether Unicode mode is enabled.
Case in point, this code:
fn main() {
let re = regex::Regex::new(r"(?-u)\W").unwrap();
println!("{:?}", re.find("☃"));
}
outputs:
Some(Match { text: "☃", start: 0, end: 1 })
Which is clearly wrong. Attempting to slice ☃
at the range 0..1
will result in a panic. The top-level Regex
API is not supposed to ever return match offsets that would result in a subslice operation panicking. i.e., The match offsets must always fall on valid UTF-8 code unit boundaries.