Description
What version of regex are you using?
Using regex-syntax
0.7.5.
Describe the bug at a high level.
When generating the HIR for certain regular expressions with unicode support turned off, the resulting HIR may contain Unicode character classes. I'm not sure if this is a bug or the intended behaviour, but the documentation seems to suggest that this is not expected. Specifically, the documentation for hir::Class says:
A character class corresponds to a set of characters. A character is either defined by a Unicode scalar value or a byte. Unicode characters are used by default, while bytes are used when Unicode mode (via the u flag) is disabled.
I assumed that the HIR produced without unicode support will contain character classes of the Class::Bytes
variant alone. However this is not the case.
What are the steps to reproduce the behavior?
Consider this example:
use regex_syntax; // 0.7.5
fn main() {
let mut parser = regex_syntax::ParserBuilder::new()
.utf8(false)
.unicode(false)
.build();
let hir = parser.parse(r"(a|\xc2\xa0)");
println!("{:?}", hir);
}
It produces the following output:
Ok(Capture(Capture { index: 1, name: None, sub: Class({'a'..='a', '\u{a0}'..='\u{a0}'}) }))
Here sub
is a class of the Class::Unicode
variant.
What is the expected behavior?
I was expecting that (a|\xc2\xa0)
is represented as an alternation of two literals, not as a Class::Unicode