Skip to content

Unexpected unicode character class in HIR generated without unicode support #1088

Closed
@plusvic

Description

@plusvic

What version of regex are you using?

Using regex-syntax 0.7.5.

Describe the bug at a high level.

When generating the HIR for certain regular expressions with unicode support turned off, the resulting HIR may contain Unicode character classes. I'm not sure if this is a bug or the intended behaviour, but the documentation seems to suggest that this is not expected. Specifically, the documentation for hir::Class says:

A character class corresponds to a set of characters. A character is either defined by a Unicode scalar value or a byte. Unicode characters are used by default, while bytes are used when Unicode mode (via the u flag) is disabled.

I assumed that the HIR produced without unicode support will contain character classes of the Class::Bytes variant alone. However this is not the case.

What are the steps to reproduce the behavior?

Consider this example:

use regex_syntax; // 0.7.5

fn main() {
    let mut parser = regex_syntax::ParserBuilder::new()
        .utf8(false)
        .unicode(false)
        .build();
        
    let hir = parser.parse(r"(a|\xc2\xa0)");

    println!("{:?}", hir);
}

It produces the following output:

Ok(Capture(Capture { index: 1, name: None, sub: Class({'a'..='a', '\u{a0}'..='\u{a0}'}) }))

Here sub is a class of the Class::Unicode variant.

What is the expected behavior?

I was expecting that (a|\xc2\xa0) is represented as an alternation of two literals, not as a Class::Unicode

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions