Skip to content

regex parser permits '(?-u)\W' when UTF-8 mode is enabled #895

Closed
@BurntSushi

Description

@BurntSushi

When you negate a character class when Unicode mode is disabled, the negation includes all bytes except for what's in the class. Namely, the only way to write a character class over codepoints is when Unicode mode is enabled.

Usually, disabling Unicode means reducing the number of features you can use. For example, (?-u)\pL will fail with a parse error because \pL is fundamentally a Unicode construct with no "ASCII-only" interpretation. However, the "Perl" character classes (\w, \d and \s) all revert to their corresponding ASCII definitions when Unicode mode is disabled.

That's all fine. It's also correct that the negated "Perl" character classes (\W, \D and \S) also revert to their ASCII definitions. That's fine too.

But when you use something like \W when Unicode mode is disabled, then it includes bytes that match invalid UTF-8 (like \xFF, since it isn't a word "character"). This should cause the regex parser to return an error, because the regex parser is supposed to guarantee that you can't build a regex that can match invalid UTF-8 when UTF-8 mode is enabled, regardless of whether Unicode mode is enabled.

Case in point, this code:

fn main() {
    let re = regex::Regex::new(r"(?-u)\W").unwrap();
    println!("{:?}", re.find("☃"));
}

outputs:

Some(Match { text: "☃", start: 0, end: 1 })

Which is clearly wrong. Attempting to slice at the range 0..1 will result in a panic. The top-level Regex API is not supposed to ever return match offsets that would result in a subslice operation panicking. i.e., The match offsets must always fall on valid UTF-8 code unit boundaries.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions