regex parser permits '(?-u)\W' when UTF-8 mode is enabled

When you negate a character class when Unicode mode is disabled, the negation includes all _bytes_ except for what's in the class. Namely, the only way to write a character class over codepoints is when Unicode mode is enabled.

Usually, disabling Unicode means reducing the number of features you can use. For example, `(?-u)\pL` will fail with a parse error because `\pL` is fundamentally a Unicode construct with no "ASCII-only" interpretation. However, the "Perl" character classes (`\w`, `\d` and `\s`) all revert to their corresponding ASCII definitions when Unicode mode is disabled.

That's all fine. It's also correct that the negated "Perl" character classes (`\W`, `\D` and `\S`) also revert to their ASCII definitions. That's fine too.

But when you use something like `\W` when Unicode mode is disabled, then it includes bytes that match invalid UTF-8 (like `\xFF`, since it isn't a word "character"). This _should_ cause the regex parser to return an error, because the regex parser is supposed to guarantee that you can't build a regex that can match invalid UTF-8 when UTF-8 mode is enabled, regardless of whether Unicode mode is enabled.

Case in point, this code:

```rust
fn main() {
    let re = regex::Regex::new(r"(?-u)\W").unwrap();
    println!("{:?}", re.find("☃"));
}
```

outputs:

```
Some(Match { text: "☃", start: 0, end: 1 })
```

Which is clearly wrong. Attempting to slice `☃` at the range `0..1` will result in a panic. The top-level `Regex` API is not supposed to ever return match offsets that would result in a subslice operation panicking. i.e., The match offsets must always fall on valid UTF-8 code unit boundaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex parser permits '(?-u)\W' when UTF-8 mode is enabled #895

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

regex parser permits '(?-u)\W' when UTF-8 mode is enabled #895

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions