UnicodeSetsMode support (`v` flag mode, `\q`)

ECMAScript RegExp supports an interesting feature: v-mode enables `\p` to match string properties (e.g. `\p{RGI_Emoji_Flag_Sequence}`, roughly `\p{Regional_Indicator}{2}`), allows `[]` character classes to match finite-length strings, and adds the <code>\q{<i>regexp</i>}</code> escape for writing "string literals" in character classes (a nested regex expression where the only valid syntax is literals, escaped literals, and `|` disjunction).

This support is quite interesting because it allows doing set operations on (finite) sets of (finite) strings (e.g. grapheme clusters / emoji), which can enable matching patterns that might otherwise be achieved with lookahead, e.g. `^[\p{RGI_Emoji_Flag_Sequence}--\q{🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷}]$` versus `^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{Regional_Indicator}{2}$` (example [from MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class#matching_strings)). This functionality is required for implementing things such as the [UAX#31 Emoji Profile for identifiers](https://unicode.org/reports/tr31/#Emoji_Profile), which "requires the use of character sequences, rather than individual code points, in the sets _Start_ and _Continue_ defined by UAX31-D1."

I don't think the regex crate particularly needs to be carrying around additional Unicode tables[^icu4x] for extended `\p`, but `\q` seems somewhat reasonably straightforward to support without impacting users who don't use it (modulo some code size / compile time) and makes good chunks of the problem space at least *possible* to address, if still annoying[^gen]. Because `\q` is so restricted, it would even be sufficient to not permit alternation within the `\q` and require writing multiple `\q` instead, e.g. making the prior regex `^[\p{RGI_Emoji_Flag_Sequence}--[\q{🇺🇸}\q{🇨🇳}\q{🇷🇺}\q{🇬🇧}\q{🇫🇷}]]$`, at no loss of expressiveness, just convenience.

[^icu4x]: At least not until such a day as [icu4x](https://github.com/unicode-org/icu4x) allows it to be plausible for other crates to share the same underlying tables and/or for the binary package to have some control over what tables are built into the executable. Along that line, `\p` *would* be quite interesting — probably something like a `fn(&self, &str) -> Option<impl ExactSizeIterator<Item=&str>>` callback, to handle v-mode string sets, with a fast path for [CodePointInversionList](https://docs.rs/icu/latest/icu/collections/codepointinvlist/struct.CodePointInversionList.html)s — but that's drifting far off topic.
[^gen]: Listing out each string in the relevant sets in `|`-delimited lists, oh my.

Is it possible that the regex crate could someday support `\q` in character classes? I do understand it could be a significant chunk of extra work to allow classes to match longer strings. Although, on the other hand, since sets already need to handle UTF-8's variable 1..=4 byte length codepoint encoding, generalizing that to 1.. bytes might not be completely different. And while that kind of processing *could* be done, it doesn't need to be for what is, ultimately, a quite niche feature, and is ultimately (after resolving set operations and longest-first ordering) no different from regular top-level alternation. 

### Concrete changes

- Add a new [`ast::ClassSetItem`](https://docs.rs/regex-syntax/latest/regex_syntax/ast/enum.ClassSetItem.html) variant for `\q`
  a. consisting of an [`Alternation`](https://docs.rs/regex-syntax/latest/regex_syntax/ast/struct.Alternation.html) of [`Concat`](https://docs.rs/regex-syntax/latest/regex_syntax/ast/struct.Concat.html) of [`Literal`](https://docs.rs/regex-syntax/latest/regex_syntax/ast/struct.Literal.html); or
  b. which is a `LiteralString` structured like `Literal`, and `\q{one|two}` is actually two of them, e.g. roughly `[LiteralString { span: "\q{one", s: "one" }, LiteralString { span: "|two}", s: "two" }]`.
- When parsing inside a bracketed class, `\q{` introduces a string set with `macro_rules!`-ish syntax `\q{ $( $($lit:Literal)* )|* }`. Encountering any meta character except a literal escape is an error.
  - `\q` inside a negated bracketed class is an error.
- Add a new [`hir::Class`](https://docs.rs/regex-syntax/latest/regex_syntax/hir/enum.Class.html) variant for `\q`, i.e. a `ClassStrings`. It can't be negated, but it can be `union`ed, `intersect`ed, `difference`d, and `symmetric_difference`d. It's essentially a fancy wrapper around `HashSet<String>`.
  - Any bracketed class containing `\q` ends up being `ClassStrings`.
  - Potentially replaced entirely with an alternation of literals in a validation pass.
- When compiling the regex engine from the HIR, string classes are processed like any other alternation of literals (e.g. they should be available for literal prefix optimizations).

If the general idea seems sound, I'm willing to try my hand at implementing `\q` (to be transparent to HIR, thus invisible to regex-automata).

### References

- [MDN `RegExp.prototype.unicodeSets`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicodeSets)
- [MDN v-mode character class](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class#v-mode_character_class) (`\q` documentation)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeSetsMode support (`v` flag mode, `\q`) #1142

Concrete changes

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UnicodeSetsMode support (v flag mode, \q) #1142

Description

Concrete changes

References

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

UnicodeSetsMode support (`v` flag mode, `\q`) #1142