Description
ECMAScript RegExp supports an interesting feature: v-mode enables \p
to match string properties (e.g. \p{RGI_Emoji_Flag_Sequence}
, roughly \p{Regional_Indicator}{2}
), allows []
character classes to match finite-length strings, and adds the \q{regexp}
escape for writing "string literals" in character classes (a nested regex expression where the only valid syntax is literals, escaped literals, and |
disjunction).
This support is quite interesting because it allows doing set operations on (finite) sets of (finite) strings (e.g. grapheme clusters / emoji), which can enable matching patterns that might otherwise be achieved with lookahead, e.g. ^[\p{RGI_Emoji_Flag_Sequence}--\q{🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷}]$
versus ^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{Regional_Indicator}{2}$
(example from MDN). This functionality is required for implementing things such as the UAX#31 Emoji Profile for identifiers, which "requires the use of character sequences, rather than individual code points, in the sets Start and Continue defined by UAX31-D1."
I don't think the regex crate particularly needs to be carrying around additional Unicode tables1 for extended \p
, but \q
seems somewhat reasonably straightforward to support without impacting users who don't use it (modulo some code size / compile time) and makes good chunks of the problem space at least possible to address, if still annoying2. Because \q
is so restricted, it would even be sufficient to not permit alternation within the \q
and require writing multiple \q
instead, e.g. making the prior regex ^[\p{RGI_Emoji_Flag_Sequence}--[\q{🇺🇸}\q{🇨🇳}\q{🇷🇺}\q{🇬🇧}\q{🇫🇷}]]$
, at no loss of expressiveness, just convenience.
Is it possible that the regex crate could someday support \q
in character classes? I do understand it could be a significant chunk of extra work to allow classes to match longer strings. Although, on the other hand, since sets already need to handle UTF-8's variable 1..=4 byte length codepoint encoding, generalizing that to 1.. bytes might not be completely different. And while that kind of processing could be done, it doesn't need to be for what is, ultimately, a quite niche feature, and is ultimately (after resolving set operations and longest-first ordering) no different from regular top-level alternation.
Concrete changes
- Add a new
ast::ClassSetItem
variant for\q
a. consisting of anAlternation
ofConcat
ofLiteral
; or
b. which is aLiteralString
structured likeLiteral
, and\q{one|two}
is actually two of them, e.g. roughly[LiteralString { span: "\q{one", s: "one" }, LiteralString { span: "|two}", s: "two" }]
. - When parsing inside a bracketed class,
\q{
introduces a string set withmacro_rules!
-ish syntax\q{ $( $($lit:Literal)* )|* }
. Encountering any meta character except a literal escape is an error.\q
inside a negated bracketed class is an error.
- Add a new
hir::Class
variant for\q
, i.e. aClassStrings
. It can't be negated, but it can beunion
ed,intersect
ed,difference
d, andsymmetric_difference
d. It's essentially a fancy wrapper aroundHashSet<String>
.- Any bracketed class containing
\q
ends up beingClassStrings
. - Potentially replaced entirely with an alternation of literals in a validation pass.
- Any bracketed class containing
- When compiling the regex engine from the HIR, string classes are processed like any other alternation of literals (e.g. they should be available for literal prefix optimizations).
If the general idea seems sound, I'm willing to try my hand at implementing \q
(to be transparent to HIR, thus invisible to regex-automata).
References
- MDN
RegExp.prototype.unicodeSets
- MDN v-mode character class (
\q
documentation)
Footnotes
-
At least not until such a day as icu4x allows it to be plausible for other crates to share the same underlying tables and/or for the binary package to have some control over what tables are built into the executable. Along that line,
\p
would be quite interesting — probably something like afn(&self, &str) -> Option<impl ExactSizeIterator<Item=&str>>
callback, to handle v-mode string sets, with a fast path for CodePointInversionLists — but that's drifting far off topic. ↩ -
Listing out each string in the relevant sets in
|
-delimited lists, oh my. ↩