Skip to content

UnicodeSetsMode support (v flag mode, \q) #1142

Open
@CAD97

Description

@CAD97

ECMAScript RegExp supports an interesting feature: v-mode enables \p to match string properties (e.g. \p{RGI_Emoji_Flag_Sequence}, roughly \p{Regional_Indicator}{2}), allows [] character classes to match finite-length strings, and adds the \q{regexp} escape for writing "string literals" in character classes (a nested regex expression where the only valid syntax is literals, escaped literals, and | disjunction).

This support is quite interesting because it allows doing set operations on (finite) sets of (finite) strings (e.g. grapheme clusters / emoji), which can enable matching patterns that might otherwise be achieved with lookahead, e.g. ^[\p{RGI_Emoji_Flag_Sequence}--\q{🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷}]$ versus ^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{Regional_Indicator}{2}$ (example from MDN). This functionality is required for implementing things such as the UAX#31 Emoji Profile for identifiers, which "requires the use of character sequences, rather than individual code points, in the sets Start and Continue defined by UAX31-D1."

I don't think the regex crate particularly needs to be carrying around additional Unicode tables1 for extended \p, but \q seems somewhat reasonably straightforward to support without impacting users who don't use it (modulo some code size / compile time) and makes good chunks of the problem space at least possible to address, if still annoying2. Because \q is so restricted, it would even be sufficient to not permit alternation within the \q and require writing multiple \q instead, e.g. making the prior regex ^[\p{RGI_Emoji_Flag_Sequence}--[\q{🇺🇸}\q{🇨🇳}\q{🇷🇺}\q{🇬🇧}\q{🇫🇷}]]$, at no loss of expressiveness, just convenience.

Is it possible that the regex crate could someday support \q in character classes? I do understand it could be a significant chunk of extra work to allow classes to match longer strings. Although, on the other hand, since sets already need to handle UTF-8's variable 1..=4 byte length codepoint encoding, generalizing that to 1.. bytes might not be completely different. And while that kind of processing could be done, it doesn't need to be for what is, ultimately, a quite niche feature, and is ultimately (after resolving set operations and longest-first ordering) no different from regular top-level alternation.

Concrete changes

  • Add a new ast::ClassSetItem variant for \q
    a. consisting of an Alternation of Concat of Literal; or
    b. which is a LiteralString structured like Literal, and \q{one|two} is actually two of them, e.g. roughly [LiteralString { span: "\q{one", s: "one" }, LiteralString { span: "|two}", s: "two" }].
  • When parsing inside a bracketed class, \q{ introduces a string set with macro_rules!-ish syntax \q{ $( $($lit:Literal)* )|* }. Encountering any meta character except a literal escape is an error.
    • \q inside a negated bracketed class is an error.
  • Add a new hir::Class variant for \q, i.e. a ClassStrings. It can't be negated, but it can be unioned, intersected, differenced, and symmetric_differenced. It's essentially a fancy wrapper around HashSet<String>.
    • Any bracketed class containing \q ends up being ClassStrings.
    • Potentially replaced entirely with an alternation of literals in a validation pass.
  • When compiling the regex engine from the HIR, string classes are processed like any other alternation of literals (e.g. they should be available for literal prefix optimizations).

If the general idea seems sound, I'm willing to try my hand at implementing \q (to be transparent to HIR, thus invisible to regex-automata).

References

Footnotes

  1. At least not until such a day as icu4x allows it to be plausible for other crates to share the same underlying tables and/or for the binary package to have some control over what tables are built into the executable. Along that line, \p would be quite interesting — probably something like a fn(&self, &str) -> Option<impl ExactSizeIterator<Item=&str>> callback, to handle v-mode string sets, with a fast path for CodePointInversionLists — but that's drifting far off topic.

  2. Listing out each string in the relevant sets in |-delimited lists, oh my.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions