RegexSet: get matched strings.

RegexSet can be used to build a tokenizer/lexer fairly easily by creating a set of regexes to match each token type:

```rust
let rs = RegexSet::new(&[
    r"^\s+", // 0
    r"^[+-]", // 1
    r"^[*/]", // 2
    r"^\(", // 3
    r"^\)", // 4
    r"^([0-9]*\.)?[0-9]+([eE][+-]?[0-9]+)?", // 5
    r"^[\p{Alphabetic}_][\d\p{Alphabetic}_]*", // 6
]);
```

Lexing is done by running the input string against the `RegexSet` and constructing a token based on which regex(es) matched (there might be multiple matches if identifiers and keywords are separate regexes, but selecting between conflicting matches is up to the developer of the lexer). However, `RegexSet` only supports checking if any regex matches or checking which regexes matched. For fixed-size matches, this is acceptable: if the `[+-]` token matches, the you know to extract just one character, for example. However, for a complex token type like a number (5) or identifier (6), the only way currently to extract the actual matched string would be to have a separate regex for just that token, and run it over the input a second time to find the match length.

This would be much simpler if `RegexSet` supported a mode where it could capture the complete match from each regex as a string slice or indexes into the input string. This doesn't require handling capture groups, just the complete match.

Another workaround is to try each regex in sequence, though this too is disadvantageous. The biggest issue is exactly the problem `RegexSet` is intended to solve: the string must be processed repeatedly, for each regex in the set. Another issue is that it makes matching certain kinds of overlapping regexes more difficult; for example, a keyword like `and` or `nil` might also be a valid identifier, so the lexer would have to first match them as a single regex and then disambiguate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexSet: get matched strings. #352

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RegexSet: get matched strings. #352

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions