Description
regex: (?m)(^|a)+
haystack: a\naaa\n
iteration output: (0, 0) (2, 2) (3, 5) (6, 6)
but...
regex: (?m)(^|a)(^|a)*
haystack: a\naaa\n
iteration output: (0, 1) (2, 5) (6, 6)
The actual problem here, we (myself, @rsc and @junyer) believe, is that (^|a)*
(and more generally, (|a)*
), is what's wrong here. In particular:
regex: (?m)(^|a)*
haystack: a\naaa\n
iteration output: (0, 1) (2, 5) (6, 6)
But the output should actually be (0, 0) (1, 1) (2, 2) (3, 5) (6, 6)
, and this matches precisely what PCRE2 outputs.
This crate, RE2 and Go's regexp
standard library package all have the same bug. @rsc has already put up a patch for Go: https://go-review.googlesource.com/c/go/+/318750
I hope to fix this as part of #656, once the new NFA compiler lands. In particular, @rsc points out an elegant fix for this: compile x*
as (x+)?
. One thing that isn't clear is whether it's wise to actually fix it or not, since it could break existing regexes. But I believe it's limited to fairly pathological cases (repetition of a sub-expression that can match the empty string preferentially). It might be worth just doing.