Skip to content

some regexes fail to satisfy that (re)+ is always equal to (re)(re)* #779

Closed
@BurntSushi

Description

@BurntSushi

regex: (?m)(^|a)+

haystack: a\naaa\n

iteration output: (0, 0) (2, 2) (3, 5) (6, 6)

but...

regex: (?m)(^|a)(^|a)*

haystack: a\naaa\n

iteration output: (0, 1) (2, 5) (6, 6)

The actual problem here, we (myself, @rsc and @junyer) believe, is that (^|a)* (and more generally, (|a)*), is what's wrong here. In particular:

regex: (?m)(^|a)*

haystack: a\naaa\n

iteration output: (0, 1) (2, 5) (6, 6)

But the output should actually be (0, 0) (1, 1) (2, 2) (3, 5) (6, 6), and this matches precisely what PCRE2 outputs.

This crate, RE2 and Go's regexp standard library package all have the same bug. @rsc has already put up a patch for Go: https://go-review.googlesource.com/c/go/+/318750

I hope to fix this as part of #656, once the new NFA compiler lands. In particular, @rsc points out an elegant fix for this: compile x* as (x+)?. One thing that isn't clear is whether it's wise to actually fix it or not, since it could break existing regexes. But I believe it's limited to fairly pathological cases (repetition of a sub-expression that can match the empty string preferentially). It might be worth just doing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions