Skip to content

regex with many literal alternates can erroneously match the empty string #999

Closed
@BurntSushi

Description

@BurntSushi

This program panics:

fn main() {
    let needles = vec![
        "aA", "bA", "cA", "dA", "eA", "fA", "gA", "hA", "iA", "jA", "kA", "lA",
        "mA", "nA", "oA", "pA", "qA", "rA", "sA", "tA", "uA", "vA", "wA", "xA",
        "yA", "zA",
    ];
    let pattern = needles.join("|");
    let re = regex::Regex::new(&pattern).unwrap();
    let hay = "FUBAR";
    assert_eq!(0, re.find_iter(hay).count());
}

But it should run without panicking as the regex should not match anything in FUBAR.

This is another bug caused by the literal optimizer. This one is pretty hard to hit. All of the following need to be true:

  • The literals extracted need to be "complete." That is, the language described by the regex is small and finite.
  • There needs to be at least 26 distinct starting bytes among all of the elements in the language described by the regex.
  • There needs to be fewer than 26 distinct ending bytes among all of the elements in the language described by the regex.
  • Possibly other criteria...

This causes a weird code path in src/exec.rs that results in using an "empty" prefix literal matcher that matches at every position.

I'll plan to get a fix up for this soon, but this bug does not impact the rewrite. (Its literal infrastructure is far more principled than the hodge podge on master. We'll see if it lives up to my expectations though.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions