Skip to content

Increased memory usage when updating to regex 1.10 #1116

Open
@Marwes

Description

@Marwes

What version of regex are you using?

1.10, and I used 1.7 before. Issue seems to be mainly be due the rewrite in 1.9

Describe the bug at a high level.

After updating to regex 1.10 I am seeing greatly increased memory usage (captured using the dhat crate. see example below). In particular part of the issue seems to be due to the use of capture groups in the regex. These captures only serve to group the regex so they could (and should) be non-capturing groups and I have fixed this on my end, however since captures do not seem to matter on 1.7 I guess there may be a missed optimization here? (#1059 comes to mind).

(The regex in the example has been altered but it remains the same in spirit and exhibits the same memory increase)

What are the steps to reproduce the behavior?

The following code can be used to reproduce the behavior by using dhat to track memory and changing the regex version.

// Cargo.toml
// regex = "=1.10"
// dhat = "0.3"

#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    without_captures();
    with_captures();
}

fn without_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (?:craigslist\.org$)|
        (?:utexas\.edu$)|
        (?:blogs\.com$)|
        (?:is\.gd$)|
        (?:vkontakte\.ru$)|
        (?:google\.com\.hk$)|
        (?:vimeo\.com$)|
        (?:simplemachines\.org$)|
        (?:plala\.or\.jp$)|
        (?:npr\.org$)|
        (?:census\.gov$)|
        (?:360\.cn$)|
        (?:wisc\.edu$)|
        (?:princeton\.edu$)|
        (?:addthis\.com$)|
        (?:google\.de$)|
        (?:ox\.ac\.uk$)|
        (?:free13runpool\.com$)|
        (?:berkeley\.edu$)|
        (?:fda\.gov$)|
        (?:soundcloud\.com$)|
        (?:ftc\.gov$)|
        (?:cloudflare\.com$)|
        (?:com\.com$)|
        (?:statcounter\.com$)|
        (?:tumblr\.com$)|
        (?:alexa\.com$)|
        (?:canalblog\.com$)|
        (?:uiuc\.edu$)|
        (?:msu\.edu$)|
        (?:bravesites\.com$)|
        (?:usatoday\.com$)|
        (?:edublogs\.org$)|
        (?:forbes\.com$)|
        (?:patch\.com$)|
        (?:1688\.com$)|
        (?:ihg\.com$)|
        (?:ow\.ly$)|
        (?:usda\.gov$)|
        (?:yellowbook\.com$)|
        (?:wired\.com$)|
        (?:homestead\.com$)|
        (?:state\.tx\.us$)|
        (?:webnode\.com$)|
        (?:123-reg\.co\.uk$)|
        (?:irs\.gov$)|
        (?:yale\.edu$)|
        (?:naver\.com$)|
        (?:elpais\.com$)|
        (?:example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

fn with_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (craigslist\.org$)|
        (utexas\.edu$)|
        (blogs\.com$)|
        (is\.gd$)|
        (vkontakte\.ru$)|
        (google\.com\.hk$)|
        (vimeo\.com$)|
        (simplemachines\.org$)|
        (plala\.or\.jp$)|
        (npr\.org$)|
        (census\.gov$)|
        (360\.cn$)|
        (wisc\.edu$)|
        (princeton\.edu$)|
        (addthis\.com$)|
        (google\.de$)|
        (ox\.ac\.uk$)|
        (free13runpool\.com$)|
        (berkeley\.edu$)|
        (fda\.gov$)|
        (soundcloud\.com$)|
        (ftc\.gov$)|
        (cloudflare\.com$)|
        (com\.com$)|
        (statcounter\.com$)|
        (tumblr\.com$)|
        (alexa\.com$)|
        (canalblog\.com$)|
        (uiuc\.edu$)|
        (msu\.edu$)|
        (bravesites\.com$)|
        (usatoday\.com$)|
        (edublogs\.org$)|
        (forbes\.com$)|
        (patch\.com$)|
        (1688\.com$)|
        (ihg\.com$)|
        (ow\.ly$)|
        (usda\.gov$)|
        (yellowbook\.com$)|
        (wired\.com$)|
        (homestead\.com$)|
        (state\.tx\.us$)|
        (webnode\.com$)|
        (123-reg\.co\.uk$)|
        (irs\.gov$)|
        (yale\.edu$)|
        (naver\.com$)|
        (elpais\.com$)|
        (example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

Memory stats from running the example

Most of the stats are the same, but we can see a 5x increase in memory when using capturing groups in 1.10.

https://docs.rs/dhat/latest/dhat/struct.HeapStats.html

1.7.3
Match `true`, with captures: HeapStats {
    total_blocks: 4137,
    total_bytes: 1189678,
    curr_blocks: 48,
    curr_bytes: 114285,
    max_blocks: 212,
    max_bytes: 247538,
}
Match `true`, with captures: HeapStats {
    total_blocks: 4152,
    total_bytes: 1201606,
    curr_blocks: 48,
    curr_bytes: 121921,
    max_blocks: 212,
    max_bytes: 247338,
}

1.10.2

Match `true`, with captures: HeapStats {
    total_blocks: 3486,
    total_bytes: 763125,
    curr_blocks: 221,
    curr_bytes: 160832,
    max_blocks: 1215,
    max_bytes: 228249,
}
Match `true`, with captures: HeapStats {
    total_blocks: 3694,
    total_bytes: 1871135,
    curr_blocks: 221,
    curr_bytes: 1242544,
    max_blocks: 216,
    max_bytes: 1242568,
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions