Skip to content

Poor performance in some cases compared to oniguruma #604

Closed
@lopopolo

Description

@lopopolo

I'm looking at replacing oniguruma with regex in some situations for the Ruby that I'm building.

I am benchmarking the following three Regexps over this several megabyte text corpus:

bench('Email', '[\w\.+-]+@[\w\.-]+\.[\w\.-]+')
bench('URI', 'https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
bench('IP', '\b(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b')

For Email, regex is 10x faster than oniguruma. For URI, regex is 2x slower than oniguruma. For IP, regex is 20x slower than oniguruma.

regex performance

Email: 92 matches
..................................................

    compile: 85.52ms elapsed in 50 iterations (avg. 1.71ms / iteration)
    scan: 1569.22ms elapsed in 50 iterations (avg. 31.38ms / iteration)
    scan with block: 1025.49ms elapsed in 50 iterations (avg. 20.5ms / iteration)

URI: 5388 matches
..................................................

    compile: 71.84ms elapsed in 50 iterations (avg. 1.43ms / iteration)
    scan: 2336.46ms elapsed in 50 iterations (avg. 46.72ms / iteration)
    scan with block: 2045.79ms elapsed in 50 iterations (avg. 40.91ms / iteration)

IP: 6 matches
..................................................

    compile: 10.79ms elapsed in 50 iterations (avg. 0.21ms / iteration)
    scan: 25693.73ms elapsed in 50 iterations (avg. 513.87ms / iteration)
    scan with block: 25642.21ms elapsed in 50 iterations (avg. 512.84ms / iteration)

oniguruma performance (via rust-onig)

Email: 92 matches
..................................................

    compile: 5.89ms elapsed in 50 iterations (avg. 0.11ms / iteration)
    scan: 16335.45ms elapsed in 50 iterations (avg. 326.7ms / iteration)
    scan with block: 16228.96ms elapsed in 50 iterations (avg. 324.57ms / iteration)

URI: 5388 matches
..................................................

    compile: 1.68ms elapsed in 50 iterations (avg. 0.03ms / iteration)
    scan: 1366.95ms elapsed in 50 iterations (avg. 27.33ms / iteration)
    scan with block: 1349.82ms elapsed in 50 iterations (avg. 26.99ms / iteration)

IP: 6 matches
..................................................

    compile: 3.79ms elapsed in 50 iterations (avg. 0.07000000000000001ms / iteration)
    scan: 1465.14ms elapsed in 50 iterations (avg. 29.3ms / iteration)
    scan with block: 1431.35ms elapsed in 50 iterations (avg. 28.62ms / iteration)

If you're interested in doing so, you can invoke this benchmark in Artichoke with:

cargo run --release --bin string_scan_bench -- artichoke-frontend/ruby/benches/string_scan.rb

The benchmark on master (with oniguruma) is different than the benchmark on this branch because I've tweaked the Regexps to remove lookahead patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions