Skip to content

[5.7] Merge benchmarker improvements and character class bitset optimization #532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rctcwyvrn
Copy link
Contributor

Cherry pick of benchmarker improvements #501 #509 #512 as well as a performance improvement in #511

Based off of #531

natecook1000 and others added 11 commits June 30, 2022 11:32
^ and $ should match the start and end of the callee, even if that
callee is a substring. Right now ^ and $ match the start and end of
the callee's base string, instead. In addition, ^ and $ should only
match the start and end of the callee when replacing a subrange, not
the start and end of the subrange.
This prepares for adopting an opaque result type for matches(of:)
and ranges(of:). The old, CollectionConsumer-based model moves 
index-by-index, and isn't aware of the regex's semantic level, 
which results in inaccurate results for regexes that match at a 
mid-character index.
* Avoid double execution by avoiding Array init

* De-genericize processor, engine, etc.

Provides only modest performance improvements (it was already getting
specialized), but makes it possible to add String-specific specializations.
* Allow CustomConsuming types to match w/ zero width

We previously asserted if a custom consuming type matches with zero
width, but that isn't necessary or good. A custom type can implement
a lookaround assertion or act as a tracer.

* Rename Processor.advance(to:) to resume(at:)

Since the given index doesn’t need to advance, this name is less
misleading.
This separates the two different ideas for boundaries in
the base input:

- subjectBounds: These represent the actual subject in the input
  string. For a `String` callee, this will cover the entire bounds,
  while for a `Substring` these will represent the bounds of the
  substring in the base.
- searchBounds: These represent the current search range in the
  subject. These bounds can be the same as `subjectBounds` or a
  subrange when searching for subsequent matches or replacing only
  in a subrange of a string.

* firstMatch shouldn't update searchBounds on iteration

When we move forward while searching for the first match, the search
bounds should stay the same. Only the currentPosition needs to move
forward. This will allow us to implement the \G start of match anchor,
with which /\Gab/ matches "abab" twice, compared with /^ab/, which
only matches once.

* Make matches(of:) and ranges(of:) boundary-aware

With this change, RegexMatchesCollection keeps the subject bounds
and search bounds separately, modifying the search bounds with each
iteration. In addition, the replace methods that only operate on a
subrange can specify that specifically, getting the correct anchor
behavior while only matching within a portion of a string.
* [benchmark] Add no-capture version of grapheme breaking exercise

* [benchmark] Add cross-engine benchmark helpers

* [benchmark] Hangul Syllable finding benchmark
- Adds benchmarks for html and email regexes
- Adds support to save and compare benchmarker runs

Co-authored-by: Michael Ilseman <[email protected]>
- Space out the names properly instead of relying on tabs
- Add a decimal point to the percentage
- Filter out NS benchmarks from the comparison
- Sort comparisons by amount of improvement/regression
  (by s, not % beceause we have lots of variance + low runtime benchmarks)
…acter classes in unicode scalars mode (swiftlang#511)

- Add AsciiBitset as an conditional optimization for custom character classes that only contain ascii characters
- Adds CompileOptions to turn off optimizations
- Adds basic testing infrastructure for testing if compilation emitted certain instructions and if the optimized regex returned the same result as the unoptimized

Co-authored-by: Michael Ilseman <[email protected]>
@milseman
Copy link
Member

Testing the combination of this on top of #531 in swiftlang/swift#59817

@stephentyrone stephentyrone self-requested a review July 1, 2022 00:59
@milseman milseman merged commit 857b6f3 into swiftlang:swift/release/5.7 Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants