Skip to content

Remove most consumer functions #660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 27, 2024

Conversation

natecook1000
Copy link
Member

This is based heavily off the work in #590, rebased onto main, with some changes to remove even more consumer uses. Consumer functions only have two remaining uses: non-ASCII ranges and Unicode lookups (for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations around their contents, with set operations emitted as instructions instead of implemented via consumer function.

This is based heavily off the work in swiftlang#590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
@natecook1000 natecook1000 requested a review from milseman April 14, 2023 02:53
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test

@rctcwyvrn
Copy link
Contributor

Oh nice, glad that PR was put to use 🎉

}
result.append(contentsOf: characters.map { .atom(.char($0)) })
result.append(contentsOf: scalars.map { .atom(.scalar($0)) })
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, do we want to need to preserve ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to preserve this ordering — at this level, everything is a union.

case let .intersection(lhs, rhs):
return lhs.guaranteesForwardProgress && rhs.guaranteesForwardProgress
case let .subtraction(lhs, _):
return lhs.guaranteesForwardProgress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the LHS's forward progress member is subtracted by the RHS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually haven't been able to get a trivia-only CCC to parse. Do you have a way of constructing one? If not, CCCs might just always guarantee forward progress.

Note that this is "guarantees forward progress" only upon matching, so even if we continue to allow empty CCCs (like [[a]--[a]]), that would just fail to match against any character, not match against a zero length position. Some regex engines (e.g. Rust) prohibit empty character classes, but others just treat them like something that's impossible to match (e.g. ICU/NSRE). Since we're currently doing the second, we could eventually eliminate branches that include these if we recognize them.

case let .subtraction(lhs, _):
return lhs.guaranteesForwardProgress
case let .symmetricDifference(lhs, rhs):
return lhs.guaranteesForwardProgress && rhs.guaranteesForwardProgress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, what if the symmetric difference only contains trivia?

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-progress review

} else {
updatedCCC = ccc
}
let filteredMembers = updatedCCC.members.filter({!$0.isOnlyTrivia})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note: I think it makes sense for this to be a single operation that will recursively traverse and produce a code-gen-relevant representation. That will also help with testing as we can test that the flattened output is properly associated and has the members we expect. It will also help divide out linearizing the set structure into the instruction stream from removing consumer functions themselves.

E.g.

extension DSLTree.CustomCharacterClass {
  enum CodeGenClassMember {
    case bitset: ASCIIBitset
    case consumerFunction: () -> () ...
    case range: ...
    case character: Character
    case scalar: Unicode.Scalar
  }

  enum Operation {
    case union, intersection, ...
    case endOfOps
  }

  func codeGenRepresentation() -> [(CodeGenClassMember, Operation)]
}

Comment on lines +1073 to +1086
let done = builder.makeAddress()
for member in filteredMembers.dropLast() {
let next = builder.makeAddress()
builder.buildSave(next)
try emitCCCMember(member)
builder.buildClear()
builder.buildFail()
builder.label(next)
}
builder.buildSave(done)
try emitCCCMember(filteredMembers.last!)
builder.buildClear()
builder.buildFail()
builder.label(done)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an inverted parameter to emitAlternationGen to cover this case as well...

}
result.append(contentsOf: characters.map { .atom(.char($0)) })
result.append(contentsOf: scalars.map { .atom(.scalar($0)) })
return result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to preserve this ordering — at this level, everything is a union.

case let .intersection(lhs, rhs):
return lhs.guaranteesForwardProgress && rhs.guaranteesForwardProgress
case let .subtraction(lhs, _):
return lhs.guaranteesForwardProgress
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually haven't been able to get a trivia-only CCC to parse. Do you have a way of constructing one? If not, CCCs might just always guarantee forward progress.

Note that this is "guarantees forward progress" only upon matching, so even if we continue to allow empty CCCs (like [[a]--[a]]), that would just fail to match against any character, not match against a zero length position. Some regex engines (e.g. Rust) prohibit empty character classes, but others just treat them like something that's impossible to match (e.g. ICU/NSRE). Since we're currently doing the second, we could eventually eliminate branches that include these if we recognize them.

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

Benchmark results for this change on latest show big improvements for anything moving off consumer functions:

=== Regressions ================================================================
- CompilerMessages_All                    88.1ms    86.9ms     1.22ms       1.4%
- DiceRollsInText_All_Scalar              42.2ms    41.6ms      604µs       1.5%
- EmojiRegex_All_Scalar                   45.3ms    44.8ms      487µs       1.1%
- AnchoredNotFound_All                    13.5ms    13.1ms      374µs       2.8%
- EmailLookaheadNoMatches_All_Scalar      20.5ms    20.1ms      336µs       1.7%
- EmailLookaheadNoMatches_All             22.7ms    22.4ms      329µs       1.5%
- ReluctantQuantWithTerminal_Whole_Scalar 6.44ms    6.14ms      299µs       4.9%
- NotFound_All                            6.06ms    5.79ms      276µs       4.8%
- BasicBuiltinCharacterClass_All           7.7ms    7.48ms      227µs       3.0%
- EmailBuiltinCharacterClass_All          11.6ms    11.4ms      205µs       1.8%
- AnchoredNotFound_First_Scalar           5.69ms    5.49ms      202µs       3.7%
- HangulSyllable_All                         6ms     5.8ms      196µs       3.4%
- AnchoredNotFound_First                   8.8ms    8.63ms      169µs       2.0%
- Numbers_All_Scalar                      4.66ms     4.5ms      164µs       3.6%
- BasicBuiltinCharacterClass_All_Scalar   6.52ms    6.38ms      142µs       2.2%
- EmailLookahead_All_Scalar               16.8ms    16.7ms      125µs       0.8%
- EmailBuiltinCharacterClass_All_Scalar   11.5ms    11.3ms      125µs       1.1%
- NotFound_All_Scalar                        5ms    4.88ms      121µs       2.5%
- BasicRangeCCC_All_Scalar                5.95ms    5.85ms      102µs       1.7%
- Css_All                                 3.34ms    3.24ms      101µs       3.1%
- EmailLookaheadList                      3.83ms    3.73ms     99.2µs       2.7%
- ReluctantQuantWithTerminal_Whole        6.26ms    6.16ms     96.8µs       1.6%
- EmailLookaheadList_Scalar               3.63ms    3.53ms     93.8µs       2.7%
- BasicCCC_All_Scalar                     5.34ms    5.28ms     59.7µs       1.1%
- AnchoredNotFound_All_Scalar             9.07ms    9.01ms     59.5µs       0.7%
- LiteralSearch_All_Scalar                4.67ms    4.62ms     48.2µs       1.0%
- HangulSyllable_First                    2.76ms    2.72ms     37.1µs       1.4%
- HangulSyllable_First_Scalar             2.37ms    2.34ms       33µs       1.4%
- IPv6Address                              2.7ms    2.68ms     23.1µs       0.9%
- EagarQuantWithTerminal_Whole_Scalar      382µs     364µs     18.7µs       5.1%
=== Improvements ===============================================================
- EmailRFCNoMatches_All                     52ms     157ms     -105ms     -66.9%
- EmailRFCNoMatches_All_Scalar            48.4ms     126ms    -77.6ms     -61.6%
- symDiffCCC_All                          20.9ms    53.7ms    -32.8ms     -61.0%
- symDiffCCC_All_Scalar                     21ms    53.7ms    -32.8ms     -61.0%
- EmailRFC_All                            34.8ms    64.4ms    -29.6ms     -45.9%
- EmailRFC_All_Scalar                     34.1ms    47.7ms    -13.6ms     -28.5%
- IntersectionCCC_All                      7.8ms    19.8ms      -12ms     -60.7%
- IntersectionCCC_All_Scalar              7.82ms    19.7ms    -11.9ms     -60.4%
- SubtractionCCC_All                      8.61ms    19.1ms    -10.5ms     -55.0%
- SubtractionCCC_All_Scalar               8.71ms    19.2ms    -10.5ms     -54.7%
- MACAddress                              2.56ms    2.65ms    -92.9µs      -3.5%
- DiceNotation                            4.87ms    4.95ms    -85.5µs      -1.7%
- CaseInsensitiveCCC_All_Scalar           5.71ms    5.79ms    -78.7µs      -1.4%
- GraphemeBreakNoCap_All                  3.15ms    3.22ms    -72.8µs      -2.3%
- GraphemeBreakNoCap_All_Scalar           3.03ms    3.09ms    -67.1µs      -2.2%
- LiteralSearchNotFound_All_Scalar        4.35ms    4.41ms    -56.3µs      -1.3%
- IPv4Address                             2.42ms    2.47ms    -52.3µs      -2.1%
- IPv4Address_Scalar                      2.29ms    2.31ms    -16.3µs      -0.7%

@natecook1000 natecook1000 force-pushed the shrink_consumer_interface branch from bef686b to e719e63 Compare March 12, 2024 04:03
When searching in a substring with an endIndex that isn't on a character boundary,
advancing from the penultimate character position to the substring's endIndex can end
up failing, since advancing by a character finds the character-aligned index that is
_past_ the substring end. This change handles that case.

# Conflicts:
#	Sources/_StringProcessing/ConsumerInterface.swift
@natecook1000 natecook1000 force-pushed the shrink_consumer_interface branch from e719e63 to 4b382cc Compare March 12, 2024 04:15
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000 natecook1000 merged commit 12db649 into swiftlang:main Mar 27, 2024
@natecook1000 natecook1000 deleted the shrink_consumer_interface branch March 27, 2024 02:09
natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
natecook1000 added a commit that referenced this pull request Mar 27, 2024
This is based heavily off the work in #590, rebased onto main, with
some changes to remove even more consumer uses. Consumer functions
only have two remaining uses: non-ASCII ranges and Unicode lookups
(for things like general category, binary properties, name, etc.).

This change primarily treats custom character classes as alternations
around their contents, with set operations emitted as instructions
instead of implemented via consumer function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants