-
Notifications
You must be signed in to change notification settings - Fork 49
Remove most consumer functions #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove most consumer functions #660
Conversation
This is based heavily off the work in swiftlang#590, rebased onto main, with some changes to remove even more consumer uses. Consumer functions only have two remaining uses: non-ASCII ranges and Unicode lookups (for things like general category, binary properties, name, etc.). This change primarily treats custom character classes as alternations around their contents, with set operations emitted as instructions instead of implemented via consumer function.
@swift-ci Please test |
@swift-ci Please test |
Oh nice, glad that PR was put to use 🎉 |
} | ||
result.append(contentsOf: characters.map { .atom(.char($0)) }) | ||
result.append(contentsOf: scalars.map { .atom(.scalar($0)) }) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check, do we want to need to preserve ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to preserve this ordering — at this level, everything is a union.
case let .intersection(lhs, rhs): | ||
return lhs.guaranteesForwardProgress && rhs.guaranteesForwardProgress | ||
case let .subtraction(lhs, _): | ||
return lhs.guaranteesForwardProgress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the LHS's forward progress member is subtracted by the RHS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually haven't been able to get a trivia-only CCC to parse. Do you have a way of constructing one? If not, CCCs might just always guarantee forward progress.
Note that this is "guarantees forward progress" only upon matching, so even if we continue to allow empty CCCs (like [[a]--[a]]
), that would just fail to match against any character, not match against a zero length position. Some regex engines (e.g. Rust) prohibit empty character classes, but others just treat them like something that's impossible to match (e.g. ICU/NSRE). Since we're currently doing the second, we could eventually eliminate branches that include these if we recognize them.
case let .subtraction(lhs, _): | ||
return lhs.guaranteesForwardProgress | ||
case let .symmetricDifference(lhs, rhs): | ||
return lhs.guaranteesForwardProgress && rhs.guaranteesForwardProgress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, what if the symmetric difference only contains trivia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-progress review
} else { | ||
updatedCCC = ccc | ||
} | ||
let filteredMembers = updatedCCC.members.filter({!$0.isOnlyTrivia}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note: I think it makes sense for this to be a single operation that will recursively traverse and produce a code-gen-relevant representation. That will also help with testing as we can test that the flattened output is properly associated and has the members we expect. It will also help divide out linearizing the set structure into the instruction stream from removing consumer functions themselves.
E.g.
extension DSLTree.CustomCharacterClass {
enum CodeGenClassMember {
case bitset: ASCIIBitset
case consumerFunction: () -> () ...
case range: ...
case character: Character
case scalar: Unicode.Scalar
}
enum Operation {
case union, intersection, ...
case endOfOps
}
func codeGenRepresentation() -> [(CodeGenClassMember, Operation)]
}
let done = builder.makeAddress() | ||
for member in filteredMembers.dropLast() { | ||
let next = builder.makeAddress() | ||
builder.buildSave(next) | ||
try emitCCCMember(member) | ||
builder.buildClear() | ||
builder.buildFail() | ||
builder.label(next) | ||
} | ||
builder.buildSave(done) | ||
try emitCCCMember(filteredMembers.last!) | ||
builder.buildClear() | ||
builder.buildFail() | ||
builder.label(done) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add an inverted
parameter to emitAlternationGen
to cover this case as well...
} | ||
result.append(contentsOf: characters.map { .atom(.char($0)) }) | ||
result.append(contentsOf: scalars.map { .atom(.scalar($0)) }) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to preserve this ordering — at this level, everything is a union.
case let .intersection(lhs, rhs): | ||
return lhs.guaranteesForwardProgress && rhs.guaranteesForwardProgress | ||
case let .subtraction(lhs, _): | ||
return lhs.guaranteesForwardProgress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually haven't been able to get a trivia-only CCC to parse. Do you have a way of constructing one? If not, CCCs might just always guarantee forward progress.
Note that this is "guarantees forward progress" only upon matching, so even if we continue to allow empty CCCs (like [[a]--[a]]
), that would just fail to match against any character, not match against a zero length position. Some regex engines (e.g. Rust) prohibit empty character classes, but others just treat them like something that's impossible to match (e.g. ICU/NSRE). Since we're currently doing the second, we could eventually eliminate branches that include these if we recognize them.
@swift-ci Please test |
Benchmark results for this change on latest show big improvements for anything moving off consumer functions:
|
bef686b
to
e719e63
Compare
When searching in a substring with an endIndex that isn't on a character boundary, advancing from the penultimate character position to the substring's endIndex can end up failing, since advancing by a character finds the character-aligned index that is _past_ the substring end. This change handles that case. # Conflicts: # Sources/_StringProcessing/ConsumerInterface.swift
e719e63
to
4b382cc
Compare
@swift-ci Please test |
@swift-ci Please test |
@swift-ci Please test |
This is based heavily off the work in #590, rebased onto main, with some changes to remove even more consumer uses. Consumer functions only have two remaining uses: non-ASCII ranges and Unicode lookups (for things like general category, binary properties, name, etc.). This change primarily treats custom character classes as alternations around their contents, with set operations emitted as instructions instead of implemented via consumer function.
This is based heavily off the work in #590, rebased onto main, with some changes to remove even more consumer uses. Consumer functions only have two remaining uses: non-ASCII ranges and Unicode lookups (for things like general category, binary properties, name, etc.). This change primarily treats custom character classes as alternations around their contents, with set operations emitted as instructions instead of implemented via consumer function.
This is based heavily off the work in #590, rebased onto main, with some changes to remove even more consumer uses. Consumer functions only have two remaining uses: non-ASCII ranges and Unicode lookups (for things like general category, binary properties, name, etc.). This change primarily treats custom character classes as alternations around their contents, with set operations emitted as instructions instead of implemented via consumer function.
This is based heavily off the work in #590, rebased onto main, with some changes to remove even more consumer uses. Consumer functions only have two remaining uses: non-ASCII ranges and Unicode lookups (for things like general category, binary properties, name, etc.).
This change primarily treats custom character classes as alternations around their contents, with set operations emitted as instructions instead of implemented via consumer function.