-
Notifications
You must be signed in to change notification settings - Fork 49
Use a bitset for ascii-only character classes #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a bitset for ascii-only character classes #511
Conversation
…erimental-string-processing into more_more_benchmarks
it did in fact, not need @escaping
I was wondering why Using a regex with fewer matches (3k instead of 15k) and a similar number of members to
Using more elements resulted in an even larger improvement, as expected because the old code would create a closure for every character in the character class
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far; some early feedback.
mutating func matchBitset( | ||
_ bitset: DSLTree.CustomCharacterClass.AsciiBitset | ||
) -> Bool { | ||
guard let cur = load(), bitset.matches(char: cur) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future work: we can implement this on the UTF-8 view, but we'd have to handle grapheme breaking ourselves.
BTW, @natecook1000 what is the model for semantic mode processing around here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to try a fast check, IIRC you could have this at the top:
guard bitset.matches(input.utf8[currentPosition]),
input._isOnGraphemeBoundarySomething(input.utf8.index(after: currentPosition))
else {
That'd be for measuring or approximating the potential benefit. I think we'd want to have a more consistent series of helper functions surrounding sub-grapheme cluster processing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When matching with grapheme cluster semantics:
- this should only match a single-scalar ASCII character (unless inverted)
- it should advance to the next character after successfully matching
When matching with Unicode scalar semantics:
- this should only check the current Unicode scalar value
- it should advance to the next Unicode scalar value after successfully matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but what is the model for the engine? The engine isn't querying options on every loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, this optimization only applies to grapheme-semantic mode right now, which is an unfortunate limitation. Lily, can you make sure to write a test for this somehow? We may need to revise our compilation testing approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another approach is to have a bit in some instructions or payloads (whether that is really a dedicated bit or a virtual bit because we expand opcodes around it) that signals whether it should end in a grapheme break check. That would allow us to have a specialized matchScalar
instruction, and we'd not bother to check grapheme boundaries for scalar sequences that we statically know are NFC invariant and don't need a check between every scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to do optimizations for scalar mode (new instructions, new processor functions) as a future PR, for now I just made it not generate the bitset when in scalar mode.
I also added some support to Compiler
and CompileTests
to check for the existence of certain opcodes under different semantic levels
("💿", true), | ||
("A", true), | ||
("a", false)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests around the CR-LF corner case? Also for the cases where an ASCII character class is matches against an input that has a decomposed Character
, such as "a\u{301}"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, test that inversion and case line up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some, does that cover the cases you were thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test against the input of "a\u{301}"
, that is we need to guarantee the grapheme cluster boundary after the "a".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I did add one, the diff it's showing is outdated for some reason
matchTest(#"[^a]"#,
("💿", true),
("a\u{301}", true),
("A", true),
("a", false))
matchTest("[a]",
("a\u{301}", false))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right after this lands, can you add case-insensitive test variants?
It was already being folded into the value on initialization, no reason to keep it
matchTest("[\r\n]", | ||
("\r\n", true), | ||
("\n", false), | ||
("\r", false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natecook1000 does this depend on matching semantics mode? That is, if we see [\r\n]
is that either CR or LF, or is it exactly CR-LF, or does it depend on mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a scalar-semantic version of these tests? That would help illustrate the grapheme boundary issue if there is one.
("💿", true), | ||
("A", true), | ||
("a", false)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test against the input of "a\u{301}"
, that is we need to guarantee the grapheme cluster boundary after the "a".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm pro this change. I have some concerns and want to make sure we're not doing the wrong thing in scalar semantic mode.
mutating func matchBitset( | ||
_ bitset: DSLTree.CustomCharacterClass.AsciiBitset | ||
) -> Bool { | ||
guard let cur = load(), bitset.matches(char: cur) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, this optimization only applies to grapheme-semantic mode right now, which is an unfortunate limitation. Lily, can you make sure to write a test for this somehow? We may need to revise our compilation testing approach.
mutating func matchBitset( | ||
_ bitset: DSLTree.CustomCharacterClass.AsciiBitset | ||
) -> Bool { | ||
guard let cur = load(), bitset.matches(char: cur) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another approach is to have a bit in some instructions or payloads (whether that is really a dedicated bit or a virtual bit because we expand opcodes around it) that signals whether it should end in a grapheme break check. That would allow us to have a specialized matchScalar
instruction, and we'd not bother to check grapheme boundaries for scalar sequences that we statically know are NFC invariant and don't need a check between every scalar.
matchTest("[\r\n]", | ||
("\r\n", true), | ||
("\n", false), | ||
("\r", false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a scalar-semantic version of these tests? That would help illustrate the grapheme boundary issue if there is one.
@swift-ci test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup for future noted, otherwise LGTM
builder.buildConsume(by: consumer) | ||
if let asciiBitset = ccc.asAsciiBitset(options), | ||
options.semanticLevel == .graphemeCluster { | ||
// future work: add a bit to .matchBitset to consume either a character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure we do this soon? I want to have as much of a unified performance story between grapheme semantic and scalar semantic as possible. Ideally a lot of perf analysis will be downgrading grapheme to scalar operations as permitted.
Having two different paths also complicates testing, as now many tests that were exhaustively testing the engine are now only testing one path in the engine. We'll need to meet to discuss testing and validation as we add special-case optimizations.
switch semanticLevel?.base { | ||
case .graphemeCluster: | ||
let sequence = AST.MatchingOptionSequence(adding: [.init(.graphemeClusterSemantics, location: .fake)]) | ||
dsl = DSLTree(.nonCapturingGroup(.init(ast: .changeMatchingOptions(sequence)), ast.dslTree.root)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: we'll want a DSLTree node for changing options that's not via groups
return idx | ||
} | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me for now.
Future: we seriously need to kill this code. The FIXME above states that it's only called for character classes, yet it is emitting atom consumers, so there's some technical debt here.
|
||
var singleScalarASCIIValue: UInt8? { | ||
switch kind { | ||
case let .char(c) where c != "\r\n": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future cleanup: something like the below to consolidate logic
extension Character {
var _singleScalarASCIIValue: UInt8? { ... }
}
("💿", true), | ||
("A", true), | ||
("a", false)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right after this lands, can you add case-insensitive test variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, looking good. You can feel free to merge or keep working off the same PR
@swift-ci test |
…acter classes in unicode scalars mode (swiftlang#511) - Add AsciiBitset as an conditional optimization for custom character classes that only contain ascii characters - Adds CompileOptions to turn off optimizations - Adds basic testing infrastructure for testing if compilation emitted certain instructions and if the optimized regex returned the same result as the unoptimized Co-authored-by: Michael Ilseman <[email protected]>
In the cases where a custom character set is made up of ascii characters, scalars, and ranges, use a 128 bit set to represent the character class instead of using the mess of closures we were using before
Branched from #509
Results show big improvements on the custom character class microbenchmarks, and smaller improvements in the real world benchmarks where the closure overhead represented less of the runtime, notably the email regexes which contain character classes that have many non-range items