Use a bitset for ascii-only character classes #511

rctcwyvrn · 2022-06-22T21:52:21Z

In the cases where a custom character set is made up of ascii characters, scalars, and ranges, use a 128 bit set to represent the character class instead of using the mess of closures we were using before

Branched from #509

Results show big improvements on the custom character class microbenchmarks, and smaller improvements in the real world benchmarks where the closure overhead represented less of the runtime, notably the email regexes which contain character classes that have many non-range items

=== Improvements ====================================================
- basicCCC				11.2ms	33.5ms	-22.3ms	-67.0%
- basicRangeCCC				12ms	13.5ms	-1.58ms	-12.0%
- invertedCCC				30.6ms	48.9ms	-18.2ms	-37.0%
- caseInsensitiveCCC			13.1ms	124ms	-111ms	-89.0%

- GraphemeBreakNoCapFirst		85.5µs	122µs	-37µs	-30.0%
- emailLookaheadNoMatchesFirst		62.5ms	88.3ms	-25.8ms	-29.0%
- emailRFCAll				51.4ms	54.3ms	-2.89ms	-5.0%
- cssAll				5.03ms	5.33ms	-297µs	-6.0%
- emailLookaheadAll			86.1ms	103ms	-16.6ms	-16.0%
- GraphemeBreakNoCapAll			9.49ms	10.1ms	-621µs	-6.0%
- emailLookaheadNoMatchesAll		62.5ms	88.3ms	-25.7ms	-29.0%
- emailLookaheadFirst			19.2µs	22.7µs	-3.5µs	-15.0%
- emailRFCFirst				8.33µs	9.75µs	-1.42µs	-15.0%

…erimental-string-processing into more_more_benchmarks

it did in fact, not need @escaping

rctcwyvrn · 2022-06-22T22:34:26Z

I was wondering why invertedCCC had less of an improvement than expected compared to basicCCC but it seemed to come down to the number of matches

Using a regex with fewer matches (3k instead of 15k) and a similar number of members to basicCCC yielded a similar level of improvement
let inverted = #"[^ABCDEeIOU1234 ]{4,6}"#

invertedCCC		16.3ms	53ms	-36.7ms	-69.0%

Using more elements resulted in an even larger improvement, as expected because the old code would create a closure for every character in the character class

let inverted = #"[^abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ0123456789 ]{4,6}"#

invertedCCC		11.1ms	123ms	-112ms	-91.0%

milseman

Looking good so far; some early feedback.

Sources/_StringProcessing/ConsumerInterface.swift

milseman · 2022-06-23T13:29:14Z

Sources/_StringProcessing/Engine/Processor.swift

+  mutating func matchBitset(
+    _ bitset: DSLTree.CustomCharacterClass.AsciiBitset
+  ) -> Bool {
+    guard let cur = load(), bitset.matches(char: cur) else {


Future work: we can implement this on the UTF-8 view, but we'd have to handle grapheme breaking ourselves.

BTW, @natecook1000 what is the model for semantic mode processing around here?

If you want to try a fast check, IIRC you could have this at the top:

guard bitset.matches(input.utf8[currentPosition]), input._isOnGraphemeBoundarySomething(input.utf8.index(after: currentPosition)) else {

That'd be for measuring or approximating the potential benefit. I think we'd want to have a more consistent series of helper functions surrounding sub-grapheme cluster processing though.

When matching with grapheme cluster semantics:

this should only match a single-scalar ASCII character (unless inverted)

it should advance to the next character after successfully matching

When matching with Unicode scalar semantics:

this should only check the current Unicode scalar value

it should advance to the next Unicode scalar value after successfully matching

Yes, but what is the model for the engine? The engine isn't querying options on every loop.

So IIUC, this optimization only applies to grapheme-semantic mode right now, which is an unfortunate limitation. Lily, can you make sure to write a test for this somehow? We may need to revise our compilation testing approach.

I think another approach is to have a bit in some instructions or payloads (whether that is really a dedicated bit or a virtual bit because we expand opcodes around it) that signals whether it should end in a grapheme break check. That would allow us to have a specialized matchScalar instruction, and we'd not bother to check grapheme boundaries for scalar sequences that we statically know are NFC invariant and don't need a check between every scalar.

I think it would make sense to do optimizations for scalar mode (new instructions, new processor functions) as a future PR, for now I just made it not generate the bitset when in scalar mode.

I also added some support to Compiler and CompileTests to check for the existence of certain opcodes under different semantic levels

Sources/_StringProcessing/Regex/DSLTree.swift

milseman · 2022-06-23T13:36:53Z

Tests/RegexTests/MatchTests.swift

+              ("💿", true),
+              ("A", true),
+              ("a", false))
+


Could you add tests around the CR-LF corner case? Also for the cases where an ASCII character class is matches against an input that has a decomposed Character, such as "a\u{301}"?

Similarly, test that inversion and case line up

I added some, does that cover the cases you were thinking of?

Can you add a test against the input of "a\u{301}", that is we need to guarantee the grapheme cluster boundary after the "a".

Oh I did add one, the diff it's showing is outdated for some reason

matchTest(#"[^a]"#, ("💿", true), ("a\u{301}", true), ("A", true), ("a", false)) matchTest("[a]", ("a\u{301}", false))

Right after this lands, can you add case-insensitive test variants?

It was already being folded into the value on initialization, no reason to keep it

Sources/_StringProcessing/Regex/DSLTree.swift

milseman · 2022-06-24T16:50:31Z

Tests/RegexTests/MatchTests.swift

+    matchTest("[\r\n]",
+      ("\r\n", true),
+      ("\n", false),
+      ("\r", false))


@natecook1000 does this depend on matching semantics mode? That is, if we see [\r\n] is that either CR or LF, or is it exactly CR-LF, or does it depend on mode?

Can you add a scalar-semantic version of these tests? That would help illustrate the grapheme boundary issue if there is one.

milseman · 2022-06-24T16:51:21Z

Tests/RegexTests/MatchTests.swift

+              ("💿", true),
+              ("A", true),
+              ("a", false))
+


Can you add a test against the input of "a\u{301}", that is we need to guarantee the grapheme cluster boundary after the "a".

milseman

Overall I'm pro this change. I have some concerns and want to make sure we're not doing the wrong thing in scalar semantic mode.

Sources/_StringProcessing/ConsumerInterface.swift

milseman · 2022-06-27T13:49:26Z

Sources/_StringProcessing/Engine/Processor.swift

+  mutating func matchBitset(
+    _ bitset: DSLTree.CustomCharacterClass.AsciiBitset
+  ) -> Bool {
+    guard let cur = load(), bitset.matches(char: cur) else {


So IIUC, this optimization only applies to grapheme-semantic mode right now, which is an unfortunate limitation. Lily, can you make sure to write a test for this somehow? We may need to revise our compilation testing approach.

milseman · 2022-06-27T13:51:24Z

Sources/_StringProcessing/Engine/Processor.swift

+  mutating func matchBitset(
+    _ bitset: DSLTree.CustomCharacterClass.AsciiBitset
+  ) -> Bool {
+    guard let cur = load(), bitset.matches(char: cur) else {


I think another approach is to have a bit in some instructions or payloads (whether that is really a dedicated bit or a virtual bit because we expand opcodes around it) that signals whether it should end in a grapheme break check. That would allow us to have a specialized matchScalar instruction, and we'd not bother to check grapheme boundaries for scalar sequences that we statically know are NFC invariant and don't need a check between every scalar.

Sources/_StringProcessing/Regex/DSLTree.swift

milseman · 2022-06-27T13:53:29Z

Tests/RegexTests/MatchTests.swift

+    matchTest("[\r\n]",
+      ("\r\n", true),
+      ("\n", false),
+      ("\r", false))


Can you add a scalar-semantic version of these tests? That would help illustrate the grapheme boundary issue if there is one.

rctcwyvrn · 2022-06-27T23:00:46Z

@swift-ci test

milseman

Cleanup for future noted, otherwise LGTM

milseman · 2022-06-28T15:55:53Z

Sources/_StringProcessing/ByteCodeGen.swift

-    builder.buildConsume(by: consumer)
+    if let asciiBitset = ccc.asAsciiBitset(options),
+        options.semanticLevel == .graphemeCluster {
+      // future work: add a bit to .matchBitset to consume either a character


Can you make sure we do this soon? I want to have as much of a unified performance story between grapheme semantic and scalar semantic as possible. Ideally a lot of perf analysis will be downgrading grapheme to scalar operations as permitted.

Having two different paths also complicates testing, as now many tests that were exhaustively testing the engine are now only testing one path in the engine. We'll need to meet to discuss testing and validation as we add special-case optimizations.

milseman · 2022-06-28T15:58:22Z

Sources/_StringProcessing/Compiler.swift

+  switch semanticLevel?.base {
+  case .graphemeCluster:
+    let sequence = AST.MatchingOptionSequence(adding: [.init(.graphemeClusterSemantics, location: .fake)])
+    dsl = DSLTree(.nonCapturingGroup(.init(ast: .changeMatchingOptions(sequence)), ast.dslTree.root))


Future: we'll want a DSLTree node for changing options that's not via groups

milseman · 2022-06-28T16:06:10Z

Sources/_StringProcessing/ConsumerInterface.swift

+              return idx
+            }
+          }
+          return nil


This looks ok to me for now.

Future: we seriously need to kill this code. The FIXME above states that it's only called for character classes, yet it is emitting atom consumers, so there's some technical debt here.

milseman · 2022-06-28T16:07:33Z

Sources/_StringProcessing/ConsumerInterface.swift

+
+  var singleScalarASCIIValue: UInt8? {
+    switch kind {
+    case let .char(c) where c != "\r\n":


Future cleanup: something like the below to consolidate logic

extension Character { var _singleScalarASCIIValue: UInt8? { ... } }

milseman · 2022-06-28T16:11:59Z

Tests/RegexTests/MatchTests.swift

+              ("💿", true),
+              ("A", true),
+              ("a", false))
+


Right after this lands, can you add case-insensitive test variants?

milseman

Again, looking good. You can feel free to merge or keep working off the same PR

Sources/_StringProcessing/ByteCodeGen.swift

Sources/_StringProcessing/Compiler.swift

Sources/_StringProcessing/Regex/Core.swift

Tests/RegexTests/MatchTests.swift

rctcwyvrn · 2022-06-28T22:23:51Z

@swift-ci test

…acter classes in unicode scalars mode (swiftlang#511) - Add AsciiBitset as an conditional optimization for custom character classes that only contain ascii characters - Adds CompileOptions to turn off optimizations - Adds basic testing infrastructure for testing if compilation emitted certain instructions and if the optimized regex returned the same result as the unoptimized Co-authored-by: Michael Ilseman <[email protected]>

milseman and others added 19 commits June 19, 2022 09:18

[benchmark] Add no-capture version of grapheme breaking exercise

5fd8840

[benchmark] Add cross-engine benchmark helpers

03fe8d6

[benchmark] Hangul Syllable finding benchmark

5667705

Add debug mode

bde259b

Fix typo in css regex

bf95e81

Add HTML benchmark

243ec7b

Add email regex benchmarks

eeb0852

Add save/compare functionality to the benchmarker

49efd67

Clean up compare and add cli flags

b3a61a7

Merge branch 'main' into more_more_benchmarks

926d208

Make fixes

752ea76

Merge branch 'more_more_benchmarks' of github.com:rctcwyvrn/swift-exp…

7327e74

…erimental-string-processing into more_more_benchmarks

oops, remove some leftover code

7a900b6

Fix linux build issue + add cli option for specifying compare file

50e8e6d

First ver of bitset character classes

3c7f62c

Did a dumb and didn't use the new api I had added...

b71b177

Fix bug in inverted character sets

e2a011c

Remove nested chararcter class cases

f7900e5

Remove comment

e9d1902

it did in fact, not need @escaping

rctcwyvrn requested review from milseman and natecook1000 June 22, 2022 21:52

Merge branch 'main' into many-closures-vs-one-bitset-boi

cf59091

Cleanup handling of isInverted

f4019d4

milseman reviewed Jun 23, 2022

View reviewed changes

rctcwyvrn added 3 commits June 23, 2022 12:04

Cleanup

ed82cb0

Remove isCaseInsensitive property

cc1ac9d

It was already being folded into the value on initialization, no reason to keep it

Add tests for special cases

ccf6ade

milseman reviewed Jun 24, 2022

View reviewed changes

Use switch on ranges instead of if

7b83e0c

milseman approved these changes Jun 27, 2022

View reviewed changes

rctcwyvrn added 8 commits June 27, 2022 11:39

Rename asciivalue to singleScalarAsciiValue

5121076

Properly handle unicode scalars mode in custom character classes

3607b65

I most definitely did not forget to commit the tests

291a974

Cleanup

ddcf40f

Add support for testing if compilation contains certain opcodes

f87b325

Forgot the tests again, twice in one day...

2d8ac2d

Spelling mistakes

fd66693

Make expectProgram take sets of opcodes

22c8213

milseman approved these changes Jun 28, 2022

View reviewed changes

Add compiler options + validation testing against unoptimized regexes

0781b93

milseman reviewed Jun 28, 2022

View reviewed changes

Cleanup, clear cache of Regex.Program when setting new compile options

ffff944

milseman approved these changes Jun 29, 2022

View reviewed changes

rctcwyvrn merged commit 711c6e3 into swiftlang:main Jun 29, 2022

rctcwyvrn mentioned this pull request Jun 30, 2022

Optimize matching to match on scalar values when possible #525

Merged

rctcwyvrn mentioned this pull request Jun 30, 2022

[5.7] Merge benchmarker improvements and character class bitset optimization #532

Merged

Use a bitset for ascii-only character classes #511

Use a bitset for ascii-only character classes #511

Uh oh!

Conversation

rctcwyvrn commented Jun 22, 2022

Uh oh!

rctcwyvrn commented Jun 22, 2022

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natecook1000 Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn commented Jun 27, 2022

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

natecook1000 Jun 23, 2022 •

edited

Loading