Handle boundaries when matching in substrings #675

natecook1000 · 2023-05-22T17:56:46Z

Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do.

This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method.

Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases.

Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do. This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method. Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases.

natecook1000 · 2023-05-22T22:20:53Z

@swift-ci Please test

Sources/_StringProcessing/Engine/MEBuiltins.swift

milseman · 2023-05-23T13:17:22Z

Sources/_StringProcessing/Engine/MEQuantify.swift

@@ -27,8 +27,7 @@ extension Processor {
        isStrictASCII: payload.builtinIsStrict,
        isScalarSemantics: isScalarSemantics)
    case .any:
-      // FIXME: endIndex or end?
-      guard currentPosition < input.endIndex else { return nil }
+      guard currentPosition < end else { return nil }


Were you able to write a test case for this one?

This is hit by the existing "match any" test cases when you pass a substring instead of a string.

Should we hoist this bounds check to the top of the method, or else leave it as a concern for the String methods? Either way, we want a consistent place and the body of the switch would ideally just be doing dispatch.

milseman · 2023-05-23T13:18:33Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

-    guard currentPosition < endIndex else {
-      return nil
-    }
+    assert(currentPosition < endIndex)


Out of curiosity, why hoist these out? We'll probably want a naming convention for the assumption in this case, such as atAssumingInBounds currentPosition.

As written, the check can't be done in this method, since we're missing the processor's end. With the limitedBy parameter we can add the guard back in.

milseman · 2023-05-23T13:20:12Z

Sources/_StringProcessing/Unicode/WordBreaking.swift

    using cache: inout Set<String.Index>?,
    _ maxIndex: inout String.Index?
  ) -> Bool {
    // TODO: needs benchmark coverage
-    guard i != startIndex, i != endIndex else {
+    guard i != range.lowerBound, i != range.upperBound else {


What if i is outside of range?

That's a programming error, added an assertion for that here.

milseman · 2023-05-23T13:21:44Z

Tests/RegexTests/MatchTests.swift

+  func validateSubstring(_ substringInput: Substring) throws {
+    // Sometimes the characters we add to a substring merge with existing
+    // string members. This messes up cross-validation, so skip the test.
+    guard input == substringInput else { return }


I think that we will want to programmatically test these kinds of situations, but maybe check with @lorentey for what the semantics should be when a substring splits a grapheme cluster

This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster.

natecook1000 · 2023-05-25T19:34:59Z

@swift-ci Please test

milseman · 2023-05-29T13:59:24Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+  func characterAndEnd(at pos: String.Index, limitedBy end: String.Index) -> (Character, String.Index)? {
+    guard pos < end else { return nil }
+    let next = index(pos, offsetBy: 1, limitedBy: end) ?? end
+    return self[pos..<next].first.map { ($0, next) }


The below seems simpler to understand and is more efficient:

guard pos < end else { return nil } let next = index(after: pos) guard next <= end else { return nil } return (self[pos], next)

Sources/_StringProcessing/Unicode/ASCII.swift

milseman · 2023-05-29T14:12:53Z

Sources/_StringProcessing/Unicode/ASCII.swift

-
-    if idx == endIndex {
+    assert(String.Index(idx, within: unicodeScalars) != nil)
+    assert(idx <= end)


Somebody in the call graph should be checking this. Where does that happen?

milseman

More feedback. Let's try to get this in, especially as we're doing perf work on the quantifications in which bounds checks start to have a measurable impact.

milseman · 2023-07-30T16:48:16Z

Sources/_StringProcessing/Engine/MEQuantify.swift

@@ -27,8 +27,7 @@ extension Processor {
        isStrictASCII: payload.builtinIsStrict,
        isScalarSemantics: isScalarSemantics)
    case .any:
-      // FIXME: endIndex or end?
-      guard currentPosition < input.endIndex else { return nil }
+      guard currentPosition < end else { return nil }


Should we hoist this bounds check to the top of the method, or else leave it as a concern for the String methods? Either way, we want a consistent place and the body of the switch would ideally just be doing dispatch.

milseman · 2023-07-30T16:49:51Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+    // Substring will round down non-scalar aligned indices
+    let substr = self[pos..<next]
+    return substr.first.map { ($0, substr.endIndex) }
+  }


Can you run the benchmark suite over this approach vs just (self[idx], self.index(after: idx)?

Post-merge performance tweak. We should hand-outline the rare and slow path, no need to pollute the i-cache and confuse the optimizer.

Sources/_StringProcessing/Engine/MEBuiltins.swift

milseman · 2023-08-03T12:29:23Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

    isInverted: Bool,
    isStrictASCII: Bool,
    isScalarSemantics: Bool
  ) -> QuickResult<String.Index?> {
-    assert(currentPosition < endIndex)
+    guard currentPosition < end else { return .definite(nil) }


(similarly)

milseman · 2023-08-03T12:30:25Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

@@ -291,7 +336,7 @@ extension String {
      if isScalarSemantics {
        matched = scalar.isNewline && asciiCheck
        if matched && scalar == "\r"
-            && next != endIndex && unicodeScalars[next] == "\n" {
+            && next != end && unicodeScalars[next] == "\n" {


BTW, is it != or <?

Should probably be < here

milseman · 2023-08-03T12:31:57Z

Sources/_StringProcessing/Engine/Processor.swift

-            let nextIndex = registers[reg](
-              input, currentPosition..<searchBounds.upperBound)
+            let nextIndex = consumer(input, currentPosition..<searchBounds.upperBound),
+            nextIndex <= end


I'm ok with being extra cautious, especially in the context of 3rd party consumer code. Technically this might mean that they are breaking the contract by not doing their own bounds checking, but that's such an onerous requirement that I think this is better.

Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar.

natecook1000 · 2023-08-03T21:23:52Z

Benchmarks show mostly mild churn with some increases, as we might expect from additional bounds checking. Something seems off with the EmojiRegex_All benchmark, however, which is showing a nearly 100% increase in execution time.

=== Regressions ======================================================================
- EmojiRegex_All                          126ms	63.9ms	62.2ms		97.3%
- EmailRFCNoMatches_All                   166ms	140ms	26.6ms		19.0%
- symDiffCCC_All_Scalar                   58.3ms	47.5ms	10.8ms		22.7%
- symDiffCCC_All                          58.4ms	47.7ms	10.8ms		22.6%
- EmailRFC_All                            70.6ms	65.6ms	4.99ms		7.6%
- IntersectionCCC_All_Scalar              22.3ms	20.6ms	1.68ms		8.1%
- IntersectionCCC_All                     22.1ms	20.6ms	1.48ms		7.2%
- EmailLookaheadNoMatches_All             40.6ms	39.3ms	1.32ms		3.4%
- BasicRangeCCC_All                       10.7ms	9.66ms	1.08ms		11.2%
- BasicRangeCCC_All_Scalar                10.7ms	9.69ms	1.03ms		10.7%
- CaseInsensitiveCCC_All                  11.4ms	10.4ms	1.01ms		9.7%
- BasicCCC_All                            10.2ms	9.22ms	1e+03µs		10.8%
- CaseInsensitiveCCC_All_Scalar           11.5ms	10.5ms	997µs		9.5%
- BasicCCC_All_Scalar                     10.1ms	9.15ms	983µs		10.7%
- DiceRollsInText_All_Scalar              43.4ms	42.6ms	783µs		1.8%
- SubtractionCCC_All_Scalar               20.9ms	20.3ms	667µs		3.3%
- SubtractionCCC_All                      20.9ms	20.3ms	600µs		3.0%
- EmailLookahead_All                      41.5ms	41.1ms	322µs		0.8%
- EmailLookaheadNoMatches_All_Scalar      24.9ms	24.7ms	194µs		0.8%
- EmailBuiltinCharacterClass_All          10.9ms	10.8ms	150µs		1.4%
- NotFound_All                            5.97ms	5.82ms	148µs		2.5%
- BasicBuiltinCharacterClass_All_Scalar   6.29ms	6.19ms	102µs		1.6%
- IPv4Address                             2.62ms	2.52ms	93.7µs		3.7%
- IPv4Address_Scalar                      2.3ms	2.22ms	79.3µs		3.6%
- LiteralSearch_All_Scalar                4.5ms	4.43ms	70.9µs		1.6%
- DiceNotation                            5.21ms	5.15ms	58.4µs		1.1%
- EmailLookaheadList_Scalar               5.18ms	5.13ms	49.6µs		1.0%
- HangulSyllable_First                    2.72ms	2.69ms	30.7µs		1.1%
- HangulSyllable_First_Scalar             2.27ms	2.24ms	27.3µs		1.2%
=== Improvements =====================================================================
- CompilerMessages_All                    93.6ms	99.6ms	-6.01ms		-6.0%
- CompilerMessages_All_Scalar             76.4ms	81.9ms	-5.47ms		-6.7%
- EmailRFCNoMatches_All_Scalar            125ms	127ms	-2.51ms		-2.0%
- EagarQuantWithTerminal_Whole_Scalar     835µs	1.4ms	-564µs		-40.3%
- EagarQuantWithTerminal_Whole            837µs	1.39ms	-552µs		-39.7%
- EmailRFC_All_Scalar                     47.1ms	47.5ms	-405µs		-0.9%
- GraphemeBreakNoCap_All_Scalar           3.51ms	3.83ms	-326µs		-8.5%
- ReluctantQuantWithTerminal_Whole_Scalar 5.33ms	5.6ms	-263µs		-4.7%
- GraphemeBreakNoCap_All                  4.22ms	4.39ms	-167µs		-3.8%
- LiteralSearchNotFound_All               5.15ms	5.29ms	-137µs		-2.6%
- AnchoredNotFound_First                  9.3ms	9.41ms	-118µs		-1.2%
- HangulSyllable_All_Scalar               5.15ms	5.26ms	-114µs		-2.2%
- Words_All                               14.2ms	14.3ms	-95.9µs		-0.7%
- MACAddress_Scalar                       2.63ms	2.67ms	-46.7µs		-1.7%
- Lines_All                               1.89ms	1.93ms	-38.8µs		-2.0%

milseman

LGTM, let's get this in. This is a very important correctness fix and we can get the perf back after merge.

milseman · 2023-08-04T12:47:34Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+    // Substring will round down non-scalar aligned indices
+    let substr = self[pos..<next]
+    return substr.first.map { ($0, substr.endIndex) }
+  }


Post-merge performance tweak. We should hand-outline the rare and slow path, no need to pollute the i-cache and confuse the optimizer.

milseman · 2023-08-04T12:50:58Z

Sources/_StringProcessing/Engine/Processor.swift

@@ -720,7 +716,7 @@ extension String {
    }

    let idx = unicodeScalars.index(after: pos)
-    guard idx <= end else { return nil }
+    assert(idx <= end, "Input is a substring with a sub-scalar endIndex.")


Can we also put the assertion near the top of Processor proper? We'll want to rely on it throughout the engine

milseman · 2023-08-04T12:53:29Z

Sources/_StringProcessing/Unicode/ASCII.swift

+    assert(String.Index(idx, within: unicodeScalars) != nil)
+    assert(idx <= end)
+
+    if idx == end {


Post-merge cleanup: it seems like anything that's not permissive of an empty match should fail long before reaching this code

milseman · 2023-08-04T12:54:17Z

Sources/_StringProcessing/Unicode/WordBreaking.swift

-    let priorIdx = input.index(before: currentPosition)
+    let priorIdx = semanticLevel == .graphemeCluster
+      ? input.index(before: currentPosition)
+      : input.unicodeScalars.index(before: currentPosition)


Do we have any new test cases that hit this?

milseman · 2023-08-04T12:56:49Z

Tests/RegexTests/MatchTests.swift

@@ -30,6 +30,68 @@ func _firstMatch(
 ) throws -> (String, [String?])? {
  var regex = try Regex(regexStr, syntax: syntax).matchingSemantics(semanticLevel)
  let result = try regex.firstMatch(in: input)
+
+  func validateSubstring(_ substringInput: Substring) throws {


Post-PR testing: let's make a test input that contains many kinds of complex grapheme clusters and, for each test regex and each substring of that input, any match of that substring is equal to the match of a String copy of that substring.

milseman · 2023-08-04T12:58:37Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

@@ -291,7 +336,7 @@ extension String {
      if isScalarSemantics {
        matched = scalar.isNewline && asciiCheck
        if matched && scalar == "\r"
-            && next != endIndex && unicodeScalars[next] == "\n" {
+            && next < end && unicodeScalars[next] == "\n" {


E.g. here is a place where we implicitly rely on end being scalar aligned (which is a good assumption to make and would be good to assert at the very top of a match operation).

milseman · 2023-08-11T18:48:26Z

@swift-ci please test

* Handle boundaries when matching in substrings Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do. This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method. Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases. * Handle sub-character substring boundaries This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster. Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar.

* Handle boundaries when matching in substrings (#675) * Handle boundaries when matching in substrings Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do. This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method. Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases. * Handle sub-character substring boundaries This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster. Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar. * Add test for substring replacement

* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. * Fix range-based quantification fast path (#653) The fast path for quantification incorrectly discards the last save position when the quantification used up all possible trips, which is only possible with range-based quantifications (e.g. `{0,3}`). This bug shows up when a range-based quantifier matches the maximum - 1 repetitions of the preceding pattern. For example, the regex `/a{0,2}a/` should succeed as a full match any of the strings "aa", "aaa", or "aaaa". However, the pattern fails to match "aaa", since the save point allowing a single "a" to match the first `a{0,2}` part of the regex is discarded. This change only discards the last save position when advancing the quantifier fails due to a failure to match, not maxing out the number of trips. * Add in ASCII fast-path for anyNonNewline (#654) * Avoid long expression type checks (#657) These changes remove several seconds of type-checking time from the RegexBuilder test cases, bringing all expressions under 150ms (on the tested computer). * Processor cleanup (#655) Clean up and refactor the processor * Simplify instruction fetching * Refactor metrics out, and void their storage in release builds *Put operations onto String * Fix `firstRange(of:)` search (#656) Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter actually use two different string searching algorithms. `ranges(of:)` uses the "z-searcher" algorithm, while `firstRange(of:)` uses a two-way search. Since it's better to align on a single path for these searches, the z-searcher has lower requirements, and the two-way search implementation has a correctness bug, this change removes the two-way search algorithm and uses z-search for `firstRange(of:)`. The correctness bug in `firstRange(of:)` appears only when searching for the second (or later) occurrence of a substring, which you have to be fairly deliberate about. In the example below, the substring at offsets `7..<12` is missed: let text = "ADACBADADACBADACB" // ===== -----===== let pattern = "ADACB" let firstRange = text.firstRange(of: pattern)! // firstRange ~= 0..<5 let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)! // secondRange ~= 12..<17 This change also removes some unrelated, unused code in Split.swift, in addition to removing an (unused) usage of `TwoWaySearcher`. rdar://92794248 * Bug fix and hot path for quantified `.` (#658) Bug fix in newline hot path, and apply hot path to quantified dot * Run scalar-semantic benchmark variants (#659) Run scalar semantic benchmarks * Refactor operations to be on String (#664) Finish refactoring logic onto String * Provide unique generic method parameter names (#669) This is getting warned on in the 5.9 compiler, will be an error starting in Swift 6. * Enable quantification optimizations for scalar semantics (#671) * Quantified scalar semantic matching * Fix doc comment for trimPrefix and trimmingPrefix funcs (#673) * Update availability for the 5.8 release (#680) * Optimize search for start-anchored regexes (#682) When a regex is anchored to the start of a subject, there's no need to search throughout a string for the pattern when searching for the first match: a prefix match is sufficient. This adds a regex compilation-time check about whether a match can only be found at the start of a subject, and then uses that to choose whether to defer to `prefixMatch` from within `firstMatch`. * Fix misuse of `XCTSkip()` (#685) * Handle boundaries when matching in substrings (#675) * Handle boundaries when matching in substrings Some of our existing matching routines use the start/endIndex of the input, which is basically never the right thing to do. This change revises those checks to use the search bounds, by either moving the boundary check out of the matching method, or if the boundary is a part of what needs to be matched (e.g. word boundaries have different behavior at the start/end than in the middle of a string) the search bounds are passed into the matching method. Testing is currently handled by piggy-backing on the existing match tests; we should add more tests to handle substring- specific edge cases. * Handle sub-character substring boundaries This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster. Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar. * Overhaul quantification fast-path (#689) Overhaul quantification save points and fast path logic, for significant wins in simplicity and performance. * adopt the stdlib’s pattern for atomic lazy references - avoids reliance on a pointer conversion * pass a pointer instead of inout conversion - this function is imported in a way that causes the compiler to not detect it as a C function * Update Sources/_StringProcessing/Regex/Core.swift comment spelling fix * Adds SPI for a NSRE compatibility mode option (#698) NSRegularExpression matches at the Unicode scalar level, but also matches `\r\n` sequences with a single `.` when single-line mode is enabled. This adds a `_nsreCompatibility` property that enables both of those behaviors, and implements support for the special case handling of `.`. * Add ASCII fast-path ASCII character class matching (#690) Uses quickASCIICharacter to speed up ASCII character class matching. 2x speedup for EmailLookahead_All and many, many others. 10% regression in AnchoredNotFound_First and related. --------- Co-authored-by: Nate Cook <[email protected]> Co-authored-by: Butta <[email protected]> Co-authored-by: Ole Begemann <[email protected]> Co-authored-by: Alex Martini <[email protected]> Co-authored-by: Alejandro Alonso <[email protected]> Co-authored-by: David Ewing <[email protected]> Co-authored-by: Dave Ewing <[email protected]> Co-authored-by: Valeriy Van <[email protected]> Co-authored-by: Jonathan Grynspan <[email protected]> Co-authored-by: Guillaume Lessard <[email protected]> Co-authored-by: Guillaume Lessard <[email protected]>

natecook1000 requested a review from milseman May 22, 2023 17:56

milseman reviewed May 23, 2023

View reviewed changes

Handle sub-character substring boundaries

e694693

This change passes the end boundary down into matching methods, and uses it to find the actual character that is part of the input substring, even if the substring's end boundary is in the middle of a grapheme cluster.

Clean up, test mid-scalar substring boundaries

9024254

natecook1000 marked this pull request as ready for review May 26, 2023 15:31

Use substring end to handle unaligned indices

6160220

milseman reviewed May 29, 2023

View reviewed changes

milseman reviewed Jul 30, 2023

View reviewed changes

natecook1000 added 3 commits July 31, 2023 10:32

Remove unused asserts

b331bc0

Merge branch 'main' into substantial_substrings

3cd121a

Improve characterAndEnd algorithm a bit

5114ea4

milseman reviewed Aug 3, 2023

View reviewed changes

natecook1000 force-pushed the substantial_substrings branch from ce8d8a2 to 651d3a3 Compare August 3, 2023 20:54

Remove an unnecessary end check

656c388

natecook1000 force-pushed the substantial_substrings branch from 651d3a3 to 656c388 Compare August 3, 2023 21:06

Remove an unnecessary check for sub-scalar upper bound

0f4f005

Substrings cannot have sub-Unicode scalar boundaries as of Swift 5.7; we can remove a check for this when matching an individual scalar.

milseman approved these changes Aug 4, 2023

View reviewed changes

milseman merged commit f5b0b5e into swiftlang:main Aug 11, 2023

natecook1000 deleted the substantial_substrings branch October 2, 2023 04:15

natecook1000 mentioned this pull request Oct 2, 2023

[swift/main] Substring boundaries during matching #695

Merged

natecook1000 mentioned this pull request Oct 3, 2023

[5.9] Substring boundaries during matching #696

Merged

natecook1000 mentioned this pull request Oct 3, 2023

[5.10] Substring boundaries during matching #697

Merged

Handle boundaries when matching in substrings #675

Handle boundaries when matching in substrings #675

Uh oh!

Conversation

natecook1000 commented May 22, 2023

Uh oh!

natecook1000 commented May 22, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natecook1000 commented May 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natecook1000 commented Aug 3, 2023

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman commented Aug 11, 2023

Uh oh!

Uh oh!