Skip to content

Optimize matching to match on scalar values when possible #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 12, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e518bdf
Add unicode and dna benchmarks
rctcwyvrn Jun 23, 2022
eb64036
Fixup naming, turn off firstmatch, add benchmark filtering by regex
rctcwyvrn Jun 23, 2022
1237f9a
Make --exclude-ns actually work correctly
rctcwyvrn Jun 23, 2022
fdf2c23
Add usage comment for generateFasta
rctcwyvrn Jun 23, 2022
eeb38e9
First ver
rctcwyvrn Jun 28, 2022
6f76f36
Merge branch 'main' into match-scalar
rctcwyvrn Jun 29, 2022
3a2b324
Remove matchseq entirely
rctcwyvrn Jun 29, 2022
df8919e
Merge branch 'just-one-more-benchmark-suite' into temp
rctcwyvrn Jun 29, 2022
c2ee8cc
Finish up matchScalar
rctcwyvrn Jun 29, 2022
809b085
Factor out nextScalarIndex for matchBitsetScalar
rctcwyvrn Jun 29, 2022
06c77c7
Add scalar mode support for matching bitsets + fix bug
rctcwyvrn Jun 29, 2022
e3d7ad7
Emit matchScalar in quotedLiteral when in unicode scalar mode
rctcwyvrn Jun 29, 2022
3e1e088
Add tests
rctcwyvrn Jun 29, 2022
b4c7c8c
Cleanup
rctcwyvrn Jun 30, 2022
5359e31
Revert "Merge branch 'just-one-more-benchmark-suite' into temp"
rctcwyvrn Jun 30, 2022
6e4c2bd
Cleanup
rctcwyvrn Jul 4, 2022
1a359b4
Add case-insensitive match instructions
rctcwyvrn Jul 4, 2022
097ffeb
Remove extra instructions and use payload bits instead
rctcwyvrn Jul 4, 2022
76667a2
Comment out compiletests for now
rctcwyvrn Jul 4, 2022
6a1f6e9
Fix compile tests
rctcwyvrn Jul 5, 2022
fce8f9a
Merge branch 'main' into scalar-optimizations-clean
rctcwyvrn Jul 7, 2022
0860368
Fix scalar matching in grapheme semantic mode
hamishknight Jul 8, 2022
c6bc811
Preserve scalar syntax in DSL conversion
hamishknight Jul 8, 2022
bca1d2b
Change scalar semantics to match #565
rctcwyvrn Jul 11, 2022
79e60ac
Merge branch 'not-to-scale' into scalar-optimizations-clean
rctcwyvrn Jul 11, 2022
22cc9d5
Add edge case test
rctcwyvrn Jul 11, 2022
7a9923d
Always match .scalar under grapheme semantics
rctcwyvrn Jul 11, 2022
47f8e66
Merge branch 'main' into scalar-optimizations-clean
rctcwyvrn Jul 11, 2022
7aa98d1
Add new instructions to compile tests
rctcwyvrn Jul 11, 2022
113cfe3
Add an XFAIL test for scalar coalescing
rctcwyvrn Jul 12, 2022
9949b8e
Fix XCTExpectFailure for linux
rctcwyvrn Jul 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 92 additions & 45 deletions Sources/_StringProcessing/ByteCodeGen.swift
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,77 @@ fileprivate extension Compiler.ByteCodeGen {
}
}

mutating func emitQuotedLiteral(_ s: String) {
if options.semanticLevel == .graphemeCluster {
if options.isCaseInsensitive {
// future work: if all ascii, emit matchBitset instructions with
// case insensitive bitsets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other ideas are to have instructions that say to do case-insensitive matching, that is we have variants of match-character/scalar and similarly for match-sequence.

A special case for ASCII could be a dedicated instruction for the parts of ASCII that are case-sensitive. It could store both scalars to match, or just store the lowercase and the engine would do the arithmetic when matching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. we'll want to get all of this off of using consumer interfaces at some point, but doesn't have to be this PR.


// TODO: buildCaseInsensitiveMatchSequence(c) or alternative
builder.buildConsume { input, bounds in
var iterator = s.makeIterator()
var currentIndex = bounds.lowerBound
while let ch = iterator.next() {
guard currentIndex < bounds.upperBound,
ch.lowercased() == input[currentIndex].lowercased()
else { return nil }
input.formIndex(after: &currentIndex)
}
return currentIndex
}
} else {
if optimizationsEnabled && s.allSatisfy({char in char.isASCII}) {
for char in s.dropLast(1) {
// Note: only cr-lf is multiple scalars
for scalar in char.unicodeScalars {
builder.buildMatchScalar(scalar, boundaryCheck: false)
}
}
let lastChar = s.last!
for scalar in lastChar.unicodeScalars {
// Only boundary check if we are the last scalar in the last character
// to make sure that there isn't a combining scalar after the quoted literal
let boundaryCheck = scalar == lastChar.unicodeScalars.last!
builder.buildMatchScalar(scalar, boundaryCheck: boundaryCheck)
}
} else {
builder.buildMatchSequence(s)
}
}
} else {
if optimizationsEnabled && !options.isCaseInsensitive {
// Match all scalars exactly, never boundary check because we're in
// unicode scalars mode
for char in s {
for scalar in char.unicodeScalars {
builder.buildMatchScalar(scalar, boundaryCheck: false)
}
}
} else {
builder.buildConsume {
[caseInsensitive = options.isCaseInsensitive] input, bounds in
// TODO: Case folding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check our testing story here? If we're missing something, let's try to add XFAIL tests.

var iterator = s.unicodeScalars.makeIterator()
var currentIndex = bounds.lowerBound
while let scalar = iterator.next() {
guard currentIndex < bounds.upperBound else { return nil }
if caseInsensitive {
if scalar.properties.lowercaseMapping != input.unicodeScalars[currentIndex].properties.lowercaseMapping {
return nil
}
} else {
if scalar != input.unicodeScalars[currentIndex] {
return nil
}
}
input.unicodeScalars.formIndex(after: &currentIndex)
}
return currentIndex
}
}
}
}

mutating func emitBackreference(
_ ref: AST.Reference
) throws {
Expand Down Expand Up @@ -216,6 +287,11 @@ fileprivate extension Compiler.ByteCodeGen {
builder.buildConsume(by: consumeScalar {
$0.properties.lowercaseMapping == s.properties.lowercaseMapping
})
return
}

if optimizationsEnabled { // should we just do this unconditionally?
builder.buildMatchScalar(s, boundaryCheck: false)
} else {
builder.buildConsume(by: consumeScalar {
$0 == s
Expand All @@ -241,9 +317,17 @@ fileprivate extension Compiler.ByteCodeGen {
? input.index(after: bounds.lowerBound)
: nil
}
} else {
builder.buildMatch(c)
}

if optimizationsEnabled && c.isASCII {
for scalar in c.unicodeScalars {
let boundaryCheck = scalar == c.unicodeScalars.last!
builder.buildMatchScalar(scalar, boundaryCheck: boundaryCheck)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea for the builder is to have a convenience function that adds a grapheme boundary check at the current position. It could either update the most recent instruction if that instruction is capable of supporting it or insert the anchor otherwise.

}
return
}

builder.buildMatch(c)
}

mutating func emitAny() {
Expand Down Expand Up @@ -652,11 +736,12 @@ fileprivate extension Compiler.ByteCodeGen {
_ ccc: DSLTree.CustomCharacterClass
) throws {
if let asciiBitset = ccc.asAsciiBitset(options),
options.semanticLevel == .graphemeCluster,
optimizationsEnabled {
// future work: add a bit to .matchBitset to consume either a character
// or a scalar so we can have this optimization in scalar mode
builder.buildMatchAsciiBitset(asciiBitset)
if options.semanticLevel == .unicodeScalar {
builder.buildScalarMatchAsciiBitset(asciiBitset)
} else {
builder.buildMatchAsciiBitset(asciiBitset)
}
} else {
let consumer = try ccc.generateConsumer(options)
builder.buildConsume(by: consumer)
Expand Down Expand Up @@ -733,45 +818,7 @@ fileprivate extension Compiler.ByteCodeGen {
try emitAtom(a)

case let .quotedLiteral(s):
if options.semanticLevel == .graphemeCluster {
if options.isCaseInsensitive {
// TODO: buildCaseInsensitiveMatchSequence(c) or alternative
builder.buildConsume { input, bounds in
var iterator = s.makeIterator()
var currentIndex = bounds.lowerBound
while let ch = iterator.next() {
guard currentIndex < bounds.upperBound,
ch.lowercased() == input[currentIndex].lowercased()
else { return nil }
input.formIndex(after: &currentIndex)
}
return currentIndex
}
} else {
builder.buildMatchSequence(s)
}
} else {
builder.buildConsume {
[caseInsensitive = options.isCaseInsensitive] input, bounds in
// TODO: Case folding
var iterator = s.unicodeScalars.makeIterator()
var currentIndex = bounds.lowerBound
while let scalar = iterator.next() {
guard currentIndex < bounds.upperBound else { return nil }
if caseInsensitive {
if scalar.properties.lowercaseMapping != input.unicodeScalars[currentIndex].properties.lowercaseMapping {
return nil
}
} else {
if scalar != input.unicodeScalars[currentIndex] {
return nil
}
}
input.unicodeScalars.formIndex(after: &currentIndex)
}
return currentIndex
}
}
emitQuotedLiteral(s)

case let .regexLiteral(l):
return try emitNode(l.ast.dslTreeNode)
Expand Down
15 changes: 11 additions & 4 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,13 @@

@_implementationOnly import _RegexParser

extension Character {
var singleScalarAsciiValue: UInt8? {
guard self != "\r\n" else { return nil }
return asciiValue
}
}

extension DSLTree.Node {
/// Attempt to generate a consumer from this AST node
///
Expand Down Expand Up @@ -60,8 +67,8 @@ extension DSLTree._AST.Atom {
extension DSLTree.Atom {
var singleScalarASCIIValue: UInt8? {
switch self {
case let .char(c) where c != "\r\n":
return c.asciiValue
case let .char(c):
return c.singleScalarAsciiValue
case let .scalar(s) where s.isASCII:
return UInt8(ascii: s)
case let .unconverted(atom):
Expand Down Expand Up @@ -214,8 +221,8 @@ extension AST.Atom {

var singleScalarASCIIValue: UInt8? {
switch kind {
case let .char(c) where c != "\r\n":
return c.asciiValue
case let .char(c):
return c.singleScalarAsciiValue
case let .scalar(s) where s.value.isASCII:
return UInt8(ascii: s.value)
default:
Expand Down
7 changes: 7 additions & 0 deletions Sources/_StringProcessing/Engine/InstPayload.swift
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,13 @@ extension Instruction.Payload {
var string: StringRegister {
interpret()
}

init(scalar: Unicode.Scalar) {
self.init(UInt64(scalar.value))
}
var scalar: Unicode.Scalar {
return Unicode.Scalar(_value: UInt32(self.rawValue))
}

init(sequence: SequenceRegister) {
self.init(sequence)
Expand Down
5 changes: 5 additions & 0 deletions Sources/_StringProcessing/Engine/Instruction.swift
Original file line number Diff line number Diff line change
Expand Up @@ -83,10 +83,15 @@ extension Instruction {
///
/// Operand: Sequence register to compare against.
case matchSequence

case matchScalar
case matchScalarUnchecked

/// Match against a set of valid ascii values stored in a bitset
/// Operand: Ascii bitset register containing the bitset
case matchBitset
/// matchBitset but emitted in unicode scalar semantic mode, matches and advances a single scalar
case matchBitsetScalar

/// TODO: builtin assertions and anchors
case builtinAssertion
Expand Down
15 changes: 15 additions & 0 deletions Sources/_StringProcessing/Engine/MEBuilder.swift
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,14 @@ extension MEProgram.Builder {
.matchSequence,
.init(sequence: sequences.store(.init(s)))))
}

mutating func buildMatchScalar(_ s: Unicode.Scalar, boundaryCheck: Bool) {
if boundaryCheck {
instructions.append(.init(.matchScalar, .init(scalar: s)))
} else {
instructions.append(.init(.matchScalarUnchecked, .init(scalar: s)))
}
}

mutating func buildMatchAsciiBitset(
_ b: DSLTree.CustomCharacterClass.AsciiBitset
Expand All @@ -155,6 +163,13 @@ extension MEProgram.Builder {
.matchBitset, .init(bitset: makeAsciiBitset(b))))
}

mutating func buildScalarMatchAsciiBitset(
_ b: DSLTree.CustomCharacterClass.AsciiBitset
) {
instructions.append(.init(
.matchBitsetScalar, .init(bitset: makeAsciiBitset(b))))
}

mutating func buildConsume(
by p: @escaping MEProgram.ConsumeFunction
) {
Expand Down
55 changes: 55 additions & 0 deletions Sources/_StringProcessing/Engine/Processor.swift
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,30 @@ extension Processor {
return true
}

func loadScalar() -> Unicode.Scalar? {
currentPosition < end ? input.unicodeScalars[currentPosition] : nil
}

func nextScalarIndex(offsetBy n: Int, boundaryCheck: Bool) -> Input.Index? {
if let idx = input.unicodeScalars.index(currentPosition, offsetBy: 1, limitedBy: end),
(!boundaryCheck || input.isOnGraphemeClusterBoundary(idx)) {
return idx
}
return nil
}

mutating func matchScalar(_ s: Unicode.Scalar, boundaryCheck: Bool) -> Bool {
guard let curScalar = loadScalar(),
curScalar == s,
let idx = nextScalarIndex(offsetBy: 1, boundaryCheck: boundaryCheck)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to drop the helper function and just do this directly:

guard s == loadScalar(), !boundaryCheck || isOnGrapheme... else { ... }

Longer term we will want to use a lower-level String interface that returns both the current scalar and the next index. E.g. a nicer version of https://github.com/apple/swift/blob/main/stdlib/public/core/UnicodeHelpers.swift#L107.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could consider adding that now if you like, something like the below and we can micro-optimize it later:

extension String {
  func _consumeScalar(in bounds: Range<String.Index>) -> (Unicode.Scalar, Index)? {
    assert(indices.contains(bounds))
    ... 
  }
}

else {
signalFailure()
return false
}
currentPosition = idx
return true
}

// If we have a bitset we know that the CharacterClass only matches against
// ascii characters, so check if the current input element is ascii then
// check if it is set in the bitset
Expand All @@ -241,6 +265,20 @@ extension Processor {
return true
}

// Equivalent of matchBitset but emitted when in unicode scalar semantic mode
mutating func matchBitsetScalar(
_ bitset: DSLTree.CustomCharacterClass.AsciiBitset
) -> Bool {
guard let curScalar = loadScalar(),
bitset.matches(scalar: curScalar),
let idx = nextScalarIndex(offsetBy: 1, boundaryCheck: false) else {
signalFailure()
return false
}
currentPosition = idx
return true
}

mutating func signalFailure() {
guard let (pc, pos, stackEnd, capEnds, intRegisters) =
savePoints.popLast()?.destructure
Expand Down Expand Up @@ -378,12 +416,29 @@ extension Processor {
controller.step()
}

case .matchScalar:
let scalar = payload.scalar
if matchScalar(scalar, boundaryCheck: true) {
controller.step()
}
case .matchScalarUnchecked:
let scalar = payload.scalar
if matchScalar(scalar, boundaryCheck: false) {
controller.step()
}

case .matchBitset:
let reg = payload.bitset
let bitset = registers[reg]
if matchBitset(bitset) {
controller.step()
}
case .matchBitsetScalar:
let reg = payload.bitset
let bitset = registers[reg]
if matchBitsetScalar(bitset) {
controller.step()
}

case .consumeBy:
let reg = payload.consumer
Expand Down
Loading