Skip to content

Use a bitset for ascii-only character classes #511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jun 29, 2022
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5fd8840
[benchmark] Add no-capture version of grapheme breaking exercise
milseman Jun 19, 2022
03fe8d6
[benchmark] Add cross-engine benchmark helpers
milseman Jun 19, 2022
5667705
[benchmark] Hangul Syllable finding benchmark
milseman Jun 19, 2022
bde259b
Add debug mode
rctcwyvrn Jun 20, 2022
bf95e81
Fix typo in css regex
rctcwyvrn Jun 20, 2022
243ec7b
Add HTML benchmark
rctcwyvrn Jun 20, 2022
eeb0852
Add email regex benchmarks
rctcwyvrn Jun 20, 2022
49efd67
Add save/compare functionality to the benchmarker
rctcwyvrn Jun 20, 2022
b3a61a7
Clean up compare and add cli flags
rctcwyvrn Jun 20, 2022
926d208
Merge branch 'main' into more_more_benchmarks
milseman Jun 21, 2022
752ea76
Make fixes
rctcwyvrn Jun 21, 2022
7327e74
Merge branch 'more_more_benchmarks' of github.com:rctcwyvrn/swift-exp…
rctcwyvrn Jun 21, 2022
7a900b6
oops, remove some leftover code
rctcwyvrn Jun 21, 2022
50e8e6d
Fix linux build issue + add cli option for specifying compare file
rctcwyvrn Jun 21, 2022
3c7f62c
First ver of bitset character classes
rctcwyvrn Jun 22, 2022
b71b177
Did a dumb and didn't use the new api I had added...
rctcwyvrn Jun 22, 2022
e2a011c
Fix bug in inverted character sets
rctcwyvrn Jun 22, 2022
f7900e5
Remove nested chararcter class cases
rctcwyvrn Jun 22, 2022
e9d1902
Remove comment
rctcwyvrn Jun 22, 2022
cf59091
Merge branch 'main' into many-closures-vs-one-bitset-boi
rctcwyvrn Jun 22, 2022
f4019d4
Cleanup handling of isInverted
rctcwyvrn Jun 23, 2022
ed82cb0
Cleanup
rctcwyvrn Jun 23, 2022
cc1ac9d
Remove isCaseInsensitive property
rctcwyvrn Jun 23, 2022
ccf6ade
Add tests for special cases
rctcwyvrn Jun 23, 2022
7b83e0c
Use switch on ranges instead of if
rctcwyvrn Jun 24, 2022
5121076
Rename asciivalue to singleScalarAsciiValue
rctcwyvrn Jun 27, 2022
3607b65
Properly handle unicode scalars mode in custom character classes
rctcwyvrn Jun 27, 2022
291a974
I most definitely did not forget to commit the tests
rctcwyvrn Jun 27, 2022
ddcf40f
Cleanup
rctcwyvrn Jun 27, 2022
f87b325
Add support for testing if compilation contains certain opcodes
rctcwyvrn Jun 27, 2022
2d8ac2d
Forgot the tests again, twice in one day...
rctcwyvrn Jun 27, 2022
fd66693
Spelling mistakes
rctcwyvrn Jun 27, 2022
22c8213
Make expectProgram take sets of opcodes
rctcwyvrn Jun 27, 2022
0781b93
Add compiler options + validation testing against unoptimized regexes
rctcwyvrn Jun 28, 2022
ffff944
Cleanup, clear cache of Regex.Program when setting new compile options
rctcwyvrn Jun 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions Sources/_StringProcessing/ByteCodeGen.swift
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,19 @@
extension Compiler {
struct ByteCodeGen {
var options: MatchingOptions
private let compileOptions: CompileOptions
var builder = MEProgram.Builder()
/// A Boolean indicating whether the first matchable atom has been emitted.
/// This is used to determine whether to apply initial options.
var hasEmittedFirstMatchableAtom = false

init(options: MatchingOptions, captureList: CaptureList) {
init(
options: MatchingOptions,
compileOptions: CompileOptions,
captureList: CaptureList
) {
self.options = options
self.compileOptions = compileOptions
self.builder.captureList = captureList
}
}
Expand Down Expand Up @@ -643,8 +649,16 @@ fileprivate extension Compiler.ByteCodeGen {
mutating func emitCustomCharacterClass(
_ ccc: DSLTree.CustomCharacterClass
) throws {
let consumer = try ccc.generateConsumer(options)
builder.buildConsume(by: consumer)
if let asciiBitset = ccc.asAsciiBitset(options),
options.semanticLevel == .graphemeCluster,
!compileOptions.contains(.unoptimized) {
// future work: add a bit to .matchBitset to consume either a character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure we do this soon? I want to have as much of a unified performance story between grapheme semantic and scalar semantic as possible. Ideally a lot of perf analysis will be downgrading grapheme to scalar operations as permitted.

Having two different paths also complicates testing, as now many tests that were exhaustively testing the engine are now only testing one path in the engine. We'll need to meet to discuss testing and validation as we add special-case optimizations.

// or a scalar so we can have this optimization in scalar mode
builder.buildMatchAsciiBitset(asciiBitset)
} else {
let consumer = try ccc.generateConsumer(options)
builder.buildConsume(by: consumer)
}
}

@discardableResult
Expand Down
52 changes: 42 additions & 10 deletions Sources/_StringProcessing/Compiler.swift
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ class Compiler {

// TODO: Or are these stored on the tree?
var options = MatchingOptions()
private var compileOptions: CompileOptions = .default

init(ast: AST) {
self.tree = ast.dslTree
Expand All @@ -25,23 +26,22 @@ class Compiler {
self.tree = tree
}

init(tree: DSLTree, compileOptions: CompileOptions) {
self.tree = tree
self.compileOptions = compileOptions
}

__consuming func emit() throws -> MEProgram {
// TODO: Handle global options
var codegen = ByteCodeGen(
options: options, captureList: tree.captureList
)
options: options,
compileOptions:
compileOptions,
captureList: tree.captureList)
return try codegen.emitRoot(tree.root)
}
}

func _compileRegex(
_ regex: String, _ syntax: SyntaxOptions = .traditional
) throws -> Executor {
let ast = try parse(regex, .semantic, syntax)
let program = try Compiler(ast: ast).emit()
return Executor(program: program)
}

// An error produced when compiling a regular expression.
enum RegexCompilationError: Error, CustomStringConvertible {
// TODO: Source location?
Expand All @@ -54,3 +54,35 @@ enum RegexCompilationError: Error, CustomStringConvertible {
}
}
}

// Testing support
@available(SwiftStdlib 5.7, *)
func _compileRegex(
_ regex: String,
_ syntax: SyntaxOptions = .traditional,
_ semanticLevel: RegexSemanticLevel? = nil
) throws -> Executor {
let ast = try parse(regex, .semantic, syntax)
let dsl: DSLTree

switch semanticLevel?.base {
case .graphemeCluster:
let sequence = AST.MatchingOptionSequence(adding: [.init(.graphemeClusterSemantics, location: .fake)])
dsl = DSLTree(.nonCapturingGroup(.init(ast: .changeMatchingOptions(sequence)), ast.dslTree.root))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future: we'll want a DSLTree node for changing options that's not via groups

case .unicodeScalar:
let sequence = AST.MatchingOptionSequence(adding: [.init(.unicodeScalarSemantics, location: .fake)])
dsl = DSLTree(.nonCapturingGroup(.init(ast: .changeMatchingOptions(sequence)), ast.dslTree.root))
case .none:
dsl = ast.dslTree
}
let program = try Compiler(tree: dsl).emit()
return Executor(program: program)
}

extension Compiler {
struct CompileOptions: OptionSet {
let rawValue: Int
static let unoptimized = CompileOptions(rawValue: 1)
static let `default`: CompileOptions = []
}
}
110 changes: 98 additions & 12 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,26 @@ extension DSLTree.Node {
}
}

extension DSLTree._AST.Atom {
var singleScalarASCIIValue: UInt8? {
return ast.singleScalarASCIIValue
}
}

extension DSLTree.Atom {
var singleScalarASCIIValue: UInt8? {
switch self {
case let .char(c) where c != "\r\n":
return c.asciiValue
case let .scalar(s) where s.isASCII:
return UInt8(ascii: s)
case let .unconverted(atom):
return atom.singleScalarASCIIValue
default:
return nil
}
}

// TODO: If ByteCodeGen switches first, then this is unnecessary for
// top-level nodes, but it's also invoked for `.atom` members of a custom CC
func generateConsumer(
Expand All @@ -61,17 +80,32 @@ extension DSLTree.Atom {

switch self {
case let .char(c):
// TODO: Match level?
return { input, bounds in
let low = bounds.lowerBound
if isCaseInsensitive && c.isCased {
return input[low].lowercased() == c.lowercased()
? input.index(after: low)
: nil
} else {
return input[low] == c
? input.index(after: low)
: nil
if opts.semanticLevel == .graphemeCluster {
return { input, bounds in
let low = bounds.lowerBound
if isCaseInsensitive && c.isCased {
return input[low].lowercased() == c.lowercased()
? input.index(after: low)
: nil
} else {
return input[low] == c
? input.index(after: low)
: nil
}
}
} else {
let consumers = c.unicodeScalars.map { s in consumeScalar {
isCaseInsensitive
? $0.properties.lowercaseMapping == s.properties.lowercaseMapping
: $0 == s
}}
return { input, bounds in
for fn in consumers {
if let idx = fn(input, bounds) {
return idx
}
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me for now.

Future: we seriously need to kill this code. The FIXME above states that it's only called for character classes, yet it is emitting atom consumers, so there's some technical debt here.

}
}
case let .scalar(s):
Expand Down Expand Up @@ -177,7 +211,18 @@ extension AST.Atom {
default: return nil
}
}


var singleScalarASCIIValue: UInt8? {
switch kind {
case let .char(c) where c != "\r\n":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future cleanup: something like the below to consolidate logic

extension Character {
  var _singleScalarASCIIValue: UInt8? { ... }
}

return c.asciiValue
case let .scalar(s) where s.value.isASCII:
return UInt8(ascii: s.value)
default:
return nil
}
}

func generateConsumer(
_ opts: MatchingOptions
) throws -> MEProgram.ConsumeFunction? {
Expand Down Expand Up @@ -235,6 +280,34 @@ extension AST.Atom {
}

extension DSLTree.CustomCharacterClass.Member {
func asAsciiBitset(
_ opts: MatchingOptions,
_ isInverted: Bool
) -> DSLTree.CustomCharacterClass.AsciiBitset? {
switch self {
case let .atom(a):
if let val = a.singleScalarASCIIValue {
return DSLTree.CustomCharacterClass.AsciiBitset(
val,
isInverted,
opts.isCaseInsensitive
)
}
case let .range(low, high):
if let lowVal = low.singleScalarASCIIValue, let highVal = high.singleScalarASCIIValue {
return DSLTree.CustomCharacterClass.AsciiBitset(
low: lowVal,
high: highVal,
isInverted: isInverted,
isCaseInsensitive: opts.isCaseInsensitive
)
}
default:
return nil
}
return nil
}

func generateConsumer(
_ opts: MatchingOptions
) throws -> MEProgram.ConsumeFunction {
Expand Down Expand Up @@ -342,6 +415,19 @@ extension DSLTree.CustomCharacterClass.Member {
}

extension DSLTree.CustomCharacterClass {
func asAsciiBitset(_ opts: MatchingOptions) -> AsciiBitset? {
return members.reduce(
.init(isInverted: isInverted),
{result, member in
if let next = member.asAsciiBitset(opts, isInverted) {
return result?.union(next)
} else {
return nil
}
}
)
}

func generateConsumer(
_ opts: MatchingOptions
) throws -> MEProgram.ConsumeFunction {
Expand Down
8 changes: 8 additions & 0 deletions Sources/_StringProcessing/Engine/InstPayload.swift
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ extension Instruction.Payload {
case bool(BoolRegister)
case element(ElementRegister)
case consumer(ConsumeFunctionRegister)
case bitset(AsciiBitsetRegister)
case assertion(AssertionFunctionRegister)
case addr(InstructionAddress)
case capture(CaptureRegister)
Expand Down Expand Up @@ -196,6 +197,13 @@ extension Instruction.Payload {
interpret()
}

init(bitset: AsciiBitsetRegister) {
self.init(bitset)
}
var bitset: AsciiBitsetRegister {
interpret()
}

init(consumer: ConsumeFunctionRegister) {
self.init(consumer)
}
Expand Down
4 changes: 4 additions & 0 deletions Sources/_StringProcessing/Engine/Instruction.swift
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ extension Instruction {
/// Operand: Sequence register to compare against.
case matchSequence

/// Match against a set of valid ascii values stored in a bitset
/// Operand: Ascii bitset register containing the bitset
case matchBitset

/// TODO: builtin assertions and anchors
case builtinAssertion

Expand Down
17 changes: 17 additions & 0 deletions Sources/_StringProcessing/Engine/MEBuilder.swift
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ extension MEProgram {
var elements = TypedSetVector<Input.Element, _ElementRegister>()
var sequences = TypedSetVector<[Input.Element], _SequenceRegister>()

var asciiBitsets: [DSLTree.CustomCharacterClass.AsciiBitset] = []
var consumeFunctions: [ConsumeFunction] = []
var assertionFunctions: [AssertionFunction] = []
var transformFunctions: [TransformFunction] = []
Expand Down Expand Up @@ -147,6 +148,13 @@ extension MEProgram.Builder {
.init(sequence: sequences.store(.init(s)))))
}

mutating func buildMatchAsciiBitset(
_ b: DSLTree.CustomCharacterClass.AsciiBitset
) {
instructions.append(.init(
.matchBitset, .init(bitset: makeAsciiBitset(b))))
}

mutating func buildConsume(
by p: @escaping MEProgram.ConsumeFunction
) {
Expand Down Expand Up @@ -273,6 +281,7 @@ extension MEProgram.Builder {
regInfo.sequences = sequences.count
regInfo.ints = nextIntRegister.rawValue
regInfo.values = nextValueRegister.rawValue
regInfo.bitsets = asciiBitsets.count
regInfo.consumeFunctions = consumeFunctions.count
regInfo.assertionFunctions = assertionFunctions.count
regInfo.transformFunctions = transformFunctions.count
Expand All @@ -283,6 +292,7 @@ extension MEProgram.Builder {
instructions: InstructionList(instructions),
staticElements: elements.stored,
staticSequences: sequences.stored,
staticBitsets: asciiBitsets,
staticConsumeFunctions: consumeFunctions,
staticAssertionFunctions: assertionFunctions,
staticTransformFunctions: transformFunctions,
Expand Down Expand Up @@ -414,6 +424,13 @@ extension MEProgram.Builder {
// TODO: A register-mapping helper struct, which could release
// registers without monotonicity required

mutating func makeAsciiBitset(
_ b: DSLTree.CustomCharacterClass.AsciiBitset
) -> AsciiBitsetRegister {
defer { asciiBitsets.append(b) }
return AsciiBitsetRegister(asciiBitsets.count)
}

mutating func makeConsumeFunction(
_ f: @escaping MEProgram.ConsumeFunction
) -> ConsumeFunctionRegister {
Expand Down
1 change: 1 addition & 0 deletions Sources/_StringProcessing/Engine/MEProgram.swift
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ struct MEProgram {

var staticElements: [Input.Element]
var staticSequences: [[Input.Element]]
var staticBitsets: [DSLTree.CustomCharacterClass.AsciiBitset]
var staticConsumeFunctions: [ConsumeFunction]
var staticAssertionFunctions: [AssertionFunction]
var staticTransformFunctions: [TransformFunction]
Expand Down
21 changes: 21 additions & 0 deletions Sources/_StringProcessing/Engine/Processor.swift
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,20 @@ extension Processor {
}
return true
}

// If we have a bitset we know that the CharacterClass only matches against
// ascii characters, so check if the current input element is ascii then
// check if it is set in the bitset
mutating func matchBitset(
_ bitset: DSLTree.CustomCharacterClass.AsciiBitset
) -> Bool {
guard let cur = load(), bitset.matches(char: cur) else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work: we can implement this on the UTF-8 view, but we'd have to handle grapheme breaking ourselves.

BTW, @natecook1000 what is the model for semantic mode processing around here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to try a fast check, IIRC you could have this at the top:

guard bitset.matches(input.utf8[currentPosition]),  
      input._isOnGraphemeBoundarySomething(input.utf8.index(after: currentPosition)) 
else {

That'd be for measuring or approximating the potential benefit. I think we'd want to have a more consistent series of helper functions surrounding sub-grapheme cluster processing though.

Copy link
Member

@natecook1000 natecook1000 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When matching with grapheme cluster semantics:

  • this should only match a single-scalar ASCII character (unless inverted)
  • it should advance to the next character after successfully matching

When matching with Unicode scalar semantics:

  • this should only check the current Unicode scalar value
  • it should advance to the next Unicode scalar value after successfully matching

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but what is the model for the engine? The engine isn't querying options on every loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, this optimization only applies to grapheme-semantic mode right now, which is an unfortunate limitation. Lily, can you make sure to write a test for this somehow? We may need to revise our compilation testing approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another approach is to have a bit in some instructions or payloads (whether that is really a dedicated bit or a virtual bit because we expand opcodes around it) that signals whether it should end in a grapheme break check. That would allow us to have a specialized matchScalar instruction, and we'd not bother to check grapheme boundaries for scalar sequences that we statically know are NFC invariant and don't need a check between every scalar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to do optimizations for scalar mode (new instructions, new processor functions) as a future PR, for now I just made it not generate the bitset when in scalar mode.

I also added some support to Compiler and CompileTests to check for the existence of certain opcodes under different semantic levels

signalFailure()
return false
}
_uncheckedForcedConsumeOne()
return true
}

mutating func signalFailure() {
guard let (pc, pos, stackEnd, capEnds, intRegisters) =
Expand Down Expand Up @@ -364,6 +378,13 @@ extension Processor {
controller.step()
}

case .matchBitset:
let reg = payload.bitset
let bitset = registers[reg]
if matchBitset(bitset) {
controller.step()
}

case .consumeBy:
let reg = payload.consumer
guard currentPosition < searchBounds.upperBound,
Expand Down
Loading