Skip to content

Use a bitset for ascii-only character classes #511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jun 29, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5fd8840
[benchmark] Add no-capture version of grapheme breaking exercise
milseman Jun 19, 2022
03fe8d6
[benchmark] Add cross-engine benchmark helpers
milseman Jun 19, 2022
5667705
[benchmark] Hangul Syllable finding benchmark
milseman Jun 19, 2022
bde259b
Add debug mode
rctcwyvrn Jun 20, 2022
bf95e81
Fix typo in css regex
rctcwyvrn Jun 20, 2022
243ec7b
Add HTML benchmark
rctcwyvrn Jun 20, 2022
eeb0852
Add email regex benchmarks
rctcwyvrn Jun 20, 2022
49efd67
Add save/compare functionality to the benchmarker
rctcwyvrn Jun 20, 2022
b3a61a7
Clean up compare and add cli flags
rctcwyvrn Jun 20, 2022
926d208
Merge branch 'main' into more_more_benchmarks
milseman Jun 21, 2022
752ea76
Make fixes
rctcwyvrn Jun 21, 2022
7327e74
Merge branch 'more_more_benchmarks' of github.com:rctcwyvrn/swift-exp…
rctcwyvrn Jun 21, 2022
7a900b6
oops, remove some leftover code
rctcwyvrn Jun 21, 2022
50e8e6d
Fix linux build issue + add cli option for specifying compare file
rctcwyvrn Jun 21, 2022
3c7f62c
First ver of bitset character classes
rctcwyvrn Jun 22, 2022
b71b177
Did a dumb and didn't use the new api I had added...
rctcwyvrn Jun 22, 2022
e2a011c
Fix bug in inverted character sets
rctcwyvrn Jun 22, 2022
f7900e5
Remove nested chararcter class cases
rctcwyvrn Jun 22, 2022
e9d1902
Remove comment
rctcwyvrn Jun 22, 2022
cf59091
Merge branch 'main' into many-closures-vs-one-bitset-boi
rctcwyvrn Jun 22, 2022
f4019d4
Cleanup handling of isInverted
rctcwyvrn Jun 23, 2022
ed82cb0
Cleanup
rctcwyvrn Jun 23, 2022
cc1ac9d
Remove isCaseInsensitive property
rctcwyvrn Jun 23, 2022
ccf6ade
Add tests for special cases
rctcwyvrn Jun 23, 2022
7b83e0c
Use switch on ranges instead of if
rctcwyvrn Jun 24, 2022
5121076
Rename asciivalue to singleScalarAsciiValue
rctcwyvrn Jun 27, 2022
3607b65
Properly handle unicode scalars mode in custom character classes
rctcwyvrn Jun 27, 2022
291a974
I most definitely did not forget to commit the tests
rctcwyvrn Jun 27, 2022
ddcf40f
Cleanup
rctcwyvrn Jun 27, 2022
f87b325
Add support for testing if compilation contains certain opcodes
rctcwyvrn Jun 27, 2022
2d8ac2d
Forgot the tests again, twice in one day...
rctcwyvrn Jun 27, 2022
fd66693
Spelling mistakes
rctcwyvrn Jun 27, 2022
22c8213
Make expectProgram take sets of opcodes
rctcwyvrn Jun 27, 2022
0781b93
Add compiler options + validation testing against unoptimized regexes
rctcwyvrn Jun 28, 2022
ffff944
Cleanup, clear cache of Regex.Program when setting new compile options
rctcwyvrn Jun 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions Sources/_StringProcessing/ByteCodeGen.swift
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,19 @@
extension Compiler {
struct ByteCodeGen {
var options: MatchingOptions
private let compileOptions: CompileOptions
var builder = MEProgram.Builder()
/// A Boolean indicating whether the first matchable atom has been emitted.
/// This is used to determine whether to apply initial options.
var hasEmittedFirstMatchableAtom = false

init(options: MatchingOptions, captureList: CaptureList) {
init(
options: MatchingOptions,
compileOptions: CompileOptions,
captureList: CaptureList
) {
self.options = options
self.compileOptions = compileOptions
self.builder.captureList = captureList
}
}
Expand Down Expand Up @@ -644,7 +650,8 @@ fileprivate extension Compiler.ByteCodeGen {
_ ccc: DSLTree.CustomCharacterClass
) throws {
if let asciiBitset = ccc.asAsciiBitset(options),
options.semanticLevel == .graphemeCluster {
options.semanticLevel == .graphemeCluster,
!compileOptions.contains(.unoptimized) {
// future work: add a bit to .matchBitset to consume either a character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure we do this soon? I want to have as much of a unified performance story between grapheme semantic and scalar semantic as possible. Ideally a lot of perf analysis will be downgrading grapheme to scalar operations as permitted.

Having two different paths also complicates testing, as now many tests that were exhaustively testing the engine are now only testing one path in the engine. We'll need to meet to discuss testing and validation as we add special-case optimizations.

// or a scalar so we can have this optimization in scalar mode
builder.buildMatchAsciiBitset(asciiBitset)
Expand Down
43 changes: 29 additions & 14 deletions Sources/_StringProcessing/Compiler.swift
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ class Compiler {

// TODO: Or are these stored on the tree?
var options = MatchingOptions()
private var compileOptions: CompileOptions = .default

init(ast: AST) {
self.tree = ast.dslTree
Expand All @@ -25,17 +26,36 @@ class Compiler {
self.tree = tree
}

init(tree: DSLTree, compileOptions: CompileOptions) {
self.tree = tree
self.compileOptions = compileOptions
}

__consuming func emit() throws -> MEProgram {
// TODO: Handle global options
var codegen = ByteCodeGen(
options: options, captureList: tree.captureList
)
options: options,
compileOptions:
compileOptions,
captureList: tree.captureList)
return try codegen.emitRoot(tree.root)
}
}

/// Regex.Program and CompilerInterface.swift call these parse/compilation methods directly, this method is
/// only for testing purposes (see CompileTest.swift)
// An error produced when compiling a regular expression.
enum RegexCompilationError: Error, CustomStringConvertible {
// TODO: Source location?
case uncapturedReference

var description: String {
switch self {
case .uncapturedReference:
return "Found a reference used before it captured any match."
}
}
}

// Testing support
@available(SwiftStdlib 5.7, *)
func _compileRegex(
_ regex: String,
Expand All @@ -59,15 +79,10 @@ func _compileRegex(
return Executor(program: program)
}

// An error produced when compiling a regular expression.
enum RegexCompilationError: Error, CustomStringConvertible {
// TODO: Source location?
case uncapturedReference

var description: String {
switch self {
case .uncapturedReference:
return "Found a reference used before it captured any match."
}
extension Compiler {
struct CompileOptions: OptionSet {
let rawValue: Int
static let unoptimized = CompileOptions(rawValue: 1)
static let `default`: CompileOptions = []
}
}
12 changes: 11 additions & 1 deletion Sources/_StringProcessing/Regex/Core.swift
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ extension Regex {
/// likely, compilation/caching.
let tree: DSLTree

/// OptionSet of compiler options for testing purposes
fileprivate var compileOptions: Compiler.CompileOptions = .default

private final class ProgramBox {
let value: MEProgram
init(_ value: MEProgram) { self.value = value }
Expand All @@ -93,7 +96,7 @@ extension Regex {
if let loweredObject = _loweredProgramStorage as? ProgramBox {
return loweredObject.value
}
let lowered = try! Compiler(tree: tree).emit()
let lowered = try! Compiler(tree: tree, compileOptions: compileOptions).emit()
_stdlib_atomicInitializeARCRef(object: &_loweredProgramStorage, desired: ProgramBox(lowered))
return lowered
}
Expand Down Expand Up @@ -132,3 +135,10 @@ extension Regex {
self.program = Program(tree: .init(node))
}
}

@available(SwiftStdlib 5.7, *)
extension Regex {
internal mutating func _setCompilerOptionsForTesting(_ opts: Compiler.CompileOptions) {
program.compileOptions = opts
}
}
24 changes: 20 additions & 4 deletions Tests/RegexTests/MatchTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,32 @@ import XCTest
@testable import _StringProcessing

struct MatchError: Error {
var message: String
init(_ message: String) {
self.message = message
}
var message: String
init(_ message: String) {
self.message = message
}
}

func _firstMatch(
_ regexStr: String,
input: String,
validate: Bool,
syntax: SyntaxOptions = .traditional
) throws -> (String, [String?]) {
let regex = try Regex(regexStr, syntax: syntax)
guard let result = try regex.firstMatch(in: input) else {
throw MatchError("match not found for \(regexStr) in \(input)")
}
let caps = result.output.slices(from: input)

if validate {
var unoptRegex = try Regex(regexStr, syntax: syntax)
unoptRegex._setCompilerOptionsForTesting(.unoptimized)
guard let unoptResult = try unoptRegex.firstMatch(in: input) else {
throw MatchError("match not found for unoptimized \(regexStr) in \(input)")
}
XCTAssertEqual(String(input[result.range]), String(input[unoptResult.range]))
}
return (String(input[result.range]), caps.map { $0.map(String.init) })
}

Expand All @@ -41,6 +51,7 @@ func flatCaptureTest(
syntax: SyntaxOptions = .traditional,
dumpAST: Bool = false,
xfail: Bool = false,
validate: Bool = true,
file: StaticString = #file,
line: UInt = #line
) {
Expand All @@ -49,6 +60,7 @@ func flatCaptureTest(
guard var (_, caps) = try? _firstMatch(
regex,
input: test,
validate: validate,
syntax: syntax
) else {
if expect == nil {
Expand Down Expand Up @@ -98,6 +110,7 @@ func matchTest(
enableTracing: Bool = false,
dumpAST: Bool = false,
xfail: Bool = false,
validate: Bool = true,
file: StaticString = #file,
line: UInt = #line
) {
Expand All @@ -110,6 +123,7 @@ func matchTest(
enableTracing: enableTracing,
dumpAST: dumpAST,
xfail: xfail,
validate: validate,
file: file,
line: line)
}
Expand All @@ -126,13 +140,15 @@ func firstMatchTest(
enableTracing: Bool = false,
dumpAST: Bool = false,
xfail: Bool = false,
validate: Bool = true,
file: StaticString = #filePath,
line: UInt = #line
) {
do {
let (found, _) = try _firstMatch(
regex,
input: input,
validate: validate,
syntax: syntax)

if xfail {
Expand Down