Skip to content

Commit 820ab38

Browse files
authored
Regex Type and Overview V2 and accompanying tests/changes (#241)
* Clarify contractions * Motivation tests, API updates, and text
1 parent c45450f commit 820ab38

File tree

12 files changed

+524
-114
lines changed

12 files changed

+524
-114
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,38 @@
22
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
33
-->
44

5-
# Regex Syntax
5+
# Run-time Regex Construction
66

7-
- Authors: Hamish Knight, Michael Ilseman
7+
- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
88

99
## Introduction
1010

11-
A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. Regexes can be created from a string at run time or from a literal at compile time. The contents of that run-time string, or the contents in-between the compile-time literal's delimiters, uses regex syntax. We present a detailed and comprehensive treatment of regex syntax.
12-
13-
This is part of a larger effort in supporting regex literals, which in turn is part of a larger effort towards better string processing using regex. See [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107), which tracks each relevant piece. This proposal regards _syntactic_ support, and does not necessarily mean that everything that can be written will be supported by Swift's runtime engine in the initial release. Support for more obscure features may appear over time, see [MatchingEngine Capabilities and Roadmap](https://github.com/apple/swift-experimental-string-processing/issues/99) for status.
11+
A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. We propose the ability to create a regex at run time from a string containing regex syntax (detailed here), API for accessing the match and captures, and a means to convert between an existential capture representation and concrete types.
1412

13+
The overall story is laid out in [Regex Type and Overview](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexTypeOverview.md) and each individual component is tracked in [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107).
1514

1615
## Motivation
1716

1817
Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.
1918

19+
<!--
20+
... tools need run time construction
21+
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
22+
... we prpose a best-in-class treatment of familiar regex syntax
23+
-->
24+
2025
The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.
2126

2227
This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.
2328

2429
## Proposed Solution
2530

31+
<!--
32+
... regex compiling and existential match type
33+
-->
34+
35+
### Syntax
36+
2637
We propose accepting a syntactic "superset" of the following existing regular expression engines:
2738

2839
- [PCRE 2][pcre2-syntax], an "industry standard" and a rough superset of Perl, Python, etc.
@@ -40,6 +51,10 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b
4051

4152
## Detailed Design
4253

54+
<!--
55+
... init, dynamic match, conversion to static
56+
-->
57+
4358
We propose the following syntax for regex.
4459

4560
<details><summary>Grammar Notation</summary>
@@ -832,6 +847,14 @@ Regex syntax will become part of Swift's source and binary-compatibility story,
832847
Even though it is more work up-front and creates a longer proposal, it is less risky to support the full intended syntax. The proposed superset maximizes the familiarity benefit of regex syntax.
833848

834849

850+
<!--
851+
852+
### TODO: Semantic capabilities
853+
854+
This proposal regards _syntactic_ support, and does not necessarily mean that everything that can be parsed will be supported by Swift's engine in the initial release. Support for more obscure features may appear over time, see [MatchingEngine Capabilities and Roadmap](https://github.com/apple/swift-experimental-string-processing/issues/99) for status.
855+
856+
-->
857+
835858
[pcre2-syntax]: https://www.pcre.org/current/doc/html/pcre2syntax.html
836859
[oniguruma-syntax]: https://github.com/kkos/oniguruma/blob/master/doc/RE
837860
[icu-syntax]: https://unicode-org.github.io/icu/userguide/strings/regexp.html

Documentation/Evolution/RegexTypeOverview.md

Lines changed: 41 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -149,11 +149,11 @@ Type mismatches and invalid regex syntax are diagnosed at construction time by `
149149
When the pattern is known at compile time, regexes can be created from a literal containing the same regex syntax, allowing the compiler to infer the output type. Regex literals enable source tools, e.g. syntax highlighting and actions to refactor into a result builder equivalent.
150150

151151
```swift
152-
let regex = re'(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)'
152+
let regex = /(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)/
153153
// regex: Regex<(Substring, Substring, Substring, Substring, Substring)>
154154
```
155155

156-
*Note*: Regex literals, most notably the choice of delimiter, are discussed in [Regex Literals][pitches]. For this example, I used the less technically-problematic option of `re'...'`.
156+
*Note*: Regex literals, most notably the choice of delimiter, are discussed in [Regex Literals][pitches].
157157

158158
This same regex can be created from a result builder, a refactoring-friendly representation:
159159

@@ -193,13 +193,13 @@ A `Regex<Output>.Match` contains the result of a match, surfacing captures by nu
193193

194194
```swift
195195
func processEntry(_ line: String) -> Transaction? {
196-
let regex = re'''
197-
(?x) # Ignore whitespace and comments
196+
// Multiline literal implies `(?x)`, i.e. non-semantic whitespace with line-ending comments
197+
let regex = #/
198198
(?<kind> \w+) \s\s+
199199
(?<date> \S+) \s\s+
200200
(?<account> (?: (?!\s\s) . )+) \s\s+
201201
(?<amount> .*)
202-
'''
202+
/#
203203
// regex: Regex<(
204204
// Substring,
205205
// kind: Substring,
@@ -291,7 +291,7 @@ A regex describes an algorithm to be ran over some model of string, and Swift's
291291

292292
Calling `dropFirst()` will not drop a leading byte or `Unicode.Scalar`, but rather a full `Character`. Similarly, a `.` in a regex will match any extended grapheme cluster. A regex will match canonical equivalents by default, strengthening the connection between regex and the equivalent `String` operations.
293293

294-
Additionally, word boundaries (`\b`) follow [UTS\#29 Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries), meaning contractions ("don't") and script changes are detected and separated, without incurring significant binary size costs associated with language dictionaries.
294+
Additionally, word boundaries (`\b`) follow [UTS\#29 Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries). Contractions ("don't") are correctly detected and script changes are separated, without incurring significant binary size costs associated with language dictionaries.
295295

296296
Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_Unicode_Support) by default, but provides options to switch to scalar-level processing as well as compatibility character classes. Detailed rules on how we infer necessary grapheme cluster breaks inside regexes, as well as options and other concerns, are discussed in [Unicode for String Processing][pitches].
297297

@@ -300,18 +300,47 @@ Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_U
300300

301301
```swift
302302
/// A regex represents a string processing algorithm.
303+
///
304+
/// let regex = try Regex(compiling: "a(.*)b")
305+
/// let match = "cbaxb".firstMatch(of: regex)
306+
/// print(match.0) // "axb"
307+
/// print(match.1) // "x"
308+
///
303309
public struct Regex<Output> {
304310
/// Match a string in its entirety.
305311
///
306312
/// Returns `nil` if no match and throws on abort
307-
public func matchWhole(_: String) throws -> Match?
313+
public func matchWhole(_ s: String) throws -> Regex<Output>.Match?
308314

309-
/// Match at the front of a string
315+
/// Match part of the string, starting at the beginning.
310316
///
311317
/// Returns `nil` if no match and throws on abort
312-
public func matchFront(_: String) throws -> Match?
318+
public func matchPrefix(_ s: String) throws -> Regex<Output>.Match?
319+
320+
/// Find the first match in a string
321+
///
322+
/// Returns `nil` if no match is found and throws on abort
323+
public func firstMatch(in s: String) throws -> Regex<Output>.Match?
324+
325+
/// Match a substring in its entirety.
326+
///
327+
/// Returns `nil` if no match and throws on abort
328+
public func matchWhole(_ s: Substring) throws -> Regex<Output>.Match?
329+
330+
/// Match part of the string, starting at the beginning.
331+
///
332+
/// Returns `nil` if no match and throws on abort
333+
public func matchPrefix(_ s: Substring) throws -> Regex<Output>.Match?
334+
335+
/// Find the first match in a substring
336+
///
337+
/// Returns `nil` if no match is found and throws on abort
338+
public func firstMatch(_ s: Substring) throws -> Regex<Output>.Match?
313339

314340
/// The result of matching a regex against a string.
341+
///
342+
/// A `Match` forwards API to the `Output` generic parameter,
343+
/// providing direct access to captures.
315344
@dynamicMemberLookup
316345
public struct Match {
317346
/// The range of the overall match
@@ -320,7 +349,7 @@ public struct Regex<Output> {
320349
/// The produced output from the match operation
321350
public var output: Output
322351

323-
/// Lookup a capture by number
352+
/// Lookup a capture by name or number
324353
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T
325354

326355
/// Lookup a capture by number
@@ -342,11 +371,6 @@ public struct Regex<Output> {
342371
extension Regex: RegexComponent {
343372
public var regex: Regex<Output> { self }
344373

345-
/// Create a regex out of a single component
346-
public init<Content: RegexComponent>(
347-
_ content: Content
348-
) where Content.Output == Output
349-
350374
/// Result builder interface
351375
public init<Content: RegexComponent>(
352376
@RegexComponentBuilder _ content: () -> Content
@@ -360,11 +384,11 @@ extension Regex.Match {
360384

361385
// Run-time compilation interfaces
362386
extension Regex {
363-
/// Parse and compile `pattern`.
387+
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
364388
public init(compiling pattern: String, as: Output.Type = Output.self) throws
365389
}
366390
extension Regex where Output == AnyRegexOutput {
367-
/// Parse and compile `pattern`.
391+
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
368392
public init(compiling pattern: String) throws
369393
}
370394
```

Sources/Exercises/Participants/RegexParticipant.swift

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ private func graphemeBreakPropertyData<RP: RegexComponent>(
6363
forLine line: String,
6464
using regex: RP
6565
) -> GraphemeBreakEntry? where RP.Output == (Substring, Substring, Substring?, Substring) {
66-
line.match(regex).map(\.output).flatMap(extractFromCaptures)
66+
line.matchWhole(regex).map(\.output).flatMap(extractFromCaptures)
6767
}
6868

6969
private func graphemeBreakPropertyDataLiteral(
@@ -80,7 +80,7 @@ private func graphemeBreakPropertyDataLiteral(
8080
private func graphemeBreakPropertyData(
8181
forLine line: String
8282
) -> GraphemeBreakEntry? {
83-
line.match {
83+
line.matchWhole {
8484
TryCapture(OneOrMore(.hexDigit)) { Unicode.Scalar(hex: $0) }
8585
Optionally {
8686
".."

Sources/RegexBuilder/Match.swift

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,29 @@
1212
import _StringProcessing
1313

1414
extension String {
15-
public func match<R: RegexComponent>(
15+
public func matchWhole<R: RegexComponent>(
1616
@RegexComponentBuilder _ content: () -> R
1717
) -> Regex<R.Output>.Match? {
18-
match(content())
18+
matchWhole(content())
19+
}
20+
21+
public func matchPrefix<R: RegexComponent>(
22+
@RegexComponentBuilder _ content: () -> R
23+
) -> Regex<R.Output>.Match? {
24+
matchPrefix(content())
1925
}
2026
}
2127

2228
extension Substring {
23-
public func match<R: RegexComponent>(
29+
public func matchWhole<R: RegexComponent>(
30+
@RegexComponentBuilder _ content: () -> R
31+
) -> Regex<R.Output>.Match? {
32+
matchWhole(content())
33+
}
34+
35+
public func matchPrefix<R: RegexComponent>(
2436
@RegexComponentBuilder _ content: () -> R
2537
) -> Regex<R.Output>.Match? {
26-
match(content())
38+
matchPrefix(content())
2739
}
2840
}

Sources/_StringProcessing/Algorithms/Consumers/RegexConsumer.swift

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ extension RegexConsumer {
2424
func _matchingConsuming(
2525
_ consumed: Substring, in range: Range<String.Index>
2626
) -> (upperBound: String.Index, match: Match)? {
27-
guard let result = regex._match(
27+
guard let result = try! regex._match(
2828
consumed.base,
2929
in: range, mode: .partialFromFront
3030
) else { return nil }

Sources/_StringProcessing/Regex/AnyRegexOutput.swift

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,18 @@
1212
import _RegexParser
1313

1414
extension Regex where Output == AnyRegexOutput {
15-
public init(_ pattern: String) throws {
15+
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
16+
public init(compiling pattern: String) throws {
17+
self.init(ast: try parse(pattern, .traditional))
18+
}
19+
}
20+
21+
extension Regex {
22+
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
23+
public init(
24+
compiling pattern: String,
25+
as: Output.Type = Output.self
26+
) throws {
1627
self.init(ast: try parse(pattern, .traditional))
1728
}
1829
}

Sources/_StringProcessing/Regex/Core.swift

Lines changed: 43 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,48 @@ public protocol RegexComponent {
1818
var regex: Regex<Output> { get }
1919
}
2020

21-
/// A regular expression.
21+
/// A regex represents a string processing algorithm.
22+
///
23+
/// let regex = try Regex(compiling: "a(.*)b")
24+
/// let match = "cbaxb".firstMatch(of: regex)
25+
/// print(match.0) // "axb"
26+
/// print(match.1) // "x"
27+
///
2228
public struct Regex<Output>: RegexComponent {
29+
let program: Program
30+
31+
var hasCapture: Bool {
32+
program.tree.hasCapture
33+
}
34+
35+
init(ast: AST) {
36+
self.program = Program(ast: ast)
37+
}
38+
init(ast: AST.Node) {
39+
self.program = Program(ast: .init(ast, globalOptions: nil))
40+
}
41+
42+
// Compiler interface. Do not change independently.
43+
@usableFromInline
44+
init(_regexString pattern: String) {
45+
self.init(ast: try! parse(pattern, .traditional))
46+
}
47+
48+
// Compiler interface. Do not change independently.
49+
@usableFromInline
50+
init(_regexString pattern: String, version: Int) {
51+
assert(version == currentRegexLiteralFormatVersion)
52+
// The version argument is passed by the compiler using the value defined
53+
// in libswiftParseRegexLiteral.
54+
self.init(ast: try! parseWithDelimiters(pattern))
55+
}
56+
57+
public var regex: Regex<Output> {
58+
self
59+
}
60+
}
61+
62+
extension Regex {
2363
/// A program representation that caches any lowered representation for
2464
/// execution.
2565
internal class Program {
@@ -41,49 +81,19 @@ public struct Regex<Output>: RegexComponent {
4181
self.tree = tree
4282
}
4383
}
84+
}
4485

45-
let program: Program
46-
// var ast: AST { program.ast }
47-
86+
extension Regex {
4887
@_spi(RegexBuilder)
4988
public var root: DSLTree.Node {
5089
program.tree.root
5190
}
5291

53-
var hasCapture: Bool {
54-
program.tree.hasCapture
55-
}
56-
57-
init(ast: AST) {
58-
self.program = Program(ast: ast)
59-
}
60-
init(ast: AST.Node) {
61-
self.program = Program(ast: .init(ast, globalOptions: nil))
62-
}
63-
6492
@_spi(RegexBuilder)
6593
public init(node: DSLTree.Node) {
6694
self.program = Program(tree: .init(node, options: nil))
6795
}
6896

69-
// Compiler interface. Do not change independently.
70-
@usableFromInline
71-
init(_regexString pattern: String) {
72-
self.init(ast: try! parse(pattern, .traditional))
73-
}
74-
75-
// Compiler interface. Do not change independently.
76-
@usableFromInline
77-
init(_regexString pattern: String, version: Int) {
78-
assert(version == currentRegexLiteralFormatVersion)
79-
// The version argument is passed by the compiler using the value defined
80-
// in libswiftParseRegexLiteral.
81-
self.init(ast: try! parseWithDelimiters(pattern))
82-
}
83-
84-
public var regex: Regex<Output> {
85-
self
86-
}
8797
}
8898

8999
// MARK: - Primitive regex components

0 commit comments

Comments
 (0)