You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+28-5Lines changed: 28 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -2,27 +2,38 @@
2
2
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. Regexes can be created from a string at run time or from a literal at compile time. The contents of that run-time string, or the contents in-between the compile-time literal's delimiters, uses regex syntax. We present a detailed and comprehensive treatment of regex syntax.
12
-
13
-
This is part of a larger effort in supporting regex literals, which in turn is part of a larger effort towards better string processing using regex. See [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107), which tracks each relevant piece. This proposal regards _syntactic_ support, and does not necessarily mean that everything that can be written will be supported by Swift's runtime engine in the initial release. Support for more obscure features may appear over time, see [MatchingEngine Capabilities and Roadmap](https://github.com/apple/swift-experimental-string-processing/issues/99) for status.
11
+
A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. We propose the ability to create a regex at run time from a string containing regex syntax (detailed here), API for accessing the match and captures, and a means to convert between an existential capture representation and concrete types.
14
12
13
+
The overall story is laid out in [Regex Type and Overview](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexTypeOverview.md) and each individual component is tracked in [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107).
15
14
16
15
## Motivation
17
16
18
17
Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.
19
18
19
+
<!--
20
+
... tools need run time construction
21
+
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
22
+
... we prpose a best-in-class treatment of familiar regex syntax
23
+
-->
24
+
20
25
The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.
21
26
22
27
This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.
23
28
24
29
## Proposed Solution
25
30
31
+
<!--
32
+
... regex compiling and existential match type
33
+
-->
34
+
35
+
### Syntax
36
+
26
37
We propose accepting a syntactic "superset" of the following existing regular expression engines:
27
38
28
39
-[PCRE 2][pcre2-syntax], an "industry standard" and a rough superset of Perl, Python, etc.
@@ -40,6 +51,10 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b
40
51
41
52
## Detailed Design
42
53
54
+
<!--
55
+
... init, dynamic match, conversion to static
56
+
-->
57
+
43
58
We propose the following syntax for regex.
44
59
45
60
<details><summary>Grammar Notation</summary>
@@ -832,6 +847,14 @@ Regex syntax will become part of Swift's source and binary-compatibility story,
832
847
Even though it is more work up-front and creates a longer proposal, it is less risky to support the full intended syntax. The proposed superset maximizes the familiarity benefit of regex syntax.
833
848
834
849
850
+
<!--
851
+
852
+
### TODO: Semantic capabilities
853
+
854
+
This proposal regards _syntactic_ support, and does not necessarily mean that everything that can be parsed will be supported by Swift's engine in the initial release. Support for more obscure features may appear over time, see [MatchingEngine Capabilities and Roadmap](https://github.com/apple/swift-experimental-string-processing/issues/99) for status.
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexTypeOverview.md
+41-17Lines changed: 41 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -149,11 +149,11 @@ Type mismatches and invalid regex syntax are diagnosed at construction time by `
149
149
When the pattern is known at compile time, regexes can be created from a literal containing the same regex syntax, allowing the compiler to infer the output type. Regex literals enable source tools, e.g. syntax highlighting and actions to refactor into a result builder equivalent.
150
150
151
151
```swift
152
-
let regex =re'(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)'
152
+
let regex =/(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)/
*Note*: Regex literals, most notably the choice of delimiter, are discussed in [Regex Literals][pitches]. For this example, I used the less technically-problematic option of `re'...'`.
156
+
*Note*: Regex literals, most notably the choice of delimiter, are discussed in [Regex Literals][pitches].
157
157
158
158
This same regex can be created from a result builder, a refactoring-friendly representation:
159
159
@@ -193,13 +193,13 @@ A `Regex<Output>.Match` contains the result of a match, surfacing captures by nu
193
193
194
194
```swift
195
195
funcprocessEntry(_line: String) -> Transaction? {
196
-
let regex = re'''
197
-
(?x) # Ignore whitespace and comments
196
+
// Multiline literal implies `(?x)`, i.e. non-semantic whitespace with line-ending comments
197
+
let regex =#/
198
198
(?<kind>\w+) \s\s+
199
199
(?<date>\S+) \s\s+
200
200
(?<account> (?: (?!\s\s) . )+) \s\s+
201
201
(?<amount>.*)
202
-
'''
202
+
/#
203
203
// regex: Regex<(
204
204
// Substring,
205
205
// kind: Substring,
@@ -291,7 +291,7 @@ A regex describes an algorithm to be ran over some model of string, and Swift's
291
291
292
292
Calling `dropFirst()` will not drop a leading byte or `Unicode.Scalar`, but rather a full `Character`. Similarly, a `.` in a regex will match any extended grapheme cluster. A regex will match canonical equivalents by default, strengthening the connection between regex and the equivalent `String` operations.
293
293
294
-
Additionally, word boundaries (`\b`) follow [UTS\#29 Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries), meaning contractions ("don't") and script changes are detected and separated, without incurring significant binary size costs associated with language dictionaries.
294
+
Additionally, word boundaries (`\b`) follow [UTS\#29 Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries). Contractions ("don't") are correctly detected and script changes are separated, without incurring significant binary size costs associated with language dictionaries.
295
295
296
296
Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_Unicode_Support) by default, but provides options to switch to scalar-level processing as well as compatibility character classes. Detailed rules on how we infer necessary grapheme cluster breaks inside regexes, as well as options and other concerns, are discussed in [Unicode for String Processing][pitches].
0 commit comments