Skip to content

revised algorithm 5 7 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
93a894e
Merge pull request #225 from rxwei/main-integration-50ec05d
rxwei Mar 22, 2022
45ab195
Pitch: String processing algorithms (#188)
itingliu Mar 22, 2022
9789993
RegexBuilder module
rxwei Mar 24, 2022
79066a8
Merge pull request #231 from rxwei/main-integration-bb1f34a
rxwei Mar 29, 2022
9e330ba
Merge branch 'main' of github.com:apple/swift-experimental-string-pro…
rxwei Mar 30, 2022
044be96
Merge pull request #235 from rxwei/main-integration-d2ff78f6
rxwei Mar 31, 2022
a989eae
Merge branch 'main' into main-merge
hamishknight Apr 4, 2022
b583909
Merge pull request #244 from hamishknight/main-merge
hamishknight Apr 4, 2022
1a96ea8
Fill out remainder of options API (#246)
natecook1000 Apr 7, 2022
d34daf6
Clean up based on the String Processing Algorithms proposal (#247)
itingliu Apr 7, 2022
5f31de8
Move `CharacterClass` API into RegexBuilder (#254)
natecook1000 Apr 8, 2022
cc91315
Eliminate extra public API (#256)
natecook1000 Apr 8, 2022
b86ca70
Update regex syntax pitch (#258)
milseman Apr 8, 2022
e2e3d63
Throwing customization hooks (#261)
milseman Apr 10, 2022
57d8db7
Nominalize API names (#271)
milseman Apr 12, 2022
3f63265
Add SwiftStdlib 5.7 availability (#276)
rxwei Apr 14, 2022
f144abc
Rename RegexComponent.Output (#281)
natecook1000 Apr 14, 2022
315c418
Move RegexComponent conformances to RegexBuilder (#279)
natecook1000 Apr 14, 2022
ba032b6
Merge pull request #283 from rxwei/fix-availability
rxwei Apr 15, 2022
fde4c58
Add Substring algorithms tests (#289)
natecook1000 Apr 18, 2022
d002466
Merge pull request #273 from itingliu/throwing-hooks
rxwei Apr 18, 2022
3c43286
RegexBuilder quantifiers take an optional behavior (#293)
natecook1000 Apr 18, 2022
0d41bb2
Nominalize option methods (#295)
natecook1000 Apr 18, 2022
3cd65cd
Merge pull request #287 from apple/impl-import
rxwei Apr 15, 2022
51756fb
Remove compiling argument label
milseman Apr 20, 2022
2d9de48
Merge pull request #1 from milseman/5_7_azoy
Azoy Apr 20, 2022
115a937
Merge pull request #298 from Azoy/da-api-mon
Azoy Apr 21, 2022
65ef2ae
Expose `matches`, `ranges` and `split` (#304)
itingliu Apr 19, 2022
3f2832d
Fix HexDigit definition in RegexSyntax.md
hamishknight Apr 21, 2022
3cce15d
Remove AST CustomCharacterClass consumer generation
hamishknight Apr 21, 2022
577dc6e
Convert scalar escape sequences to DSL scalars
hamishknight Apr 21, 2022
2d1de9e
Allow custom character classes to begin with `:`
hamishknight Apr 21, 2022
5912ab4
Allow POSIX character properties outside of custom character classes
hamishknight Apr 21, 2022
c638486
Fix character class trivia matching
hamishknight Apr 21, 2022
f053dc3
Fix trivia parsing for set operations and initial `]` cases
hamishknight Apr 21, 2022
e84c93d
Throw error if we encounter stray opening '('
hamishknight Apr 21, 2022
1ea6f20
Change matching option scoping behavior to match PCRE
hamishknight Apr 21, 2022
5cc0ea0
Error on unknown character properties
hamishknight Apr 21, 2022
771e735
Don't parse a character property containing a backslash
hamishknight Apr 21, 2022
b8a1a81
Adds RegexBuilder.CharacterClass.anyUnicodeScalar (#315)
natecook1000 Apr 21, 2022
82fcf4a
Allow setting any of the three quant behaviors (#311)
natecook1000 Apr 21, 2022
eba0393
Fixup for missing AST import separation
natecook1000 Apr 21, 2022
dad77c5
Merge pull request #313 from Azoy/matches-ranges-split
Azoy Apr 21, 2022
fc46753
Merge pull request #316 from natecook1000/unicode_api_5.7
natecook1000 Apr 22, 2022
29bc5da
Merge pull request #309 from hamishknight/parser-changes-5.7
hamishknight Apr 22, 2022
2745de2
Updates for algorithms proposal (#319)
milseman Apr 22, 2022
978cce1
Rename CustomPrefixMatchRC to CustomConsumingRegexComponent
itingliu Apr 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
-->

# Run-time Regex Construction
# Regex Syntax and Run-time Construction

- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)

Expand All @@ -16,21 +16,50 @@ The overall story is laid out in [Regex Type and Overview](https://github.com/ap

Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.

<!--
... tools need run time construction
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
... we prpose a best-in-class treatment of familiar regex syntax
-->
`NSRegularExpression` can construct a processing pipeline from a string containing [ICU regular expression syntax][icu-syntax]. However, it is inherently tied to ICU's engine and thus it operates over a fundamentally different model of string than Swift's `String`. It is also limited in features and carries a fair amount of Objective-C baggage, such as the need to translate between `NSRange` and `Range`.

```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let nsRegEx = try! NSRegularExpression(pattern: pattern)

func processEntry(_ line: String) -> Transaction? {
let range = NSRange(line.startIndex..<line.endIndex, in: line)
guard let result = nsRegEx.firstMatch(in: line, range: range),
let kindRange = Range(result.range(at: 1), in: line),
let kind = Transaction.Kind(line[kindRange]),
let dateRange = Range(result.range(at: 2), in: line),
let date = try? Date(String(line[dateRange]), strategy: dateParser),
let accountRange = Range(result.range(at: 3), in: line),
let amountRange = Range(result.range(at: 4), in: line),
let amount = try? Decimal(
String(line[amountRange]), format: decimalParser)
else {
return nil
}

return Transaction(
kind: kind, date: date, account: String(line[accountRange]), amount: amount)
}
```

Fixing these fundamental limitations requires migrating to a completely different engine and type system representation. This is the path we're proposing with `Regex`, outlined in [Regex Type and Overview][overview]. Details on the semantic differences between ICU's string model and Swift's `String` is discussed in [Unicode for String Processing][pitches].

The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.

This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.

## Proposed Solution

<!--
... regex compiling and existential match type
-->
We propose run-time construction of `Regex` from a best-in-class treatment of familiar regular expression syntax. A `Regex` is generic over its `Output`, which includes capture information. This may be an existential `AnyRegexOutput`, or a concrete type provided by the user.

```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(pattern)
// regex: Regex<AnyRegexOutput>

let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
try! Regex(pattern)
```

### Syntax

Expand All @@ -51,11 +80,87 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b

## Detailed Design

<!--
... init, dynamic match, conversion to static
-->
We propose initializers to declare and compile a regex from syntax. Upon failure, these initializers throw compilation errors, such as for syntax or type errors. API for retrieving error information is future work.

```swift
extension Regex {
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
public init(compiling pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
public init(compiling pattern: String) throws
}
```

We propose `AnyRegexOutput` for capture types not known at compilation time, alongside casting API to convert to a strongly-typed capture list.

```swift
/// A type-erased regex output
public struct AnyRegexOutput {
/// Creates a type-erased regex output from an existing output.
///
/// Use this initializer to fit a regex with strongly typed captures into the
/// use site of a dynamic regex, i.e. one that was created from a string.
public init<Output>(_ match: Regex<Output>.Match)

We propose the following syntax for regex.
/// Returns a typed output by converting the underlying value to the specified
/// type.
///
/// - Parameter type: The expected output type.
/// - Returns: The output, if the underlying value can be converted to the
/// output type, or nil otherwise.
public func `as`<Output>(_ type: Output.Type) -> Output?
}
extension AnyRegexOutput: RandomAccessCollection {
public struct Element {
/// The range over which a value was captured. `nil` for no-capture.
public var range: Range<String.Index>?

/// The slice of the input over which a value was captured. `nil` for no-capture.
public var substring: Substring?

/// The captured value. `nil` for no-capture.
public var value: Any?
}

// Trivial collection conformance requirements

public var startIndex: Int { get }

public var endIndex: Int { get }

public var count: Int { get }

public func index(after i: Int) -> Int

public func index(before i: Int) -> Int

public subscript(position: Int) -> Element
}
```

We propose adding an API to `Regex<AnyRegexOutput>.Match` to cast the output type to a concrete one. A regex match will lazily create a `Substring` on demand, so casting the match itself saves ARC traffic vs extracting and casting the output.

```swift
extension Regex.Match where Output == AnyRegexOutput {
/// Creates a type-erased regex match from an existing match.
///
/// Use this initializer to fit a regex match with strongly typed captures into the
/// use site of a dynamic regex match, i.e. one that was created from a string.
public init<Output>(_ match: Regex<Output>.Match)

/// Returns a typed match by converting the underlying values to the specified
/// types.
///
/// - Parameter type: The expected output type.
/// - Returns: A match generic over the output type if the underlying values can be converted to the
/// output type. Returns `nil` otherwise.
public func `as`<Output>(_ type: Output.Type) -> Regex<Output>.Match?
}
```

The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.

<details><summary>Grammar Notation</summary>

Expand Down Expand Up @@ -234,7 +339,7 @@ UnicodeScalar -> '\u{' HexDigit{1...} '}'
| '\o{' OctalDigit{1...} '}'
| '\0' OctalDigit{0...3}

HexDigit -> [0-9a-zA-Z]
HexDigit -> [0-9a-fA-F]
OctalDigit -> [0-7]

NamedScalar -> '\N{' ScalarName '}'
Expand Down Expand Up @@ -827,6 +932,12 @@ We are deferring runtime support for callouts from regex literals as future work

## Alternatives Considered

### Failalbe inits

There are many ways for compilation to fail, from syntactic errors to unsupported features to type mismatches. In the general case, run-time compilation errors are not recoverable by a tool without modifying the user's input. Even then, the thrown errors contain valuable information as to why compilation failed. For example, swiftpm presents any errors directly to the user.

As proposed, the errors thrown will be the same errors presented to the Swift compiler, tracking fine-grained source locations with specific reasons why compilation failed. Defining a rich error API is future work, as these errors are rapidly evolving and it is too early to lock in the ABI.


### Skip the syntax

Expand Down
38 changes: 22 additions & 16 deletions Documentation/Evolution/RegexTypeOverview.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Regex Type and Overview

- Authors: [Michael Ilseman](https://github.com/milseman) and the Standard Library Team
Expand Down Expand Up @@ -135,11 +134,11 @@ Regexes can be created at run time from a string containing familiar regex synta

```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(compiling: pattern)
let regex = try! Regex(pattern)
// regex: Regex<AnyRegexOutput>

let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
try! Regex(compiling: pattern)
try! Regex(pattern)
```

*Note*: The syntax accepted and further details on run-time compilation, including `AnyRegexOutput` and extended syntaxes, are discussed in [Run-time Regex Construction][pitches].
Expand Down Expand Up @@ -225,7 +224,7 @@ func processEntry(_ line: String) -> Transaction? {

The result builder allows for inline failable value construction, which participates in the overall string processing algorithm: returning `nil` signals a local failure and the engine backtracks to try an alternative. This not only relieves the use site from post-processing, it enables new kinds of processing algorithms, allows for search-space pruning, and enhances debuggability.

Swift regexes describe an unambiguous algorithm, were choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").

`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:

Expand Down Expand Up @@ -278,14 +277,14 @@ func processEntry(_ line: String) -> Transaction? {
*Note*: Details on how references work is discussed in [Regex Builders][pitches]. `Regex.Match` supports referring to _all_ captures by position (`match.1`, etc.) whether named or referenced or neither. Due to compiler limitations, result builders do not support forming labeled tuples for named captures.


### Algorithms, algorithms everywhere
### Regex-powered algorithms

Regexes can be used right out of the box with a variety of powerful and convenient algorithms, including trimming, splitting, and finding/replacing all matches within a string.

These algorithms are discussed in [String Processing Algorithms][pitches].


### Onward Unicode
### Unicode handling

A regex describes an algorithm to be ran over some model of string, and Swift's `String` has a rather unique Unicode-forward model. `Character` is an [extended grapheme cluster](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) and equality is determined under [canonical equivalence](https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence).

Expand All @@ -301,7 +300,7 @@ Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_U
```swift
/// A regex represents a string processing algorithm.
///
/// let regex = try Regex(compiling: "a(.*)b")
/// let regex = try Regex("a(.*)b")
/// let match = "cbaxb".firstMatch(of: regex)
/// print(match.0) // "axb"
/// print(match.1) // "x"
Expand All @@ -310,12 +309,12 @@ public struct Regex<Output> {
/// Match a string in its entirety.
///
/// Returns `nil` if no match and throws on abort
public func matchWhole(_ s: String) throws -> Regex<Output>.Match?
public func wholeMatch(in s: String) throws -> Regex<Output>.Match?

/// Match part of the string, starting at the beginning.
///
/// Returns `nil` if no match and throws on abort
public func matchPrefix(_ s: String) throws -> Regex<Output>.Match?
public func prefixMatch(in s: String) throws -> Regex<Output>.Match?

/// Find the first match in a string
///
Expand All @@ -325,17 +324,17 @@ public struct Regex<Output> {
/// Match a substring in its entirety.
///
/// Returns `nil` if no match and throws on abort
public func matchWhole(_ s: Substring) throws -> Regex<Output>.Match?
public func wholeMatch(in s: Substring) throws -> Regex<Output>.Match?

/// Match part of the string, starting at the beginning.
///
/// Returns `nil` if no match and throws on abort
public func matchPrefix(_ s: Substring) throws -> Regex<Output>.Match?
public func prefixMatch(in s: Substring) throws -> Regex<Output>.Match?

/// Find the first match in a substring
///
/// Returns `nil` if no match is found and throws on abort
public func firstMatch(_ s: Substring) throws -> Regex<Output>.Match?
public func firstMatch(in s: Substring) throws -> Regex<Output>.Match?

/// The result of matching a regex against a string.
///
Expand All @@ -344,19 +343,19 @@ public struct Regex<Output> {
@dynamicMemberLookup
public struct Match {
/// The range of the overall match
public let range: Range<String.Index>
public var range: Range<String.Index> { get }

/// The produced output from the match operation
public var output: Output
public var output: Output { get }

/// Lookup a capture by name or number
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T { get }

/// Lookup a capture by number
@_disfavoredOverload
public subscript(
dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output>
) -> Output
) -> Output { get }
// Note: this allows `.0` when `Match` is not a tuple.

}
Expand Down Expand Up @@ -482,6 +481,13 @@ We're also looking for more community discussion on what the default type system

The actual `Match` struct just stores ranges: the `Substrings` are lazily created on demand. This avoids unnecessary ARC traffic and memory usage.


### `Regex<Match, Captures>` instead of `Regex<Output>`

The generic parameter `Output` is proposed to contain both the whole match (the `.0` element if `Output` is a tuple) and captures. One alternative we have considered is separating `Output` into the entire match and the captures, i.e. `Regex<Match, Captures>`, and using `Void` for for `Captures` when there are no captures.

The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors.

### Future work: static optimization and compilation

Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).
Expand Down
2 changes: 1 addition & 1 deletion Documentation/Evolution/StringProcessingAlgorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ public protocol CustomMatchingRegexComponent : RegexComponent {
_ input: String,
startingAt index: String.Index,
in bounds: Range<String.Index>
) -> (upperBound: String.Index, match: Match)?
) throws -> (upperBound: String.Index, match: Match)?
}
```

Expand Down
Loading