[Integration] main (4d04019) -> swift/main #442

hamishknight · 2022-05-27T10:23:23Z

No description provided.

PCRE treats them as octal, but we require a `0` prefix.

Ban multi-scalar characters that start with ASCII, and are not letters, numbers, or `\r\n`. These may be confused with metacharacters and as such should be spelled explicitly.

Ban a balanced set of `<{...}>` delimiters for a potential future interpolation syntax.

_RegexParser does not need resilience as it's only ever going to be used by _StringProcessing and RegexBuilder.

One is a lightweight component that allows the use of the leading dot syntax to reference `RegexComponent` static members such as character classes as a non-first expression in a regex builder block. --- Before: ```swift Regex { .digit // works today but brittle; inserting anything above this line will break this OneOrMore(.whitespace) .word // ❌ error: 'OneOrMore' has no member named 'word' (because this is parsed as a member reference on the preceeding expression) } ``` After: ```swift Regex { One(.digit) // recommended even though `.digit` works today OneOrMore(.whitespace) One(.word) } // ✅ ``` In a follow-up patch, we will propose adding an additional protocol inheriting from `RegexComponent` that will ban the use of the leading dot syntax even on the first line of `Regex { ... }`, as this will enforce the recommended style (use of `One`), and prevent surprises when the user inserts a pattern above the leading dot line.

PCRE, Oniguruma, and ICU allow `]` to appear as the first member of a custom character class, and treat it as literal, due to empty character classes being invalid. However this behavior isn't particularly intuitive, and makes lexing heuristics harder to implement properly. Instead, reject such character classes as being empty, and require escaping if `]` is meant as the first character.

Introduce `One`

Wrap character classes around One

This fixes an issue where calling `matches(of:)` with an pattern that matches an empty substring gets stuck searching the same position over and over.

When working on overload resolution, it's trickier with these unnecessary name collisions. This includes a few symbols that might be public eventually, but we can de-underscore them at that point.

PatternConverter uses one last AST-based API defined in _StringProcessing, which is technically not allowed but not problematic b/c PatternConverter also imports _RegexParser. However, this can have compile-time problems later on, so this changes the entry point for PatternConverter to just use an Any parameter that is checked at runtime to actually be AST.

The initial options are stored in the lowered program, and include all options that are set before the first attempted match. Note that not all initial options are global - a leading option-setting group is included in initial options, even though it applies only to a portion of the overall regex. Previously, searching via firstMatch or matches(of:) would only _start_ searches at a character index, even when a regex has Unicode scalar semantics.

Add validation testing for supported and unsupported Unicode properties, along with support for the following properties: - age - numeric type - numeric value - lower/upper/titlecase mapping - canonical combining class

…string Keep substring bounds when searching in Regex.wholeMatch

Implement .as for Regex and Unify Match and AnyRegexOutput

This change addresses two overload resolution problems with the collection-based algorithm methods. First, when RegexBuilder is imported, `String` gains `RegexComponent` conformance, which means the `RegexComponent`-based overloads win for strings, which is undesirable. Second, if a collection has an element type that can be expressed as an array literal, collection-based methods get selected ahead of any standard library counterpart. These two problems combine in a tricky way for `split` and `contains`. For `split`, both the collection-based and regex-based versions need to be marked as `@_disfavoredOverload` so that the problems above can be resolved. Unfortunately, this sets up an ambiguity once `String` has `RegexComponent` conformance, so the `RegexBuilder` module includes separate overloads for `String` and `Substring` that act as tie-breakers. If introduced in the standard library, these would be a source-breaking change, as they would win over the `Element`- based split when referencing the `split` method, as with `let splitFunction = myString.split`. For `contains`, the same requirements hold, with the added complication that the Foundation overlay defines its own `String.contains(_:)` method with different behavior than included in these additions. For this reason, the more specific overloads for `String` and `Substring` can't live in the `RegexBuilder` module, which creates a problem for source compatibility. As it stands now, this existing code does not compile with the new algorithm methods added, as the type of `vowelPredicate` changes from `(Character) -> Bool` to `(String) -> Bool`: ``` let str = "abcde" let vowelPredicate = "aeiou".contains print(str.filter(vowelPredicate)) ```

Use this to replace the various places we're doing `var src = self`.

We don't have to handle bailing early, the loop will terminate if we don't lex another operator.

Make sure an inverted character class does not dump the same as a regular character class.

`expectQuoted` expects non-empty contents, which doesn't apply to comments.

PCRE does not allow whitespace here, instead treating the sequence as literal if whitespace is present. However this behavior is quite unintuitive. Instead, lex whitespace between range operands.

Previously we would only parse non-semantic whitespace, but also expand to end-of-line comments, which are supported by ICU.

Factor out the logic that deals with parsing an individual character class member, and interleave `lexTrivia` calls between range operand parsing.

Use the CaptureList as the source of truth on which index a name corresponds to, and query it when emitting a named backreference.

This doesn't appear to be used, and should be available from the CaptureList.

`RegexCompilationError` was not meant to be exposed as API.

…n-error Make `RegexCompilationError` internal

Previously we only supported a subset of the Oniguruma spellings for these. Introduce them as an actual Unicode property with the key `blk` or `block`. Additionally, allow a non-Unicode shorthand syntax that uses the prefix `in`. This is supported by Oniguruma and Perl (though Perl discourages its usage). We may want to warn/error on it and suggest users switch to the more explicit form.

These correspond to various `is`-prefixed accessors on `java.lang.Character`. For now, parse them, but mark them unsupported.

Additional character property parsing

hamishknight · 2022-05-27T10:23:31Z

@swift-ci please test

Azoy and others added 30 commits May 3, 2022 14:03

Implement .as for Regex

3f54941

Unify Match and AnyRegexOutput

7e1ab7d

Ban numeric escapes in custom character classes

bc51e91

PCRE treats them as octal, but we require a `0` prefix.

Ban confusable multi-scalar ASCII characters

a4a4a9a

Ban multi-scalar characters that start with ASCII, and are not letters, numbers, or `\r\n`. These may be confused with metacharacters and as such should be spelled explicitly.

Reserve <{...}> for interpolation syntax

db58c1b

Ban a balanced set of `<{...}>` delimiters for a potential future interpolation syntax.

Remove the namedCaptureOffset and StructuredCapture

a53a40b

Disable resilience on _RegexParser (swiftlang#397)

87ea119

_RegexParser does not need resilience as it's only ever going to be used by _StringProcessing and RegexBuilder.

Merge pull request swiftlang#404 from hamishknight/ban-empty-cc

d3ea692

Subsume referencedCaptureOffsets

21f7910

Add optional tests

c7b70a4

Merge pull request swiftlang#403 from rxwei/1

b8178c2

Introduce `One`

Wrap character classes around One

9d86c21

fix intersection, subtraction, symmetricDiference

24c139a

Merge pull request swiftlang#410 from Azoy/more-patternconverter-updates

489c63c

Wrap character classes around One

Merge pull request swiftlang#393 from hamishknight/stricter-syntax

9cf3cfc

Don't get stuck on empty matches (swiftlang#415)

adf5688

This fixes an issue where calling `matches(of:)` with an pattern that matches an empty substring gets stuck searching the same position over and over.

Underscore internal algorithms methods (swiftlang#414)

4f1e0ee

When working on overload resolution, it's trickier with these unnecessary name collisions. This includes a few symbols that might be public eventually, but we can de-underscore them at that point.

More unicode properties (swiftlang#385)

c000596

Add validation testing for supported and unsupported Unicode properties, along with support for the following properties: - age - numeric type - numeric value - lower/upper/titlecase mapping - canonical combining class

Keep substring bounds when searching in Regex.wholeMatch

812c394

Merge pull request swiftlang#421 from natecook1000/fix_wholematch_sub…

ba33c0d

…string Keep substring bounds when searching in Regex.wholeMatch

Merge pull request swiftlang#376 from Azoy/types-types-and-more-types

7969272

Implement .as for Regex and Unify Match and AnyRegexOutput

Add test fixtures for renderAsBuilderDSL (swiftlang#423)

74f3b99

Introduce Source.lookahead

06dbc16

Use this to replace the various places we're doing `var src = self`.

Remove throws from a couple of lexing methods

8242df6

Add ASTBuilder helper for char class set operations

e80322b

hamishknight and others added 16 commits May 24, 2022 11:05

Simplify character class parsing a little

1e57c5a

We don't have to handle bailing early, the loop will terminate if we don't lex another operator.

Dump the inverted bit of a custom character class

95dc487

Make sure an inverted character class does not dump the same as a regular character class.

Allow empty comments

9d84967

`expectQuoted` expects non-empty contents, which doesn't apply to comments.

Lex whitespace in range quantifiers

24b64cd

PCRE does not allow whitespace here, instead treating the sequence as literal if whitespace is present. However this behavior is quite unintuitive. Instead, lex whitespace between range operands.

Parse end-of-line comments in custom character classes

8388d0f

Previously we would only parse non-semantic whitespace, but also expand to end-of-line comments, which are supported by ICU.

Allow trivia between character class range operands

5b0524a

Factor out the logic that deals with parsing an individual character class member, and interleave `lexTrivia` calls between range operand parsing.

Merge pull request swiftlang#431 from hamishknight/trivia-pursuit

bd9bf23

Implement named backreferences

720ddd2

Use the CaptureList as the source of truth on which index a name corresponds to, and query it when emitting a named backreference.

Remove namedCaptureOffsets from MECaptureList

4b7d534

This doesn't appear to be used, and should be available from the CaptureList.

Merge pull request swiftlang#433 from hamishknight/named-refs

471e073

Make RegexCompilationError internal

5495a75

`RegexCompilationError` was not meant to be exposed as API.

Merge pull request swiftlang#438 from rxwei/internal-regex-compilatio…

a936e9e

…n-error Make `RegexCompilationError` internal

Parse Java character properties

05f73db

These correspond to various `is`-prefixed accessors on `java.lang.Character`. For now, parse them, but mark them unsupported.

Merge pull request swiftlang#440 from hamishknight/chunk-loader

4d04019

Additional character property parsing

Merge branch 'main' into main-merge

6d1d146

hamishknight mentioned this pull request May 27, 2022

[DNM] Null PR swiftlang/swift#58827

Draft

hamishknight merged commit 62fd560 into swiftlang:swift/main May 27, 2022

hamishknight deleted the main-merge branch May 27, 2022 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Integration] main (4d04019) -> swift/main #442

[Integration] main (4d04019) -> swift/main #442

Uh oh!

hamishknight commented May 27, 2022

Uh oh!

hamishknight commented May 27, 2022

Uh oh!

Uh oh!

[Integration] main (4d04019) -> swift/main #442

[Integration] main (4d04019) -> swift/main #442

Uh oh!

Conversation

hamishknight commented May 27, 2022

Uh oh!

hamishknight commented May 27, 2022

Uh oh!

Uh oh!