Skip to content

[Integration] main (4d04019) -> swift/main #442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
May 27, 2022

Conversation

hamishknight
Copy link
Contributor

No description provided.

Azoy and others added 30 commits May 3, 2022 14:03
PCRE treats them as octal, but we require a `0`
prefix.
Ban multi-scalar characters that start with ASCII,
and are not letters, numbers, or `\r\n`. These
may be confused with metacharacters and as such
should be spelled explicitly.
Ban a balanced set of `<{...}>` delimiters for a
potential future interpolation syntax.
_RegexParser does not need resilience as it's only ever going to be used by _StringProcessing and RegexBuilder.
One is a lightweight component that allows the use of the leading dot syntax to reference `RegexComponent` static members such as character classes as a non-first expression in a regex builder block.

---

Before:

```swift
Regex {
    .digit // works today but brittle; inserting anything above this line will break this

    OneOrMore(.whitespace)

    .word // ❌ error: 'OneOrMore' has no member named 'word' (because this is parsed as a member reference on the preceeding expression)
}
```

After:

```swift
Regex {
    One(.digit)              // recommended even though `.digit` works today
    OneOrMore(.whitespace)
    One(.word)
} // ✅
```

In a follow-up patch, we will propose adding an additional protocol inheriting from `RegexComponent` that will ban the use of the leading dot syntax even on the first line of `Regex { ... }`, as this will enforce the recommended style (use of `One`), and prevent surprises when the user inserts a pattern above the leading dot line.
PCRE, Oniguruma, and ICU allow `]` to appear as
the first member of a custom character class, and
treat it as literal, due to empty character classes
being invalid.

However this behavior isn't particularly intuitive,
and makes lexing heuristics harder to implement
properly. Instead, reject such character classes
as being empty, and require escaping if `]` is
meant as the first character.
This fixes an issue where calling `matches(of:)` with an pattern
that matches an empty substring gets stuck searching the same position
over and over.
When working on overload resolution, it's trickier with these
unnecessary name collisions. This includes a few symbols that might
be public eventually, but we can de-underscore them at that point.
PatternConverter uses one last AST-based API defined in
_StringProcessing, which is technically not allowed but not
problematic b/c PatternConverter also imports _RegexParser.
However, this can have compile-time problems later on, so
this changes the entry point for PatternConverter to just use
an Any parameter that is checked at runtime to actually be AST.
The initial options are stored in the lowered program, and include
all options that are set before the first attempted match. Note that
not all initial options are global - a leading option-setting group
is included in initial options, even though it applies only to a
portion of the overall regex.

Previously, searching via firstMatch or matches(of:) would only
_start_ searches at a character index, even when a regex has
Unicode scalar semantics.
Add validation testing for supported and unsupported Unicode properties,
along with support for the following properties:

- age
- numeric type
- numeric value
- lower/upper/titlecase mapping
- canonical combining class
…string

Keep substring bounds when searching in Regex.wholeMatch
Implement .as for Regex and Unify Match and AnyRegexOutput
This change addresses two overload resolution problems with the
collection-based algorithm methods. First, when RegexBuilder is
imported, `String` gains `RegexComponent` conformance, which means
the `RegexComponent`-based overloads win for strings, which is
undesirable. Second, if a collection has an element type that can
be expressed as an array literal, collection-based methods get
selected ahead of any standard library counterpart. These two problems
combine in a tricky way for `split` and `contains`.

For `split`, both the collection-based and regex-based versions need
to be marked as `@_disfavoredOverload` so that the problems above can
be resolved. Unfortunately, this sets up an ambiguity once `String`
has `RegexComponent` conformance, so the `RegexBuilder` module
includes separate overloads for `String` and `Substring` that act
as tie-breakers. If introduced in the standard library, these would
be a source-breaking change, as they would win over the `Element`-
based split when referencing the `split` method, as with
`let splitFunction = myString.split`.

For `contains`, the same requirements hold, with the added
complication that the Foundation overlay defines its own
`String.contains(_:)` method with different behavior than included
in these additions. For this reason, the more specific overloads for
`String` and `Substring` can't live in the `RegexBuilder` module,
which creates a problem for source compatibility. As it stands now,
this existing code does not compile with the new algorithm methods
added, as the type of `vowelPredicate` changes from `(Character) ->
Bool` to `(String) -> Bool`:

```
let str = "abcde"
let vowelPredicate = "aeiou".contains
print(str.filter(vowelPredicate))
```
Use this to replace the various places we're doing
`var src = self`.
hamishknight and others added 16 commits May 24, 2022 11:05
We don't have to handle bailing early, the loop
will terminate if we don't lex another operator.
Make sure an inverted character class does not
dump the same as a regular character class.
`expectQuoted` expects non-empty contents, which
doesn't apply to comments.
PCRE does not allow whitespace here, instead
treating the sequence as literal if whitespace is
present. However this behavior is quite
unintuitive. Instead, lex whitespace between range
operands.
Previously we would only parse non-semantic
whitespace, but also expand to end-of-line
comments, which are supported by ICU.
Factor out the logic that deals with parsing an
individual character class member, and interleave
`lexTrivia` calls between range operand parsing.
Use the CaptureList as the source of truth on
which index a name corresponds to, and query it
when emitting a named backreference.
This doesn't appear to be used, and should be
available from the CaptureList.
`RegexCompilationError` was not meant to be exposed as API.
…n-error

Make `RegexCompilationError` internal
Previously we only supported a subset of the
Oniguruma spellings for these. Introduce them as
an actual Unicode property with the key `blk` or
`block`.

Additionally, allow a non-Unicode shorthand syntax
that uses the prefix `in`. This is supported by
Oniguruma and Perl (though Perl discourages its
usage). We may want to warn/error on it and suggest
users switch to the more explicit form.
These correspond to various `is`-prefixed
accessors on `java.lang.Character`. For now, parse
them, but mark them unsupported.
Additional character property parsing
@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight hamishknight merged commit 62fd560 into swiftlang:swift/main May 27, 2022
@hamishknight hamishknight deleted the main-merge branch May 27, 2022 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants