-
Notifications
You must be signed in to change notification settings - Fork 49
[Integration] main (4d04019) -> swift/main #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PCRE treats them as octal, but we require a `0` prefix.
Ban multi-scalar characters that start with ASCII, and are not letters, numbers, or `\r\n`. These may be confused with metacharacters and as such should be spelled explicitly.
Ban a balanced set of `<{...}>` delimiters for a potential future interpolation syntax.
_RegexParser does not need resilience as it's only ever going to be used by _StringProcessing and RegexBuilder.
One is a lightweight component that allows the use of the leading dot syntax to reference `RegexComponent` static members such as character classes as a non-first expression in a regex builder block. --- Before: ```swift Regex { .digit // works today but brittle; inserting anything above this line will break this OneOrMore(.whitespace) .word // ❌ error: 'OneOrMore' has no member named 'word' (because this is parsed as a member reference on the preceeding expression) } ``` After: ```swift Regex { One(.digit) // recommended even though `.digit` works today OneOrMore(.whitespace) One(.word) } // ✅ ``` In a follow-up patch, we will propose adding an additional protocol inheriting from `RegexComponent` that will ban the use of the leading dot syntax even on the first line of `Regex { ... }`, as this will enforce the recommended style (use of `One`), and prevent surprises when the user inserts a pattern above the leading dot line.
PCRE, Oniguruma, and ICU allow `]` to appear as the first member of a custom character class, and treat it as literal, due to empty character classes being invalid. However this behavior isn't particularly intuitive, and makes lexing heuristics harder to implement properly. Instead, reject such character classes as being empty, and require escaping if `]` is meant as the first character.
Introduce `One`
Wrap character classes around One
This fixes an issue where calling `matches(of:)` with an pattern that matches an empty substring gets stuck searching the same position over and over.
When working on overload resolution, it's trickier with these unnecessary name collisions. This includes a few symbols that might be public eventually, but we can de-underscore them at that point.
PatternConverter uses one last AST-based API defined in _StringProcessing, which is technically not allowed but not problematic b/c PatternConverter also imports _RegexParser. However, this can have compile-time problems later on, so this changes the entry point for PatternConverter to just use an Any parameter that is checked at runtime to actually be AST.
The initial options are stored in the lowered program, and include all options that are set before the first attempted match. Note that not all initial options are global - a leading option-setting group is included in initial options, even though it applies only to a portion of the overall regex. Previously, searching via firstMatch or matches(of:) would only _start_ searches at a character index, even when a regex has Unicode scalar semantics.
Add validation testing for supported and unsupported Unicode properties, along with support for the following properties: - age - numeric type - numeric value - lower/upper/titlecase mapping - canonical combining class
…string Keep substring bounds when searching in Regex.wholeMatch
Implement .as for Regex and Unify Match and AnyRegexOutput
This change addresses two overload resolution problems with the collection-based algorithm methods. First, when RegexBuilder is imported, `String` gains `RegexComponent` conformance, which means the `RegexComponent`-based overloads win for strings, which is undesirable. Second, if a collection has an element type that can be expressed as an array literal, collection-based methods get selected ahead of any standard library counterpart. These two problems combine in a tricky way for `split` and `contains`. For `split`, both the collection-based and regex-based versions need to be marked as `@_disfavoredOverload` so that the problems above can be resolved. Unfortunately, this sets up an ambiguity once `String` has `RegexComponent` conformance, so the `RegexBuilder` module includes separate overloads for `String` and `Substring` that act as tie-breakers. If introduced in the standard library, these would be a source-breaking change, as they would win over the `Element`- based split when referencing the `split` method, as with `let splitFunction = myString.split`. For `contains`, the same requirements hold, with the added complication that the Foundation overlay defines its own `String.contains(_:)` method with different behavior than included in these additions. For this reason, the more specific overloads for `String` and `Substring` can't live in the `RegexBuilder` module, which creates a problem for source compatibility. As it stands now, this existing code does not compile with the new algorithm methods added, as the type of `vowelPredicate` changes from `(Character) -> Bool` to `(String) -> Bool`: ``` let str = "abcde" let vowelPredicate = "aeiou".contains print(str.filter(vowelPredicate)) ```
Use this to replace the various places we're doing `var src = self`.
We don't have to handle bailing early, the loop will terminate if we don't lex another operator.
Make sure an inverted character class does not dump the same as a regular character class.
`expectQuoted` expects non-empty contents, which doesn't apply to comments.
PCRE does not allow whitespace here, instead treating the sequence as literal if whitespace is present. However this behavior is quite unintuitive. Instead, lex whitespace between range operands.
Previously we would only parse non-semantic whitespace, but also expand to end-of-line comments, which are supported by ICU.
Factor out the logic that deals with parsing an individual character class member, and interleave `lexTrivia` calls between range operand parsing.
Use the CaptureList as the source of truth on which index a name corresponds to, and query it when emitting a named backreference.
This doesn't appear to be used, and should be available from the CaptureList.
`RegexCompilationError` was not meant to be exposed as API.
…n-error Make `RegexCompilationError` internal
Previously we only supported a subset of the Oniguruma spellings for these. Introduce them as an actual Unicode property with the key `blk` or `block`. Additionally, allow a non-Unicode shorthand syntax that uses the prefix `in`. This is supported by Oniguruma and Perl (though Perl discourages its usage). We may want to warn/error on it and suggest users switch to the more explicit form.
These correspond to various `is`-prefixed accessors on `java.lang.Character`. For now, parse them, but mark them unsupported.
Additional character property parsing
@swift-ci please test |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.