Skip to content

Parse PCRE callouts, backtracking directives, and .NET balanced captures #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 21, 2022

Conversation

hamishknight
Copy link
Contributor

  • Parse PCRE callouts (?C) with an integer or string argument. This doesn't yet handle the Oniguruma-specific callout syntax, which is a little more involved.

  • Parse PCRE backtracking directives e.g (*ACCEPT). This requires generalizing canLexGroupLikeAtom a bit to treat all (* groups as being atoms, and as such we need to special-case the PCRE2 explicit group syntax. We do it this way around to accommodate the extended Oniguruma callout syntax which also uses (*, which we aim to support.

  • Parse .NET balanced captures. This requires imposing some restrictions on what can be used as a group name to allow for the syntax (?<a-b>). For now, restrict the characters to letters, numbers and _, and forbid the first character from being a number. This should be no stricter than the rules imposed by PCRE, Oniguruma, ICU, Java and .NET.

@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

.empty, .groupTransform:
return false
}
}
Copy link
Member

@milseman milseman Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to emit a diagnostic for backreference directives that aren't quantifiable:

aea6d9d#diff-4f8c3a04c147c57dc383b837f459ee51c8bec6de78d061d54ac7e228f5228dc5R1328-R1334

extension AST.Atom {
public struct Callout: Hashable {
public enum Argument: Hashable {
case number(Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these references? That is, could they be relative or are they always absolute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they're just plain arguments to the callout function. As I understand it, PCRE expects you to give it a single callout function, and it gets called with the argument specified to let you differentiate which call it is.

/// (*SKIP)
case skip
/// (*THEN)
case then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a newline to separate cases? Is there anywhere in the repo where these terms or concepts are defined or have a quick blurb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and no there currently is not. Should I elaborate on the comments here? Or split out a separate document to explain some of the more obscure features?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the comment at first, but I do think a document sounds like a really good idea. That can serve as fodder for pitches too. I suspect the very first fully-realized pitch will be the literal-interior syntax.

}
// '(?C)' is implicitly '(?C0)'.
if src.peek() == ")" {
return .number(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a bit somewhere tracking the implicit-ness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could probably be derived from the source info, but yeah it might be worthwhile to store a bit to indicate this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning on refactoring the callout AST type as a part of handling the Oniguruma-specific callout syntax, so I'll take care of this in that patch

@@ -44,7 +44,10 @@ extension AST {
return .atom() + innerCaptures
case .namedCapture(let name):
return .atom(name: name.value) + innerCaptures
case .balancedCapture(let b):
return .atom(name: b.name?.value) + innerCaptures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @rxwei who is fixing some bugs around this area

Parse the `(?C)` syntax with an integer or string
argument. This doesn't yet handle the Oniguruma
specific callout syntax, which is a little more
involved.
This requires generalizing `canLexGroupLikeAtom` a
bit to treat all `(*` groups as being atoms, and as
such we need to special-case the PCRE2 explicit
group syntax. We do it this way around to accommodate
the extended Oniguruma callout syntax which also
uses `(*`, which we aim to support.
This requires imposing some restrictions on what
can be used as a group name to allow for the
syntax `(?<a-b>)`. For now, restrict the characters
to letters, numbers and `_`, and forbid the first
character from being a number. This should be no
stricter than the rules imposed by PCRE, Oniguruma,
ICU, Java and .NET.
@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

@hamishknight hamishknight merged commit 5ee1a22 into swiftlang:main Jan 21, 2022
@hamishknight hamishknight deleted the groups-n-stuff branch January 21, 2022 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants