Skip to content

Allow unbounded quoted sequences \Q... #432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 15, 2022

Conversation

hamishknight
Copy link
Contributor

@hamishknight hamishknight commented May 24, 2022

PCRE and ICU both support quoted sequences that don't have a terminating \E. Update the parsing to allow this. Additionally, allow empty quoted sequences outside of custom character classes, which is consistent with ICU. Finally, don't allow quoted sequences to span multiple lines in extended syntax literals.

@hamishknight hamishknight requested a review from milseman May 24, 2022 12:00
@hamishknight hamishknight changed the title Allow unbounded and empty quoted sequences @hamishknight Allow unbounded quoted sequences \Q... May 24, 2022
@hamishknight hamishknight changed the title @hamishknight Allow unbounded quoted sequences \Q... Allow unbounded quoted sequences \Q... May 24, 2022
@hamishknight
Copy link
Contributor Author

@swift-ci please test

PCRE and ICU both support quoted sequences that
don't have a terminating `\E`. Update the parsing
to allow this.

Additionally, allow empty quoted sequences outside
of custom character classes, which is consistent
with ICU.

Finally, don't allow quoted sequences to span
multiple lines in extended syntax literals.
@hamishknight hamishknight force-pushed the unbounded-quote branch 2 times, most recently from 0029e7c to 57a5092 Compare May 26, 2022 16:50
@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight hamishknight requested a review from milseman May 26, 2022 16:52
@@ -579,7 +579,7 @@ extension Source {

/// Try to consume quoted content
///
/// Quote -> '\Q' (!'\E' .)* '\E'
/// Quote -> '\Q' (!'\E' .)* '\E'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the quote get terminated by a group or does it have to be terminated by the end of the regex? In that case, it seems more like:

    Quote -> '\Q' (!'\E' .)* ('\E' | <end-of-input>)

(or whatever our notational convention was for end of input)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be terminated by the end of the regex yeah, I can update the grammar to be explicit about that. I don't believe we ever added a convention for it, <end-of-input> seems reasonable to me tho.

}
if context.experimentalQuotes, src.tryEat("\"") {
// TODO: Can experimental quotes be empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we'll be dropping experimental mode in favor of (?"quoted")

#/
\Q
/#
"""#, .quoteMayNotSpanMultipleLines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be the regex /\Q/, i.e. it's not spanning multiple lines because there's no additional line? I'm ok either way, and perhaps reporting an error is helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Currently we don't do any whitespace stripping, so if we just allowed it as-is, it would parse as \Q\n \E. That being said, it seems reasonable enough to strip everything after-and-including the last newline given we enforce that the closing delimiter must appear on a newline. Does that sound reasonable to you, or should be leave it as an error for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... for now, if someone wants verbatim multi-line content, they have to use a string inside builder for that. We'll eventually get at least interpolation if not quoting as well. Otherwise the multi-line literal ignores the lines and whitespaces. It might be the case these are all the same regex:

Regex { "re gex" }

#/
  re\Q gex
/#

#/
  re\Q gex\E
/#

#/
  \Qre gex
/#

#/
  re\ gex
/#

/re\Q gex/

/re\Q gex\E/

For errors:

#/
  re\Q  # error: it spans lines (same logic as isolated option setting?)
  gex
/#

What about:

#/
  re\Q gex
  

/#

Newlines are a consideration of the host language. So perhaps \Q in a literal cannot quote a newline and a matching \E has to appear on the same line or it's the last semantic line in the literal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By semantic line, do you mean any line that doesn't consist entirely of whitespace? So:

#/
 re\Q gex
  

/#

would be legal as we'd strip all trailing whitespace after the x character, forming the quote \Q gex\E?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question here is whether (?-x) should be allowed if it's used on the last semantic line, e.g:

// Okay
#/
(?-x) abc


/#

// Not okay
#/
(?-x)abc
d
/#

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably have a consistent behavior between (?-x) and unterminated \Q whichever way we go. For now at least, I'm tempted to say we should keep everything using the rule where semantic whitespace cannot span multiple lines anywhere in the literal, so (?-x) and unterminated \Q are always invalid. So this test case would remain as-is. Given this rule only affects literals, we'd be free to lift in the future if we wanted. Does that sound reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight
Copy link
Contributor Author

Going to merge this for now, happy to look into the trailing semantic whitespace rule in a follow-up

@hamishknight hamishknight merged commit 2265620 into swiftlang:main Jun 15, 2022
@hamishknight hamishknight deleted the unbounded-quote branch June 15, 2022 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants