Skip to content

Introduce scalar sequences \u{AA BB CC} #386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 10, 2022

Conversation

hamishknight
Copy link
Contributor

Allow a whitespace-separated list of scalars within the \u{...} syntax. This is syntactic sugar that gets implicitly splatted out, for example \u{A B C} becomes \u{A}\u{B}\u{C}.

@hamishknight hamishknight requested a review from milseman May 9, 2022 14:01
)
parseTest(
#"[\u{CC}-\u{AA BB}]"#,
charClass(range_m(scalar_a("\u{CC}"), scalarSeq_a("\u{AA}", "\u{BB}")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a scalar sequence in a range entail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @natecook1000 on whether we want these inside custom ccs and whether this would do the whole NFD thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's syntax sugar, it's the same as [\u{CC}-\u{AA}\u{BB}], which is consistent with https://www.unicode.org/reports/tr18/#RL1.1. Though I agree it might not be immediately obvious, and might be worth a warning or just rejecting for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's error/reject

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was my frustration from the last time we discussed these. It would be great if this syntax could be used for writing a multi-scalar character in a custom CC, but that's not how it's defined at all. I agree that we should error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we should reject them completely within a custom character class? Or just as a range operand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was my frustration from the last time we discussed these. It would be great if this syntax could be used for writing a multi-scalar character in a custom CC, but that's not how it's defined at all. I agree that we should error.

Can we can define it that way? That makes way more sense to me. But would it be a sequence again in scalar-semantics mode? Would we error out if it's more than one grapheme cluster normally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reject altogether in CCs for now, so that we can discuss and decide on whether there's a reasonable behavior that works for different modes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to reject as unsupported in a custom character class for now

I wasn't aware of this Unicode property when
initially implementing this. It's a more restricted
set of whitespace that Unicode reccommends for
parsing patterns. It's the same set of whitespace
used for extended syntax.

UAX44-LM3 itself doesn't appear to specify the
exact set of whitespace to match against, but this
is no more restrictive than the engines I'm aware
of.
This allows us to store the source location of the
inner scalar value.
Allow a whitespace-separated list of scalars within
the `\u{...}` syntax. This is syntactic sugar that
gets implicitly splatted out, for example `\u{A B C}`
becomes `\u{A}\u{B}\u{C}`.
`curIdx` is an index of `astChildren`, not
`children`.
The `predicate` may independently advance the
location before bailing, and we don't want that
to affect the recorded location of the result. We
probably ought to replace `lexUntil` with a better
API.
@hamishknight
Copy link
Contributor Author

@swift-ci please test

@hamishknight hamishknight merged commit 5b30c5b into swiftlang:main May 10, 2022
@hamishknight hamishknight deleted the multiscalar branch May 10, 2022 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants