-
Notifications
You must be signed in to change notification settings - Fork 49
Introduce scalar sequences \u{AA BB CC}
#386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tests/RegexTests/ParseTests.swift
Outdated
) | ||
parseTest( | ||
#"[\u{CC}-\u{AA BB}]"#, | ||
charClass(range_m(scalar_a("\u{CC}"), scalarSeq_a("\u{AA}", "\u{BB}"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a scalar sequence in a range entail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @natecook1000 on whether we want these inside custom ccs and whether this would do the whole NFD thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's syntax sugar, it's the same as [\u{CC}-\u{AA}\u{BB}]
, which is consistent with https://www.unicode.org/reports/tr18/#RL1.1. Though I agree it might not be immediately obvious, and might be worth a warning or just rejecting for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's error/reject
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was my frustration from the last time we discussed these. It would be great if this syntax could be used for writing a multi-scalar character in a custom CC, but that's not how it's defined at all. I agree that we should error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, we should reject them completely within a custom character class? Or just as a range operand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was my frustration from the last time we discussed these. It would be great if this syntax could be used for writing a multi-scalar character in a custom CC, but that's not how it's defined at all. I agree that we should error.
Can we can define it that way? That makes way more sense to me. But would it be a sequence again in scalar-semantics mode? Would we error out if it's more than one grapheme cluster normally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reject altogether in CCs for now, so that we can discuss and decide on whether there's a reasonable behavior that works for different modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to reject as unsupported in a custom character class for now
I wasn't aware of this Unicode property when initially implementing this. It's a more restricted set of whitespace that Unicode reccommends for parsing patterns. It's the same set of whitespace used for extended syntax. UAX44-LM3 itself doesn't appear to specify the exact set of whitespace to match against, but this is no more restrictive than the engines I'm aware of.
This allows us to store the source location of the inner scalar value.
Allow a whitespace-separated list of scalars within the `\u{...}` syntax. This is syntactic sugar that gets implicitly splatted out, for example `\u{A B C}` becomes `\u{A}\u{B}\u{C}`.
`curIdx` is an index of `astChildren`, not `children`.
The `predicate` may independently advance the location before bailing, and we don't want that to affect the recorded location of the result. We probably ought to replace `lexUntil` with a better API.
@swift-ci please test |
Allow a whitespace-separated list of scalars within the
\u{...}
syntax. This is syntactic sugar that gets implicitly splatted out, for example\u{A B C}
becomes\u{A}\u{B}\u{C}
.