Introduce scalar sequences `\u{AA BB CC}` #386

hamishknight · 2022-05-09T14:00:53Z

Allow a whitespace-separated list of scalars within the \u{...} syntax. This is syntactic sugar that gets implicitly splatted out, for example \u{A B C} becomes \u{A}\u{B}\u{C}.

Sources/_RegexParser/Regex/Parse/LexicalAnalysis.swift

milseman · 2022-05-09T16:01:40Z

Tests/RegexTests/ParseTests.swift

+    )
+    parseTest(
+      #"[\u{CC}-\u{AA BB}]"#,
+      charClass(range_m(scalar_a("\u{CC}"), scalarSeq_a("\u{AA}", "\u{BB}")))


What does a scalar sequence in a range entail?

CC @natecook1000 on whether we want these inside custom ccs and whether this would do the whole NFD thing

Because it's syntax sugar, it's the same as [\u{CC}-\u{AA}\u{BB}], which is consistent with https://www.unicode.org/reports/tr18/#RL1.1. Though I agree it might not be immediately obvious, and might be worth a warning or just rejecting for now.

Let's error/reject

Yeah, this was my frustration from the last time we discussed these. It would be great if this syntax could be used for writing a multi-scalar character in a custom CC, but that's not how it's defined at all. I agree that we should error.

To be clear, we should reject them completely within a custom character class? Or just as a range operand?

Yeah, this was my frustration from the last time we discussed these. It would be great if this syntax could be used for writing a multi-scalar character in a custom CC, but that's not how it's defined at all. I agree that we should error.

Can we can define it that way? That makes way more sense to me. But would it be a sequence again in scalar-semantics mode? Would we error out if it's more than one grapheme cluster normally?

I think we should reject altogether in CCs for now, so that we can discuss and decide on whether there's a reasonable behavior that works for different modes.

Updated to reject as unsupported in a custom character class for now

I wasn't aware of this Unicode property when initially implementing this. It's a more restricted set of whitespace that Unicode reccommends for parsing patterns. It's the same set of whitespace used for extended syntax. UAX44-LM3 itself doesn't appear to specify the exact set of whitespace to match against, but this is no more restrictive than the engines I'm aware of.

This allows us to store the source location of the inner scalar value.

Allow a whitespace-separated list of scalars within the `\u{...}` syntax. This is syntactic sugar that gets implicitly splatted out, for example `\u{A B C}` becomes `\u{A}\u{B}\u{C}`.

`curIdx` is an index of `astChildren`, not `children`.

The `predicate` may independently advance the location before bailing, and we don't want that to affect the recorded location of the result. We probably ought to replace `lexUntil` with a better API.

hamishknight · 2022-05-10T10:31:53Z

@swift-ci please test

hamishknight requested a review from milseman May 9, 2022 14:01

hamishknight commented May 9, 2022

View reviewed changes

Sources/_RegexParser/Regex/Parse/LexicalAnalysis.swift Show resolved Hide resolved

milseman approved these changes May 9, 2022

View reviewed changes

hamishknight force-pushed the multiscalar branch from fb06ed2 to 0261bb3 Compare May 10, 2022 10:26

hamishknight added 6 commits May 10, 2022 11:31

Improve the wording of a diagnostic

05e610a

Introduce AST.Atom.Scalar

7752015

This allows us to store the source location of the inner scalar value.

Introduce scalar sequences \u{AA BB CC}

f436cca

Allow a whitespace-separated list of scalars within the `\u{...}` syntax. This is syntactic sugar that gets implicitly splatted out, for example `\u{A B C}` becomes `\u{A}\u{B}\u{C}`.

Fix invalid indexing

0597164

`curIdx` is an index of `astChildren`, not `children`.

Fix source location tracking in lexUntil

0872d16

The `predicate` may independently advance the location before bailing, and we don't want that to affect the recorded location of the result. We probably ought to replace `lexUntil` with a better API.

hamishknight force-pushed the multiscalar branch from 0261bb3 to 0872d16 Compare May 10, 2022 10:31

hamishknight merged commit 5b30c5b into swiftlang:main May 10, 2022

hamishknight deleted the multiscalar branch May 10, 2022 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce scalar sequences `\u{AA BB CC}` #386

Introduce scalar sequences `\u{AA BB CC}` #386

Uh oh!

hamishknight commented May 9, 2022

Uh oh!

Uh oh!

milseman May 9, 2022

Uh oh!

milseman May 9, 2022

Uh oh!

hamishknight May 9, 2022

Uh oh!

milseman May 9, 2022

Uh oh!

natecook1000 May 9, 2022

Uh oh!

hamishknight May 9, 2022

Uh oh!

milseman May 9, 2022

Uh oh!

natecook1000 May 9, 2022

Uh oh!

hamishknight May 10, 2022

Uh oh!

hamishknight commented May 10, 2022

Uh oh!

Uh oh!

Introduce scalar sequences \u{AA BB CC} #386

Introduce scalar sequences \u{AA BB CC} #386

Uh oh!

Conversation

hamishknight commented May 9, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamishknight commented May 10, 2022

Uh oh!

Uh oh!

Introduce scalar sequences `\u{AA BB CC}` #386

Introduce scalar sequences `\u{AA BB CC}` #386