Better coalesce adjacent scalars #574

hamishknight · 2022-07-13T16:39:04Z

Previously we would only coalesce adjacent scalars in regex literals outside of custom character classes. Change the behavior such that we start coalescing scalars inside custom character classes in grapheme mode, e.g [e\u{301}] matches e\u{301}, and start coalescing adjacent scalars in DSL.

Previously for the DSL we would emit a series of scalars as a series of individual characters in grapheme semantic mode. Change the behavior such that we coalesce any adjacent scalars and characters, including those in regex literals and nested concatenations. We then perform grapheme breaking over the result, and can emit character matches for scalars that coalesced into a grapheme.

This transform subsumes a similar transform we performed for regex literals when converting them to a DSLTree. This has the nice side effect of allowing us to better preserve scalar syntax in the DSL transform.

Resolves #572 (rdar://96942688)
Resolves #586 (rdar://97209131)
Resolves #573

milseman

What about adjacent content in a custom character class?

Tests/RegexTests/MatchTests.swift

hamishknight · 2022-07-18T19:24:34Z

@rctcwyvrn mind taking a look at 203868f?

rctcwyvrn

LGTM!

Previously we would emit a series of scalars written in the DSL as a series of individual characters in grapheme semantic mode. Change the behavior such that we coalesce any adjacent scalars and characters, including those in regex literals and nested concatenations. We then perform grapheme breaking over the result, and can emit character matches for scalars that coalesced into a grapheme. This transform subsumes a similar transform we performed for regex literals when converting them to a DSLTree. This has the nice side effect of allowing us to better preserve scalar syntax in the DSL transform. rdar://96942688

Previously we would only match entire characters. Update to use the generic Character consumer logic that can handle scalar semantic mode. rdar://97209131

In grapheme semantic mode, coalesce adjacent character and scalar members of a custom character class, over which we can perform grapheme breaking. This involves potentially re-writing ranges such that they contain a complete grapheme of adjacent scalars.

Make sure we throw the right error for ranges that are invalid in grapheme mode, but are valid in scalar mode.

I also noticed that `lexQuantifier` could silently eat trivia if it failed to lex a quantification, so also fix that.

hamishknight · 2022-07-26T14:08:40Z

@swift-ci please test

hamishknight requested a review from milseman July 13, 2022 16:39

milseman reviewed Jul 14, 2022

View reviewed changes

Tests/RegexTests/MatchTests.swift Show resolved Hide resolved

hamishknight force-pushed the better-together branch from 73ebb61 to a32f08c Compare July 18, 2022 19:17

hamishknight requested a review from rctcwyvrn July 18, 2022 19:24

rctcwyvrn approved these changes Jul 18, 2022

View reviewed changes

hamishknight changed the title ~~Coalesce adjacent scalars and characters in the DSL~~ Better coalesce adjacent scalars Jul 19, 2022

hamishknight force-pushed the better-together branch 2 times, most recently from ef29854 to 3595070 Compare July 19, 2022 12:30

hamishknight mentioned this pull request Jul 19, 2022

[5.7] Character class and scalar coalescing fixes #588

Merged

hamishknight force-pushed the better-together branch from 3595070 to 3995e96 Compare July 20, 2022 20:25

hamishknight added 6 commits July 25, 2022 19:13

Fix scalar mode for quoted sequences in character class

618325a

Previously we would only match entire characters. Update to use the generic Character consumer logic that can handle scalar semantic mode. rdar://97209131

Form ASCII bitsets for quoted sequences in character classes

47bd7c5

Throw RegexCompilationError for invalid character class bounds

96adc3c

Make sure we throw the right error for ranges that are invalid in grapheme mode, but are valid in scalar mode.

Allow coalescing through trivia

dc4171f

I also noticed that `lexQuantifier` could silently eat trivia if it failed to lex a quantification, so also fix that.

hamishknight force-pushed the better-together branch from 3995e96 to dc4171f Compare July 25, 2022 18:14

milseman approved these changes Jul 26, 2022

View reviewed changes

hamishknight merged commit e47cc63 into swiftlang:main Jul 26, 2022

hamishknight deleted the better-together branch July 26, 2022 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better coalesce adjacent scalars #574

Better coalesce adjacent scalars #574

Uh oh!

hamishknight commented Jul 13, 2022 •

edited

Loading

Uh oh!

milseman left a comment

Uh oh!

Uh oh!

hamishknight commented Jul 18, 2022

Uh oh!

rctcwyvrn left a comment

Uh oh!

hamishknight commented Jul 26, 2022

Uh oh!

Uh oh!

Better coalesce adjacent scalars #574

Better coalesce adjacent scalars #574

Uh oh!

Conversation

hamishknight commented Jul 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hamishknight commented Jul 18, 2022

Uh oh!

rctcwyvrn left a comment

Choose a reason for hiding this comment

Uh oh!

hamishknight commented Jul 26, 2022

Uh oh!

Uh oh!

hamishknight commented Jul 13, 2022 •

edited

Loading