Skip to content

Add additional Unicode API to RegexBuilder.CharacterClass #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

natecook1000
Copy link
Member

@natecook1000 natecook1000 commented May 24, 2022

This includes revisions to the options API and additional CharacterClass type to bring it into alignment with the functionality that we're offering through regex literals. For example, /\p{NumericValue=1}/ can be written in RegexBuilder syntax as CharacterClass.numericValue(1).

@natecook1000 natecook1000 marked this pull request as draft May 24, 2022 16:05
@natecook1000 natecook1000 changed the title More unicode api Add additional Unicode API to RegexBuilder.CharacterClass May 24, 2022
@natecook1000 natecook1000 marked this pull request as ready for review June 2, 2022 18:56
@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000
Copy link
Member Author

Note: Add single API for advancing in an input's character/scalar view depending on semantic level.

@natecook1000
Copy link
Member Author

@swift-ci Please test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall in favor, a little uncanny if we don't have an equivalent to regex literal character classes for things like ..

/// ``CharacterClass.anyNonNewline``.
///
/// This character class is equivalent to the regex syntax "dot"
/// metacharacter in single-line mode: `(?s:.)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's not what this is. This is ..

/// This character class is equivalent to the regex syntax "dot"
/// metacharacter with single-line mode disabled: `(?-s:.)`.
public static var anyNonNewline: CharacterClass {
.init(DSLTree.CustomCharacterClass(members: [.atom(.any)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.any? Aren't these two things the same?

/// A character class that matches any single `Character`, or extended
/// grapheme cluster, regardless of the current semantic level.
///
/// This character class is equivalent to `\X` in regex syntax.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including newlines right? Is this the real "any" above?


/// A character class that matches any digit.
///
/// This character class is equivalent to `\d` in regex syntax.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it vary based on options? How is this different than any in that regard?

@@ -72,27 +98,58 @@ extension RegexComponent where Self == CharacterClass {
]))
}

public static var horizontalWhitespace: CharacterClass {
.init(unconverted: .horizontalWhitespace)
/// A character class that matches any element that is a "word character".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, is there any better description than "word character"? "Word character" can be mentioned as an aside but that's more of a historical note. @Azoy does Unicode have another name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class of <word_character> includes all the Alphabetic values from the Unicode character database ...
https://unicode.org/reports/tr18/#RL1.4

astCharacterProperty(.generalCategory(category.extendedGeneralCategory!))
}

public static func binaryProperty(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get this code in a different file?

/// - Returns: The modified regular expression.
public func asciiOnlyWhitespace(_ useASCII: Bool = true) -> Regex<RegexOutput> {
wrapInOption(.asciiOnlySpace, addingIf: useASCII)
public func asciiOnlyClasses(_ kinds: RegexCharacterClassKind = .all) -> Regex<RegexOutput> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do to .any, properties, etc?

@available(SwiftStdlib 5.7, *)
public struct RegexCharacterClassKind: OptionSet, Hashable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider or debate whether RegexBuilder.CharacterClass should be Swift.RegexCharacterClass? Then this would be a Kind under it.

@natecook1000
Copy link
Member Author

Closing this stale PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants