Skip to content

Fix scalar matching in grapheme semantic mode #565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions Sources/_StringProcessing/ByteCodeGen.swift
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,10 @@ fileprivate extension Compiler.ByteCodeGen {
try emitCharacter(c)

case let .scalar(s):
try emitScalar(s)
// A scalar always matches the same as a single scalar character. This
// means it must match a whole grapheme in grapheme semantic mode, but
// can match a single scalar in scalar semantic mode.
try emitCharacter(Character(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be conditional on whether we're in grapheme semantics or scalar semantics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitCharacter will call emitConsumeScalar for scalar mode


case let .assertion(kind):
try emitAssertion(kind.ast)
Expand Down Expand Up @@ -244,8 +247,12 @@ fileprivate extension Compiler.ByteCodeGen {
}
}
}

mutating func emitScalar(_ s: UnicodeScalar) throws {

/// Emit a consume of a single scalar value. This must only be used in scalar
/// semantic mode.
mutating func emitConsumeScalar(_ s: UnicodeScalar) throws {
assert(options.semanticLevel == .unicodeScalar, "Wrong semantic level")

// TODO: Native instruction buildMatchScalar(s)
if options.isCaseInsensitive {
// TODO: e.g. buildCaseInsensitiveMatchScalar(s)
Expand All @@ -263,7 +270,7 @@ fileprivate extension Compiler.ByteCodeGen {
// Unicode scalar matches the specific scalars that comprise a character
if options.semanticLevel == .unicodeScalar {
for scalar in c.unicodeScalars {
try emitScalar(scalar)
try emitConsumeScalar(scalar)
}
return
}
Expand Down
88 changes: 51 additions & 37 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,45 @@ extension DSLTree._AST.Atom {
}
}

extension Character {
func generateConsumer(
_ opts: MatchingOptions
) throws -> MEProgram.ConsumeFunction? {
let isCaseInsensitive = opts.isCaseInsensitive
switch opts.semanticLevel {
case .graphemeCluster:
return { input, bounds in
let low = bounds.lowerBound
if isCaseInsensitive && isCased {
return input[low].lowercased() == lowercased()
? input.index(after: low)
: nil
} else {
return input[low] == self
? input.index(after: low)
: nil
}
}
case .unicodeScalar:
// TODO: This should only be reachable from character class emission, can
// we guarantee that? Otherwise we'd want a different matching behavior.
let consumers = unicodeScalars.map { s in consumeScalar {
isCaseInsensitive
? $0.properties.lowercaseMapping == s.properties.lowercaseMapping
: $0 == s
}}
return { input, bounds in
for fn in consumers {
if let idx = fn(input, bounds) {
return idx
}
}
return nil
}
}
}
}

extension DSLTree.Atom {
var singleScalarASCIIValue: UInt8? {
switch self {
Expand All @@ -72,44 +111,15 @@ extension DSLTree.Atom {
func generateConsumer(
_ opts: MatchingOptions
) throws -> MEProgram.ConsumeFunction? {
let isCaseInsensitive = opts.isCaseInsensitive

switch self {
case let .char(c):
if opts.semanticLevel == .graphemeCluster {
return { input, bounds in
let low = bounds.lowerBound
if isCaseInsensitive && c.isCased {
return input[low].lowercased() == c.lowercased()
? input.index(after: low)
: nil
} else {
return input[low] == c
? input.index(after: low)
: nil
}
}
} else {
let consumers = c.unicodeScalars.map { s in consumeScalar {
isCaseInsensitive
? $0.properties.lowercaseMapping == s.properties.lowercaseMapping
: $0 == s
}}
return { input, bounds in
for fn in consumers {
if let idx = fn(input, bounds) {
return idx
}
}
return nil
}
}
return try c.generateConsumer(opts)

case let .scalar(s):
return consumeScalar {
isCaseInsensitive
? $0.properties.lowercaseMapping == s.properties.lowercaseMapping
: $0 == s
}
// A scalar always matches the same as a single scalar character. This
// means it must match a whole grapheme in grapheme semantic mode, but
// can match a single scalar in scalar semantic mode.
return try Character(s).generateConsumer(opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty convoluted. Why do we not just emit a scalar consuming instruction in scalar semantic mode and set the grapheme boundary bit? CC @rctcwyvrn to see if we need to get some of her instructions in first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that within a consumer matching function? I know we eventually want to migrate off consumer functions, but for now this will generate a consumeScalar consumer in scalar semantic mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should only be reached for non-ascii custom character classes so I don't think we can emit an instruction instead of a consume fn in that case. Maybe we could if we made custom character classes emit in the same way we do alternations?


case .any:
// FIXME: Should this be a total ordering?
Expand Down Expand Up @@ -211,16 +221,20 @@ extension AST.Atom {
var singleScalar: UnicodeScalar? {
switch kind {
case .scalar(let s): return s.value
case .escaped(let e):
guard let s = e.scalarValue else { return nil }
return s
default: return nil
}
}

var singleScalarASCIIValue: UInt8? {
if let s = singleScalar, s.isASCII {
return UInt8(ascii: s)
}
switch kind {
case let .char(c) where c != "\r\n":
return c.asciiValue
case let .scalar(s) where s.value.isASCII:
return UInt8(ascii: s.value)
default:
return nil
}
Expand Down
56 changes: 40 additions & 16 deletions Sources/_StringProcessing/PrintAsPattern.swift
Original file line number Diff line number Diff line change
Expand Up @@ -315,8 +315,7 @@ extension PrettyPrinter {
return
}

var charMembers = ""

var charMembers = StringLiteralBuilder()

// This iterates through all of the character class members collecting all
// of the members who can be stuffed into a singular '.anyOf(...)' vs.
Expand All @@ -340,14 +339,10 @@ extension PrettyPrinter {
switch a {
case let .char(c):
charMembers.append(c)

if c == "\\" {
charMembers.append(c)
}

return false
case let .scalar(s):
charMembers += "\\u{\(String(s.value, radix: 16, uppercase: true))}"
charMembers.append(
unescaped: "\\u{\(String(s.value, radix: 16, uppercase: true))}")
return false
case .unconverted(_):
return true
Expand All @@ -356,7 +351,7 @@ extension PrettyPrinter {
}

case let .quotedLiteral(s):
charMembers += s
charMembers.append(s)
return false

case .trivia(_):
Expand All @@ -370,7 +365,7 @@ extension PrettyPrinter {
// Also in the same vein, if we have a few atom members but no
// nonAtomMembers, then we can emit a single .anyOf(...) for them.
if !charMembers.isEmpty, nonCharMembers.isEmpty {
let anyOf = ".anyOf(\(charMembers._quoted))"
let anyOf = ".anyOf(\(charMembers))"

indent()

Expand All @@ -393,7 +388,7 @@ extension PrettyPrinter {
printer.indent()

if !charMembers.isEmpty {
printer.output(".anyOf(\(charMembers._quoted))")
printer.output(".anyOf(\(charMembers))")

if nonCharMembers.count > 0 {
printer.output(",")
Expand Down Expand Up @@ -617,10 +612,39 @@ extension PrettyPrinter {
}

extension String {
// TODO: Escaping?
fileprivate var _escaped: String {
_replacing(#"\"#, with: #"\\"#)._replacing(#"""#, with: #"\""#)
}

fileprivate var _quoted: String {
"\"\(self._replacing(#"\"#, with: #"\\"#)._replacing(#"""#, with: #"\""#))\""
_escaped._bareQuoted
}

fileprivate var _bareQuoted: String {
#""\#(self)""#
}
}

/// A helper for building string literals, which handles escaping the contents
/// appended.
fileprivate struct StringLiteralBuilder {
private var contents = ""

var result: String { contents._bareQuoted }
var isEmpty: Bool { contents.isEmpty }

mutating func append(_ str: String) {
contents += str._escaped
}
mutating func append(_ c: Character) {
contents += String(c)._escaped
}
mutating func append(unescaped str: String) {
contents += str
}
}
extension StringLiteralBuilder: CustomStringConvertible {
var description: String { result }
}

extension AST.Atom.AssertionKind {
Expand Down Expand Up @@ -1107,8 +1131,8 @@ extension DSLTree.Atom {

case let .scalar(s):
let hex = String(s.value, radix: 16, uppercase: true)
return ("\\u{\(hex)}"._quoted, false)
return ("\\u{\(hex)}"._bareQuoted, false)

case let .unconverted(a):
if a.ast.isUnprintableAtom {
return ("#/\(a.ast._regexBase)/#", false)
Expand Down Expand Up @@ -1149,7 +1173,7 @@ extension DSLTree.Atom {

case let .scalar(s):
let hex = String(s.value, radix: 16, uppercase: true)
return "\\u{\(hex)}"._quoted
return "\\u{\(hex)}"._bareQuoted

case let .unconverted(a):
return a.ast._regexBase
Expand Down
2 changes: 1 addition & 1 deletion Sources/_StringProcessing/Regex/ASTConversion.swift
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ extension AST.Atom {

switch self.kind {
case let .char(c): return .char(c)
case let .scalar(s): return .char(Character(s.value))
case let .scalar(s): return .scalar(s.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the behaviour of regex to be different depending on if we input a unicode value as its scalar code or as a literal character? ie "ợ" vs "\u{1ee3}"? With this one would be interpreted as a .char and the other as a .scalar in the DSLTree. I think the intent for the original code here was to make those two inputs equivalent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be semantically equivalent (i.e they go through the exact same code paths in byte code emission), but this distinction allows us to have more accurate printing of the DSL tree when performing a literal -> DSL conversion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if we didn't need the distinction for printing, IMO it would make sense to outright remove .scalar from the DSL tree)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.

Also yea, seeing as we're moving towards making a .scalar behave exactly the same as .char, maybe it would make sense to remove it outright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.

For example, if the user wrote /e\u{301}/, ideally the transformed DSL tree would be:

Regex {
  "e\u{301}"
}

instead of:

Regex {
  "é"
}

Semantically they're the same, but IMO it's much nicer to preserve the original syntax the user wrote.

case .any: return .any
case let .backreference(r): return .backreference(.init(ast: r))
case let .changeMatchingOptions(seq): return .changeMatchingOptions(.init(ast: seq))
Expand Down
39 changes: 39 additions & 0 deletions Tests/RegexBuilderTests/RegexDSLTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -1120,6 +1120,45 @@ class RegexDSLTests: XCTestCase {
}
}

func testScalarMatching() throws {
// RegexBuilder provides a RegexComponent conformance for UnicodeScalar. In
// grapheme cluster mode, it should only match entire graphemes. It may
// match a single scalar of a grapheme cluster in scalar semantic mode.
XCTAssertNotNil("a".firstMatch(of: "a" as UnicodeScalar))
XCTAssertNil("a\u{301}".firstMatch(of: "a" as UnicodeScalar))
XCTAssertNotNil("a\u{301}".firstMatch(
of: ("a" as UnicodeScalar).regex.matchingSemantics(.unicodeScalar)))

let r1 = Regex {
"a" as UnicodeScalar
}
XCTAssertNil(try r1.firstMatch(in: "a\u{301}"))
XCTAssertNotNil(
try r1.matchingSemantics(.unicodeScalar).firstMatch(in: "a\u{301}")
)

let r2 = Regex {
CharacterClass.anyOf(["a" as UnicodeScalar, "👍"])
}
XCTAssertNil(try r2.firstMatch(in: "a\u{301}"))
XCTAssertNotNil(
try r2.matchingSemantics(.unicodeScalar).firstMatch(in: "a\u{301}")
)

let r3 = Regex {
"👨" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👨" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👧" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👦" as UnicodeScalar
}
XCTAssertNil(try r3.firstMatch(in: "👨‍👨‍👧‍👦"))
Comment on lines +1148 to +1157
Copy link
Contributor Author

@hamishknight hamishknight Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I'm not sure whether this is correct, as we allow matching for e.g:

print("👨‍👨‍👧‍👦".firstMatch(of: /👨\u{200D}👨\u{200D}👧\u{200D}👦/)

should we be combining the scalars in a DSL concatenation the same way we combine them in a literal? And if so, would it also apply to regex scalar children e.g /\u{...}/?

Or is a DSL concatenation semantically more like the following?

print("👨‍👨‍👧‍👦".firstMatch(of: /(?:👨)(?:\u{200D})(?:👨)(?:\u{200D})(?:👧)(?:\u{200D})(?:👦)/)) // nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense, though we'll need to define exactly when the combining happens, e.g do we combine the following?

Regex {
  "a" as UnicodeScalar
  /\u{301}/
}

and would that still happen if there's further contents in the regex? (because this is now a concat within a concat) e.g:

Regex {
  "a" as UnicodeScalar
  /\u{301}b/
}

Should Regex {...} calls serve as a boundary between combining scalars? e.g:

Regex {
  "a" as UnicodeScalar
  Regex {
    "\u{301}" as UnicodeScalar
    "b"
  }
}

XCTAssertNotNil(try r3.matchingSemantics(.unicodeScalar).firstMatch(in: "👨‍👨‍👧‍👦"))
XCTAssertNotNil(try r3.matchingSemantics(.unicodeScalar).wholeMatch(in: "👨‍👨‍👧‍👦"))
}

struct SemanticVersion: Equatable {
var major: Int
var minor: Int
Expand Down
30 changes: 30 additions & 0 deletions Tests/RegexTests/RenderDSLTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -117,4 +117,34 @@ extension RenderDSLTests {
}
"""#)
}

func testScalar() throws {
try testConversion(#"\u{B4}"#, #"""
Regex {
"\u{B4}"
}
"""#)
try testConversion(#"\u{301}"#, #"""
Regex {
"\u{301}"
}
"""#)
try testConversion(#"[\u{301}]"#, #"""
Regex {
One(.anyOf("\u{301}"))
}
"""#)
try testConversion(#"[abc\u{301}]"#, #"""
Regex {
One(.anyOf("abc\u{301}"))
}
"""#)

// TODO: We ought to try and preserve the scalar syntax here.
try testConversion(#"a\u{301}"#, #"""
Regex {
"á"
}
"""#)
}
}