Remove CodePoints parsers and use CodeUnits parsers #59

JordanMartinez · 2020-09-13T17:38:12Z

What does this pull request do?

WIP PR that attempts to fix #48 by following the instructions specified by Harry in his comment.

Say that the pos field in the parser state always counts code units.
- Jordan: I wasn't sure how this should now be rendered.
Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.
- Jordan: I'm not sure which parsers in the original CodeUnits module should be removed
Clearly demarcate parsers which have the ability to split up surrogate pairs, like anyChar. This could be done with doc-comments or we could move them into their own module Text.Parsing.StringParser.CodeUnits.
- Jordan: I'm not sure which ones these are.
Provide CodePoint-based alternatives to any of the parsers which are currently based on Char, so that it is possible to do everything you might want to do without having to resort to using parsers like anyChar which can split up surrogate pairs.
- Jordan: I'm not sure what to do here.

Where should the reviewer start?

Provide further direction about the above guidelines

How should this be manually tested?

We might need to add more tests, but I'm not sure whether that's necessary. I haven't looked at this repo's tests yet.

hdgarrood

I'll try to give a bit more background to help you understand which parsers are the problematic ones. The issue stems from there being two different ways of looking at strings:

A sequence of Unicode code points
A sequence of UTF-16 code units

A Unicode code point generally identifies a single character, like U+0061 'a', U+0051 'Q', U+003F '?', U+00EA 'ê', or U+1F431 '🐱'. Each code point can be identified with a number between 0 and 1114111, or 10FFFF in hexadecimal (although not every number in that range corresponds to a code point; more on that in a bit). The notation U+XXXX means "The unicode code point identified by the hexadecimal number XXXX".

A naive encoding of Unicode text could be a list of 32-bit integers, since 32 bits is enough to uniquely represent each Unicode character. However, setting aside 32 bits for every character was considered a little bit wasteful, especially since so many text files (eg: ones written by speakers of English and other languages which use small alphabets) only use a tiny proportion of the total number of Unicode code points. Therefore, a few systems (including JavaScript) use an encoding called UTF-16 internally, which consists of a sequence of UTF-16 code units. A UTF-16 code unit is essentially a 16-bit integer, and the UTF-16 encoding is a set of rules for converting a sequence of code points to a sequence of UTF-16 code units and back.

In UTF-16, any code point between U+0000 and U+D7FF, or between U+E000 and U+FFFF, is just represented as itself. So e.g. U+0061 'a' is represented as 0x0061, and "hello" is represented as 0x0068 0x0065 0x006C 0x006C 0x006F. Any code point too big to fit in that range is split across two code units, and the result is called a surrogate pair. So in UTF-16, the text "hello", which consists of 5 code points, becomes 5 code units. The text "🍔🍺", which contains two emojis whose code point values are U+1F354 and U+1F37A, becomes 4 code units: each emoji needs two code units, because their values are both above U+FFFF.

Most of the JavaScript String API is based on code units, not code points. For example, the length property of a string in JS counts the number of code units, not code points:

> "hello".length
5
> "🍔🍺".length
4

If .length were counting code units, we'd see 5 for the first example, but 2 for the second. As another example, the slice function on strings counts code units. This means that you can cut a surrogate pair in half using slice!

> "🍔🍺".slice(1)
'�🍺'

The question mark appears because we have an invalid sequence of UTF-16 code units; we have lost the first element of the surrogate pair for the burger and are only left with the second. The first thing in the result string there, the part which appears as a question mark symbol, is called a lone surrogate. Ideally you don't want this to ever happen; if your code splits up a surrogate pair, it's probably a bug.

This issue is why both Data.String.CodePoints and Data.String.CodeUnits exist. The Data.String.CodePoints module views strings as lists of code points, and prevents you from accidentally splitting up a surrogate pair. The Data.String.CodeUnits module views strings as lists of code units, and does nothing to prevent you from splitting up a surrogate pair. This is why most of Data.String.CodePoints is re-exported by Data.String: it's safer, so we have made it the default. Unfortunately, it turns out naively rewriting the string-parsers library in terms of Data.String.CodePoints makes it really slow, so we need to be a bit more clever than that.

Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.

Looking back, I've just realised that these two modules have the exact same API so I don't think there's anything from the CodePoints module we should keep; we should just get rid of the whole thing. We should also document which of the parsers in the current Text.Parsing.StringParser.CodeUnits can split up surrogate pairs, and provide alternative versions of them based on CodePoints which don't suffer from the same issue. For example, here's a parser which can split up surrogate pairs:

purescript-string-parsers/src/Text/Parsing/StringParser/CodeUnits.purs

Lines 48 to 53 in 6670834

    
           -- | Match any character. 
        
           anyChar :: Parser Char 
        
           anyChar = Parser \{ str, pos } -> 
        
             case charAt pos str of 
        
               Just chr -> Right { result: chr, suffix: { str, pos: pos + 1 } } 
        
               Nothing -> Left { pos, error: ParseError "Unexpected EOF" }

This can split up surrogate pairs because it is based on the Data.String.CodeUnits.charAt function. A safe version of this parser might be

anyCodePoint :: Parser CodePoint
anyCodePoint = Parser \{ str, pos } -> 
   case SCP.codePointAt 0 (SCU.drop pos str) of 
     Just cp -> Right { result: cp, suffix: { str, pos: pos + SCU.length (SCP.singleton cp) } } 
     Nothing -> Left { pos, error: ParseError "Unexpected EOF" }

which will ensure that if there is a surrogate pair at the current position in the string, we take both code units which make it up and return them in the CodePoint return value, as well as advancing the position by 2 (since we have consumed 2 code units). At least, I think/hope it will do that. I haven't tested this code, though.

hdgarrood · 2020-09-15T01:47:34Z

src/Text/Parsing/StringParser/Parser.purs

@@ -0,0 +1,105 @@
+-- | This module defines the `Parser` type of string parsers, and its instances.
+
+module Text.Parsing.StringParser.Parser where


I think this stuff should stay in Text.Parsing.StringParser ideally - I think moving it to a new module would be an unnecessary breaking change. I think the contents of the current Text.Parsing.StringParser.CodeUnits module can move into Text.Parsing.StringParser alongside what's there already.

Then again maybe we should just leave the current contents of Text.Parsing.StringParser.CodeUnits where they are, as moving them is probably also an unnecessary breaking change.

Per your original instructions, CodeUnits was to be moved into Text.Parsing.StringParser. However, doing that caused a cyclical dependency between that module and Text.Parsing.StringParser.Combinators. So, I moved this code out into its own module and then re-export it in Text.Parsing.StringParser.

I think leaving Text.Parsing.StringParser.CodeUnits as is makes sense to reduce breakage. However, why have a separate module for that now that CodePoints won't be its own module? If we're going to make breaking changes, now is the time, right?

Yeah, good point, now is the time to make breaking changes. Since putting the current contents of Text.Parsing.StringParser.CodeUnits into Text.Parsing.StringParser directly won't work, how about putting them back where they were before the change which added the CodeUnits and CodePoints modules: Text.Parsing.StringParser.String?

I think I should implement this in a separate PR after this one gets merged. It'll make the diff easier to see.

JordanMartinez · 2020-09-15T13:51:27Z

I'll try to give a bit more background to help you understand which parsers are the problematic ones. The issue stems from there being two different ways of looking at strings:

Thanks for that detailed response! I'll add that to this repo's documentation to help clarify things.

hdgarrood · 2020-09-15T14:26:06Z

I actually don't think this should be part of this repo's documentation; ideally users of this library shouldn't really have to care about this stuff. (As maintainers, we do have to care about it, though). If we were planning to document UTF-16 (and the fact that strings in PureScript use it), I think the Prim module would be a more appropriate place for that, as that's where the String type is defined. A slightly longer explainer in the Data.String.CodePoints and Data.String.CodeUnits modules would probably also be beneficial. I also think that it may be worth looking for an existing resource on Unicode encoding and UTF-16 specifically and linking to that, because this issue affects way more people than just us.

thomashoneyman · 2020-09-15T17:14:05Z

@hdgarrood One form of documentation we're eager to add to more of these libraries is a sort of "maintainer's documentation", which is not meant to guide users but rather to guide contributors. For example, there's the Hooks Internals documentation in the Halogen Hooks library. It's also a place to establish what is and is not within scope for a library -- for example, Try PureScript needs some documentation to guide contributors on what sorts of changes are acceptable.

I don't exactly know how to organize that documentation appropriately but I think it's important it exists. Otherwise, new contributors have to guess at the decisions made in the library (and maintainers like myself have to do the same).

thomashoneyman · 2020-09-15T17:15:42Z

It may still be better to have that documentation largely link to other resources, rather than type up something we have to maintain here. But in general I would like to see more maintainer-oriented (or contributor-oriented) documentation in these repositories.

hdgarrood · 2020-09-15T17:23:15Z

I agree. I think in most cases, this kind of documentation is best kept alongside the code it relates to, i.e. in code comments - if it's kept in separate files it's much easier to forget to update it.

For this library specifically, the choice that the pos in the Parser type represents the number of code units consumed is probably best documented alongside the definition of that type. We should also include the reasoning for that - which is that it allows O(1) indexing, which is necessary for good performance. The alternative (of having that index represent the number of code points consumed) is not workable, since indexing using code points is O(n), where n is the index. However, even for maintainer-oriented docs, I think the issue of Unicode and UTF-16 should be dealt with by linking to some other documentation, because it is not specific to this library: it affects almost any library which wants to work with strings.

JordanMartinez · 2020-09-16T03:14:53Z

A safe version of this parser might be

When I use that, the below tests pass:

expectResult :: forall a. Eq a => a -> Parser a -> String -> Boolean
expectResult res p input = runParser p input == Right res

testCodeUnits :: Effect Unit
testCodeUnits = do
  assert $ expectResult (NonEmptyList ("🍔" :| "🍺":Nil)) (many1 $ map SCP.singleton anyCodePoint) "🍔🍺"
  assert $ expectResult "🍔" (anyChar *> (map SCP.singleton anyCodePoint) <* anyChar) "a🍔a"
  assert $ expectResult ({a: "🍔", b: "🍺"}) ({a:_, b:_} <$> ((map SCP.singleton anyCodePoint) <* (void anyChar)) <*> (map SCP.singleton anyCodePoint)) "🍔a🍺"

So... I guess it works?

hdgarrood · 2020-09-16T10:58:07Z

Seems like it! It would be worth testing with basic multilingual plane code points too (the ones with code point values less than U+FFFF, which can be represented as one code unit).

JordanMartinez · 2020-09-16T14:06:45Z

How's this? Any other tests I should add?

src/Text/Parsing/StringParser/CodeUnits.purs

@hdgarrood

Implementation was designed by @hdgarrood

JordanMartinez · 2020-09-16T19:25:41Z

I think if the user supplies a string containing a lone surrogate to string, so that returning that string as the parse result splits a surrogate pair in half, I think we should actually split the surrogate pair in that case (since that's what the user has asked us to do). Come to think of it, we should have tests for that.

Currently working on this.

JordanMartinez · 2020-09-16T19:37:01Z

Not sure what the test would need to be, but perhaps what I have above works?

hdgarrood · 2020-09-25T23:47:39Z

That test looks good. I'd also like to add a test where the string being parsed is valid UTF-16, and it becomes invalid as a result of a surrogate pair being split by string. For example, given the input "🍔🍺" and a parser string "🍔\xd38c", the parser should match and consume the first three code units, and leave the last ('\xdf7a') unconsumed.

JordanMartinez · 2020-09-26T01:12:05Z

Doesn't seem to work:

case unParser (string "🍔\xd38c") {str:"🍔🍺", pos:0} of
    Left e -> log $ show e.error <> " @ " <> show e.pos
    Right r -> log $ show r.result <> " | Remaining: " <> (drop r.suffix.pos r.suffix.str)

outputs

Expected '🍔펌'. @ 0

hdgarrood · 2020-09-26T01:14:56Z

Sorry, I made a typo: it should be \xd83c rather than \xd38c.

JordanMartinez · 2020-09-26T01:20:58Z

New output:

"🍔�" | Remaining: �

I'll add this test in a commit.

…ate pair

JordanMartinez · 2020-12-05T15:51:40Z

Is this PR ready to be merged yet? It's currently holding up #63

hdgarrood · 2020-12-05T16:31:16Z

I remember thinking this still needs work the last time I looked. I’ll take a proper look before the end of this weekend.

hdgarrood · 2020-12-05T16:35:47Z

I guess we could merge this now in order to unblock #63, but I think more work is required to consider #48 resolved.

JordanMartinez · 2020-12-05T17:11:21Z

I guess we could merge this now in order to unblock #63, but I think more work is required to consider #48 resolved.

What else needs to be done again? Your comment here?

hdgarrood · 2020-12-05T17:26:17Z

There's that and there's also the checkboxes in your initial comment which aren't marked as done yet. I also think I need to go over the entire API and make sure it makes sense in the context of treating strings as sequences of code points by default; there's still a lot of usage of Char which I think is a bit dodgy, for example.

thomashoneyman · 2020-12-09T22:27:11Z

#63 is now merged -- what followup tasks should be done for #59 (comment)? Should we open followup issues marked with fix before 0.14?

hdgarrood · 2020-12-11T23:03:58Z

I think followup issues probably make sense. I started looking into this and started to become unsure about the approach of just having a single module; if we're treating Unicode properly, then we probably ought to use CodePoint instead of Char, which (to me) suggests that we might want to keep the two separate modules after all, and go through the old Text.Parsing.StringParser.CodePoints and replace each Char with CodePoint.

* Revert #59 * Update code in light of ParseError change * Update CI to v0.14.0-rc5

thomashoneyman mentioned this pull request Sep 14, 2020

Improve default error #57

Merged

hdgarrood reviewed Sep 15, 2020

View reviewed changes

Remove CodePoints parser

3c11820

hdgarrood reviewed Sep 16, 2020

View reviewed changes

src/Text/Parsing/StringParser/CodeUnits.purs Outdated Show resolved Hide resolved

hdgarrood reviewed Sep 16, 2020

View reviewed changes

src/Text/Parsing/StringParser/CodeUnits.purs Outdated Show resolved Hide resolved

Implement, test, and document anyCodePoint

f6a0576

Implementation was designed by @hdgarrood

Add tests for lone surrogate pairs

ef7102c

Test the parsed UTF-16 string becomes invalid due to splitting surrog…

3b6562c

…ate pair

thomashoneyman added merge before 0.14 purs-0.14 A reminder to address this issue or merge this PR before we release PureScript v0.14.0 and removed merge before 0.14 labels Dec 3, 2020

hdgarrood approved these changes Dec 5, 2020

View reviewed changes

thomashoneyman merged commit aff7f2b into purescript-contrib:main Dec 5, 2020

JordanMartinez deleted the mergeParsers branch December 5, 2020 17:11

JordanMartinez mentioned this pull request Dec 16, 2020

Change docs on pos Parser state to always count code units #65

Closed

This was referenced Dec 27, 2020

Unify CodePoints and CodeUnits parsers #48

Open

Revert #59 #67

Merged

thomashoneyman pushed a commit that referenced this pull request Dec 28, 2020

Revert #59 (#67)

0eaf186

* Revert #59 * Update code in light of ParseError change * Update CI to v0.14.0-rc5

This was referenced Feb 9, 2022

Efficient CodePoint indexing function? purescript/purescript-strings#155

Closed

Update docs to refer to CodeUnit-based inxed #66

Merged

	-- \| Match any character.
	anyChar :: Parser Char
	anyChar = Parser \{ str, pos } ->
	case charAt pos str of
	Just chr -> Right { result: chr, suffix: { str, pos: pos + 1 } }
	Nothing -> Left { pos, error: ParseError "Unexpected EOF" }

		@@ -0,0 +1,105 @@
		-- \| This module defines the `Parser` type of string parsers, and its instances.

		module Text.Parsing.StringParser.Parser where

Remove CodePoints parsers and use CodeUnits parsers #59

Remove CodePoints parsers and use CodeUnits parsers #59

Uh oh!

Conversation

JordanMartinez commented Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this pull request do?

Where should the reviewer start?

How should this be manually tested?

Uh oh!

hdgarrood left a comment

Choose a reason for hiding this comment

Uh oh!

hdgarrood Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

hdgarrood Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

JordanMartinez Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

hdgarrood Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

JordanMartinez Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

JordanMartinez commented Sep 15, 2020

Uh oh!

hdgarrood commented Sep 15, 2020

Uh oh!

thomashoneyman commented Sep 15, 2020

Uh oh!

thomashoneyman commented Sep 15, 2020

Uh oh!

hdgarrood commented Sep 15, 2020

Uh oh!

JordanMartinez commented Sep 16, 2020

Uh oh!

hdgarrood commented Sep 16, 2020

Uh oh!

JordanMartinez commented Sep 16, 2020

Uh oh!

Uh oh!

Uh oh!

JordanMartinez commented Sep 16, 2020

Uh oh!

JordanMartinez commented Sep 16, 2020

Uh oh!

hdgarrood commented Sep 25, 2020

Uh oh!

JordanMartinez commented Sep 26, 2020

Uh oh!

hdgarrood commented Sep 26, 2020

Uh oh!

JordanMartinez commented Sep 26, 2020

Uh oh!

JordanMartinez commented Dec 5, 2020

Uh oh!

hdgarrood commented Dec 5, 2020

Uh oh!

hdgarrood commented Dec 5, 2020

Uh oh!

JordanMartinez commented Dec 5, 2020

Uh oh!

hdgarrood commented Dec 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomashoneyman commented Dec 9, 2020

Uh oh!

hdgarrood commented Dec 11, 2020

Uh oh!

Uh oh!

JordanMartinez commented Sep 13, 2020 •

edited

Loading

hdgarrood commented Dec 5, 2020 •

edited

Loading