Skip to content

Remove CodePoints parsers and use CodeUnits parsers #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 5, 2020
Merged

Remove CodePoints parsers and use CodeUnits parsers #59

merged 4 commits into from
Dec 5, 2020

Conversation

JordanMartinez
Copy link
Contributor

@JordanMartinez JordanMartinez commented Sep 13, 2020

What does this pull request do?

WIP PR that attempts to fix #48 by following the instructions specified by Harry in his comment.

  • Say that the pos field in the parser state always counts code units.
    • Jordan: I wasn't sure how this should now be rendered.
  • Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.
    • Jordan: I'm not sure which parsers in the original CodeUnits module should be removed
  • Clearly demarcate parsers which have the ability to split up surrogate pairs, like anyChar. This could be done with doc-comments or we could move them into their own module Text.Parsing.StringParser.CodeUnits.
    • Jordan: I'm not sure which ones these are.
  • Provide CodePoint-based alternatives to any of the parsers which are currently based on Char, so that it is possible to do everything you might want to do without having to resort to using parsers like anyChar which can split up surrogate pairs.
    • Jordan: I'm not sure what to do here.

Where should the reviewer start?

Provide further direction about the above guidelines

How should this be manually tested?

We might need to add more tests, but I'm not sure whether that's necessary. I haven't looked at this repo's tests yet.

Copy link
Contributor

@hdgarrood hdgarrood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to give a bit more background to help you understand which parsers are the problematic ones. The issue stems from there being two different ways of looking at strings:

  • A sequence of Unicode code points
  • A sequence of UTF-16 code units

A Unicode code point generally identifies a single character, like U+0061 'a', U+0051 'Q', U+003F '?', U+00EA 'ê', or U+1F431 '🐱'. Each code point can be identified with a number between 0 and 1114111, or 10FFFF in hexadecimal (although not every number in that range corresponds to a code point; more on that in a bit). The notation U+XXXX means "The unicode code point identified by the hexadecimal number XXXX".

A naive encoding of Unicode text could be a list of 32-bit integers, since 32 bits is enough to uniquely represent each Unicode character. However, setting aside 32 bits for every character was considered a little bit wasteful, especially since so many text files (eg: ones written by speakers of English and other languages which use small alphabets) only use a tiny proportion of the total number of Unicode code points. Therefore, a few systems (including JavaScript) use an encoding called UTF-16 internally, which consists of a sequence of UTF-16 code units. A UTF-16 code unit is essentially a 16-bit integer, and the UTF-16 encoding is a set of rules for converting a sequence of code points to a sequence of UTF-16 code units and back.

In UTF-16, any code point between U+0000 and U+D7FF, or between U+E000 and U+FFFF, is just represented as itself. So e.g. U+0061 'a' is represented as 0x0061, and "hello" is represented as 0x0068 0x0065 0x006C 0x006C 0x006F. Any code point too big to fit in that range is split across two code units, and the result is called a surrogate pair. So in UTF-16, the text "hello", which consists of 5 code points, becomes 5 code units. The text "🍔🍺", which contains two emojis whose code point values are U+1F354 and U+1F37A, becomes 4 code units: each emoji needs two code units, because their values are both above U+FFFF.

Most of the JavaScript String API is based on code units, not code points. For example, the length property of a string in JS counts the number of code units, not code points:

> "hello".length
5
> "🍔🍺".length
4

If .length were counting code units, we'd see 5 for the first example, but 2 for the second. As another example, the slice function on strings counts code units. This means that you can cut a surrogate pair in half using slice!

> "🍔🍺".slice(1)
'�🍺'

The question mark appears because we have an invalid sequence of UTF-16 code units; we have lost the first element of the surrogate pair for the burger and are only left with the second. The first thing in the result string there, the part which appears as a question mark symbol, is called a lone surrogate. Ideally you don't want this to ever happen; if your code splits up a surrogate pair, it's probably a bug.

This issue is why both Data.String.CodePoints and Data.String.CodeUnits exist. The Data.String.CodePoints module views strings as lists of code points, and prevents you from accidentally splitting up a surrogate pair. The Data.String.CodeUnits module views strings as lists of code units, and does nothing to prevent you from splitting up a surrogate pair. This is why most of Data.String.CodePoints is re-exported by Data.String: it's safer, so we have made it the default. Unfortunately, it turns out naively rewriting the string-parsers library in terms of Data.String.CodePoints makes it really slow, so we need to be a bit more clever than that.

Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.

Looking back, I've just realised that these two modules have the exact same API so I don't think there's anything from the CodePoints module we should keep; we should just get rid of the whole thing. We should also document which of the parsers in the current Text.Parsing.StringParser.CodeUnits can split up surrogate pairs, and provide alternative versions of them based on CodePoints which don't suffer from the same issue. For example, here's a parser which can split up surrogate pairs:

-- | Match any character.
anyChar :: Parser Char
anyChar = Parser \{ str, pos } ->
case charAt pos str of
Just chr -> Right { result: chr, suffix: { str, pos: pos + 1 } }
Nothing -> Left { pos, error: ParseError "Unexpected EOF" }

This can split up surrogate pairs because it is based on the Data.String.CodeUnits.charAt function. A safe version of this parser might be

anyCodePoint :: Parser CodePoint
anyCodePoint = Parser \{ str, pos } -> 
   case SCP.codePointAt 0 (SCU.drop pos str) of 
     Just cp -> Right { result: cp, suffix: { str, pos: pos + SCU.length (SCP.singleton cp) } } 
     Nothing -> Left { pos, error: ParseError "Unexpected EOF" }

which will ensure that if there is a surrogate pair at the current position in the string, we take both code units which make it up and return them in the CodePoint return value, as well as advancing the position by 2 (since we have consumed 2 code units). At least, I think/hope it will do that. I haven't tested this code, though.

@@ -0,0 +1,105 @@
-- | This module defines the `Parser` type of string parsers, and its instances.

module Text.Parsing.StringParser.Parser where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this stuff should stay in Text.Parsing.StringParser ideally - I think moving it to a new module would be an unnecessary breaking change. I think the contents of the current Text.Parsing.StringParser.CodeUnits module can move into Text.Parsing.StringParser alongside what's there already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again maybe we should just leave the current contents of Text.Parsing.StringParser.CodeUnits where they are, as moving them is probably also an unnecessary breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your original instructions, CodeUnits was to be moved into Text.Parsing.StringParser. However, doing that caused a cyclical dependency between that module and Text.Parsing.StringParser.Combinators. So, I moved this code out into its own module and then re-export it in Text.Parsing.StringParser.

I think leaving Text.Parsing.StringParser.CodeUnits as is makes sense to reduce breakage. However, why have a separate module for that now that CodePoints won't be its own module? If we're going to make breaking changes, now is the time, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, now is the time to make breaking changes. Since putting the current contents of Text.Parsing.StringParser.CodeUnits into Text.Parsing.StringParser directly won't work, how about putting them back where they were before the change which added the CodeUnits and CodePoints modules: Text.Parsing.StringParser.String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should implement this in a separate PR after this one gets merged. It'll make the diff easier to see.

@JordanMartinez
Copy link
Contributor Author

I'll try to give a bit more background to help you understand which parsers are the problematic ones. The issue stems from there being two different ways of looking at strings:

Thanks for that detailed response! I'll add that to this repo's documentation to help clarify things.

@hdgarrood
Copy link
Contributor

I actually don't think this should be part of this repo's documentation; ideally users of this library shouldn't really have to care about this stuff. (As maintainers, we do have to care about it, though). If we were planning to document UTF-16 (and the fact that strings in PureScript use it), I think the Prim module would be a more appropriate place for that, as that's where the String type is defined. A slightly longer explainer in the Data.String.CodePoints and Data.String.CodeUnits modules would probably also be beneficial. I also think that it may be worth looking for an existing resource on Unicode encoding and UTF-16 specifically and linking to that, because this issue affects way more people than just us.

@thomashoneyman
Copy link
Contributor

@hdgarrood One form of documentation we're eager to add to more of these libraries is a sort of "maintainer's documentation", which is not meant to guide users but rather to guide contributors. For example, there's the Hooks Internals documentation in the Halogen Hooks library. It's also a place to establish what is and is not within scope for a library -- for example, Try PureScript needs some documentation to guide contributors on what sorts of changes are acceptable.

I don't exactly know how to organize that documentation appropriately but I think it's important it exists. Otherwise, new contributors have to guess at the decisions made in the library (and maintainers like myself have to do the same).

@thomashoneyman
Copy link
Contributor

It may still be better to have that documentation largely link to other resources, rather than type up something we have to maintain here. But in general I would like to see more maintainer-oriented (or contributor-oriented) documentation in these repositories.

@hdgarrood
Copy link
Contributor

I agree. I think in most cases, this kind of documentation is best kept alongside the code it relates to, i.e. in code comments - if it's kept in separate files it's much easier to forget to update it.

For this library specifically, the choice that the pos in the Parser type represents the number of code units consumed is probably best documented alongside the definition of that type. We should also include the reasoning for that - which is that it allows O(1) indexing, which is necessary for good performance. The alternative (of having that index represent the number of code points consumed) is not workable, since indexing using code points is O(n), where n is the index. However, even for maintainer-oriented docs, I think the issue of Unicode and UTF-16 should be dealt with by linking to some other documentation, because it is not specific to this library: it affects almost any library which wants to work with strings.

@JordanMartinez
Copy link
Contributor Author

A safe version of this parser might be

When I use that, the below tests pass:

expectResult :: forall a. Eq a => a -> Parser a -> String -> Boolean
expectResult res p input = runParser p input == Right res

testCodeUnits :: Effect Unit
testCodeUnits = do
  assert $ expectResult (NonEmptyList ("🍔" :| "🍺":Nil)) (many1 $ map SCP.singleton anyCodePoint) "🍔🍺"
  assert $ expectResult "🍔" (anyChar *> (map SCP.singleton anyCodePoint) <* anyChar) "a🍔a"
  assert $ expectResult ({a: "🍔", b: "🍺"}) ({a:_, b:_} <$> ((map SCP.singleton anyCodePoint) <* (void anyChar)) <*> (map SCP.singleton anyCodePoint)) "🍔a🍺"

So... I guess it works?

@hdgarrood
Copy link
Contributor

Seems like it! It would be worth testing with basic multilingual plane code points too (the ones with code point values less than U+FFFF, which can be represented as one code unit).

@JordanMartinez
Copy link
Contributor Author

How's this? Any other tests I should add?

@JordanMartinez
Copy link
Contributor Author

I think if the user supplies a string containing a lone surrogate to string, so that returning that string as the parse result splits a surrogate pair in half, I think we should actually split the surrogate pair in that case (since that's what the user has asked us to do). Come to think of it, we should have tests for that.

Currently working on this.

@JordanMartinez
Copy link
Contributor Author

Not sure what the test would need to be, but perhaps what I have above works?

@hdgarrood
Copy link
Contributor

That test looks good. I'd also like to add a test where the string being parsed is valid UTF-16, and it becomes invalid as a result of a surrogate pair being split by string. For example, given the input "🍔🍺" and a parser string "🍔\xd38c", the parser should match and consume the first three code units, and leave the last ('\xdf7a') unconsumed.

@JordanMartinez
Copy link
Contributor Author

Doesn't seem to work:

case unParser (string "🍔\xd38c") {str:"🍔🍺", pos:0} of
    Left e -> log $ show e.error <> " @ " <> show e.pos
    Right r -> log $ show r.result <> " | Remaining: " <> (drop r.suffix.pos r.suffix.str)

outputs

Expected '🍔펌'. @ 0

@hdgarrood
Copy link
Contributor

Sorry, I made a typo: it should be \xd83c rather than \xd38c.

@JordanMartinez
Copy link
Contributor Author

New output:

"🍔�" | Remaining: �

I'll add this test in a commit.

@thomashoneyman thomashoneyman added merge before 0.14 purs-0.14 A reminder to address this issue or merge this PR before we release PureScript v0.14.0 and removed merge before 0.14 labels Dec 3, 2020
@JordanMartinez
Copy link
Contributor Author

Is this PR ready to be merged yet? It's currently holding up #63

@hdgarrood
Copy link
Contributor

I remember thinking this still needs work the last time I looked. I’ll take a proper look before the end of this weekend.

@hdgarrood
Copy link
Contributor

I guess we could merge this now in order to unblock #63, but I think more work is required to consider #48 resolved.

@thomashoneyman thomashoneyman merged commit aff7f2b into purescript-contrib:main Dec 5, 2020
@JordanMartinez
Copy link
Contributor Author

I guess we could merge this now in order to unblock #63, but I think more work is required to consider #48 resolved.

What else needs to be done again? Your comment here?

@JordanMartinez JordanMartinez deleted the mergeParsers branch December 5, 2020 17:11
@hdgarrood
Copy link
Contributor

hdgarrood commented Dec 5, 2020

There's that and there's also the checkboxes in your initial comment which aren't marked as done yet. I also think I need to go over the entire API and make sure it makes sense in the context of treating strings as sequences of code points by default; there's still a lot of usage of Char which I think is a bit dodgy, for example.

@thomashoneyman
Copy link
Contributor

#63 is now merged -- what followup tasks should be done for #59 (comment)? Should we open followup issues marked with fix before 0.14?

@hdgarrood
Copy link
Contributor

I think followup issues probably make sense. I started looking into this and started to become unsure about the approach of just having a single module; if we're treating Unicode properly, then we probably ought to use CodePoint instead of Char, which (to me) suggests that we might want to keep the two separate modules after all, and go through the old Text.Parsing.StringParser.CodePoints and replace each Char with CodePoint.

This was referenced Dec 27, 2020
thomashoneyman pushed a commit that referenced this pull request Dec 28, 2020
* Revert #59

* Update code in light of ParseError change

* Update CI to v0.14.0-rc5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
purs-0.14 A reminder to address this issue or merge this PR before we release PureScript v0.14.0
Development

Successfully merging this pull request may close these issues.

Unify CodePoints and CodeUnits parsers
3 participants