-
Notifications
You must be signed in to change notification settings - Fork 20
Remove CodePoints parsers and use CodeUnits parsers #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove CodePoints parsers and use CodeUnits parsers #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to give a bit more background to help you understand which parsers are the problematic ones. The issue stems from there being two different ways of looking at strings:
- A sequence of Unicode code points
- A sequence of UTF-16 code units
A Unicode code point generally identifies a single character, like U+0061 'a', U+0051 'Q', U+003F '?', U+00EA 'ê', or U+1F431 '🐱'. Each code point can be identified with a number between 0 and 1114111, or 10FFFF in hexadecimal (although not every number in that range corresponds to a code point; more on that in a bit). The notation U+XXXX means "The unicode code point identified by the hexadecimal number XXXX".
A naive encoding of Unicode text could be a list of 32-bit integers, since 32 bits is enough to uniquely represent each Unicode character. However, setting aside 32 bits for every character was considered a little bit wasteful, especially since so many text files (eg: ones written by speakers of English and other languages which use small alphabets) only use a tiny proportion of the total number of Unicode code points. Therefore, a few systems (including JavaScript) use an encoding called UTF-16 internally, which consists of a sequence of UTF-16 code units. A UTF-16 code unit is essentially a 16-bit integer, and the UTF-16 encoding is a set of rules for converting a sequence of code points to a sequence of UTF-16 code units and back.
In UTF-16, any code point between U+0000 and U+D7FF, or between U+E000 and U+FFFF, is just represented as itself. So e.g. U+0061 'a' is represented as 0x0061, and "hello" is represented as 0x0068 0x0065 0x006C 0x006C 0x006F. Any code point too big to fit in that range is split across two code units, and the result is called a surrogate pair. So in UTF-16, the text "hello", which consists of 5 code points, becomes 5 code units. The text "🍔🍺", which contains two emojis whose code point values are U+1F354 and U+1F37A, becomes 4 code units: each emoji needs two code units, because their values are both above U+FFFF.
Most of the JavaScript String API is based on code units, not code points. For example, the length
property of a string in JS counts the number of code units, not code points:
> "hello".length
5
> "🍔🍺".length
4
If .length
were counting code units, we'd see 5 for the first example, but 2 for the second. As another example, the slice
function on strings counts code units. This means that you can cut a surrogate pair in half using slice
!
> "🍔🍺".slice(1)
'�🍺'
The question mark appears because we have an invalid sequence of UTF-16 code units; we have lost the first element of the surrogate pair for the burger and are only left with the second. The first thing in the result string there, the part which appears as a question mark symbol, is called a lone surrogate. Ideally you don't want this to ever happen; if your code splits up a surrogate pair, it's probably a bug.
This issue is why both Data.String.CodePoints
and Data.String.CodeUnits
exist. The Data.String.CodePoints module views strings as lists of code points, and prevents you from accidentally splitting up a surrogate pair. The Data.String.CodeUnits module views strings as lists of code units, and does nothing to prevent you from splitting up a surrogate pair. This is why most of Data.String.CodePoints is re-exported by Data.String: it's safer, so we have made it the default. Unfortunately, it turns out naively rewriting the string-parsers library in terms of Data.String.CodePoints makes it really slow, so we need to be a bit more clever than that.
Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.
Looking back, I've just realised that these two modules have the exact same API so I don't think there's anything from the CodePoints module we should keep; we should just get rid of the whole thing. We should also document which of the parsers in the current Text.Parsing.StringParser.CodeUnits can split up surrogate pairs, and provide alternative versions of them based on CodePoints which don't suffer from the same issue. For example, here's a parser which can split up surrogate pairs:
-- | Match any character. | |
anyChar :: Parser Char | |
anyChar = Parser \{ str, pos } -> | |
case charAt pos str of | |
Just chr -> Right { result: chr, suffix: { str, pos: pos + 1 } } | |
Nothing -> Left { pos, error: ParseError "Unexpected EOF" } |
This can split up surrogate pairs because it is based on the Data.String.CodeUnits.charAt
function. A safe version of this parser might be
anyCodePoint :: Parser CodePoint
anyCodePoint = Parser \{ str, pos } ->
case SCP.codePointAt 0 (SCU.drop pos str) of
Just cp -> Right { result: cp, suffix: { str, pos: pos + SCU.length (SCP.singleton cp) } }
Nothing -> Left { pos, error: ParseError "Unexpected EOF" }
which will ensure that if there is a surrogate pair at the current position in the string, we take both code units which make it up and return them in the CodePoint return value, as well as advancing the position by 2 (since we have consumed 2 code units). At least, I think/hope it will do that. I haven't tested this code, though.
@@ -0,0 +1,105 @@ | |||
-- | This module defines the `Parser` type of string parsers, and its instances. | |||
|
|||
module Text.Parsing.StringParser.Parser where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this stuff should stay in Text.Parsing.StringParser
ideally - I think moving it to a new module would be an unnecessary breaking change. I think the contents of the current Text.Parsing.StringParser.CodeUnits
module can move into Text.Parsing.StringParser
alongside what's there already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again maybe we should just leave the current contents of Text.Parsing.StringParser.CodeUnits where they are, as moving them is probably also an unnecessary breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per your original instructions, CodeUnits
was to be moved into Text.Parsing.StringParser
. However, doing that caused a cyclical dependency between that module and Text.Parsing.StringParser.Combinators
. So, I moved this code out into its own module and then re-export it in Text.Parsing.StringParser
.
I think leaving Text.Parsing.StringParser.CodeUnits
as is makes sense to reduce breakage. However, why have a separate module for that now that CodePoints
won't be its own module? If we're going to make breaking changes, now is the time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, now is the time to make breaking changes. Since putting the current contents of Text.Parsing.StringParser.CodeUnits into Text.Parsing.StringParser directly won't work, how about putting them back where they were before the change which added the CodeUnits and CodePoints modules: Text.Parsing.StringParser.String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should implement this in a separate PR after this one gets merged. It'll make the diff easier to see.
Thanks for that detailed response! I'll add that to this repo's documentation to help clarify things. |
I actually don't think this should be part of this repo's documentation; ideally users of this library shouldn't really have to care about this stuff. (As maintainers, we do have to care about it, though). If we were planning to document UTF-16 (and the fact that strings in PureScript use it), I think the Prim module would be a more appropriate place for that, as that's where the String type is defined. A slightly longer explainer in the Data.String.CodePoints and Data.String.CodeUnits modules would probably also be beneficial. I also think that it may be worth looking for an existing resource on Unicode encoding and UTF-16 specifically and linking to that, because this issue affects way more people than just us. |
@hdgarrood One form of documentation we're eager to add to more of these libraries is a sort of "maintainer's documentation", which is not meant to guide users but rather to guide contributors. For example, there's the Hooks Internals documentation in the Halogen Hooks library. It's also a place to establish what is and is not within scope for a library -- for example, Try PureScript needs some documentation to guide contributors on what sorts of changes are acceptable. I don't exactly know how to organize that documentation appropriately but I think it's important it exists. Otherwise, new contributors have to guess at the decisions made in the library (and maintainers like myself have to do the same). |
It may still be better to have that documentation largely link to other resources, rather than type up something we have to maintain here. But in general I would like to see more maintainer-oriented (or contributor-oriented) documentation in these repositories. |
I agree. I think in most cases, this kind of documentation is best kept alongside the code it relates to, i.e. in code comments - if it's kept in separate files it's much easier to forget to update it. For this library specifically, the choice that the |
When I use that, the below tests pass: expectResult :: forall a. Eq a => a -> Parser a -> String -> Boolean
expectResult res p input = runParser p input == Right res
testCodeUnits :: Effect Unit
testCodeUnits = do
assert $ expectResult (NonEmptyList ("🍔" :| "🍺":Nil)) (many1 $ map SCP.singleton anyCodePoint) "🍔🍺"
assert $ expectResult "🍔" (anyChar *> (map SCP.singleton anyCodePoint) <* anyChar) "a🍔a"
assert $ expectResult ({a: "🍔", b: "🍺"}) ({a:_, b:_} <$> ((map SCP.singleton anyCodePoint) <* (void anyChar)) <*> (map SCP.singleton anyCodePoint)) "🍔a🍺" So... I guess it works? |
Seems like it! It would be worth testing with basic multilingual plane code points too (the ones with code point values less than U+FFFF, which can be represented as one code unit). |
How's this? Any other tests I should add? |
Implementation was designed by @hdgarrood
Currently working on this. |
Not sure what the test would need to be, but perhaps what I have above works? |
That test looks good. I'd also like to add a test where the string being parsed is valid UTF-16, and it becomes invalid as a result of a surrogate pair being split by |
Doesn't seem to work:
outputs
|
Sorry, I made a typo: it should be |
New output:
I'll add this test in a commit. |
Is this PR ready to be merged yet? It's currently holding up #63 |
I remember thinking this still needs work the last time I looked. I’ll take a proper look before the end of this weekend. |
What else needs to be done again? Your comment here? |
There's that and there's also the checkboxes in your initial comment which aren't marked as done yet. I also think I need to go over the entire API and make sure it makes sense in the context of treating strings as sequences of code points by default; there's still a lot of usage of |
#63 is now merged -- what followup tasks should be done for #59 (comment)? Should we open followup issues marked with |
I think followup issues probably make sense. I started looking into this and started to become unsure about the approach of just having a single module; if we're treating Unicode properly, then we probably ought to use CodePoint instead of Char, which (to me) suggests that we might want to keep the two separate modules after all, and go through the old Text.Parsing.StringParser.CodePoints and replace each Char with CodePoint. |
What does this pull request do?
WIP PR that attempts to fix #48 by following the instructions specified by Harry in his comment.
Jordan:
I wasn't sure how this should now be rendered.Jordan:
I'm not sure which parsers in the originalCodeUnits
module should be removedJordan:
I'm not sure which ones these are.Jordan:
I'm not sure what to do here.Where should the reviewer start?
Provide further direction about the above guidelines
How should this be manually tested?
We might need to add more tests, but I'm not sure whether that's necessary. I haven't looked at this repo's tests yet.