Unify CodePoints and CodeUnits parsers

Arising out of discussion in #46.

Right now, we have parsers in the modules `Text.Parsing.StringParser.CodePoints` and `Text.Parsing.StringParser.CodeUnits`, which use the same `Parser` data type, except that the former treats the integer `pos` field in the parser state as the number of _code points_ we have consumed in the string being parsed, and the latter treats the `pos` field as the number of _code units_. This can cause problems if they are mixed:

```
> runParser (Tuple <$> CP.string "🐱" <*> CP.anyChar) "🐱hi"
(Right (Tuple "🐱" 'h'))

> runParser (Tuple <$> CP.string "🐱" <*> CU.anyChar) "🐱hi"
(Right (Tuple "🐱" '�'))
```

Addtionally, storing an index into code points is not really justifiable from a performance perspective, since indexing into a string using code points is an O(n) operation, where n is the index; it requires looking at every code point in the string up to the given index.

If we compare the APIs exported by the `Text.Parsing.StringParser.Code{Units,Points}` modules, they are basically the same; in particular, the `CodePoints` parsers still use `Char` almost everywhere, which limits their utility quite severely. As far as I can tell, the only difference between the `CodePoints` and `CodeUnits` parsers (now that #46 has been merged) is that the `CodePoints` ones will fail rather than splitting up surrogate pairs.

I think the ideal solution would be to do the following:

* Say that the `pos` field in the parser state always counts code units
* Unify `Text.Parsing.StringParser.CodePoints` and `Text.Parsing.StringParser.CodeUnits` into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to `Text.Parsing.StringParser`.
* Clearly demarcate parsers which have the ability to split up surrogate pairs, like `anyChar`. This could be done with doc-comments or we could move them into their own module `Text.Parsing.StringParser.CodeUnits`.
* Provide `CodePoint`-based alternatives to any of the parsers which are currently based on `Char`, so that it is possible to do everything you might want to do without having to resort to using parsers like `anyChar` which can split up surrogate pairs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify CodePoints and CodeUnits parsers #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unify CodePoints and CodeUnits parsers #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions