Description
Arising out of discussion in #46.
Right now, we have parsers in the modules Text.Parsing.StringParser.CodePoints
and Text.Parsing.StringParser.CodeUnits
, which use the same Parser
data type, except that the former treats the integer pos
field in the parser state as the number of code points we have consumed in the string being parsed, and the latter treats the pos
field as the number of code units. This can cause problems if they are mixed:
> runParser (Tuple <$> CP.string "🐱" <*> CP.anyChar) "🐱hi"
(Right (Tuple "🐱" 'h'))
> runParser (Tuple <$> CP.string "🐱" <*> CU.anyChar) "🐱hi"
(Right (Tuple "🐱" '�'))
Addtionally, storing an index into code points is not really justifiable from a performance perspective, since indexing into a string using code points is an O(n) operation, where n is the index; it requires looking at every code point in the string up to the given index.
If we compare the APIs exported by the Text.Parsing.StringParser.Code{Units,Points}
modules, they are basically the same; in particular, the CodePoints
parsers still use Char
almost everywhere, which limits their utility quite severely. As far as I can tell, the only difference between the CodePoints
and CodeUnits
parsers (now that #46 has been merged) is that the CodePoints
ones will fail rather than splitting up surrogate pairs.
I think the ideal solution would be to do the following:
- Say that the
pos
field in the parser state always counts code units - Unify
Text.Parsing.StringParser.CodePoints
andText.Parsing.StringParser.CodeUnits
into just one module; effectively, get rid of the former and move most/all of the contents of the latter back toText.Parsing.StringParser
. - Clearly demarcate parsers which have the ability to split up surrogate pairs, like
anyChar
. This could be done with doc-comments or we could move them into their own moduleText.Parsing.StringParser.CodeUnits
. - Provide
CodePoint
-based alternatives to any of the parsers which are currently based onChar
, so that it is possible to do everything you might want to do without having to resort to using parsers likeanyChar
which can split up surrogate pairs.