Skip to content

Unify CodePoints and CodeUnits parsers #48

Open
@hdgarrood

Description

@hdgarrood

Arising out of discussion in #46.

Right now, we have parsers in the modules Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits, which use the same Parser data type, except that the former treats the integer pos field in the parser state as the number of code points we have consumed in the string being parsed, and the latter treats the pos field as the number of code units. This can cause problems if they are mixed:

> runParser (Tuple <$> CP.string "🐱" <*> CP.anyChar) "🐱hi"
(Right (Tuple "🐱" 'h'))

> runParser (Tuple <$> CP.string "🐱" <*> CU.anyChar) "🐱hi"
(Right (Tuple "🐱" '�'))

Addtionally, storing an index into code points is not really justifiable from a performance perspective, since indexing into a string using code points is an O(n) operation, where n is the index; it requires looking at every code point in the string up to the given index.

If we compare the APIs exported by the Text.Parsing.StringParser.Code{Units,Points} modules, they are basically the same; in particular, the CodePoints parsers still use Char almost everywhere, which limits their utility quite severely. As far as I can tell, the only difference between the CodePoints and CodeUnits parsers (now that #46 has been merged) is that the CodePoints ones will fail rather than splitting up surrogate pairs.

I think the ideal solution would be to do the following:

  • Say that the pos field in the parser state always counts code units
  • Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.
  • Clearly demarcate parsers which have the ability to split up surrogate pairs, like anyChar. This could be done with doc-comments or we could move them into their own module Text.Parsing.StringParser.CodeUnits.
  • Provide CodePoint-based alternatives to any of the parsers which are currently based on Char, so that it is possible to do everything you might want to do without having to resort to using parsers like anyChar which can split up surrogate pairs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions