You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Correctly handle UTF-16 surrogate pairs in `String`s.
All prior tests pass with no modifications. Add a few new tests.
Non-breaking changes
====================
Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
`CodePoint`s.
Add the `match` combinator.
Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't
export it.
Breaking changes
================
Change the definition of `whiteSpace` and `skipSpaces` to
`Data.CodePoint.Unicode.isSpace`.
Move the character class parsers from `Text.Parsing.Parser.Token` module into
the `Text.Parsing.Parser.String` module.
To make this library handle Unicode correctly, it is necessary to
either alter the `StringLike` class or delete it.
We decided to delete it. The `String` module will now operate only
on inputs of the concrete `String` type.
`StringLike` has no laws, and during the five years of its life,
no-one on Github has ever written another instance of `StringLike`.
https://github.com/search?l=&q=StringLike+language%3APureScript&type=code
The last time someone tried to alter `StringLike`, this is what
happened:
purescript-contrib#62
Breaking changes which won’t be caught by the compiler
======================================================
Fundamentally, we change the way we consume the next input character from
`Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`.
`anyChar` will no longer always succeed. It will only succeed on a Basic
Multilingual Plane character. The new parser `anyCodePoint` will always succeed.
We are not quite “making the default `CodePoint`”, as was discussed in
purescript-contrib#76 (comment) .
Rather we are keeping most of the current API and making it work
properly with astral Unicode.
We keep the `Char` parsers for backward compatibility.
We also keep the `Char` parsers for ergonomic reasons. For example
the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`.
This parser is usually called with a literal like `char 'a'`. It would
be annoying to call this parser with `char (codePointFromChar 'a')`.
Benchmarks
==========
For Unicode correctness, we're now consuming characters with
`Data.String.CodePoints.uncons` instead of
`Data.String.CodeUnits.uncons`. If that were going to effect
performance, then the effect would show up in the `runParser parse23`
benchmark, but it doesn’t.
Before
------
```
runParser parse23
mean = 43.36 ms
stddev = 6.75 ms
min = 41.12 ms
max = 124.65 ms
runParser parseSkidoo
mean = 22.53 ms
stddev = 3.86 ms
min = 21.40 ms
max = 61.76 ms
```
After
-----
```
runParser parse23
mean = 42.90 ms
stddev = 6.01 ms
min = 40.97 ms
max = 115.74 ms
runParser parseSkidoo
mean = 22.03 ms
stddev = 2.79 ms
min = 20.78 ms
max = 53.34 ms
```
0 commit comments