Use slices instead of cursors #83

chtenb · 2022-02-17T18:35:59Z

Proof of concept to fix #77. This change would be breaking because it changes the underlying representation.

Use slices instead of cursors
Do a startsWith check in the string parser instead of a call to indexOf
Make the CodePoint parsers anyDigit, anyLetter and derivatives use the CodeUnit version of anyChar for speed
Added the change to the changelog's "Unreleased" section with a link to this PR and your username
Linked any existing issues or proposals that this pull request should close

Benchmark before	Benchmark after
StringParser.runParser parse23AnyCharPoints	StringParser.runParser parse23AnyCharPoints
mean = 1.01 s	mean = 16.14 ms
stddev = 51.48 ms	stddev = 13.64 ms
min = 963.28 ms	min = 8.55 ms
max = 1.16 s	max = 53.29 ms
StringParser.runParser parse23AnyCharUnits	StringParser.runParser parse23AnyCharUnits
mean = 8.53 ms	mean = 8.73 ms
stddev = 2.77 ms	stddev = 2.95 ms
min = 7.25 ms	min = 7.24 ms
max = 38.72 ms	max = 40.49 ms
StringParser.runParser parse23DigitPoints	StringParser.runParser parse23DigitPoints
mean = 994.93 ms	mean = 10.02 ms
stddev = 16.97 ms	stddev = 1.61 ms
min = 974.33 ms	min = 8.65 ms
max = 1.04 s	max = 23.21 ms
StringParser.runParser parse23DigitUnits	StringParser.runParser parse23DigitUnits
mean = 10.87 ms	mean = 10.28 ms
stddev = 1.78 ms	stddev = 1.56 ms
min = 9.51 ms	min = 8.71 ms
max = 23.81 ms	max = 22.51 ms
StringParser.runParser parse23StringPoints	StringParser.runParser parse23StringPoints
mean = 1.06 s	mean = 6.91 ms
stddev = 16.93 ms	stddev = 954.29 μs
min = 1.01 s	min = 5.80 ms
max = 1.09 s	max = 15.04 ms
StringParser.runParser parse23StringUnits	StringParser.runParser parse23StringUnits
mean = 4.52 ms	mean = 4.40 ms
stddev = 958.15 μs	stddev = 757.47 μs
min = 3.81 ms	min = 3.61 ms
max = 13.09 ms	max = 8.46 ms
StringParser.runParser parse23RegexPoints	StringParser.runParser parse23RegexPoints
mean = 1.03 s	mean = 11.77 ms
stddev = 44.87 ms	stddev = 2.11 ms
min = 978.55 ms	min = 9.76 ms
max = 1.15 s	max = 28.49 ms
StringParser.runParser parse23RegexUnits	StringParser.runParser parse23RegexUnits
mean = 5.90 ms	mean = 5.88 ms
stddev = 900.33 μs	stddev = 1.05 ms
min = 4.89 ms	min = 4.93 ms
max = 12.99 ms	max = 16.71 m

chtenb · 2022-02-19T19:11:48Z

We can close purescript/purescript-strings#155 once this PR gets merged

jamesdbrock · 2022-02-26T14:50:15Z

Nice, I will look at this.

jamesdbrock

Why is this a breaking change, besides changing the names of the record fields in PosString?

jamesdbrock · 2022-02-27T15:16:55Z

src/Text/Parsing/StringParser/CodePoints.purs

-      Nothing -> Left { pos, error: "CodePoint " <> show cp <> " is not a character" }
-    Nothing -> Left { pos, error: "Unexpected EOF" }
+      Just chr -> Right { result: chr, suffix: { substring: SCP.drop 1 substring, position: position + 1 } }
+      Nothing -> Left { pos: position, error: "CodePoint " <> show cp <> " is not a character" }


So here we're in the CodePoints parser, but I think this says that any astral CodePoint is not allowed? Is that what we want?

I mean it's true that because this parser returns Char there is no way for it to parse an astral CodePoint.

But I think this will surprise and confuse people? Maybe the CodePoints parsing module should not include this parser?

Yeah, I agree with you. I'm in favour of exporting an anyCodePoint instead or something in this module. I think the Char type is not a useful abstraction in PureScript (at least, it shouldn't be), but perhaps we should also keep supporting the anyChar parser as long as Char is a thing? I think it also depends on what direction purescript/purescript#3662 is taking.

chtenb · 2022-02-27T21:51:06Z

Why is this a breaking change, besides changing the names of the record fields in PosString?

Because the Parser type has changed and it is part of the public API of this package. If people have written a parser in terms of this definition, their code will have to be adjusted. Or does that not qualify as a breaking change?

jamesdbrock · 2022-02-28T14:30:23Z

Because the Parser type has changed and it is part of the public API of this package.

Sure, that makes sense. But the only thing that has changed about the type of the Parser is the record field names in PosString, right?

If it were up to me, I think I would do the work you've already done, and then also

add anyCodePoint :: Parser CodePoint
Document that anyChar will not always succeed, like here https://pursuit.purescript.org/packages/purescript-parsing/8.2.0/docs/Text.Parsing.Parser.String#v:anyChar

chtenb · 2022-02-28T14:36:44Z

But the only thing that has changed about the type of the Parser is the record field names in PosString, right?

Yes

I'll make an issue for anyCodePoint

chtenb added 2 commits February 17, 2022 19:33

proof of concept

c4567b7

remove unused import

72dc284

This was referenced Feb 18, 2022

StringParser.CodePoints quadratic scaling #77

Closed

Unify CodePoints and CodeUnits parsers #48

Open

chtenb added 4 commits February 20, 2022 10:42

improve string parser performance

5119643

improve performance of Char parsers with fixed domain

18e85ed

improve performance of code unit string parser

ef1fb7b

fix anyChar position administration

7cc3b36

chtenb force-pushed the slicing branch from 5a089d1 to 7cc3b36 Compare February 20, 2022 10:13

chtenb mentioned this pull request Feb 21, 2022

Efficient CodePoint indexing function? purescript/purescript-strings#155

Closed

chtenb added 4 commits February 26, 2022 12:12

Merge branch 'main' into slicing

31c1e07

revert usage of code unit positions in code point parsers for now

219799c

update changelog

61705a0

renames

998e1d1

chtenb marked this pull request as ready for review February 26, 2022 14:31

chtenb requested a review from jamesdbrock February 26, 2022 14:31

jamesdbrock approved these changes Feb 27, 2022

View reviewed changes

Merge branch 'main' into slicing

ede9ce8

chtenb merged commit 04fdd95 into main Mar 5, 2022

chtenb deleted the slicing branch March 5, 2022 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use slices instead of cursors #83

Use slices instead of cursors #83

Uh oh!

chtenb commented Feb 17, 2022 •

edited

Loading

Uh oh!

chtenb commented Feb 19, 2022

Uh oh!

jamesdbrock commented Feb 26, 2022

Uh oh!

jamesdbrock left a comment

Uh oh!

jamesdbrock Feb 27, 2022

Uh oh!

jamesdbrock Feb 27, 2022

Uh oh!

chtenb Feb 27, 2022

Uh oh!

chtenb commented Feb 27, 2022

Uh oh!

jamesdbrock commented Feb 28, 2022

Uh oh!

chtenb commented Feb 28, 2022

Uh oh!

Uh oh!

Use slices instead of cursors #83

Use slices instead of cursors #83

Uh oh!

Conversation

chtenb commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chtenb commented Feb 19, 2022

Uh oh!

jamesdbrock commented Feb 26, 2022

Uh oh!

jamesdbrock left a comment

Choose a reason for hiding this comment

Uh oh!

jamesdbrock Feb 27, 2022

Choose a reason for hiding this comment

Uh oh!

jamesdbrock Feb 27, 2022

Choose a reason for hiding this comment

Uh oh!

chtenb Feb 27, 2022

Choose a reason for hiding this comment

Uh oh!

chtenb commented Feb 27, 2022

Uh oh!

jamesdbrock commented Feb 28, 2022

Uh oh!

chtenb commented Feb 28, 2022

Uh oh!

Uh oh!

chtenb commented Feb 17, 2022 •

edited

Loading