Tokenizer: pretranslate lowercase element and attribute names #520

jayaddison · 2020-12-29T17:34:32Z

During tokenization, some element names, attributes, and temporary buffered strings are compared in a case-insensitive mode.

To avoid repeat string transformation operations, this change performs lowercasing on those strings at construction-time.

This change builds upon and includes #519 and prepares for further refactoring work aiming towards resolving #24. Those further changes should de-duplicate a number of the references to str.translate added here.

…me state

jayaddison · 2020-12-29T18:31:49Z

Ah; I've now discovered the html5lib-tests.git submodule and the tests that are failing here. The requirement to retain casing in constructed tree data might rule out this approach.

gsnedders · 2020-12-30T12:31:12Z

Ah; I've now discovered the html5lib-tests.git submodule and the tests that are failing here. The requirement to retain casing in constructed tree data might rule out this approach.

I have, on some branch (potentially local, for a bunch of reasons that are delaying me pushing anything right now), moved the lowercasing to the tokenizer from the tree constructor; there's real reason for it living in the tree constructor (besides the tokenizer previously being shared with a lax XML parser, the legacy of which has been pretty slowly removed).

IIRC, when I looked at this (and similar changes) before, the cost of doing the lowercasing everywhere (versus once at the end) ultimately lead this to be a net loss.

You could probably get much of the same benefit by reordering the if statements in rcdataEndTagNameState/rawtextEndTagOpenState/scriptDataEndTagNameState, firstly checking if the character is an ASCII letter, then computing whether it's appropriate, as we're then leaving the state either way.

jayaddison · 2020-12-30T13:00:39Z

IIRC, when I looked at this (and similar changes) before, the cost of doing the lowercasing everywhere (versus once at the end) ultimately lead this to be a net loss.

Thanks, that makes sense. I've had to back-out the on-the-fly lowercasing of content written into the temporaryBuffer variable because that can be emitted as character data during script escaping.

That said, there are still a few cases where I think lowercasing has been occurring repeatedly on the same string data -- particularly tag and attribute names -- and we can remove those.

In the current diff view for this PR, lines where a comparison is removed (without a replacement comparison being added) should correspond to those cases. I don't think it's going to make a huge impact, but some of these do appear to be redundant.

The one remaining case that I'm puzzling over is the way that temporaryBuffer.lower is called during double-escaped script processing (example). That's proving tricky to reason about, but I'm optimistic there'll be a way to clean that up too.

gsnedders · 2021-01-04T16:20:29Z

The one remaining case that I'm puzzling over is the way that temporaryBuffer.lower is called during double-escaped script processing (example). That's proving tricky to reason about, but I'm optimistic there'll be a way to clean that up too.

What's confusing you about that case? It's built up in the scriptDataEscapedLessThanSignState and scriptDataDoubleEscapeStartState. Or is this about how it doesn't get output ever?

gsnedders · 2021-01-04T16:29:42Z

html5lib/_tokenizer.py

-        appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower()
+        name = self.temporaryBuffer.translate(asciiUpper2Lower)
+        appropriate = self.currentToken and self.currentToken["name"] == name


FWIW: my thinking from my previous comment was to instead change this state (and similar ones) to:

if data in asciiLetters: ... elif (data in spaceCharacters or data in ("/", ">")) and self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower(): if data in spaceCharacters: ... elif data == "/": ...

etc.

At the absolute least, I think we should avoid computing appropriate in the if data in asciiLetters case? Otherwise we're doing this lower-casing after each time we add a character.

Ok, yep - I didn't really register that after your first comment; I'll take a look at re-ordering these conditionals soon. Thanks for detailing that a bit further.

That's applied now; I didn't find a noticeable performance difference as a result (cpython 3.9.1), but it may be logically clearer.

I feel like the modifications in this PR are getting blended together slightly confusingly, so I'll do a more thorough analysis soon to pick apart the individual suggestions, analyze performance for them individually, and then keep the ones that still seem useful and show performance benefits (or seem worthwhile enough to include anyway).

gsnedders · 2021-01-04T16:31:17Z

html5lib/_tokenizer.py

@@ -448,7 +449,7 @@ def tagNameState(self):
                                    "data": "invalid-codepoint"})
            self.currentToken["name"] += "\uFFFD"
        else:
-            self.currentToken["name"] += data
+            self.currentToken["name"] += data.translate(asciiUpper2Lower)


I'm very skeptical about this being a perf win, versus it being in emitCurrentToken. What do the benchmarks say?

Yes, emitCurrentToken's lowercasing becomes redundant in the RCDATA/RAWTEXT/script cases, but I expect the cost of this will negate any gains.

That's fair, yep - especially for short element names it seems likely that the translate method call overhead (especially if called repeatedly) could negate any benefits provided by simpler comparisons.

I hadn't assessed the performance of this code path separately; it felt worth maintaining consistency but I don't believe there's a noticeable performance change.

jayaddison · 2021-01-04T17:10:44Z

The one remaining case that I'm puzzling over is the way that temporaryBuffer.lower is called during double-escaped script processing (example). That's proving tricky to reason about, but I'm optimistic there'll be a way to clean that up too.

What's confusing you about that case? It's built up in the scriptDataEscapedLessThanSignState and scriptDataDoubleEscapeStartState. Or is this about how it doesn't get output ever?

I should have been clearer: there are two things I found a little confusing. One is that these are the only usages of the str.lower method to perform ASCII lowercasing; I imagine there are historical reasons why the character-map str.translate approach is used instead. Either way for future cleanup and replacement work it might be wise to use the same approach for lowercasing so that it can be substituted out more easily.

I saw that you'd previously taken a go at consolidating the ASCII lowercasing logic in a branch, and I also seemed to find some small performance wins by calling str.lower everywhere (which is kind-of-understandable if it is a compiled builtin in some implementations of Python), but I was wary of behaviour changes and didn't end up collecting those stats or publishing a pull request.

The other thing I found a bit confusing was the transitions between the escape and double-escape states. I trust that they do make sense since the test cases prove the behaviour, but I found it really hard to reason about from a logical code flow and parser state point-of-view. It doesn't help that it's all to do with complex HTML escaping scenarios which can a bit mind-bending in themselves.

gsnedders · 2021-01-05T13:30:02Z

I should have been clearer: there are two things I found a little confusing. One is that these are the only usages of the str.lower method to perform ASCII lowercasing; I imagine there are historical reasons why the character-map str.translate approach is used instead. Either way for future cleanup and replacement work it might be wise to use the same approach for lowercasing so that it can be substituted out more easily.

The important thing is that str.lower doesn't do ASCII lowercasing, it follows the Unicode Default Case Conversion algorithm.

The current approach in html5lib in principle is to use str.lower when we can guarantee we have a pure-ASCII string, on the assumption it's quicker. (Though I was looking at the implementation in CPython yesterday, and there's definitely wins to be had there!)

I had locally a C implementation of ASCII lowercasing which does nothing if the string is already lowercased, which probably wins even over str.lower currently. Unfortunately I accidentally deleted it… Not too hard to recreate (and improve upon), and I'll try do that soon.

I'll push my Cython branch sometime this week, which includes replacing all the ASCII case conversion to an _ascii module (though the extra function calls might be a loss without Cython).

The other thing I found a bit confusing was the transitions between the escape and double-escape states. I trust that they do make sense since the test cases prove the behaviour, but I found it really hard to reason about from a logical code flow and parser state point-of-view. It doesn't help that it's all to do with complex HTML escaping scenarios which can a bit mind-bending in themselves.

It would take me a fair while to get me head around the semantics there!

…er data translation

jayaddison · 2021-01-09T17:53:26Z

The important thing is that str.lower doesn't do ASCII lowercasing, it follows the Unicode Default Case Conversion algorithm.

Ok, yep - that is an important detail, as is the resulting implication that non-ASCII unicode characters can be transformed by lowercasing.

That said, in the HTML5 living spec (currently @ b49b9d970a5bded83b4ea019034f448fc2233e11), valid element and attribute names consist of ASCII alphanumerics - so using str.lower should handle matching of tags and attribute names correctly.

I've opened #526 to try this out, and perhaps you can catch me out by suggesting a test case if there's a situation this doesn't handle.

I had locally a C implementation of ASCII lowercasing which does nothing if the string is already lowercased, which probably wins even over str.lower currently. Unfortunately I accidentally deleted it… Not too hard to recreate (and improve upon), and I'll try do that soon.

That'd be neat to see, but don't hurry or feel a need to display that on my behalf at least. I'd be curious about whether that results in a probabilistic performance win (i.e. dependent on dataset).

jayaddison · 2022-12-24T01:03:15Z

Cleaning up some old / stale pull requests; please let me know if this changeset is considered worthwhile and I'll reopen if so.

jayaddison added 3 commits December 29, 2020 14:44

Consistency: consume a single character at a time during attribute na…

183d8a0

…me state

Refactor: pretranslate lowercase element and attribute names

2e86373

Restore self.currentToken safety check

8f96b17

Alternate approach: do not pretranslate temporary buffered data

a912842

Consistency: character consumption within double-escaped state

f9f370e

jayaddison mentioned this pull request Dec 30, 2020

Tokenizer: use Python objects to represent tokens #521

Closed

gsnedders reviewed Jan 4, 2021

View reviewed changes

Check ASCII character data condition before performing temporary buff…

fa62671

…er data translation

jayaddison mentioned this pull request Jan 9, 2021

Use Python built-in str.lower in preference to asciiUpper2Lower character table translation #526

Closed

Merge branch 'master' into tokenizer/pretranslate-lowercase-names

df94e2d

jayaddison closed this Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer: pretranslate lowercase element and attribute names #520

Tokenizer: pretranslate lowercase element and attribute names #520

jayaddison commented Dec 29, 2020

jayaddison commented Dec 29, 2020

gsnedders commented Dec 30, 2020

jayaddison commented Dec 30, 2020

gsnedders commented Jan 4, 2021

gsnedders Jan 4, 2021 •

edited

Loading

jayaddison Jan 4, 2021

jayaddison Jan 5, 2021

jayaddison Jan 5, 2021

gsnedders Jan 4, 2021

jayaddison Jan 4, 2021

jayaddison commented Jan 4, 2021

gsnedders commented Jan 5, 2021

jayaddison commented Jan 9, 2021

jayaddison commented Dec 24, 2022

Tokenizer: pretranslate lowercase element and attribute names #520

Tokenizer: pretranslate lowercase element and attribute names #520

Conversation

jayaddison commented Dec 29, 2020

jayaddison commented Dec 29, 2020

gsnedders commented Dec 30, 2020

jayaddison commented Dec 30, 2020

gsnedders commented Jan 4, 2021

gsnedders Jan 4, 2021 • edited Loading

Choose a reason for hiding this comment

jayaddison Jan 4, 2021

Choose a reason for hiding this comment

jayaddison Jan 5, 2021

Choose a reason for hiding this comment

jayaddison Jan 5, 2021

Choose a reason for hiding this comment

gsnedders Jan 4, 2021

Choose a reason for hiding this comment

jayaddison Jan 4, 2021

Choose a reason for hiding this comment

jayaddison commented Jan 4, 2021

gsnedders commented Jan 5, 2021

jayaddison commented Jan 9, 2021

jayaddison commented Dec 24, 2022

gsnedders Jan 4, 2021 •

edited

Loading