Skip to content

Tokenizer: pretranslate lowercase element and attribute names #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Tokenizer: pretranslate lowercase element and attribute names #520

wants to merge 7 commits into from

Conversation

jayaddison
Copy link
Contributor

During tokenization, some element names, attributes, and temporary buffered strings are compared in a case-insensitive mode.

To avoid repeat string transformation operations, this change performs lowercasing on those strings at construction-time.

This change builds upon and includes #519 and prepares for further refactoring work aiming towards resolving #24. Those further changes should de-duplicate a number of the references to str.translate added here.

@jayaddison
Copy link
Contributor Author

Ah; I've now discovered the html5lib-tests.git submodule and the tests that are failing here. The requirement to retain casing in constructed tree data might rule out this approach.

@gsnedders
Copy link
Member

Ah; I've now discovered the html5lib-tests.git submodule and the tests that are failing here. The requirement to retain casing in constructed tree data might rule out this approach.

I have, on some branch (potentially local, for a bunch of reasons that are delaying me pushing anything right now), moved the lowercasing to the tokenizer from the tree constructor; there's real reason for it living in the tree constructor (besides the tokenizer previously being shared with a lax XML parser, the legacy of which has been pretty slowly removed).

IIRC, when I looked at this (and similar changes) before, the cost of doing the lowercasing everywhere (versus once at the end) ultimately lead this to be a net loss.

You could probably get much of the same benefit by reordering the if statements in rcdataEndTagNameState/rawtextEndTagOpenState/scriptDataEndTagNameState, firstly checking if the character is an ASCII letter, then computing whether it's appropriate, as we're then leaving the state either way.

@jayaddison
Copy link
Contributor Author

IIRC, when I looked at this (and similar changes) before, the cost of doing the lowercasing everywhere (versus once at the end) ultimately lead this to be a net loss.

Thanks, that makes sense. I've had to back-out the on-the-fly lowercasing of content written into the temporaryBuffer variable because that can be emitted as character data during script escaping.

That said, there are still a few cases where I think lowercasing has been occurring repeatedly on the same string data -- particularly tag and attribute names -- and we can remove those.

In the current diff view for this PR, lines where a comparison is removed (without a replacement comparison being added) should correspond to those cases. I don't think it's going to make a huge impact, but some of these do appear to be redundant.

The one remaining case that I'm puzzling over is the way that temporaryBuffer.lower is called during double-escaped script processing (example). That's proving tricky to reason about, but I'm optimistic there'll be a way to clean that up too.

@gsnedders
Copy link
Member

The one remaining case that I'm puzzling over is the way that temporaryBuffer.lower is called during double-escaped script processing (example). That's proving tricky to reason about, but I'm optimistic there'll be a way to clean that up too.

What's confusing you about that case? It's built up in the scriptDataEscapedLessThanSignState and scriptDataDoubleEscapeStartState. Or is this about how it doesn't get output ever?

Comment on lines 479 to 481
appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower()
name = self.temporaryBuffer.translate(asciiUpper2Lower)
appropriate = self.currentToken and self.currentToken["name"] == name
Copy link
Member

@gsnedders gsnedders Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: my thinking from my previous comment was to instead change this state (and similar ones) to:

if data in asciiLetters:
    ...
elif (data in spaceCharacters or data in ("/", ">")) and self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower():
    if data in spaceCharacters:
        ...
    elif data == "/":
        ...

etc.

At the absolute least, I think we should avoid computing appropriate in the if data in asciiLetters case? Otherwise we're doing this lower-casing after each time we add a character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yep - I didn't really register that after your first comment; I'll take a look at re-ordering these conditionals soon. Thanks for detailing that a bit further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's applied now; I didn't find a noticeable performance difference as a result (cpython 3.9.1), but it may be logically clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the modifications in this PR are getting blended together slightly confusingly, so I'll do a more thorough analysis soon to pick apart the individual suggestions, analyze performance for them individually, and then keep the ones that still seem useful and show performance benefits (or seem worthwhile enough to include anyway).

@@ -448,7 +449,7 @@ def tagNameState(self):
"data": "invalid-codepoint"})
self.currentToken["name"] += "\uFFFD"
else:
self.currentToken["name"] += data
self.currentToken["name"] += data.translate(asciiUpper2Lower)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very skeptical about this being a perf win, versus it being in emitCurrentToken. What do the benchmarks say?

Yes, emitCurrentToken's lowercasing becomes redundant in the RCDATA/RAWTEXT/script cases, but I expect the cost of this will negate any gains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, yep - especially for short element names it seems likely that the translate method call overhead (especially if called repeatedly) could negate any benefits provided by simpler comparisons.

I hadn't assessed the performance of this code path separately; it felt worth maintaining consistency but I don't believe there's a noticeable performance change.

@jayaddison
Copy link
Contributor Author

The one remaining case that I'm puzzling over is the way that temporaryBuffer.lower is called during double-escaped script processing (example). That's proving tricky to reason about, but I'm optimistic there'll be a way to clean that up too.

What's confusing you about that case? It's built up in the scriptDataEscapedLessThanSignState and scriptDataDoubleEscapeStartState. Or is this about how it doesn't get output ever?

I should have been clearer: there are two things I found a little confusing. One is that these are the only usages of the str.lower method to perform ASCII lowercasing; I imagine there are historical reasons why the character-map str.translate approach is used instead. Either way for future cleanup and replacement work it might be wise to use the same approach for lowercasing so that it can be substituted out more easily.

I saw that you'd previously taken a go at consolidating the ASCII lowercasing logic in a branch, and I also seemed to find some small performance wins by calling str.lower everywhere (which is kind-of-understandable if it is a compiled builtin in some implementations of Python), but I was wary of behaviour changes and didn't end up collecting those stats or publishing a pull request.

The other thing I found a bit confusing was the transitions between the escape and double-escape states. I trust that they do make sense since the test cases prove the behaviour, but I found it really hard to reason about from a logical code flow and parser state point-of-view. It doesn't help that it's all to do with complex HTML escaping scenarios which can a bit mind-bending in themselves.

@gsnedders
Copy link
Member

I should have been clearer: there are two things I found a little confusing. One is that these are the only usages of the str.lower method to perform ASCII lowercasing; I imagine there are historical reasons why the character-map str.translate approach is used instead. Either way for future cleanup and replacement work it might be wise to use the same approach for lowercasing so that it can be substituted out more easily.

The important thing is that str.lower doesn't do ASCII lowercasing, it follows the Unicode Default Case Conversion algorithm.

The current approach in html5lib in principle is to use str.lower when we can guarantee we have a pure-ASCII string, on the assumption it's quicker. (Though I was looking at the implementation in CPython yesterday, and there's definitely wins to be had there!)

I had locally a C implementation of ASCII lowercasing which does nothing if the string is already lowercased, which probably wins even over str.lower currently. Unfortunately I accidentally deleted it… Not too hard to recreate (and improve upon), and I'll try do that soon.

I'll push my Cython branch sometime this week, which includes replacing all the ASCII case conversion to an _ascii module (though the extra function calls might be a loss without Cython).

The other thing I found a bit confusing was the transitions between the escape and double-escape states. I trust that they do make sense since the test cases prove the behaviour, but I found it really hard to reason about from a logical code flow and parser state point-of-view. It doesn't help that it's all to do with complex HTML escaping scenarios which can a bit mind-bending in themselves.

It would take me a fair while to get me head around the semantics there!

@jayaddison
Copy link
Contributor Author

The important thing is that str.lower doesn't do ASCII lowercasing, it follows the Unicode Default Case Conversion algorithm.

Ok, yep - that is an important detail, as is the resulting implication that non-ASCII unicode characters can be transformed by lowercasing.

That said, in the HTML5 living spec (currently @ b49b9d970a5bded83b4ea019034f448fc2233e11), valid element and attribute names consist of ASCII alphanumerics - so using str.lower should handle matching of tags and attribute names correctly.

I've opened #526 to try this out, and perhaps you can catch me out by suggesting a test case if there's a situation this doesn't handle.

I had locally a C implementation of ASCII lowercasing which does nothing if the string is already lowercased, which probably wins even over str.lower currently. Unfortunately I accidentally deleted it… Not too hard to recreate (and improve upon), and I'll try do that soon.

That'd be neat to see, but don't hurry or feel a need to display that on my behalf at least. I'd be curious about whether that results in a probabilistic performance win (i.e. dependent on dataset).

@jayaddison
Copy link
Contributor Author

Cleaning up some old / stale pull requests; please let me know if this changeset is considered worthwhile and I'll reopen if so.

@jayaddison jayaddison closed this Dec 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants