-
Notifications
You must be signed in to change notification settings - Fork 294
Tokenizer: pretranslate lowercase element and attribute names #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer: pretranslate lowercase element and attribute names #520
Conversation
Ah; I've now discovered the |
I have, on some branch (potentially local, for a bunch of reasons that are delaying me pushing anything right now), moved the lowercasing to the tokenizer from the tree constructor; there's real reason for it living in the tree constructor (besides the tokenizer previously being shared with a lax XML parser, the legacy of which has been pretty slowly removed). IIRC, when I looked at this (and similar changes) before, the cost of doing the lowercasing everywhere (versus once at the end) ultimately lead this to be a net loss. You could probably get much of the same benefit by reordering the if statements in rcdataEndTagNameState/rawtextEndTagOpenState/scriptDataEndTagNameState, firstly checking if the character is an ASCII letter, then computing whether it's appropriate, as we're then leaving the state either way. |
Thanks, that makes sense. I've had to back-out the on-the-fly lowercasing of content written into the That said, there are still a few cases where I think lowercasing has been occurring repeatedly on the same string data -- particularly tag and attribute names -- and we can remove those. In the current diff view for this PR, lines where a comparison is removed (without a replacement comparison being added) should correspond to those cases. I don't think it's going to make a huge impact, but some of these do appear to be redundant. The one remaining case that I'm puzzling over is the way that |
What's confusing you about that case? It's built up in the |
html5lib/_tokenizer.py
Outdated
appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower() | ||
name = self.temporaryBuffer.translate(asciiUpper2Lower) | ||
appropriate = self.currentToken and self.currentToken["name"] == name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: my thinking from my previous comment was to instead change this state (and similar ones) to:
if data in asciiLetters:
...
elif (data in spaceCharacters or data in ("/", ">")) and self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower():
if data in spaceCharacters:
...
elif data == "/":
...
etc.
At the absolute least, I think we should avoid computing appropriate
in the if data in asciiLetters
case? Otherwise we're doing this lower-casing after each time we add a character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yep - I didn't really register that after your first comment; I'll take a look at re-ordering these conditionals soon. Thanks for detailing that a bit further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's applied now; I didn't find a noticeable performance difference as a result (cpython 3.9.1), but it may be logically clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the modifications in this PR are getting blended together slightly confusingly, so I'll do a more thorough analysis soon to pick apart the individual suggestions, analyze performance for them individually, and then keep the ones that still seem useful and show performance benefits (or seem worthwhile enough to include anyway).
@@ -448,7 +449,7 @@ def tagNameState(self): | |||
"data": "invalid-codepoint"}) | |||
self.currentToken["name"] += "\uFFFD" | |||
else: | |||
self.currentToken["name"] += data | |||
self.currentToken["name"] += data.translate(asciiUpper2Lower) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very skeptical about this being a perf win, versus it being in emitCurrentToken
. What do the benchmarks say?
Yes, emitCurrentToken
's lowercasing becomes redundant in the RCDATA/RAWTEXT/script cases, but I expect the cost of this will negate any gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, yep - especially for short element names it seems likely that the translate
method call overhead (especially if called repeatedly) could negate any benefits provided by simpler comparisons.
I hadn't assessed the performance of this code path separately; it felt worth maintaining consistency but I don't believe there's a noticeable performance change.
I should have been clearer: there are two things I found a little confusing. One is that these are the only usages of the I saw that you'd previously taken a go at consolidating the ASCII lowercasing logic in a branch, and I also seemed to find some small performance wins by calling The other thing I found a bit confusing was the transitions between the escape and double-escape states. I trust that they do make sense since the test cases prove the behaviour, but I found it really hard to reason about from a logical code flow and parser state point-of-view. It doesn't help that it's all to do with complex HTML escaping scenarios which can a bit mind-bending in themselves. |
The important thing is that The current approach in html5lib in principle is to use I had locally a C implementation of ASCII lowercasing which does nothing if the string is already lowercased, which probably wins even over I'll push my Cython branch sometime this week, which includes replacing all the ASCII case conversion to an
It would take me a fair while to get me head around the semantics there! |
…er data translation
Ok, yep - that is an important detail, as is the resulting implication that non-ASCII unicode characters can be transformed by lowercasing. That said, in the HTML5 living spec (currently @ b49b9d970a5bded83b4ea019034f448fc2233e11), valid element and attribute names consist of ASCII alphanumerics - so using I've opened #526 to try this out, and perhaps you can catch me out by suggesting a test case if there's a situation this doesn't handle.
That'd be neat to see, but don't hurry or feel a need to display that on my behalf at least. I'd be curious about whether that results in a probabilistic performance win (i.e. dependent on dataset). |
Cleaning up some old / stale pull requests; please let me know if this changeset is considered worthwhile and I'll reopen if so. |
During tokenization, some element names, attributes, and temporary buffered strings are compared in a case-insensitive mode.
To avoid repeat string transformation operations, this change performs lowercasing on those strings at construction-time.
This change builds upon and includes #519 and prepares for further refactoring work aiming towards resolving #24. Those further changes should de-duplicate a number of the references to
str.translate
added here.