Open
Description
The etree walker with implementation lxml.etree
doesn't work when passed a full html document (having type lxml.etree._ElementTree
).
To reproduce--
def serialize(element, treebuilder, implementation=None):
walker_cls = html5lib.getTreeWalker(treebuilder, implementation=implementation)
walker = walker_cls(element)
serializer = HTMLSerializer(omit_optional_tags=False)
html = serializer.render(walker)
print(html)
html = """<!DOCTYPE html>
<html>
<head>
<title>foo</title>
</head>
<body>
<p>a</p><p>b</p>
</body>
</html>
"""
builder = html5lib.getTreeBuilder('lxml')
parser = html5lib.HTMLParser(builder, namespaceHTMLElements=False)
element = parser.parse(html)
serialize(element, 'lxml')
serialize(element, 'etree', implementation=lxml.etree)
The last line fails with the following error:
Traceback (most recent call last):
File "test-html5lib.py", line 98, in <module>
parse_and_serialize(element, 'etree', implementation=lxml.etree)
File "test-html5lib.py", line 79, in serialize
html = serializer.render(walker)
File "/.../python3.6/site-packages/html5lib/serializer.py", line 323, in render
return "".join(list(self.serialize(treewalker)))
File "/.../python3.6/site-packages/html5lib/serializer.py", line 209, in serialize
for token in treewalker:
File "/.../python3.6/site-packages/html5lib/treewalkers/base.py", line 128, in __iter__
firstChild = self.getFirstChild(currentNode)
File "/.../python3.6/site-packages/html5lib/treewalkers/etree.py", line 88, in getFirstChild
if element.text:
AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text'
The walker should probably first be calling root = element.getroot()
. This seems to be on the same wave length as the issue with treewalkers/etree.py
I described in this comment: #338 (comment)