Open
Description
Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.
Each of these trigger the exception below:
html5lib.parse('<p>', treebuilder='lxml')
html5lib.parse('<p>\x01', treebuilder='lxml')
html5lib.parse('<p id="">', treebuilder='lxml')
html5lib.parse('<p id="\x01">', treebuilder='lxml')
Traceback (most recent call last):
File "/tmp/a.py", line 4, in <module>
html5lib.parse('<p>', treebuilder='lxml')
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
return p.parse(doc, encoding=encoding)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
self.mainLoop()
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
new_token = phase.processCharacters(new_token)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
self.tree.insertText(token["data"])
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
parent.insertText(data)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
builder.Element.insertText(self, data, insertBefore)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
self._element.text += data
File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:
DataLossWarning: Text cannot contain U+000C
libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.