Description
HTML 5 Proposed Recommendation §8.2.2 The input byte stream, HTML 5.1 Draft §8.2.2 The input byte stream:
Note: Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]
Test case:
class TestInvalidSequences(unittest.TestCase):
def test_invalid_sequences(self):
parser = html5lib.HTMLParser()
doc = parser.parse(io.BytesIO('<!DOCTYPE html>\xA0'), encoding='ascii')
self.assertTrue(parser.errors)
Expected behavior: parser.errors
is not empty
Observed behavior: parser.errors
is empty; doc
contains a tree which contains the \uFFFD
replacement character in place of the invalid byte.
Cause: In HTMLBinaryInputStream.reset
, the codec is constructed with the option 'replace'
; the HTMLUnicodeInputStream
only reports errors for Unicode code points which were successfully decoded but are either non-characters or surrogates.