Skip to content

Warn about invalid byte sequences #167

Open
@yurikhan

Description

@yurikhan

HTML 5 Proposed Recommendation §8.2.2 The input byte stream, HTML 5.1 Draft §8.2.2 The input byte stream:

Note: Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]

Test case:

class TestInvalidSequences(unittest.TestCase):
    def test_invalid_sequences(self):
        parser = html5lib.HTMLParser()
        doc = parser.parse(io.BytesIO('<!DOCTYPE html>\xA0'), encoding='ascii')
        self.assertTrue(parser.errors)

Expected behavior: parser.errors is not empty

Observed behavior: parser.errors is empty; doc contains a tree which contains the \uFFFD replacement character in place of the invalid byte.

Cause: In HTMLBinaryInputStream.reset, the codec is constructed with the option 'replace'; the HTMLUnicodeInputStream only reports errors for Unicode code points which were successfully decoded but are either non-characters or surrogates.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions