Skip to content

Condition for handling malformed UTF-8; also an interface to iconv #4837

Closed
@lifthrasiir

Description

@lifthrasiir

Currently even this simple cat program:

use io::ReaderUtil;
fn main() {
    for io::stdin().each_line |line| { io::println(line); }
}

...fails on the broken or invalid UTF-8 strings (or possibly in other character encodings, as this example illustrates):

$ echo 깨진 글자 | iconv -f utf-8 -t cp949 | ./test
rust: task failed at 'Assertion is_utf8(vv) failed', [...]/rust/src/libcore/str.rs:50
rust: domain main @0x7fcf32815e10 root task failed

...due to the byte sequence is assumed to be in UTF-8 (which is not). But there is currently no standard way to fix broken UTF-8 strings by replacing offending substrings by some other valid UTF-8, so it is hard to fix this kind of bugs.

This issue is ultimately linked to the general character encoding handling (libiconv binding, perhaps?) and a strict distinction between byte sequence and Unicode (UTF-8) string. I found Python's approach reasonable (bytes and str are separated, converted to each other via encode and decode methods, normal file open reads bytes, codecs.open with an encoding converts them to str), but I'm really not sure about the actual interface.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeC-enhancementCategory: An issue proposing an enhancement or a PR with one.E-easyCall for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions