Description
Currently even this simple cat
program:
use io::ReaderUtil;
fn main() {
for io::stdin().each_line |line| { io::println(line); }
}
...fails on the broken or invalid UTF-8 strings (or possibly in other character encodings, as this example illustrates):
$ echo 깨진 글자 | iconv -f utf-8 -t cp949 | ./test
rust: task failed at 'Assertion is_utf8(vv) failed', [...]/rust/src/libcore/str.rs:50
rust: domain main @0x7fcf32815e10 root task failed
...due to the byte sequence is assumed to be in UTF-8 (which is not). But there is currently no standard way to fix broken UTF-8 strings by replacing offending substrings by some other valid UTF-8, so it is hard to fix this kind of bugs.
This issue is ultimately linked to the general character encoding handling (libiconv binding, perhaps?) and a strict distinction between byte sequence and Unicode (UTF-8) string. I found Python's approach reasonable (bytes and str are separated, converted to each other via encode
and decode
methods, normal file open
reads bytes, codecs.open
with an encoding converts them to str), but I'm really not sure about the actual interface.