Read::chars reform #33801

SimonSapin · 2016-05-22T23:09:17Z

Don’t swallow valid code point on UTF-8 decoding errors, as discussed in Tracking issue for Read::chars #27802 (comment)
Rename/split error variants
Rename the method to utf8_chars, as previously proposed and discussed in Rename std::io::Read::chars to utf8_chars. #33761.
Add Read::utf8_chars_lossy which replaces UTF-8 errors with U+FFFD, leaving only I/O errors.
Edit 2016-05-25: per discussion in Tracking issue for Read::chars #27802, move the methods to BufRead. Read::chars is now deprecated and returns Utf8Chars<BufReader<Self>> with a one-byte buffer to stay closer to the previous behavior.

Internally this duplicates the UTF-8 decoding logic from String::from_utf8_lossy, but let’s decide on the desired behavior first and we can refactor internals later.

With this, I’d like to propose the affected methods and types for stabilization (which would complete #27802).

r? @alexcrichton

…rrors.

Unlike `str::chars` where UTF-8 is implied since that’s always the encoding of `str` (whose contents is guaranteed to be well-formed), the bytes read from `io::Read` are arbitrary. Fixes rust-lang#33761.

rust-highfive · 2016-05-22T23:09:30Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

aturon · 2016-05-23T21:19:59Z

cc @rust-lang/libs

troplin · 2016-05-24T20:52:47Z

Why does it need its own error type?

InvalidUtf8 -> io::ErrorKind::InvalidData
IncompleteUtf8 -> io::ErrorKind::UnexpectedEof

SimonSapin · 2016-05-24T21:27:30Z

The error type existed before, and at least for lossy decoding we want to distinguish UTF-8 decoding errors from other kinds of invalid data or unexpected EOF the underlying Read stream might return.

SimonSapin · 2016-05-25T08:30:40Z

Added a commit: per discussion in #27802, move the methods to BufRead. Read::chars is now deprecated and returns Utf8Chars<BufReader<Self>> with a one-byte buffer to stay closer to the previous behavior.

taralx · 2016-05-25T15:22:12Z

src/libstd/io/mod.rs

+                                    $on_eof
+                                }
+                            }
+                            Err(ref e) if e.kind() == ErrorKind::Interrupted => {}


Is this really necessary? Most users of BufRead are not going to check this.

Looping on ErrorKind::Interrupted? We do it because users are unlikely to do it themselves. Also, saving state in the middle of an UTF-8 byte sequence to return this error to the user and let them resume would be more complicated.

I wonder how difficult it would be to get the contract for BufRead to prohibit this return. It's just not useful.

Why should it not be useful?
Without it there's no possibility to interrupt a blocking call.

I already had to reimplement a large part of std::io in my crate netio, just to be able to interrupt those calls (and not lose any data).
If you remove it also from Read::read, BufRead::fill_buf and Write::write, then this will be impossible without reimplementing everything from scratch!

prohibit this return

This is not a return! On the contrary, on ErrorKind::Interrupted this code loops and reads again from the stream.

Sorry, I meant this return value from fill_buf. In my opinion, we should not be retrying errors from lower layers like this -- either they wanted to interrupt blocking operations (as with @troplin's case) or they should be doing the retry.

I’m following the pattern of other functions/methods in this module. If this is to change, it should be a decision for the entire module and so is out of the scope of this PR. Please file a separate issue or RFC.

Additionally, if Iterator::next returns Err on ErrorKind::Interrupted while in the middle of an UTF-8 byte sequence and we want to allow users to resume without losing data, we’ll need to have additional state/buffers in the Utf8Chars struct. BufRead is not enough, since at least with the BufReader implementation the only way to get "new" bytes in the buffer is to consume the previous buffer entirely.

And the situation is similar for other helpers: for example, what should read_line do if interrupted in the middle of a line?

per discussion after rust-lang#27802 (comment)

jwilm · 2016-05-30T19:16:15Z

Unless I'm missing something, it appears that the Utf8Chars iterator swallows bytes in the case of a partial read. There's calls to self.inner.consume(1) during parsing; if the IncompleteUtf8 error is returned, there's no way to recover those bytes. It would be nice to only consume the bytes when a valid char is returned or in the case of InvalidUtf8.

eefriedman · 2016-05-30T19:37:50Z

It would be nice to only consume the bytes when a valid char is returned

This is impossible given the current design of BufRead: you have to consume the whole buffer before requesting more data with fill_buf. (If you don't, fill_buf is a no-op).

taralx · 2016-06-21T00:44:16Z

Since this is related to the FCP, I will reiterate that I feel strongly that Utf8CharsError should contain the failed bytes, for debugging purposes if nothing else.

alexcrichton · 2016-06-28T17:25:33Z

Discussed briefly during the @rust-lang/libs triage meeting yesterday, unfortunately there were none who had yet had a chance to closely review this yet. We're still considering perhaps moving this into FCP for the next cycle if we decide to merge, however.

Looking over this now, some thoughts I have are:

I don't personally think there's motivation to move this to BufRead because:
- This will always be slower than just reading a ton of bytes and then using str::from_utf8
- Partial reads and multibyte chars straddling buffer boundaries still need to be handled
- Due to buffer boundaries, there's no way that to have commit-like semantics where bytes aren't read if we don't need them
I'm a little worried that if we add utf8_chars it'll be expected that something like utf16_chars will pop up somewhere. It's unclear to me where we'd add this and whether it's even appropriate.

SimonSapin · 2016-06-28T18:32:27Z

Due to buffer boundaries, there's no way that to have commit-like semantics where bytes aren't read if we don't need them

There is a way: using a buffer so that you can "rewind" by one byte. To me this does sound like a reason to use BufRead.

I'm a little worried that if we add utf8_chars it'll be expected that something like utf16_chars will pop up somewhere. It's unclear to me where we'd add this and whether it's even appropriate.

FWIW, everything related to UTF-16 currently in std is based on u16 code units. (Unicode calls this an encoding form.) This is useful e.g. for calling Windows APIs.

To use UTF-16 in I/O, you really need to pick a byte serialization (little or big endian) of these code units, and overall use UTF-16LE or UTF-16BE. (Unicode calls this an encoding scheme.) std doesn’t support this at all currently, and I don’t think there’s a good reason to add it without adding more general support for a number of character encodings.

bors · 2016-07-19T12:12:57Z

☔ The latest upstream changes (presumably #33974) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2016-10-31T20:28:53Z

I think this has basically stalled out at this point, unfortunately. Would it be possible to perhaps include this in a separate crate and deprecate in favor of that? If steam picks up externally we can certainly move back towards the standard library.

brson · 2016-11-09T01:23:28Z

Closing because old and inactive.

SimonSapin added 5 commits May 23, 2016 00:53

Make Read::chars not swallow valid chars and emit correct number of e…

7ab3450

…rrors.

Rename CharsError::Other to CharsError::Io

fd31eb6

Split CharsError::NotUtf8 into InvalidUtf8 and IncompleteUtf8 variants.

bffa87b

Rename std::io::Read::chars to utf8_chars.

fae703e

Unlike `str::chars` where UTF-8 is implied since that’s always the encoding of `str` (whose contents is guaranteed to be well-formed), the bytes read from `io::Read` are arbitrary. Fixes rust-lang#33761.

Add Read::utf8_chars_lossy

f81aee4

rust-highfive assigned alexcrichton May 22, 2016

SimonSapin mentioned this pull request May 22, 2016

Tracking issue for Read::chars #27802

Closed

alexcrichton mentioned this pull request May 23, 2016

Rename std::io::Read::chars to utf8_chars. #33761

Closed

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 23, 2016

taralx reviewed May 25, 2016
View reviewed changes

Move Read::utf8_chars and utf8_chars_lossy to BufRead

a218f27

per discussion after rust-lang#27802 (comment)

SimonSapin force-pushed the io-chars-error-handling branch from 73353f2 to a218f27 Compare May 25, 2016 20:04

alexcrichton mentioned this pull request Jun 10, 2016

Retry on EINTR in Bytes and Chars. #33974

Merged

brson closed this Nov 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read::chars reform #33801

Read::chars reform #33801

SimonSapin commented May 22, 2016 •

edited

Loading

rust-highfive commented May 22, 2016

aturon commented May 23, 2016

troplin commented May 24, 2016

SimonSapin commented May 24, 2016

SimonSapin commented May 25, 2016

taralx May 25, 2016

SimonSapin May 25, 2016

taralx May 25, 2016

troplin May 25, 2016

SimonSapin May 25, 2016

taralx May 25, 2016

SimonSapin May 25, 2016

jwilm commented May 30, 2016 •

edited

Loading

eefriedman commented May 30, 2016

taralx commented Jun 21, 2016

alexcrichton commented Jun 28, 2016

SimonSapin commented Jun 28, 2016

bors commented Jul 19, 2016

alexcrichton commented Oct 31, 2016

brson commented Nov 9, 2016

Read::chars reform #33801

Read::chars reform #33801

Conversation

SimonSapin commented May 22, 2016 • edited Loading

rust-highfive commented May 22, 2016

aturon commented May 23, 2016

troplin commented May 24, 2016

SimonSapin commented May 24, 2016

SimonSapin commented May 25, 2016

taralx May 25, 2016

Choose a reason for hiding this comment

SimonSapin May 25, 2016

Choose a reason for hiding this comment

taralx May 25, 2016

Choose a reason for hiding this comment

troplin May 25, 2016

Choose a reason for hiding this comment

SimonSapin May 25, 2016

Choose a reason for hiding this comment

taralx May 25, 2016

Choose a reason for hiding this comment

SimonSapin May 25, 2016

Choose a reason for hiding this comment

jwilm commented May 30, 2016 • edited Loading

eefriedman commented May 30, 2016

taralx commented Jun 21, 2016

alexcrichton commented Jun 28, 2016

SimonSapin commented Jun 28, 2016

bors commented Jul 19, 2016

alexcrichton commented Oct 31, 2016

brson commented Nov 9, 2016

SimonSapin commented May 22, 2016 •

edited

Loading

jwilm commented May 30, 2016 •

edited

Loading