-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Read::chars reform #33801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read::chars reform #33801
Conversation
Unlike `str::chars` where UTF-8 is implied since that’s always the encoding of `str` (whose contents is guaranteed to be well-formed), the bytes read from `io::Read` are arbitrary. Fixes rust-lang#33761.
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
cc @rust-lang/libs |
Why does it need its own error type?
|
The error type existed before, and at least for lossy decoding we want to distinguish UTF-8 decoding errors from other kinds of invalid data or unexpected EOF the underlying |
Added a commit: per discussion in #27802, move the methods to |
$on_eof | ||
} | ||
} | ||
Err(ref e) if e.kind() == ErrorKind::Interrupted => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really necessary? Most users of BufRead are not going to check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looping on ErrorKind::Interrupted
? We do it because users are unlikely to do it themselves. Also, saving state in the middle of an UTF-8 byte sequence to return this error to the user and let them resume would be more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how difficult it would be to get the contract for BufRead to prohibit this return. It's just not useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should it not be useful?
Without it there's no possibility to interrupt a blocking call.
I already had to reimplement a large part of std::io
in my crate netio, just to be able to interrupt those calls (and not lose any data).
If you remove it also from Read::read
, BufRead::fill_buf
and Write::write
, then this will be impossible without reimplementing everything from scratch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prohibit this return
This is not a return! On the contrary, on ErrorKind::Interrupted
this code loops and reads again from the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant this return value from fill_buf
. In my opinion, we should not be retrying errors from lower layers like this -- either they wanted to interrupt blocking operations (as with @troplin's case) or they should be doing the retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m following the pattern of other functions/methods in this module. If this is to change, it should be a decision for the entire module and so is out of the scope of this PR. Please file a separate issue or RFC.
Additionally, if Iterator::next
returns Err
on ErrorKind::Interrupted
while in the middle of an UTF-8 byte sequence and we want to allow users to resume without losing data, we’ll need to have additional state/buffers in the Utf8Chars
struct. BufRead
is not enough, since at least with the BufReader
implementation the only way to get "new" bytes in the buffer is to consume the previous buffer entirely.
And the situation is similar for other helpers: for example, what should read_line
do if interrupted in the middle of a line?
per discussion after rust-lang#27802 (comment)
73353f2
to
a218f27
Compare
Unless I'm missing something, it appears that the |
This is impossible given the current design of BufRead: you have to consume the whole buffer before requesting more data with fill_buf. (If you don't, fill_buf is a no-op). |
Since this is related to the FCP, I will reiterate that I feel strongly that Utf8CharsError should contain the failed bytes, for debugging purposes if nothing else. |
Discussed briefly during the @rust-lang/libs triage meeting yesterday, unfortunately there were none who had yet had a chance to closely review this yet. We're still considering perhaps moving this into FCP for the next cycle if we decide to merge, however. Looking over this now, some thoughts I have are:
|
There is a way: using a buffer so that you can "rewind" by one byte. To me this does sound like a reason to use
FWIW, everything related to UTF-16 currently in To use UTF-16 in I/O, you really need to pick a byte serialization (little or big endian) of these code units, and overall use UTF-16LE or UTF-16BE. (Unicode calls this an encoding scheme.) |
☔ The latest upstream changes (presumably #33974) made this pull request unmergeable. Please resolve the merge conflicts. |
I think this has basically stalled out at this point, unfortunately. Would it be possible to perhaps include this in a separate crate and deprecate in favor of that? If steam picks up externally we can certainly move back towards the standard library. |
Closing because old and inactive. |
utf8_chars
, as previously proposed and discussed in Rename std::io::Read::chars to utf8_chars. #33761.Read::utf8_chars_lossy
which replaces UTF-8 errors with U+FFFD, leaving only I/O errors.BufRead
.Read::chars
is now deprecated and returnsUtf8Chars<BufReader<Self>>
with a one-byte buffer to stay closer to the previous behavior.Internally this duplicates the UTF-8 decoding logic from
String::from_utf8_lossy
, but let’s decide on the desired behavior first and we can refactor internals later.With this, I’d like to propose the affected methods and types for stabilization (which would complete #27802).
r? @alexcrichton