Skip to content

Read::chars reform #33801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented May 22, 2016

Internally this duplicates the UTF-8 decoding logic from String::from_utf8_lossy, but let’s decide on the desired behavior first and we can refactor internals later.

With this, I’d like to propose the affected methods and types for stabilization (which would complete #27802).

r? @alexcrichton

Unlike `str::chars` where UTF-8 is implied since that’s always
the encoding of `str` (whose contents is guaranteed to be well-formed),
the bytes read from `io::Read` are arbitrary.

Fixes rust-lang#33761.
@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 23, 2016
@aturon
Copy link
Member

aturon commented May 23, 2016

cc @rust-lang/libs

@troplin
Copy link
Contributor

troplin commented May 24, 2016

Why does it need its own error type?

  • InvalidUtf8 -> io::ErrorKind::InvalidData
  • IncompleteUtf8 -> io::ErrorKind::UnexpectedEof

@SimonSapin
Copy link
Contributor Author

The error type existed before, and at least for lossy decoding we want to distinguish UTF-8 decoding errors from other kinds of invalid data or unexpected EOF the underlying Read stream might return.

@SimonSapin
Copy link
Contributor Author

Added a commit: per discussion in #27802, move the methods to BufRead. Read::chars is now deprecated and returns Utf8Chars<BufReader<Self>> with a one-byte buffer to stay closer to the previous behavior.

$on_eof
}
}
Err(ref e) if e.kind() == ErrorKind::Interrupted => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? Most users of BufRead are not going to check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looping on ErrorKind::Interrupted? We do it because users are unlikely to do it themselves. Also, saving state in the middle of an UTF-8 byte sequence to return this error to the user and let them resume would be more complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how difficult it would be to get the contract for BufRead to prohibit this return. It's just not useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it not be useful?
Without it there's no possibility to interrupt a blocking call.

I already had to reimplement a large part of std::io in my crate netio, just to be able to interrupt those calls (and not lose any data).
If you remove it also from Read::read, BufRead::fill_buf and Write::write, then this will be impossible without reimplementing everything from scratch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prohibit this return

This is not a return! On the contrary, on ErrorKind::Interrupted this code loops and reads again from the stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant this return value from fill_buf. In my opinion, we should not be retrying errors from lower layers like this -- either they wanted to interrupt blocking operations (as with @troplin's case) or they should be doing the retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m following the pattern of other functions/methods in this module. If this is to change, it should be a decision for the entire module and so is out of the scope of this PR. Please file a separate issue or RFC.

Additionally, if Iterator::next returns Err on ErrorKind::Interrupted while in the middle of an UTF-8 byte sequence and we want to allow users to resume without losing data, we’ll need to have additional state/buffers in the Utf8Chars struct. BufRead is not enough, since at least with the BufReader implementation the only way to get "new" bytes in the buffer is to consume the previous buffer entirely.

And the situation is similar for other helpers: for example, what should read_line do if interrupted in the middle of a line?

@SimonSapin SimonSapin force-pushed the io-chars-error-handling branch from 73353f2 to a218f27 Compare May 25, 2016 20:04
@jwilm
Copy link

jwilm commented May 30, 2016

Unless I'm missing something, it appears that the Utf8Chars iterator swallows bytes in the case of a partial read. There's calls to self.inner.consume(1) during parsing; if the IncompleteUtf8 error is returned, there's no way to recover those bytes. It would be nice to only consume the bytes when a valid char is returned or in the case of InvalidUtf8.

@eefriedman
Copy link
Contributor

It would be nice to only consume the bytes when a valid char is returned

This is impossible given the current design of BufRead: you have to consume the whole buffer before requesting more data with fill_buf. (If you don't, fill_buf is a no-op).

@taralx
Copy link
Contributor

taralx commented Jun 21, 2016

Since this is related to the FCP, I will reiterate that I feel strongly that Utf8CharsError should contain the failed bytes, for debugging purposes if nothing else.

@alexcrichton
Copy link
Member

Discussed briefly during the @rust-lang/libs triage meeting yesterday, unfortunately there were none who had yet had a chance to closely review this yet. We're still considering perhaps moving this into FCP for the next cycle if we decide to merge, however.


Looking over this now, some thoughts I have are:

  • I don't personally think there's motivation to move this to BufRead because:
    • This will always be slower than just reading a ton of bytes and then using str::from_utf8
    • Partial reads and multibyte chars straddling buffer boundaries still need to be handled
    • Due to buffer boundaries, there's no way that to have commit-like semantics where bytes aren't read if we don't need them
  • I'm a little worried that if we add utf8_chars it'll be expected that something like utf16_chars will pop up somewhere. It's unclear to me where we'd add this and whether it's even appropriate.

@SimonSapin
Copy link
Contributor Author

Due to buffer boundaries, there's no way that to have commit-like semantics where bytes aren't read if we don't need them

There is a way: using a buffer so that you can "rewind" by one byte. To me this does sound like a reason to use BufRead.

I'm a little worried that if we add utf8_chars it'll be expected that something like utf16_chars will pop up somewhere. It's unclear to me where we'd add this and whether it's even appropriate.

FWIW, everything related to UTF-16 currently in std is based on u16 code units. (Unicode calls this an encoding form.) This is useful e.g. for calling Windows APIs.

To use UTF-16 in I/O, you really need to pick a byte serialization (little or big endian) of these code units, and overall use UTF-16LE or UTF-16BE. (Unicode calls this an encoding scheme.) std doesn’t support this at all currently, and I don’t think there’s a good reason to add it without adding more general support for a number of character encodings.

@bors
Copy link
Collaborator

bors commented Jul 19, 2016

☔ The latest upstream changes (presumably #33974) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

I think this has basically stalled out at this point, unfortunately. Would it be possible to perhaps include this in a separate crate and deprecate in favor of that? If steam picks up externally we can certainly move back towards the standard library.

@brson
Copy link
Contributor

brson commented Nov 9, 2016

Closing because old and inactive.

@brson brson closed this Nov 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants