stdio: Handle unicode boundaries better on Windows

The stdio support on windows has two separate modes. One is when stdout is not a console and implies byte orientation, the other is when stdout is a console and implies [u16] orientation. In the stdout-is-a-console case we require that all input and output to be valid unicode so we can translate [u8] to [u16].

In this console case, however, our translation may not be quite right for a number of scenarios:
- When reading data, the console will give us a block of u16's which we need to translate to UTF-8. In theory the console can give us half of a surrogate pair (with the next half available on the next call to read), but we do not handle this case well currently. This should be fixed by simply adding a "buffer of size 1" to hold half of a surrogate pair if necessary.
- When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`.
- When writing data, we require that the entire block of input is valid UTF-8. We should instead take a similar approach as above where we try to interpret as much of the input as possible as UTF-8, but we simply ignore the rest for that one call to `read` (returning an error if the first byte is invalid UTF-8).
- When writing data, we don't handle the case where a multibyte character straddles calls to `write`. Like the reading case, this could be alleviated with a 4-byte buffer for unwritten-but-valid-utf8 characters.
- When writing, we translate a block of valid UTF-8 to a block of UTF-16. Upon writing this UTF-16, however, not all characters may be written. There are two problems here:
  - We need to translate a length of UTF-16 characters to a number of UTF-8 bytes that were written.
  - If half of a surrogate pair is written, the other half should be "buffered" and the length of UTF-8 written should include the half-written character (as the extra data is buffered).

At this time I'm not sure what sort of guarantees Windows actually gives us on these APIs. For example does windows never deal with only half of a surrogate pair on a read/write? If Windows were to guarantee various properties like this (or that it always writes the entire buffer, for example), our lives would be a lot easier! For now, though, it is unclear what we can rely on, so this issue is written as if we rely on nothing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stdio: Handle unicode boundaries better on Windows #23344

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

stdio: Handle unicode boundaries better on Windows #23344

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions