Skip to content

stdio: Handle unicode boundaries better on Windows #23344

Closed
@alexcrichton

Description

@alexcrichton

The stdio support on windows has two separate modes. One is when stdout is not a console and implies byte orientation, the other is when stdout is a console and implies [u16] orientation. In the stdout-is-a-console case we require that all input and output to be valid unicode so we can translate [u8] to [u16].

In this console case, however, our translation may not be quite right for a number of scenarios:

  • When reading data, the console will give us a block of u16's which we need to translate to UTF-8. In theory the console can give us half of a surrogate pair (with the next half available on the next call to read), but we do not handle this case well currently. This should be fixed by simply adding a "buffer of size 1" to hold half of a surrogate pair if necessary.
  • When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to read, then return an error, then return the remaining on the next call to read.
  • When writing data, we require that the entire block of input is valid UTF-8. We should instead take a similar approach as above where we try to interpret as much of the input as possible as UTF-8, but we simply ignore the rest for that one call to read (returning an error if the first byte is invalid UTF-8).
  • When writing data, we don't handle the case where a multibyte character straddles calls to write. Like the reading case, this could be alleviated with a 4-byte buffer for unwritten-but-valid-utf8 characters.
  • When writing, we translate a block of valid UTF-8 to a block of UTF-16. Upon writing this UTF-16, however, not all characters may be written. There are two problems here:
    • We need to translate a length of UTF-16 characters to a number of UTF-8 bytes that were written.
    • If half of a surrogate pair is written, the other half should be "buffered" and the length of UTF-8 written should include the half-written character (as the extra data is buffered).

At this time I'm not sure what sort of guarantees Windows actually gives us on these APIs. For example does windows never deal with only half of a surrogate pair on a read/write? If Windows were to guarantee various properties like this (or that it always writes the entire buffer, for example), our lives would be a lot easier! For now, though, it is unclear what we can rely on, so this issue is written as if we rely on nothing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeC-bugCategory: This is a bug.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions