Description
I tried this code:
println!("{:?}", "12345678901234".encode_utf16().size_hint());
let mut it = "\u{101234}".encode_utf16();
it.next().unwrap();
println!("{:?}", it.size_hint());
I expected to see this happen:
(5, Some(14))
(1, Some(1))
Instead, this happened:
(4, Some(28))
(0, Some(0))
Meta
rustc --version --verbose
:
rustc 1.73.0-nightly (39f42ad9e 2023-07-19)
binary: rustc
commit-hash: 39f42ad9e8430a8abb06c262346e89593278c515
commit-date: 2023-07-19
host: x86_64-pc-windows-msvc
release: 1.73.0-nightly
LLVM version: 16.0.5
The reason is that the EncodeUtf16
iterator calculates its size hint in terms of the contained Chars
iterator size hint, assuming that each character can correspond to either 1 or 2 code units.
In the case that the iterator is NOT in the middle of a surrogate pair, this leads to too-low lower bounds and too high upper-bounds.
In the case that the iterator IS in the middle of a surrogate pair, the remaining code unit is not taken into account as the iterator has advanced past this point.
The actual calculation should be done in terms of the remaining bytes:
- The lower bound is achieved by assuming the remaining bytes consist of as many 3-byte sequences as possible, optionally followed by a 1 or 2-byte sequence, leading to a lower bound of
(bytes_remaining + 2) / 3
- The upper bound is achieved by assuming the remaining bytes consist of 1-byte sequences, leading to an upper bound of
bytes_remaining
.
In the case of the iterator being positioned in the middle of a surrogate pair, both these values should be increased by 1.