Description
The Display
implementations for Path
, PathBuf
, OsStr
, and OsString
incorrectly handle Formatter<'_>
flags. This is visible on stable (for paths) or with #![feature(os_str_display)]
(for OS strings).
Background
On Windows, these types convert from wide strings (u16
) to WTF-8 and the only invalid sequences are unpaired surrogate halves. On other systems, these types use conventional UTF-8 and may have any kind of invalid sequence. The Display
implementations replace such invalid sequences with the replacement character U+FFFD. This is done by iterating over valid chunks, where each is followed by an invalid sequence.
Bugs
In the Windows implementation, Wtf8
, when the full string is valid UTF-8, it forwards the formatter args by calling Display::fmt(s, f)
, so it is padded as requested in most cases. However, when it is invalid, the formatter args are ignored. In the non-Windows implementation, Bytes
, the last valid chunk is formatted with Display::fmt(last_chunk, f)
, but all other chunks with f.write_str(s)
. This is not only inconsistent, but it can display padding inside the string data.
This illustrates the buggy Display
implementations in (AFAIK) all the APIs they're exposed in:
#![feature(os_str_display)]
use std::{ffi::OsString, path::PathBuf};
#[cfg(unix)]
{
use std::os::unix::ffi::OsStringExt;
display(OsString::from_vec(b"b\xFFd".to_vec()), "ab� d e"); // BUG
display(OsString::from_vec(b"bcd".to_vec()), "a bcd e");
}
#[cfg(windows)]
{
use std::os::windows::ffi::OsStringExt;
macro_rules! wtf8[($($c:literal),*) => { &[$($c as u16),*] }];
display(OsString::from_wide(wtf8![b'b', 0xD800, b'd']), "ab�de"); // BUG
display(OsString::from_wide(wtf8![b'b', b'c', b'd']), "a bcd e");
}
assert_eq!(format!("a{:^10}e", "bcd"), "a bcd e");
fn display(s: OsString, expect: &str) {
let p = PathBuf::from(s.clone());
assert_eq!(format!("a{:^10}e", s.display()), expect);
assert_eq!(format!("a{:^10}e", s.as_os_str().display()), expect);
assert_eq!(format!("a{:^10}e", p.display()), expect);
assert_eq!(format!("a{:^10}e", p.as_path().display()), expect);
}
History
- 742ca0caf25 (std: Respect formatting flags for str-like OsStr, 2017-08-12): Forward formatter flags for fully valid strings
- ac96fd77874 (Avoid allocations in Debug for os_str, 2017-06-12): Display with
Utf8Chunks
-style iteration - 45ddf50cebd (Add new path module, 2015-01-29): Display with
String::from_utf8_lossy
Fix
We cannot simply ignore the formatter args, because that would break backwards compatibility, as the common case of fully valid UTF-8 respects padding, matching str
and String
.
The simplest fix would be to bring the behavior of non-Windows in line with Windows, i.e., fix the erroneous padding for the last chunk by only forwarding formatter args if the full string is valid.
A simple fix would be to format self.to_string_lossy()
instead of doing a manual iteration. However, this incurs an allocation when invalid.
I propose doing the formatting by scanning in two passes. The first pass counts the number of Unicode scalars to render. Since there's no escaping, this is relatively inexpensive. Then the left padding is written. Then the second pass writes the string. Finally, the right padding is written. This could be abstracted by making a method on the formatter, which takes a length in Unicode scalars and a callback which writes the data: It would pad left according to the length, write via the callback, then pad right.
I would like to work on fixing this, pending discussion on which direction to take it :).
Related?
I suspect that other types could have similar issues with inconsistent or misplaced formatter flag handling, but have only surveyed those detailed here.