-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Avoid redundant WTF-8 checks in PathBuf
#140159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rustbot has assigned @workingjubilee. Use |
72b18d0
to
bd2e262
Compare
oh dear. |
@ChrisDenton wanna review this? They've reviewed my other WTF-8 stuff, so I probably should have r?'d them. |
Failed to set assignee to
|
library/std/src/ffi/os_str.rs
Outdated
/// Extending the buffer with this slice must not require encoding-dependent | ||
/// surrogate joining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"must not require encoding-dependent surrogate joining" is deferring the safety requirement to instead be "you must know exactly how the implementation works", so I think this is not a significant improvement on messaging for the (ahem) modularity of reasoning.
unless we can concisely describe what requires that, in which case "don't put that in" should be the requirement instead.
and I am pretty sure we can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically "the last bytes of self
must not encode a leading/high surrogate for a surrogate pair, and the first bytes of other
must not encode a trailing/low surrogate of a surrogate pair". is there another case?
library/std/src/sys/os_str/wtf8.rs
Outdated
/// Extending the buffer with this slice must not require joining | ||
/// surrogates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it feels like "must not require joining surrogates" kinda wants to just say what the bad input that you shouldn't give it is but is being coy instead.
/// # Safety | ||
/// | ||
/// The length must be at an `OsStr` boundary, according to | ||
/// `Slice::check_public_boundary`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having increasingly negative opinions about having an internal type named Slice
but it is not your responsibility for that, so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then what would a better name be? I might submit a PR for that. But maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, honestly! it just is starting to bug me because this looked like a typo at first.
"well, my child, when a high surrogate and a low surrogate get together, they blow your code up stop doing that" |
Communicate the safety invariants of these methods with `unsafe fn` rather than privacy.
Eliminate checks for WTF-8 boundaries in `PathBuf::set_extension` and `add_extension`, where joining WTF-8 surrogate halves is impossible. Don't convert the `str` to `OsStr`, because `OsString::push` specializes to skip the joining when given strings.
bd2e262
to
0f0c0d8
Compare
I've updated the safety comments. In addition to rewording based on your feedback, I also mentioned that the slice must be valid for the given encoding like |
cool, that looks good to me. should be able to r+ it in the morning or thereabouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/// This bypasses the encoding-dependent surrogate joining, so `self` must | ||
/// not end with a leading surrogate half and `other` must not start with | ||
/// with a trailing surrogate half. | ||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you to @thaliaarchi for helping explain to/remind me that the reason this is sufficient as a safety condition is that joined surrogate pairs get distinct encodings that are not identical to "a lone surrogate followed by a lone surrogate" because that's why UTF-8 is, er, UTF-8 instead of... yeah.
To reiterate for my own benefit, because I have an annoying "I understand this from the bottom up or it doesn't click" problem, this required understanding:
- UTF-16 evolved from UCS-2, an encoding for the "basic multilingual plane"
- UTF-16 encodes "astral" code points beyond the BMP using "surrogate pairs", unassigned values from UCS-2 that served as a "niche" in the encoding, to fit in each single u16 the de facto enum:
enum Utf16 {
Single(Ucs2),
Paired(Surrogate)
}
- Nominally, "UTF-16" encoded data should reject the
Utf16::Paired
variant when it isn't actually paired, but a lot of "WTF-16" encodings actually allow lone surrogates in because everyone was still tacitly assuming handling things code-unit-by-code-unit would work fine and that's not how any multi-code-unit encoding works. - WTF-8 always encodes values in UTF-8 when values can be validly encoded in UTF-8.
- WTF-8 has an encoding for lone surrogates so it can handle invalid "WTF-16", taken out of UTF-8's own encoding "niche": it is not valid UTF-8.
- This means we have to handle "this surrogate should not exist" and "this surrogate is paired with another surrogate so they're a real
boyUnicode code point". Thus, putting two surrogates "one after the other" is not always "simple" appending for WTF-8, when it can actually mean decoding the pair of code units into a Unicode code point and then encoding that into UTF-8.
( There's a detail for "trailing and leading" surrogates about the actual encoding for surrogates but I am deliberately eliding it because everyone spends five hours of explanation on exactly how surrogates are encoded which doesn't matter when thinking about the design at a higher level, such that everything that actually matters oozes out of my brain before we get to the conclusion. )
( I guess that's me not being completely bottom-up but whatever. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...this is almost completely irrelevant to this actual review obviously I'm just trying to write this down again to try to get it to stick to my brain this time, as I have tried to learn/understand this at least 3 times.
@bors r+ |
Rollup of 12 pull requests Successful merges: - rust-lang#134034 (handle paren in macro expand for let-init-else expr) - rust-lang#137474 (pretty-print: Print shebang at the top of the output) - rust-lang#138872 (rustc_target: RISC-V `Zfinx` is incompatible with `{ILP32,LP64}[FD]` ABIs) - rust-lang#139046 (Improve `Lifetime::suggestion`) - rust-lang#139206 (std: use the address of `errno` to identify threads in `unique_thread_exit`) - rust-lang#139608 (Clarify `async` block behaviour) - rust-lang#139847 (Delegate to inner `vec::IntoIter` from `env::ArgsOs`) - rust-lang#140159 (Avoid redundant WTF-8 checks in `PathBuf`) - rust-lang#140197 (Document breaking out of a named code block) - rust-lang#140389 (Remove `avx512dq` and `avx512vl` implication for `avx512fp16`) - rust-lang#140430 (Improve test coverage of HIR pretty printing.) - rust-lang#140507 (rustc_target: RISC-V: feature addition batch 3) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#140159 - thaliaarchi:pathbuf-extension, r=workingjubilee Avoid redundant WTF-8 checks in `PathBuf` Eliminate checks for WTF-8 boundaries in `PathBuf::set_extension` and `add_extension`, where joining WTF-8 surrogate halves is impossible. Don't convert the `str` to `OsStr`, because `OsString::push` specializes to skip the joining when given strings. To assist in this, mark the internal methods `OsString::truncate` and `extend_from_slice` as `unsafe` to communicate their safety invariants better than with module privacy. Similar to rust-lang#137777. cc `@joboet` `@ChrisDenton`
Eliminate checks for WTF-8 boundaries in
PathBuf::set_extension
andadd_extension
, where joining WTF-8 surrogate halves is impossible. Don't convert thestr
toOsStr
, becauseOsString::push
specializes to skip the joining when given strings.To assist in this, mark the internal methods
OsString::truncate
andextend_from_slice
asunsafe
to communicate their safety invariants better than with module privacy.Similar to #137777.
cc @joboet @ChrisDenton