Avoid redundant WTF-8 checks in `PathBuf` #140159

thaliaarchi · 2025-04-22T10:53:09Z

Eliminate checks for WTF-8 boundaries in PathBuf::set_extension and add_extension, where joining WTF-8 surrogate halves is impossible. Don't convert the str to OsStr, because OsString::push specializes to skip the joining when given strings.

To assist in this, mark the internal methods OsString::truncate and extend_from_slice as unsafe to communicate their safety invariants better than with module privacy.

Similar to #137777.

cc @joboet @ChrisDenton

rustbot · 2025-04-22T10:53:15Z

r? @workingjubilee

rustbot has assigned @workingjubilee.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

workingjubilee · 2025-05-01T04:39:00Z

oh dear.

thaliaarchi · 2025-05-01T04:57:54Z

@ChrisDenton wanna review this? They've reviewed my other WTF-8 stuff, so I probably should have r?'d them.

rustbot · 2025-05-01T04:57:57Z

Failed to set assignee to 'd: invalid assignee

Note: Only org members with at least the repository "read" role, users with write permissions, or people who have commented on the PR may be assigned.

workingjubilee · 2025-05-01T05:03:58Z

library/std/src/ffi/os_str.rs

+    /// Extending the buffer with this slice must not require encoding-dependent
+    /// surrogate joining.


"must not require encoding-dependent surrogate joining" is deferring the safety requirement to instead be "you must know exactly how the implementation works", so I think this is not a significant improvement on messaging for the (ahem) modularity of reasoning.

unless we can concisely describe what requires that, in which case "don't put that in" should be the requirement instead.

and I am pretty sure we can?

specifically "the last bytes of self must not encode a leading/high surrogate for a surrogate pair, and the first bytes of other must not encode a trailing/low surrogate of a surrogate pair". is there another case?

workingjubilee · 2025-05-01T05:06:28Z

library/std/src/sys/os_str/wtf8.rs

+    /// Extending the buffer with this slice must not require joining
+    /// surrogates.


yeah, it feels like "must not require joining surrogates" kinda wants to just say what the bad input that you shouldn't give it is but is being coy instead.

workingjubilee · 2025-05-01T05:07:50Z

library/std/src/sys/os_str/bytes.rs

+    /// # Safety
+    ///
+    /// The length must be at an `OsStr` boundary, according to
+    /// `Slice::check_public_boundary`.


I am having increasingly negative opinions about having an internal type named Slice but it is not your responsibility for that, so

Then what would a better name be? I might submit a PR for that. But maybe not.

I am not sure, honestly! it just is starting to bug me because this looked like a typo at first.

workingjubilee · 2025-05-01T05:20:53Z

"well, my child, when a high surrogate and a low surrogate get together, they blow your code up stop doing that"

Communicate the safety invariants of these methods with `unsafe fn` rather than privacy.

Eliminate checks for WTF-8 boundaries in `PathBuf::set_extension` and `add_extension`, where joining WTF-8 surrogate halves is impossible. Don't convert the `str` to `OsStr`, because `OsString::push` specializes to skip the joining when given strings.

thaliaarchi · 2025-05-01T06:59:04Z

I've updated the safety comments. In addition to rewording based on your feedback, I also mentioned that the slice must be valid for the given encoding like from_encoded_bytes_unchecked.

workingjubilee · 2025-05-01T07:05:52Z

cool, that looks good to me. should be able to r+ it in the morning or thereabouts.

workingjubilee

Thanks!

workingjubilee · 2025-05-01T23:08:02Z

library/std/src/ffi/os_str.rs

+    /// This bypasses the encoding-dependent surrogate joining, so `self` must
+    /// not end with a leading surrogate half and `other` must not start with
+    /// with a trailing surrogate half.
    #[inline]


Thank you to @thaliaarchi for helping explain to/remind me that the reason this is sufficient as a safety condition is that joined surrogate pairs get distinct encodings that are not identical to "a lone surrogate followed by a lone surrogate" because that's why UTF-8 is, er, UTF-8 instead of... yeah.

To reiterate for my own benefit, because I have an annoying "I understand this from the bottom up or it doesn't click" problem, this required understanding:

UTF-16 evolved from UCS-2, an encoding for the "basic multilingual plane"

UTF-16 encodes "astral" code points beyond the BMP using "surrogate pairs", unassigned values from UCS-2 that served as a "niche" in the encoding, to fit in each single u16 the de facto enum:

enum Utf16 { Single(Ucs2), Paired(Surrogate) }

Nominally, "UTF-16" encoded data should reject the Utf16::Paired variant when it isn't actually paired, but a lot of "WTF-16" encodings actually allow lone surrogates in because everyone was still tacitly assuming handling things code-unit-by-code-unit would work fine and that's not how any multi-code-unit encoding works.

WTF-8 always encodes values in UTF-8 when values can be validly encoded in UTF-8.

WTF-8 has an encoding for lone surrogates so it can handle invalid "WTF-16", taken out of UTF-8's own encoding "niche": it is not valid UTF-8.

This means we have to handle "this surrogate should not exist" and "this surrogate is paired with another surrogate so they're a real ~~boy~~ Unicode code point". Thus, putting two surrogates "one after the other" is not always "simple" appending for WTF-8, when it can actually mean decoding the pair of code units into a Unicode code point and then encoding that into UTF-8.

( There's a detail for "trailing and leading" surrogates about the actual encoding for surrogates but I am deliberately eliding it because everyone spends five hours of explanation on exactly how surrogates are encoded which doesn't matter when thinking about the design at a higher level, such that everything that actually matters oozes out of my brain before we get to the conclusion. )

( I guess that's me not being completely bottom-up but whatever. )

...this is almost completely irrelevant to this actual review obviously I'm just trying to write this down again to try to get it to stick to my brain this time, as I have tried to learn/understand this at least 3 times.

workingjubilee · 2025-05-02T00:46:15Z

@bors r+

bors · 2025-05-02T00:46:18Z

📌 Commit 0f0c0d8 has been approved by workingjubilee

It is now in the queue for this repository.

Rollup of 12 pull requests Successful merges: - rust-lang#134034 (handle paren in macro expand for let-init-else expr) - rust-lang#137474 (pretty-print: Print shebang at the top of the output) - rust-lang#138872 (rustc_target: RISC-V `Zfinx` is incompatible with `{ILP32,LP64}[FD]` ABIs) - rust-lang#139046 (Improve `Lifetime::suggestion`) - rust-lang#139206 (std: use the address of `errno` to identify threads in `unique_thread_exit`) - rust-lang#139608 (Clarify `async` block behaviour) - rust-lang#139847 (Delegate to inner `vec::IntoIter` from `env::ArgsOs`) - rust-lang#140159 (Avoid redundant WTF-8 checks in `PathBuf`) - rust-lang#140197 (Document breaking out of a named code block) - rust-lang#140389 (Remove `avx512dq` and `avx512vl` implication for `avx512fp16`) - rust-lang#140430 (Improve test coverage of HIR pretty printing.) - rust-lang#140507 (rustc_target: RISC-V: feature addition batch 3) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of rust-lang#140159 - thaliaarchi:pathbuf-extension, r=workingjubilee Avoid redundant WTF-8 checks in `PathBuf` Eliminate checks for WTF-8 boundaries in `PathBuf::set_extension` and `add_extension`, where joining WTF-8 surrogate halves is impossible. Don't convert the `str` to `OsStr`, because `OsString::push` specializes to skip the joining when given strings. To assist in this, mark the internal methods `OsString::truncate` and `extend_from_slice` as `unsafe` to communicate their safety invariants better than with module privacy. Similar to rust-lang#137777. cc `@joboet` `@ChrisDenton`

rustbot assigned workingjubilee Apr 22, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 22, 2025

thaliaarchi force-pushed the pathbuf-extension branch from 72b18d0 to bd2e262 Compare April 22, 2025 12:54

workingjubilee reviewed May 1, 2025

View reviewed changes

thaliaarchi added 2 commits April 30, 2025 23:56

Make internal OsString::truncate and extend_from_slice unsafe

7cb357a

Communicate the safety invariants of these methods with `unsafe fn` rather than privacy.

thaliaarchi force-pushed the pathbuf-extension branch from bd2e262 to 0f0c0d8 Compare May 1, 2025 06:57

workingjubilee approved these changes May 2, 2025

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 2, 2025

Zalathar mentioned this pull request May 2, 2025

Rollup of 12 pull requests #140581

Merged

bors merged commit 5a58c7a into rust-lang:master May 2, 2025
6 checks passed

rustbot added this to the 1.88.0 milestone May 2, 2025

thaliaarchi deleted the pathbuf-extension branch May 3, 2025 03:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid redundant WTF-8 checks in `PathBuf` #140159

Avoid redundant WTF-8 checks in `PathBuf` #140159

thaliaarchi commented Apr 22, 2025

rustbot commented Apr 22, 2025

workingjubilee commented May 1, 2025

thaliaarchi commented May 1, 2025

rustbot commented May 1, 2025

workingjubilee May 1, 2025

workingjubilee May 1, 2025

workingjubilee May 1, 2025

workingjubilee May 1, 2025

thaliaarchi May 1, 2025

workingjubilee May 1, 2025

workingjubilee commented May 1, 2025

thaliaarchi commented May 1, 2025

workingjubilee commented May 1, 2025

workingjubilee left a comment

workingjubilee May 1, 2025

workingjubilee May 2, 2025

workingjubilee commented May 2, 2025

bors commented May 2, 2025

		/// Extending the buffer with this slice must not require encoding-dependent
		/// surrogate joining.

		/// Extending the buffer with this slice must not require joining
		/// surrogates.

Avoid redundant WTF-8 checks in PathBuf #140159

Avoid redundant WTF-8 checks in PathBuf #140159

Conversation

thaliaarchi commented Apr 22, 2025

rustbot commented Apr 22, 2025

workingjubilee commented May 1, 2025

thaliaarchi commented May 1, 2025

rustbot commented May 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

workingjubilee commented May 1, 2025

thaliaarchi commented May 1, 2025

workingjubilee commented May 1, 2025

workingjubilee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

workingjubilee commented May 2, 2025

bors commented May 2, 2025

Avoid redundant WTF-8 checks in `PathBuf` #140159

Avoid redundant WTF-8 checks in `PathBuf` #140159