Skip to content

Avoid redundant WTF-8 checks in PathBuf #140159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 2, 2025

Conversation

thaliaarchi
Copy link
Contributor

Eliminate checks for WTF-8 boundaries in PathBuf::set_extension and add_extension, where joining WTF-8 surrogate halves is impossible. Don't convert the str to OsStr, because OsString::push specializes to skip the joining when given strings.

To assist in this, mark the internal methods OsString::truncate and extend_from_slice as unsafe to communicate their safety invariants better than with module privacy.

Similar to #137777.

cc @joboet @ChrisDenton

@rustbot
Copy link
Collaborator

rustbot commented Apr 22, 2025

r? @workingjubilee

rustbot has assigned @workingjubilee.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 22, 2025
@workingjubilee
Copy link
Member

oh dear.

@thaliaarchi
Copy link
Contributor Author

@ChrisDenton wanna review this? They've reviewed my other WTF-8 stuff, so I probably should have r?'d them.

@rustbot
Copy link
Collaborator

rustbot commented May 1, 2025

Failed to set assignee to 'd: invalid assignee

Note: Only org members with at least the repository "read" role, users with write permissions, or people who have commented on the PR may be assigned.

Comment on lines 594 to 595
/// Extending the buffer with this slice must not require encoding-dependent
/// surrogate joining.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must not require encoding-dependent surrogate joining" is deferring the safety requirement to instead be "you must know exactly how the implementation works", so I think this is not a significant improvement on messaging for the (ahem) modularity of reasoning.

unless we can concisely describe what requires that, in which case "don't put that in" should be the requirement instead.

and I am pretty sure we can?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically "the last bytes of self must not encode a leading/high surrogate for a surrogate pair, and the first bytes of other must not encode a trailing/low surrogate of a surrogate pair". is there another case?

Comment on lines 215 to 216
/// Extending the buffer with this slice must not require joining
/// surrogates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it feels like "must not require joining surrogates" kinda wants to just say what the bad input that you shouldn't give it is but is being coy instead.

/// # Safety
///
/// The length must be at an `OsStr` boundary, according to
/// `Slice::check_public_boundary`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having increasingly negative opinions about having an internal type named Slice but it is not your responsibility for that, so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what would a better name be? I might submit a PR for that. But maybe not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, honestly! it just is starting to bug me because this looked like a typo at first.

@workingjubilee
Copy link
Member

"well, my child, when a high surrogate and a low surrogate get together, they blow your code up stop doing that"

Communicate the safety invariants of these methods with `unsafe fn`
rather than privacy.
Eliminate checks for WTF-8 boundaries in `PathBuf::set_extension` and
`add_extension`, where joining WTF-8 surrogate halves is impossible.
Don't convert the `str` to `OsStr`, because `OsString::push` specializes
to skip the joining when given strings.
@thaliaarchi thaliaarchi force-pushed the pathbuf-extension branch from bd2e262 to 0f0c0d8 Compare May 1, 2025 06:57
@thaliaarchi
Copy link
Contributor Author

I've updated the safety comments. In addition to rewording based on your feedback, I also mentioned that the slice must be valid for the given encoding like from_encoded_bytes_unchecked.

@workingjubilee
Copy link
Member

cool, that looks good to me. should be able to r+ it in the morning or thereabouts.

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +597 to 600
/// This bypasses the encoding-dependent surrogate joining, so `self` must
/// not end with a leading surrogate half and `other` must not start with
/// with a trailing surrogate half.
#[inline]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you to @thaliaarchi for helping explain to/remind me that the reason this is sufficient as a safety condition is that joined surrogate pairs get distinct encodings that are not identical to "a lone surrogate followed by a lone surrogate" because that's why UTF-8 is, er, UTF-8 instead of... yeah.

To reiterate for my own benefit, because I have an annoying "I understand this from the bottom up or it doesn't click" problem, this required understanding:

  • UTF-16 evolved from UCS-2, an encoding for the "basic multilingual plane"
  • UTF-16 encodes "astral" code points beyond the BMP using "surrogate pairs", unassigned values from UCS-2 that served as a "niche" in the encoding, to fit in each single u16 the de facto enum:
enum Utf16 {
    Single(Ucs2),
    Paired(Surrogate)
}
  • Nominally, "UTF-16" encoded data should reject the Utf16::Paired variant when it isn't actually paired, but a lot of "WTF-16" encodings actually allow lone surrogates in because everyone was still tacitly assuming handling things code-unit-by-code-unit would work fine and that's not how any multi-code-unit encoding works.
  • WTF-8 always encodes values in UTF-8 when values can be validly encoded in UTF-8.
  • WTF-8 has an encoding for lone surrogates so it can handle invalid "WTF-16", taken out of UTF-8's own encoding "niche": it is not valid UTF-8.
  • This means we have to handle "this surrogate should not exist" and "this surrogate is paired with another surrogate so they're a real boy Unicode code point". Thus, putting two surrogates "one after the other" is not always "simple" appending for WTF-8, when it can actually mean decoding the pair of code units into a Unicode code point and then encoding that into UTF-8.

( There's a detail for "trailing and leading" surrogates about the actual encoding for surrogates but I am deliberately eliding it because everyone spends five hours of explanation on exactly how surrogates are encoded which doesn't matter when thinking about the design at a higher level, such that everything that actually matters oozes out of my brain before we get to the conclusion. )

( I guess that's me not being completely bottom-up but whatever. )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...this is almost completely irrelevant to this actual review obviously I'm just trying to write this down again to try to get it to stick to my brain this time, as I have tried to learn/understand this at least 3 times.

@workingjubilee
Copy link
Member

@bors r+

@bors
Copy link
Collaborator

bors commented May 2, 2025

📌 Commit 0f0c0d8 has been approved by workingjubilee

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 2, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request May 2, 2025
Rollup of 12 pull requests

Successful merges:

 - rust-lang#134034 (handle paren in macro expand for let-init-else expr)
 - rust-lang#137474 (pretty-print: Print shebang at the top of the output)
 - rust-lang#138872 (rustc_target: RISC-V `Zfinx` is incompatible with `{ILP32,LP64}[FD]` ABIs)
 - rust-lang#139046 (Improve `Lifetime::suggestion`)
 - rust-lang#139206 (std: use the address of `errno` to identify threads in `unique_thread_exit`)
 - rust-lang#139608 (Clarify `async` block behaviour)
 - rust-lang#139847 (Delegate to inner `vec::IntoIter` from `env::ArgsOs`)
 - rust-lang#140159 (Avoid redundant WTF-8 checks in `PathBuf`)
 - rust-lang#140197 (Document breaking out of a named code block)
 - rust-lang#140389 (Remove `avx512dq` and `avx512vl` implication for `avx512fp16`)
 - rust-lang#140430 (Improve test coverage of HIR pretty printing.)
 - rust-lang#140507 (rustc_target: RISC-V: feature addition batch 3)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 5a58c7a into rust-lang:master May 2, 2025
6 checks passed
@rustbot rustbot added this to the 1.88.0 milestone May 2, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request May 2, 2025
Rollup merge of rust-lang#140159 - thaliaarchi:pathbuf-extension, r=workingjubilee

Avoid redundant WTF-8 checks in `PathBuf`

Eliminate checks for WTF-8 boundaries in `PathBuf::set_extension` and `add_extension`, where joining WTF-8 surrogate halves is impossible. Don't convert the `str` to `OsStr`, because `OsString::push` specializes to skip the joining when given strings.

To assist in this, mark the internal methods `OsString::truncate` and `extend_from_slice` as `unsafe` to communicate their safety invariants better than with module privacy.

Similar to rust-lang#137777.

cc `@joboet` `@ChrisDenton`
@thaliaarchi thaliaarchi deleted the pathbuf-extension branch May 3, 2025 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants