Skip to content

Import small cold functions #82980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 26, 2021
Merged

Conversation

tmiasko
Copy link
Contributor

@tmiasko tmiasko commented Mar 10, 2021

The Rust code is often written under an assumption that for generic
methods inline attribute is mostly unnecessary, since for optimized
builds using ThinLTO, a method will be code generated in at least one
CGU and available for import.

For example, deref implementations for Box, Vec, MutexGuard, and
MutexGuard are not currently marked as inline, neither is identity
implementation of From trait.

In PGO builds, when functions are determined to be cold, the default
multiplier of zero will stop the import, no matter how trivial the
implementation.

Increase slightly the default multiplier from 0 to 0.1.

r? @ghost

@tmiasko
Copy link
Contributor Author

tmiasko commented Mar 10, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 10, 2021
@bors
Copy link
Collaborator

bors commented Mar 10, 2021

⌛ Trying commit 95b36e9bc49fc971dbbbc328f77136be7f753b3b with merge df0df906bd16e3d83f3f52a984101dc638c848e3...

@bors
Copy link
Collaborator

bors commented Mar 10, 2021

☀️ Try build successful - checks-actions
Build commit: df0df906bd16e3d83f3f52a984101dc638c848e3 (df0df906bd16e3d83f3f52a984101dc638c848e3)

@rust-timer
Copy link
Collaborator

Queued df0df906bd16e3d83f3f52a984101dc638c848e3 with parent a4d9624, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (df0df906bd16e3d83f3f52a984101dc638c848e3): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Mar 10, 2021
@tmiasko
Copy link
Contributor Author

tmiasko commented Mar 10, 2021

The perf result look quite promising, with small improvements in instruction counts, wall-time and max-rss. The change also reduced size of librustc_driver.so from 102.5 MiB to 97.1 MiB.

The main premise behind the change is that with PGO, we will have cold functions with negative inline cost, and we should continue inlining them. I don't know if this is something we would like to land in this form, but could be interesting from Rust PGO perspective, so

r? @michaelwoerister

The Rust code is often written under an assumption that for generic
methods inline attribute is mostly unnecessary, since for optimized
builds using ThinLTO, a method will be generated in at least one CGU and
available for import.

For example, deref implementations for Box, Vec, MutexGuard, and
MutexGuard are not currently marked as inline, neither is identity
implementation of From trait.

In PGO builds, when functions are determined to be cold, the default
multiplier of zero will stop the import, even for completely trivial
functions.

Increase slightly the default multiplier from 0 to 0.1 to import them
regardless.
@tmiasko tmiasko force-pushed the import-cold-multiplier branch from 95b36e9 to 1aee808 Compare March 11, 2021 11:38
@tmiasko tmiasko changed the title Use small non-zero import-cold-multiplier Import small cold functions Mar 11, 2021
@michaelwoerister
Copy link
Member

I'll take a look at this next week. Thanks for the PR, @tmiasko!

@michaelwoerister
Copy link
Member

Hm, the numbers look like an improvement overall. However, I'm wondering, in the PGO case, wouldn't the compiler have accurate information about the coldness of a function? In which case it should actually be beneficial not to import cold functions?

On the other hand, as you say, a multiplier of 0.1 would be a strong discouragement for importing a given function while still allowing a decision to be made based on other factors such as size.

I think we can just merge this. It's just a small tweak and does not seem to a negative impact on LLVM times.

@bors r+

@bors
Copy link
Collaborator

bors commented Mar 26, 2021

📌 Commit 1aee808 has been approved by michaelwoerister

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 26, 2021
@bors
Copy link
Collaborator

bors commented Mar 26, 2021

⌛ Testing commit 1aee808 with merge e423058...

@bors
Copy link
Collaborator

bors commented Mar 26, 2021

☀️ Test successful - checks-actions
Approved by: michaelwoerister
Pushing e423058 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 26, 2021
@bors bors merged commit e423058 into rust-lang:master Mar 26, 2021
@rustbot rustbot added this to the 1.53.0 milestone Mar 26, 2021
@tmiasko
Copy link
Contributor Author

tmiasko commented Mar 26, 2021

Inlining trivial functions often reduces caller code size, and inlining heuristics for cold functions try to avoid inlining otherwise, so yes it should be beneficial even with completely accurate profile information. For example, an error handling code path using ? taken from rustc MIR interpreter, before:

             → callq  *0x20446f5(%rip)        # 3e1a678 <<rustc_middle::mir::interpret::error::InterpErrorInfo as core::convert::From<rustc_middle::mir::interpret::error::InterpError>>::from@@Base+0x1ba93a8>
               mov    %rax,%rdi
             → callq  <T as core::convert::From<T>>::from
               mov    %rax,%rdi
        1ce: → callq  <T as core::convert::From<T>>::from
               mov    %rax,%rdi
             → callq  <T as core::convert::From<T>>::from
               mov    $0x1,%sil
             ↑ jmpq   115     

and after:

             → callq  *0x1f22ee0(%rip)        # 3a0ae68 <<rustc_middle::mir::interpret::error::InterpErrorInfo as core::convert::From<rustc_middle::mir::interpret::error::InterpError>>::from@@Base+0x1a80f38>
        1c8:   mov    $0x1,%sil
             ↑ jmpq   116     

@tmiasko tmiasko deleted the import-cold-multiplier branch March 26, 2021 14:57
@rylev
Copy link
Member

rylev commented Apr 1, 2021

@tmiasko @michaelwoerister Can you talk a bit more about the performance regressions seen here? Some of the regressions are somewhat severe, so it would be nice to talk through them a bit more. Why is the performance regression worth it?

@michaelwoerister
Copy link
Member

The main change here is that we allow LLVM to import functions across module boundaries during ThinLTO even if a given function has been marked as cold. This allows LLVM to inline trivial functions (like the example in the comment above), leading to an overall reduction in code size. LLVM still has the ability to not do inlining for such functions based on its usual heuristics (in which case the effort put into import the function was wasted). The change should lead to an overall quality improvement for all machine code being produced via ThinLTO (i.e. currently all --release done via cargo).

Because this PR re-jiggers LLVM inlining decisions it can be expected to have a mixed influence on compile times. Since that effect seems to be positive overall (with the exception of two small synthetic test cases) I think this change is a net win.

@Mark-Simulacrum
Copy link
Member

Thanks @michaelwoerister! That makes sense to me. Would it be worth gating this on --release only?

@michaelwoerister
Copy link
Member

No, I don't think that would make a difference since all the non-release builds don't run ThinLTO anyway. The change only affects one of the ThinLTO phases.

@Mark-Simulacrum
Copy link
Member

Makes sense. OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants