-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Discard LLVM modules earlier when performing ThinLTO #56487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bors: try Whoa this is a great idea! This seems like the correct strategy for incremental too, although I forget if that takes different paths. This also makes me think that we should link in fat LTO ASAP instead of synchronizing and then linking as it'd allow pipelining a bit ideally. In any case that's a patch for another time! I'll take a closer look later, but I'm curious on the perf impact here too, it should both make builds faster (slightly) and decrease peak memory usage in theory |
⌛ Trying commit 94131ebd21ed6d64acdeae6d4766c5669414c488 with merge 360659bd585b2529141e8fc3228fc3bf2e1fa6d1... |
☀️ Test successful - status-travis |
@rust-timer build 360659bd585b2529141e8fc3228fc3bf2e1fa6d1 |
Success: Queued 360659bd585b2529141e8fc3228fc3bf2e1fa6d1 with parent 0c999ed, comparison URL. |
Finished benchmarking try commit 360659bd585b2529141e8fc3228fc3bf2e1fa6d1 |
The max-rss results basically looks like noise to me. The wall-time seem to be a minor win for opt builds (a few percent for clean/baseline-incremental). I guess that means that LLVM memory usage is dominated by rustc memory usage, at least for the build types used here (opt w/o debuginfo), so it has no impact on max-rss. Unfortunately I was not able to get |
I've got massif to run (need to directly call the jemalloc free rustc, no Before:
After:
The first hump is peak optimization, the second hump is peak LTO. So in this case peak memory usage is moved from the optimization stage to the LTO stage, but it ultimately does not make much of a difference. |
Ah bummer! In any case this looks good to me and it looks like graphs are indeed confirming a shift in peaks, so this seems good to land to me. r=me with a rebase! |
Instead of only determining whether some form of LTO is necessary, determine whether thin, fat or no LTO is necessary. I've rewritten the conditions in a way that I think is more obvious, i.e. specified LTO type + additional preconditions.
These are going to have different intermediate artifacts, so create separate codepaths for them.
Fat LTO merges into one module, so only return one module.
94131eb
to
96cc381
Compare
@bors r=alexcrichton |
📌 Commit 96cc381285c8b1d83aea776282232022ed949fd7 has been approved by |
This comment has been minimized.
This comment has been minimized.
Instead of keeping all modules in memory until thin LTO and only serializing them then, serialize the module immediately after it finishes optimizing.
96cc381
to
8128d0d
Compare
@bors r=alexcrichton Rebase mistake with submodules... |
📌 Commit 8128d0d has been approved by |
Discard LLVM modules earlier when performing ThinLTO Currently ThinLTO is performed by first compiling all modules (and keeping them in memory), and then serializing them into ThinLTO buffers in a separate, synchronized step. Modules are later read back from ThinLTO buffers when running the ThinLTO optimization pipeline. We can also find the following comment in `lto.rs`: // FIXME: right now, like with fat LTO, we serialize all in-memory // modules before working with them and ThinLTO. We really // shouldn't do this, however, and instead figure out how to // extract a summary from an in-memory module and then merge that // into the global index. It turns out that this loop is by far // the most expensive portion of this small bit of global // analysis! I don't think that what is suggested here is the right approach: One of the primary benefits of using ThinLTO over ordinary LTO is that it's not necessary to keep all the modules (merged or not) in memory for the duration of the linking step. However, we currently don't really make use of this (at least for crate-local ThinLTO), because we keep all modules in memory until the start of the LTO step. This PR changes the implementation to instead perform the serialization into ThinLTO buffers directly after the initial optimization step. Most of the changes here are plumbing to separate out fat and thin lto handling in `write.rs`, as these now use different intermediate artifacts. For fat lto this will be in-memory modules, for thin lto it will be ThinLTO buffers. r? @alexcrichton
☀️ Test successful - status-appveyor, status-travis |
Currently ThinLTO is performed by first compiling all modules (and keeping them in memory), and then serializing them into ThinLTO buffers in a separate, synchronized step. Modules are later read back from ThinLTO buffers when running the ThinLTO optimization pipeline.
We can also find the following comment in
lto.rs
:I don't think that what is suggested here is the right approach: One of the primary benefits of using ThinLTO over ordinary LTO is that it's not necessary to keep all the modules (merged or not) in memory for the duration of the linking step.
However, we currently don't really make use of this (at least for crate-local ThinLTO), because we keep all modules in memory until the start of the LTO step. This PR changes the implementation to instead perform the serialization into ThinLTO buffers directly after the initial optimization step.
Most of the changes here are plumbing to separate out fat and thin lto handling in
write.rs
, as these now use different intermediate artifacts. For fat lto this will be in-memory modules, for thin lto it will be ThinLTO buffers.r? @alexcrichton