Skip to content

Tracking Issue for Incremental Compilation #47660

Open
@michaelwoerister

Description

@michaelwoerister

Incremental compilation will soon be available in stable Rust (with version 1.24) but there's still lots of room for improvement. This issue will try to give an overview of specific areas for potential optimization.

Compile Time

The main goal for incremental compilation is to decrease compile times in the common case. Here are some of things that affect it:

Caching Efficiency

The biggest topic for incremental compilation is: How efficiently are we using our caches. It can be further subdivided into the following areas:

Data Structure Stability

Caching efficiency depends on how the data in the cache is represented. If data is likely to change frequently then the cache is likely to be invalidated frequently. One example is source location information: Source location information is likely to change. Add a comment somewhere and the source location of everything below the comment has changed. As a consequence, everything in the cache that contains source location information is likely in need of frequent invalidation. It would be preferable to factor data structures in a way that confines this volatility to only few kinds of cache entries. The following issues track concrete plans to improve the situation here:

Object File Granularity and Partitioning

The problem of object file granularity can be viewed as a variation of "Data Structure Stability" but it is so important for compile times that I want to list it separately. At the moment the compiler will put machine code that comes from items in the same source-level module into the same object file. This means that changing one function in a module will lead to all functions in that module to be re-compiled. Obviously this can lead to a large amount of redundant work. For full accuracy we could in theory have one object file per function - which indeed improves re-compile times a lot in many cases - but that would increase disk space requirements to an unacceptable level and makes compilation sessions with few cache hits much more expensive (TODO: insert numbers). And as long as we don't do incremental linking it might also make linking a bottleneck.

The main goal here is to re-compile as little unchanged code as possible while keeping overhead small. This is a hard problem and some approaches are:

  • Keep using a fixed partitioning scheme but improve the partitioning algorithm
  • Implement an adaptive scheme that reacts to set of changes a user makes
  • It might even be a good idea to write a simulator that allows to test different schemes and then feed it with actual data generated by an instrumented compiler.

Adaptive schemes would require us to re-think part of our testing and benchmarking infrastructure and writing a simulator is a big project of its own, so short term we should look into improved static partitioning schemes:

Another avenue for improvement of how we handle object files:

Whole-Cache Invalidation

Currently commandline arguments are tracked at a very coarse level: Any change to a commandline argument will completely invalidate the cache. The red-green tracking system could take care of making this finer grained but quite a lot of refactoring would need to happen in order to make sure that commandline arguments are only accessible via queries.

Note that this currently completely prevents sharing a cache between cargo check and cargo build.

Avoid Redundant Work

The compiler is doing redundant work at the moment. Reducing it will also positively affect incremental compilation:

Querify More Logic, Cache More Queries

We can only cache things that are represented as queries, and we can only profit from caching for things that are (transitively) cached. There are some obvious candidates for querification:

A more ambitious goal would be to querify name resolution and macro expansion.

Framework Overhead

Doing dependency tracking will invariably introduce some overhead. We should strive to keep this overhead low. The main areas of overhead are:

  • Building the dependency graph during compiler execution
  • Computing query result hashes and dependency node identifiers
  • Loading and storing the dependency graph from/to disk

Another thing that falls into this category is:

  • Efficiency of loading and storing cache entries

The following issues track individual improvements:

We should continuously profile common use cases to find out where we can further reduce framework overhead.

Disk Space Requirements

Build directories of large C/C++ and Rust code bases can be massive, oftentimes many gigabytes. Since incremental compilation has to keep around the previous version of each object file for re-use, plus LLVM IR, plus the dependency graph and query result cache, build directories can up to triple in size when incremental compilation is turned on (depending on which crates are compiled incrementally). The best way to reduce cache size is to reduce the amount of translated code that we need to cache. Solving #47317 and #46477 would help here. MIR-only RLIBs (#38913), which are one way to solve #47317, might also obviate the need to cache LLVM IR at all.

Runtime Performance of Incrementally Compiled Code

Currently delivering good runtime performance for incrementally compiled code is only a side goal. If incrementally compiled code is "fast enough" we rather try to improve compile times. However, since ThinLTO also supports an incremental mode, we could provide a middle ground between "re-compile everything for good performance" and "re-compile incrementally and only get 50% performance".


If you have any further ideas or spot anything that I've missed, let me know in the comments below!

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-incr-compArea: Incremental compilationC-tracking-issueCategory: An issue tracking the progress of sth. like the implementation of an RFCS-tracking-impl-incompleteStatus: The implementation is incomplete.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions