incr.comp.: Improve caching efficiency by handling spans in a more robust way




The Problem
-----------

Source location information (which, for simplicity, I'll just call "spans" from here on out) are the bane of incremental compilation's existence. Why is that? Unlike most other kinds of frequent changes done to source code, changing spans has (seemingly) non-local effects.

As an example, let's first consider a "regular" change of a program, like turning a `+` in an expression into a `-`. This change means that the function containing the expression has to be re-analyzed and the object file it was instantiated in has to be re-compiled. So far, so expected. Now, in contrast, consider a change that affects the span of a function, like adding a line with a comment to it. At first glance, it looks like we haven't really changed anything -- we just added a comment after all -- but that's not true. Spans are part of the HIR, the MIR, ScopeTrees, and (via debuginfo and panic messages) even LLVM IR and object files. So, adding a comment to a function will *legitimately* cause the function and its containing object file to be re-compiled. That's a bit unexpected and sad, but how is it "non-local"? 

Remember that we added a new line with a comment to a function, thus changing the span of the function. What I didn't explicitly mention was that by adding this line, we shifted down *everything following that line* in the same source file, thus changing not only one function but potentially dozens of functions and type definitions. That's what I described as "non-local" effects (or rather "seemingly" non-local because shifting everything by a line is a legitimate, real change to everything that has been shifted, it's just easy to overlook).

"That's horrific!", you say, "We have to do something about it!" Well, I'm glad you think so too.



What can we do?
---------------

As stated above, the changes and the invalidation throughout the incr.comp. cache that they cause are legitimate. They are not false positives, since changing the source location of a function really changes (for example) the MIR of that function. So we cannot just be smarter about tracking spans or ignore them altogether. However, what we can do is refactoring the representation of HIR, MIR, etc, so that they don't actually contain spans anymore. The span information has to be somewhere, and we still have to be able to map various things in HIR, MIR, etc to spans, but spans can be split out into separate tables. As a consequence, HIR, MIR, and ScopeTrees will be represented in a way that is impervious to changes that don't affect their *structure*.

One way to achieve this (and the only way I know) is to introduce the concept of "abstract spans". An abstract span does not directly contain a source location, but it identifies a source location uniquely. For example, if we store all spans in a side table then the abstract span would be the key to this table. For this to bring any improvement over the current situation, an abstract span must be stable across changes that don't affect the structure of the thing containing it. E.g. shifting down a function by a line can change the thing the abstract span points to, but the value of the abstract span itself must not change. (This is simple to achieve by using a scheme that is similar to what we already do for `HirId`. Implementing it without increasing memory requirements is harder).


Implementation Strategies
-------------------------

There are a few prerequisites for the implementation:

- Span information must be tracked by a different `DepNode` than `Hir`, `HirBody`, `OptimizedMir`, etc, which implies that it must not be directly accessible from any of the data covered by these `DepNodes`. 
- Span information must still be tracked, but in contrast to the current situation, we only want to depend on it when the information is actually used.
- Abstract spans must be stable.

These goals can be achieved by:
- Splitting out span information during HIR lowering and storing it separately. I imagine having one table per `HirId::owner` that then corresponds to one `DepNode`, making spans be tracked at the same granularity as HIR items.
- Replacing `Span` fields with abstract spans in HIR, MIR, etc. This will mean quite a bit of refactoring everywhere these spans are used (as opposed to just being copied around)

Alternatively, this could also be achieved by:
- Making the `CodeMap` inaccessible from queries and generating a map from `Span` value to `DepNode` during HIR lowering, thus effectively making the existing `Span` type abstract. 
- Providing a query that allows to decode a `Span` to its contents.
- Making sure that none of the error reporting APIs take `Spans` directly.

I lean a bit towards alternative (1) but it's hard to gauge which one will lead to cleaner, more robust code in the end. Solution (1) would have a risk of false positives (too much invalidation), while solution (2) has the risk of false negatives (changes not detected) because existing APIs present tracking holes. Not detecting changes seems like the worse problem.

Regardless of the implementation, we will have to store additional tables in crate metadata that allow mapping from abstract spans to regular spans for upstream crates.  


Abstract Span Representation
----------------------------

Ideally, an abstract span would not take up more space than one `u32`, which is how much space a `Span` takes up. One way to achieve this would be by making abstract spans be `struct SpanId(ast::NodeId)`. Mapping from `SpanId` to `Span` would then involve mapping from `NodeId` to `HirId`, taking the `HirId::owner` to identify the correct side table *and* `DepNode`, and then the `HirId::local_id` as key into the side table. However, this only works for the current crate. In order for this to work across crates, we would either have to make `SpanId` also contain a `CrateNum` (thus doubling its size to 8 bytes), or implement a `NodeId` remapping scheme, similar to what we do for imported `FileMaps` and formerly already had for AST "inlining". With the latter in place we might be able to remove the `HirId` from some of the HIR structs again, which would help amortize its implementation effort.

`NodeId`-based abstract spans have the restriction of only being able to represent things that have a `NodeId`. However, that should be easily solved by assigning `NodeId`s to things that at the moment have a `Span` but no `NodeId`. 

`NodeId`-based abstract spans have the advantage that HIR structs would not have to store a separate span field. The `SpanId` could be generated from the already available `NodeId`. 

Abstract spans could be implemented completely separately from `NodeId` and `HirId` but there's probably little advantage to doing so while quite a bit of new infrastructure would have to be put into place.


Guarding against Regressions
----------------------------
After putting so much effort into using abstract spans, we'll want to avoid that vanilla `Span` values make their way into query results again. Luckily this should be easily achievable by adding an assertion to the `HashStable` implementation for `Span` that makes sure we don't encounter unexpected invocations. 


Abstract Spans Level 2
----------------------

The first goal would be to use abstract spans in everything up until and including MIR. An even more ambitious goal would be to also use abstract spans in cached LLVM IR and/or object files. That might allow us to skip re-optimizing code and just patch up source locations (if it's really just spans that have changed -- detecting that is another challenge).


Call for feedback
-----------------
Since this will probably result in quite a few changes, I'd like to get some feedback before jumping into an implementation. Here are some guiding questions:

 - Did I explain the problem properly?
 - Do you know an alternative to span abstraction for solving the problem?
 - Which of the two implementation approaches would you choose?
 - Is there a better way of implementing abstract spans?

Any kind of feedback is welcome!

cc @rust-lang/compiler 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incr.comp.: Improve caching efficiency by handling spans in a more robust way #47389

The Problem

What can we do?

Implementation Strategies

Abstract Span Representation

Guarding against Regressions

Abstract Spans Level 2

Call for feedback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

incr.comp.: Improve caching efficiency by handling spans in a more robust way #47389

Description

The Problem

What can we do?

Implementation Strategies

Abstract Span Representation

Guarding against Regressions

Abstract Spans Level 2

Call for feedback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions