-
Notifications
You must be signed in to change notification settings - Fork 60
Area proposal: validity invariants #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# Data type validity requirements | ||
|
||
This discussion is meant to focus on the question: Which invariants derived from | ||
types are there that the compiler expects to be *always* maintained, and | ||
(equivalently) that unsafe code must *always* uphold (or else cause undefined | ||
behavior)? This is what is called "validity invariant" in | ||
[Ralf's blog post](https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html), | ||
but we might also decide to change that name. | ||
|
||
### Interactions and constraints | ||
|
||
Choices of invariants interact, in particular, with layout optimizations: For | ||
example, the fact that `Option<&T>` is pointer-sized relies on the fact that the | ||
validity invariant for `&T` rules out `0x0`, and hence we can use that value as | ||
signaling the `None` case. | ||
|
||
Moreover, the invariants are constrained by attributes that we emit when | ||
generating LLVM IR. For example, we emit `aligned` attributes pretty much any | ||
time we can, which means it is probably a good idea to say that valid references | ||
must be aligned. | ||
|
||
Finally, another consideration to take into account is that ruling out certain | ||
behavior can be great for bug finding. For example, if arithmetic overflow is | ||
defined to have two's-complement-behavior, then bug finding tools can no longer | ||
use overflow as an indication of a software bug. (This is a real problem with | ||
unsigned integer arithmetic in C/C++.) | ||
|
||
### Possible bit patterns | ||
|
||
The validity invariant of a type is, basically, a set of bit patterns that is | ||
allowed to occur at that type. ("Basically" because the invariant may also be | ||
allowed to depend on memory.) To discuss this properly, we need to first agree | ||
on what "bit patterns" even are. It is not enough to just consider sequences of | ||
0 and 1, because we also need to take uninitialized data into account. For the | ||
purpose of this discussion, I think it is sufficient to consider every bit as | ||
being either 0, 1 or uninitialized. | ||
[That is not always sufficient](https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html), | ||
but I think we can mostly ignore the extra complications introduced by pointer | ||
values. | ||
gnzlbg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In terms of comparing with C, the "uninitialized" bit corresponds to what C | ||
calls "indeterminate" data. In particular, it is allowed to be a "trap | ||
representation". Also, observing the same indeterminate data multiple times is | ||
allowed to yield different results. That's why I am proposing we treat it as a | ||
third state a bit can be in. | ||
|
||
In terms of LLVM, the "uninitialized" bit corresponds to `poison`. It is *not* | ||
the same as `undef`! See | ||
[this paper](https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf) for some | ||
more material on the topic. | ||
|
||
### Extent of "always" | ||
|
||
One point we will have to figure out is what exactly "always" means. Thinking | ||
in terms of a semantics for MIR, data most probably needs to be valid any time | ||
it is copied, which primarily happens when executing assignment statements (the | ||
other cases are passing of function arguments and return values). However, it | ||
is less clear whether merely creating a place without accessing the data inside | ||
(such as in `&*x`) should require the data to be valid. | ||
|
||
The entire discussion here is only about validity invariants that have to hold | ||
when the compiler considers a variable initialized. For example, `let b: bool;` | ||
is completely okay to not be initialized because the compiler knows about that; | ||
`let b: bool = mem::uninitialized();` however copies uninitialized data at type | ||
`bool` and hence violates `bool`'s validity invariant. | ||
|
||
## Goals | ||
|
||
* For every primitive type, determine which assumptions (if any) the compiler | ||
makes about values *not* occurring at that type (serving as a lower bound for | ||
what to declare invalid), and determine which popular patterns in unsafe code | ||
might create "interesting" values of this type that safe code cannot create on | ||
its own (serving as an upper bound for how much we want to declare invalid). | ||
Both of these bounds are soft, but informative. | ||
* Based on that, map out a design space of invariants that seem reasonable. | ||
* Determine when exactly the validity invariant is assumed to hold. | ||
|
||
## Active threads | ||
|
||
To start, we will create threads for each major category of types. | ||
|
||
* Integers and floating point types | ||
* Do we allow values that contain uninitialized bits? If yes, what are the | ||
rules for arithmetic and logical operations involving uninitialized bits, | ||
e.g. in cases like `x * 0`? There is also some interaction with bug finding | ||
here: tools can only flag uninitialized data at integer type as a bug if we | ||
do not allow that to happen in unsafe code. | ||
|
||
* Raw pointers | ||
* Do we allow values that contain uninitialized bits? | ||
* Are there any requirements on the metadata? | ||
|
||
* References | ||
* Presumably, references must be non-NULL. | ||
* They probably also must be aligned, but is that required every time a | ||
reference is taken? Also see the [ongoing discussion in RFC 2582][RFC2582]. | ||
* Can there ever be uninitialized bits in a reference? | ||
* Do they have to be dereferencable? What exactly does that even mean? | ||
* Does `&[mut] T` have to point to data that is valid at `T`? This interacts | ||
with the question of whether `&*x` is allowed when `x` is a well-aligned | ||
non-null dereferencable pointer that points to invalid data. | ||
* Out of scope: aliasing rules | ||
|
||
* Function pointers | ||
* Presumably, these must be non-NULL. Anything else? Can there ever be | ||
uninitialized bits? | ||
|
||
* Booleans | ||
* Is there anything to say besides: A `bool` must be `0x0` or `0x1`? Do we | ||
allow the remaining bits to be uninitialized? | ||
gnzlbg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* Unions | ||
* Do we make any restrictions here, or are unions just "bags of bits" that may | ||
contain anything? That would mean we can do no layout optimizations. | ||
|
||
* Enums | ||
* Is there anything to say besides: The discriminant must be valid, and all | ||
fields of the active variant must be valid at their respective types? | ||
* The padding between fields can be anything, including uninitialized. | ||
gnzlbg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* Structs, tuples, arrays and all other aggregates (closures, ...) | ||
* Is there anything to say besides: All fields must be valid at their | ||
respective types? | ||
* The padding between fields can be anything, including uninitialized. It was | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are generator fields (and more generally generator contents) exposed to user code at all? If not, this doesn't seem like something UCG have to say anything about, just pure rustc implementation details. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
They are not, AFAIK. But I think when we are in the business of type invariants, we should not forget about types like closures or generators. Just because the user cannot write them directly, doesn't mean their invariants are not important to know -- and compiler devs also need guidance for how to approach validity invariants. And, of course, unsafe code could start to play games with these fields. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's certainly a related issue, but one that interacts with generator implementation details much more than with user-facing guarantees or semantics of Rust-the-language, so I'm not sure the group of people here is the best to discuss it. So I think it would be reasonable to punt on it to reduce the scope of the validity invariant discussion and get it done more quickly. But we could also open the issue for generators and maybe end up wrapping up the discussion without any decisions being made about generators.
Regardless of how generators end up being organized internally, I am quite certain they should be opaque, i.e., user code should not be allowed to poke at their contents in such a way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I always reasoned about closures as being equivalent to a I would find it extremely weird if the same wouldn't be true for generators. That is, the exact type of the Generator would be unspecified, but this type would be a normal Rust type, e.g., a Otherwise, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That seems reasonable. And I agree unsafe code shouldn't mess with implementation details of closures or generators, but that doesn't follow from them being in any way different than other "user code should not be allowed to poke at their contents in such a way" also applies to many, many other data types -- to pick an example, |
||
[recently determined][generators-maybe-uninit] that generators behave | ||
different from other aggregates here. Are we okay with that? Should we push | ||
for generator fields to reflect this in their types? | ||
|
||
[RFC2582]: https://github.com/rust-lang/rfcs/pull/2582 | ||
[generators-maybe-uninit]: https://github.com/rust-lang/rust/pull/56100 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious whether there exist properties that unsafe code must always uphold but which it cannot rely upon. The idea would be: for now, we require that this is true, but we expect we may not require it forever. Technically, from the compiler's POV, we can tweak these predicates because that just makes less code UB -- but unsafe code could be relying on the fact that doing otherwise would be UB, and that could be a problem.
This seems like a more general thing -- it also e.g. arises in the reprsentation discussion -- but once we start getting into invariants, I expect we should think carefully about this distinction.
I would sort of like this to be noted somewhere as a kind of "cross-cutting concern" -- i.e., for each area, we are aiming to define these two predicates (what unsafe code must uphold, and what it can rely on) somewhat separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general. We want some "wiggle room" in our definition of what unsafe code can do, certainly when it comes to the safety invariant. I am less sure for the validity invariant.
My main problem with this is that I don't have a good understanding yet of how this works formally. That's mostly because I did not spend enough time to think about this.