Skip to content

Semantics of MIR assignments, around aliasing, ordering, and primitives. #68364

Open
@eddyb

Description

@eddyb

In today's MIR, an indirect assignment like *p = *q; is similar to, but not exactly the same as:

tmp = *q;
*p = tmp;

The differences are:

  • performance: they only produce the same codegen for primitives
    • this is assuming tmp isn't used elsewhere, allowing codegen to treat it like an SSA value, resulting in store(p, load(q)), which is also what *p = *q codegens to
    • for non-primitives, the amount of data being copied doubles, as tmp must be in memory
  • correctness: only *p = *q; is UB (AFAIK) if p..p+size overlaps q..q+size
    • this also likely only affects non-primitives, which have to copy data in memory, but we could decide to ignore the type and always make it UB when they overlap

For the purposes of this discussion, a primitive is:

  • scalar (bool, char, integer, float, or pointer/reference)
  • vector (SIMD-enabled array of scalars)

Scalar pairs likely also should/need to be included, due to how easy they are to support in any code that already handles scalars, and also due to their use in wide pointers/references.


What's interesting about primitives, though, is that some kinds of Rvalues (the RHS of the assignment) always produce primitive values, because they're primitive operations.

The Rvalue variants which are always primitive, today, are:

  • Ref (&T / &mut T - may become dependent on custom DSTs in the future)
  • AddressOf (*const T / *mut T - may become dependent on custom DSTs in the future)
  • Len (usize)
  • Cast, other than unsizing (scalar)
  • BinaryOp, UnaryOp (scalar, or maybe also vector)
  • CheckedBinaryOp (pair of integer and bool - only if we consider scalar pairs to be primitive)
  • NullaryOp(SizeOf) (usize)
  • NullaryOp(Box) (Box<T>)
  • Discriminant (integer)

Which leaves these variants as potentially relying on memory operations to write the result:

  • Use (any type, one copy)
  • Repeat ([T; N], N copies)
  • Cast, specifically unsizing (any type implementing CoerceUnsized, per-field copies)
  • Aggregate (any ADT, per-field copies)

If we want to remain conservative, we could ignore types for the latter, and just assume that the destination of the assignment cannot overlap any memory read in the Operands of the Rvalue.

We could even cement the distinction by moving the always-primitive operations into a new PrimOp enum, and/or move the other Rvalues to their own statements (e.g. introduce Copy(*p, *q)), but that's more aesthetic than anything for the time being.

At the very least, we should probably document these differences, and make sure that miri only allows overlaps in the cases we don't consider UB (either abstractly, or due to our choice of codegen).


Another interesting aspect of the always-primitive ops is that they're "pure functions" of their operands (other than NullaryOp(Box), I suppose, but that could be replaced with a call to a lang item returning a Box<MaybeUninit<T>>, instead).

This means that if we wanted to, we could replace some of the intermediary locals with an PrimOp DAG, a bit like SSA but without φ (phi) nodes or a strict instruction stream.
All of the necessary ordering would still happen at the statement level (so this is nowhere near as complex as VSDG), but we might see some benefits in scalar-heavy code.


Asides aside, cc @RalfJung @rust-lang/wg-mir-opt


The latest proposal is here

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-MIRArea: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.htmlA-miriArea: The miri toolT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.T-langRelevant to the language team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions