Description
I recently discovered this funny little intrinsic with the great comment
Emits a !nontemporal store according to LLVM (see their docs). Probably will never become stable.
Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like _mm_stream_ps
.
Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:
static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);
thread::spawn(|| {
while INIT.load(Acquire) == false {}
assert_eq!(DATA, 42); // can this ever fail? that would be bad
});
nontemporal_store(&mut DATA, 42);
INIT.store(true, Release);
The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:
- Peter Cordes answer here: "A mutex unlock on x86 is sometimes a lock add, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple mov store then you need at least sfence at some point after NT stores, before unlock."
- glibc fixing their memcpy (which uses nontemporal stores) to have a trailing sfence.
This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.
So what could be done?
- Remove nontemporal_store and implement the
_mm_stream
intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior. - Change the way release stores are compiled such that an sfence is emitted. This is mostly a theoretical option though: this is an ABI-breaking change, at least my understanding of the x86 ABI is that a regular mov is considered to be sufficient synchronization. To make this work reliably all compilers for all languages that have something like release writes need to emit the sfence, or else Rust code cannot be soundly linked with code produced by those other compilers.
- There's a third hypothetical option of adjusting our concurrency memory model to be able to support these operations. But that's (a) a huge design space -- which fences are supposed to interact how with this? We might even have to add a new kind of fence! And (b) modifying the C++ concurrency memory model should be done with utmost care and thorough formal analysis; the current model is the result of a decade worth of research that shouldn't be thrown away lightly.
- Just ignore the problem and hope it doesn't explode? I am not happy with this.
Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.
Cc @rust-lang/lang @Amanieu
Also see the nomination comment here.