Skip to content

Soundness of std::sync::Once #65796

Closed
Closed
@pitdicker

Description

@pitdicker

While working on a refactor of std::sync::Once in #65719 I stumbled across multiple soundness issues. I now believe the current algorithm that queues waiting threads in a linked list with nodes an the stack of each waiting thread is not worth the complexity.

Thread may be parked forever

Edit: this is not a valid concern, see #65796 (comment).
Code:

while !node.signaled.load(Ordering::SeqCst) {
    // HERE
    thread::park();
}

The thread managing the waiter queue (thread 1) sets node.signaled and unparkes a thread.
The thread that wants to wait (thread 2) checks node.signaled before parking itself.
But at HERE thread 1 may set node.signaled and unpark thread 2 (which is not parked yet). Afterwards thread 2 will park itself, and never receive an unpark again.

This can be solved by using park_timeout. It does seems suboptimal to me though.

Aliasing of a mutable reference

Code

let me = &mut node as *mut Waiter as usize;
/* ... */
// the mutable reference to me is send to another thread
let old = self.state.compare_and_swap(state, me | RUNNING, Ordering::SeqCst);
/* ... */
// We are at the same time checking `node` ourselves
while !node.signaled.load(Ordering::SeqCst) { /* ... */ }

This can be solved by using shared references and interior mutability.

Use of a potentially dangling shared reference (#55005)

Code

(*queue).signaled.store(true, Ordering::SeqCst);

The waiting thread can free queue after signaled is set after a spurious wakeup. At this point store can in theory still hold a dangling reference.

There is not much the implementation of OnceCell can do here, but I suppose if will be solved in the same way as #55005.


This reason std::sync::Once does not go with the obvious implementation that uses a mutex, is because Mutex::new is not const. This is the explanation in the comments:

// So to implement this, one might first reach for a `Mutex`, but those cannot
// be put into a `static`. It also gets a lot harder with poisoning to figure
// out when the mutex needs to be deallocated because it's not after the closure
// finishes, but after the first successful closure finishes.
//
// All in all, this is instead implemented with atomics and lock-free
// operations! Whee! Each `Once` has one word of atomic state, and this state is
// CAS'd on to determine what to do.

The talk about atomics and lock-free make it sound like this implementation might be more optimal then using a mutex. But the park/unpark machinery currently relies on a mutex+condvar per thread. So instead of using one mutex, it is using as many as there are waiting threads.

Alternative implementation

While also a bit tricky, it is still possible to just use a mutex: encode a reference to an Arc<Mutex> in Once.state. The reference will be created when the Once starts running, and not be dropped until the Once is dropped.

I'd like to include the fixes in #65719, and make unother PR with an alternative implementation using a mutex. Also a thing to remember is that if #65719 ever lands, the standard library can just use the Once implementation from parking_lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: This is a bug.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions