Description
Feature or enhancement
Implementing PEP 703 will require adding additional fine grained locks and other synchronization mechanisms. For good performance, it's important that these locks be "lightweight" in the sense that they don't take up much space and don't require memory allocations to create. Additionally, it's important that these locks are fast in the common uncontended case, perform reasonably under contention, and avoid thread starvation.
Platform provided mutexes like pthread_mutex_t
are large (40 bytes on x86-64 Linux) and our current cross-platform wrappers ([1], [2], [3]) require additional memory allocations.
I'm proposing a lightweight mutex (PyMutex
) along with internal-only APIs used for building an efficient PyMutex
as well as other synchronization primitives. The design is based on WebKit's WTF::Lock
and WTF::ParkingLot
, which is described in detail in the Locking in WebKit blog post. (The design has also been ported to Rust in the parking_lot
crate.)
Public API
The public API (in Include/cpython
) would provide a PyMutex
that occupies one byte and can be zero-initialized:
typedef struct PyMutex { uint8_t state; } PyMutex;
void PyMutex_Lock(PyMutex *m);
void PyMutex_Unlock(PyMutex *m);
I'm proposing making PyMutex
public because it's useful in C extensions, such as NumPy, (as opposed to C++) where it can be a pain to wrap cross-platform synchronization primitives.
Internal APIs
The internal only API (in Include/internal
) would provide APIs for building PyMutex
and other synchronization primitives. The main addition is a compare-and-wait primitive, like Linux's futex
or Window's WaitOnAdress
.
int _PyParkingLot_Park(const void *address, const void *expected, size_t address_size,
_PyTime_t timeout_ns, void *arg, int detach)
The API closely matches WaitOnAddress
but with two additions: arg
is an optional, arbitrary pointer passed to the wake-up thread and detach
indicates whether to release the GIL (or detach in --disable-gil
builds) while waiting. The additional arg
pointer allows the locks to be only one byte (instead of at least pointer sized), since it allows passing additional (stack allocated) data between the waiting and the waking thread.
The wakeup API looks like:
// wake up all threads waiting on `address`
void _PyParkingLot_UnparkAll(const void *address);
// or wake up a single thread
_PyParkingLot_Unpark(address, unpark, {
// code here is executed after the thread to be woken up is identified but before we wake it up
void *arg = unpark->arg;
int more_waiters = unpark->more_waiters;
...
});
_PyParkingLot_Unpark
is currently a macro that takes a code block. For PyMutex
we need to update the mutex bits after we identify the thread but before we actually wake it up.