-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[libc] add a simple TTAS spin lock #98846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-libc Author: Schrodinger ZHU Yifan (SchrodingerZhu) ChangesThis is to be used in Full diff: https://github.com/llvm/llvm-project/pull/98846.diff 2 Files Affected:
diff --git a/libc/src/__support/threads/CMakeLists.txt b/libc/src/__support/threads/CMakeLists.txt
index 9ea0b59befe7a..d2e46b8e2574e 100644
--- a/libc/src/__support/threads/CMakeLists.txt
+++ b/libc/src/__support/threads/CMakeLists.txt
@@ -10,6 +10,15 @@ add_header_library(
sleep.h
)
+add_header_library(
+ spin_lock
+ HDRS
+ spin_lock.h
+ DEPENDS
+ .sleep
+ libc.src.__support.CPP.atomic
+)
+
if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${LIBC_TARGET_OS})
add_subdirectory(${LIBC_TARGET_OS})
endif()
diff --git a/libc/src/__support/threads/spin_lock.h b/libc/src/__support/threads/spin_lock.h
new file mode 100644
index 0000000000000..8fd6fb960f3ce
--- /dev/null
+++ b/libc/src/__support/threads/spin_lock.h
@@ -0,0 +1,56 @@
+//===-- TTAS Spin Lock ----------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIBC_SRC___SUPPORT_THREADS_SPIN_LOCK_H
+#define LLVM_LIBC_SRC___SUPPORT_THREADS_SPIN_LOCK_H
+
+#include "src/__support/CPP/atomic.h"
+#include "src/__support/macros/attributes.h"
+#include "src/__support/threads/sleep.h"
+namespace LIBC_NAMESPACE_DECL {
+class SpinLock {
+ cpp::Atomic<bool> flag;
+
+public:
+ LIBC_INLINE constexpr SpinLock() : flag{false} {}
+ LIBC_INLINE bool try_lock() {
+ return !flag.exchange(true, cpp::MemoryOrder::ACQUIRE);
+ }
+ LIBC_INLINE void lock() {
+ // clang-format off
+ // this compiles to the following on armv9a and x86_64:
+ // mov w8, #1 | .LBB0_1:
+ // .LBB0_1: | mov al, 1
+ // swpab w8, w9, [x0] | xchg byte ptr [rdi], al
+ // tbnz w9, #0, .LBB0_3 | test al, 1
+ // b .LBB0_4 | jne .LBB0_3
+ // .LBB0_2: | jmp .LBB0_4
+ // isb | .LBB0_2:
+ // .LBB0_3: | pause
+ // ldrb w9, [x0] | .LBB0_3:
+ // tbnz w9, #0, .LBB0_2 | movzx eax, byte ptr [rdi]
+ // b .LBB0_1 | test al, 1
+ // .LBB0_4: | jne .LBB0_2
+ // ret | jmp .LBB0_1
+ // | .LBB0_4:
+ // | ret
+ // clang-format on
+ // Notice that inside the busy loop .LBB0_2 and .LBB0_3, only instructions
+ // with load semantics are used. swpab/xchg is only issued in outer loop
+ // .LBB0_1. This is useful to avoid extra writer traffic. The cache
+ // coherence guarantees "write propagation", so even if the inner loop only
+ // reads with relaxed ordering, the thread will evetually see the write.
+ while (!try_lock())
+ while (flag.load(cpp::MemoryOrder::RELAXED))
+ sleep_briefly();
+ }
+ LIBC_INLINE void unlock() { flag.store(false, cpp::MemoryOrder::RELEASE); }
+};
+} // namespace LIBC_NAMESPACE_DECL
+
+#endif // LLVM_LIBC_SRC___SUPPORT_THREADS_SPIN_LOCK_H
|
1e454e2
to
92579a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found using fetch_or
and fetch_and
works better on the GPU targets. That is,
atomic<int32_t> lock{};
bool has_lock = !lock.fetch_or(1, cpp::MemoryOrder::Acquire);
...
lock.fetch_and(0, cpp::MemoryOrder::Release);
Any clue how this performs on non-GPU targets?
@jhuber6 I think this technique is called CTR as in https://arxiv.org/pdf/2102.03863, section 2.1. But TTAS semantic should be more general. Frankly speaking, I am not sure how it performs on modern hardware regarding implement TTAS locks. |
So according to the discussion in that paper, CTR is considered more performant in 1-to-1 communication:
|
@jhuber6 Updated. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spin locks in general don't work on the GPU due to the lack of a fair scheduler, you need to make sure there's enough resources that any thread can make progress even if every other thread is suspended and never comes back. I'm not too concerned about the GPU implementation for this, since I don't think it'll be used in anything. My questions were mostly whether or not those bitwise operations worked better since in my experience they're considerably faster for the type of 1-to-1 waiting that's done on GPUs. However, looking at the generated assembly for x64 it's definitely not the case there. I think I'm fine with either this or the one you had before.
Co-authored-by: Joseph Huber <[email protected]>
Yeah, spinlocks are quite useless on GPU. warp/wavefront divergence plus other scheduling issues will make a spinlock a deadlock easily. And for many architectures, there is no progression guarantee for cross block/CTA synchronization, so spinlock variants cannot be used across blocks/CTAs neither. I will use the updated version anyway. Just to leave the potential for the future. |
This is to be used in
pthread_spinlock
.