Skip to content

[libc] add a simple TTAS spin lock #98846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 16, 2024

Conversation

SchrodingerZhu
Copy link
Contributor

@SchrodingerZhu SchrodingerZhu commented Jul 15, 2024

This is to be used in pthread_spinlock.

@SchrodingerZhu SchrodingerZhu requested a review from jhuber6 July 15, 2024 00:04
@llvmbot llvmbot added the libc label Jul 15, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 15, 2024

@llvm/pr-subscribers-libc

Author: Schrodinger ZHU Yifan (SchrodingerZhu)

Changes

This is to be used in pthread_spinlock.


Full diff: https://github.com/llvm/llvm-project/pull/98846.diff

2 Files Affected:

  • (modified) libc/src/__support/threads/CMakeLists.txt (+9)
  • (added) libc/src/__support/threads/spin_lock.h (+56)
diff --git a/libc/src/__support/threads/CMakeLists.txt b/libc/src/__support/threads/CMakeLists.txt
index 9ea0b59befe7a..d2e46b8e2574e 100644
--- a/libc/src/__support/threads/CMakeLists.txt
+++ b/libc/src/__support/threads/CMakeLists.txt
@@ -10,6 +10,15 @@ add_header_library(
     sleep.h
 )
 
+add_header_library(
+  spin_lock
+  HDRS
+    spin_lock.h
+  DEPENDS
+    .sleep
+    libc.src.__support.CPP.atomic
+)
+
 if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${LIBC_TARGET_OS})
   add_subdirectory(${LIBC_TARGET_OS})
 endif()
diff --git a/libc/src/__support/threads/spin_lock.h b/libc/src/__support/threads/spin_lock.h
new file mode 100644
index 0000000000000..8fd6fb960f3ce
--- /dev/null
+++ b/libc/src/__support/threads/spin_lock.h
@@ -0,0 +1,56 @@
+//===-- TTAS Spin Lock ----------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIBC_SRC___SUPPORT_THREADS_SPIN_LOCK_H
+#define LLVM_LIBC_SRC___SUPPORT_THREADS_SPIN_LOCK_H
+
+#include "src/__support/CPP/atomic.h"
+#include "src/__support/macros/attributes.h"
+#include "src/__support/threads/sleep.h"
+namespace LIBC_NAMESPACE_DECL {
+class SpinLock {
+  cpp::Atomic<bool> flag;
+
+public:
+  LIBC_INLINE constexpr SpinLock() : flag{false} {}
+  LIBC_INLINE bool try_lock() {
+    return !flag.exchange(true, cpp::MemoryOrder::ACQUIRE);
+  }
+  LIBC_INLINE void lock() {
+    // clang-format off
+    // this compiles to the following on armv9a and x86_64:
+    //         mov     w8, #1            |          .LBB0_1:
+    // .LBB0_1:                          |                  mov     al, 1
+    //         swpab   w8, w9, [x0]      |                  xchg    byte ptr [rdi], al
+    //         tbnz    w9, #0, .LBB0_3   |                  test    al, 1
+    //         b       .LBB0_4           |                  jne     .LBB0_3
+    // .LBB0_2:                          |                  jmp     .LBB0_4
+    //         isb                       |         .LBB0_2:
+    // .LBB0_3:                          |                  pause
+    //         ldrb    w9, [x0]          |         .LBB0_3:       
+    //         tbnz    w9, #0, .LBB0_2   |                  movzx   eax, byte ptr [rdi]
+    //         b       .LBB0_1           |                  test    al, 1
+    // .LBB0_4:                          |                  jne     .LBB0_2
+    //         ret                       |                  jmp     .LBB0_1
+    //                                   |          .LBB0_4:
+    //                                   |                  ret
+    // clang-format on
+    // Notice that inside the busy loop .LBB0_2 and .LBB0_3, only instructions
+    // with load semantics are used. swpab/xchg is only issued in outer loop
+    // .LBB0_1. This is useful to avoid extra writer traffic. The cache
+    // coherence guarantees "write propagation", so even if the inner loop only
+    // reads with relaxed ordering, the thread will evetually see the write.
+    while (!try_lock())
+      while (flag.load(cpp::MemoryOrder::RELAXED))
+        sleep_briefly();
+  }
+  LIBC_INLINE void unlock() { flag.store(false, cpp::MemoryOrder::RELEASE); }
+};
+} // namespace LIBC_NAMESPACE_DECL
+
+#endif // LLVM_LIBC_SRC___SUPPORT_THREADS_SPIN_LOCK_H

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found using fetch_or and fetch_and works better on the GPU targets. That is,

atomic<int32_t> lock{};

bool has_lock = !lock.fetch_or(1, cpp::MemoryOrder::Acquire);
...
lock.fetch_and(0, cpp::MemoryOrder::Release);

Any clue how this performs on non-GPU targets?

@SchrodingerZhu
Copy link
Contributor Author

@jhuber6 I think this technique is called CTR as in https://arxiv.org/pdf/2102.03863, section 2.1. But TTAS semantic should be more general. Frankly speaking, I am not sure how it performs on modern hardware regarding implement TTAS locks.

@SchrodingerZhu
Copy link
Contributor Author

So according to the discussion in that paper, CTR is considered more performant in 1-to-1 communication:

in our case with the 1-to-1 communication protocol used on the Grant field in Hemlock, busy-waiting via
read-modify-write atomic operations provides a performance benefit. Because of the simple communication pattern, back-off in the busy-waiting loop is not useful.

@SchrodingerZhu SchrodingerZhu requested a review from jhuber6 July 16, 2024 03:15
@SchrodingerZhu
Copy link
Contributor Author

@jhuber6 Updated. Please take a look.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spin locks in general don't work on the GPU due to the lack of a fair scheduler, you need to make sure there's enough resources that any thread can make progress even if every other thread is suspended and never comes back. I'm not too concerned about the GPU implementation for this, since I don't think it'll be used in anything. My questions were mostly whether or not those bitwise operations worked better since in my experience they're considerably faster for the type of 1-to-1 waiting that's done on GPUs. However, looking at the generated assembly for x64 it's definitely not the case there. I think I'm fine with either this or the one you had before.

Co-authored-by: Joseph Huber <[email protected]>
@SchrodingerZhu
Copy link
Contributor Author

Yeah, spinlocks are quite useless on GPU. warp/wavefront divergence plus other scheduling issues will make a spinlock a deadlock easily.

And for many architectures, there is no progression guarantee for cross block/CTA synchronization, so spinlock variants cannot be used across blocks/CTAs neither.

I will use the updated version anyway. Just to leave the potential for the future.

@SchrodingerZhu SchrodingerZhu merged commit 408a351 into llvm:main Jul 16, 2024
4 of 5 checks passed
@SchrodingerZhu SchrodingerZhu deleted the libc/spin_lock branch July 16, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants