Skip to content

[lldb-dap] Fix a race during shutdown #91591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 10, 2024
Merged

[lldb-dap] Fix a race during shutdown #91591

merged 1 commit into from
May 10, 2024

Conversation

labath
Copy link
Collaborator

@labath labath commented May 9, 2024

lldb-dap was setting a flag which was meant to shut it down as soon as it sent a terminated event. The problem with this flag is two-fold:

  • as far as I can tell (definitely not an expert here), there's no justification for this in the protocol spec. The only way I found to shut the server down was to send it a disconnect request.
  • the flag did not actually work most of the time, because it's only checked between requests so nothing will happen if the server starts listening for a new request before a different thread manages to send the terminated event. And since the next request is usually the disconnect request, everything will operate normally.

The combination of these two things meant that the issue was largely unnoticable, except for rare flaky test failures, which happened when the handler thread was too slow, and checked the flag after it has already been said. This caused the test suite to complain as it did not get a response to the disconnect request. This situation could be s(t)imulated by adding a sleep to the and of the main loop, which delayed the flag check, and caused the DAP tests to fail reliably.

This patch changes the shutdown condition to only trigger when the disconnect request has been received. Since the flag can now only be set from the handler thread, it no longer needs to be atomic.

lldb-dap was setting a flag which was meant to shut it down as soon as
it sent a terminated event. The problem with this flag is two-fold:
- as far as I can tell (definitely not an expert here), there's no
  justification for this in the protocol spec. The only way I found to
  shut the server down was to send it a disconnect request.
- the flag did not actually work most of the time, because it's only
  checked between requests so nothing will happen if the server starts
  listening for a new request before a different thread manages to send
  the terminated event. And since the next request is usually the
  disconnect request, everything will operate normally.

The combination of these two things meant that the issue was largely
unnoticable, except for rare flaky test failures, which happened when
the handler thread was too slow, and checked the flag after it has
already been said. This caused the test suite to complain as it did not
get a response to the disconnect request. This situation could be
s(t)imulated by adding a sleep to the and of the main loop, which
delayed the flag check, and caused the DAP tests to fail reliably.

This patch changes the shutdown condition to only trigger when the
disconnect request has been received. Since the flag can now only be set
from the handler thread, it no longer needs to be atomic.
@labath labath requested a review from JDevlieghere as a code owner May 9, 2024 13:24
@llvmbot llvmbot added the lldb label May 9, 2024
@labath
Copy link
Collaborator Author

labath commented May 9, 2024

https://lab.llvm.org/buildbot/#/builders/68/builds/73871 is an example of such a failure.

@llvmbot
Copy link
Member

llvmbot commented May 9, 2024

@llvm/pr-subscribers-lldb

Author: Pavel Labath (labath)

Changes

lldb-dap was setting a flag which was meant to shut it down as soon as it sent a terminated event. The problem with this flag is two-fold:

  • as far as I can tell (definitely not an expert here), there's no justification for this in the protocol spec. The only way I found to shut the server down was to send it a disconnect request.
  • the flag did not actually work most of the time, because it's only checked between requests so nothing will happen if the server starts listening for a new request before a different thread manages to send the terminated event. And since the next request is usually the disconnect request, everything will operate normally.

The combination of these two things meant that the issue was largely unnoticable, except for rare flaky test failures, which happened when the handler thread was too slow, and checked the flag after it has already been said. This caused the test suite to complain as it did not get a response to the disconnect request. This situation could be s(t)imulated by adding a sleep to the and of the main loop, which delayed the flag check, and caused the DAP tests to fail reliably.

This patch changes the shutdown condition to only trigger when the disconnect request has been received. Since the flag can now only be set from the handler thread, it no longer needs to be atomic.


Full diff: https://github.com/llvm/llvm-project/pull/91591.diff

3 Files Affected:

  • (modified) lldb/tools/lldb-dap/DAP.cpp (+2-3)
  • (modified) lldb/tools/lldb-dap/DAP.h (+1-1)
  • (modified) lldb/tools/lldb-dap/lldb-dap.cpp (+7-18)
diff --git a/lldb/tools/lldb-dap/DAP.cpp b/lldb/tools/lldb-dap/DAP.cpp
index b254ddfef0d5f..55ff1493c1011 100644
--- a/lldb/tools/lldb-dap/DAP.cpp
+++ b/lldb/tools/lldb-dap/DAP.cpp
@@ -39,8 +39,7 @@ DAP::DAP()
            {"objc_throw", "Objective-C Throw", lldb::eLanguageTypeObjC},
            {"swift_catch", "Swift Catch", lldb::eLanguageTypeSwift},
            {"swift_throw", "Swift Throw", lldb::eLanguageTypeSwift}}),
-      focus_tid(LLDB_INVALID_THREAD_ID), sent_terminated_event(false),
-      stop_at_entry(false), is_attach(false),
+      focus_tid(LLDB_INVALID_THREAD_ID), stop_at_entry(false), is_attach(false),
       enable_auto_variable_summaries(false),
       enable_synthetic_child_debugging(false),
       restarting_process_id(LLDB_INVALID_PROCESS_ID),
@@ -623,7 +622,7 @@ bool DAP::HandleObject(const llvm::json::Object &object) {
 }
 
 llvm::Error DAP::Loop() {
-  while (!sent_terminated_event) {
+  while (!disconnecting) {
     llvm::json::Object object;
     lldb_dap::PacketStatus status = GetNextObject(object);
 
diff --git a/lldb/tools/lldb-dap/DAP.h b/lldb/tools/lldb-dap/DAP.h
index 5c70a056fea4b..bbd9d46ba3a04 100644
--- a/lldb/tools/lldb-dap/DAP.h
+++ b/lldb/tools/lldb-dap/DAP.h
@@ -168,7 +168,7 @@ struct DAP {
   // arguments if we get a RestartRequest.
   std::optional<llvm::json::Object> last_launch_or_attach_request;
   lldb::tid_t focus_tid;
-  std::atomic<bool> sent_terminated_event;
+  bool disconnecting = false;
   bool stop_at_entry;
   bool is_attach;
   bool enable_auto_variable_summaries;
diff --git a/lldb/tools/lldb-dap/lldb-dap.cpp b/lldb/tools/lldb-dap/lldb-dap.cpp
index f35abd665e844..96da458be21d1 100644
--- a/lldb/tools/lldb-dap/lldb-dap.cpp
+++ b/lldb/tools/lldb-dap/lldb-dap.cpp
@@ -226,26 +226,14 @@ void SendContinuedEvent() {
 // Send a "terminated" event to indicate the process is done being
 // debugged.
 void SendTerminatedEvent() {
-  // If an inferior exits prior to the processing of a disconnect request, then
-  // the threads executing EventThreadFunction and request_discontinue
-  // respectively may call SendTerminatedEvent simultaneously. Without any
-  // synchronization, the thread executing EventThreadFunction may set
-  // g_dap.sent_terminated_event before the thread executing
-  // request_discontinue has had a chance to test it, in which case the latter
-  // would move ahead to issue a response to the disconnect request. Said
-  // response may get dispatched ahead of the terminated event compelling the
-  // client to terminate the debug session without consuming any console output
-  // that might've been generated by the execution of terminateCommands. So,
-  // synchronize simultaneous calls to SendTerminatedEvent.
+  // Prevent races if the process exits while we're being asked to disconnect.
   static std::mutex mutex;
   std::lock_guard<std::mutex> locker(mutex);
-  if (!g_dap.sent_terminated_event) {
-    g_dap.sent_terminated_event = true;
-    g_dap.RunTerminateCommands();
-    // Send a "terminated" event
-    llvm::json::Object event(CreateTerminatedEventObject());
-    g_dap.SendJSON(llvm::json::Value(std::move(event)));
-  }
+
+  g_dap.RunTerminateCommands();
+  // Send a "terminated" event
+  llvm::json::Object event(CreateTerminatedEventObject());
+  g_dap.SendJSON(llvm::json::Value(std::move(event)));
 }
 
 // Send a thread stopped event for all threads as long as the process
@@ -1003,6 +991,7 @@ void request_disconnect(const llvm::json::Object &request) {
     g_dap.broadcaster.BroadcastEventByType(eBroadcastBitStopProgressThread);
     g_dap.progress_event_thread.join();
   }
+  g_dap.disconnecting = true;
 }
 
 void request_exceptionInfo(const llvm::json::Object &request) {

Copy link
Member

@walter-erquinigo walter-erquinigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense to me. Thank you!

Copy link
Collaborator

@clayborg clayborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember: does terminate come before disconnecting? Or is it the other way around?

@labath
Copy link
Collaborator Author

labath commented May 10, 2024

I can't remember: does terminate come before disconnecting? Or is it the other way around?

Disconnect docs say "The disconnect request asks the debug adapter to disconnect from the debuggee (thus ending the debug session) and then to shut down itself (the debug adapter).", so it has to be the last request. The terminate just terminates the debuggee and, if the client&server support it, the debuggee can be restarted after that.

Copy link
Collaborator

@clayborg clayborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for the clarification.

@labath labath merged commit 871f483 into llvm:main May 10, 2024
6 checks passed
@kusmour
Copy link
Contributor

kusmour commented May 22, 2024

I can't remember: does terminate come before disconnecting? Or is it the other way around?

Disconnect docs say "The disconnect request asks the debug adapter to disconnect from the debuggee (thus ending the debug session) and then to shut down itself (the debug adapter).", so it has to be the last request. The terminate just terminates the debuggee and, if the client&server support it, the debuggee can be restarted after that.

This patch resulted in terminated event being sent twice for every session. One from exited event one from disconnect request. Before this patch the behavior was:

  • Debuggee program exits
    DAP msg: exited event --> terminated event --> disconnect request --> disconnect response
  • User initiated stop debugging
    DAP msg: disconnect request --> exited event --> terminated event --> disconnect response

Now

  • Debuggee program exits
    DAP msg: exited event --> terminated event --> disconnect request --> terminated event --> disconnect response
  • User initiated stop debugging
    DAP msg: disconnect request --> exited event --> terminated event --> disconnect response --> terminated event

(Also not sure why the terminated event comes after disconnect response in the second scenario)

@labath
Copy link
Collaborator Author

labath commented May 22, 2024

Ah yes, of course, I've kept the mutex thinking that its sufficient to make it run only once, but of course that's only true if the critical section actually contains a "run once" check. I'll create a patch for that tomorrow. Feel free to revert in the mean time if this is blocking something.

labath added a commit to labath/llvm-project that referenced this pull request May 23, 2024
This is a regression from llvm#91591. Turns out std::mutex does not prevent
code from running twice 🤦.
labath added a commit that referenced this pull request May 24, 2024
This is a regression from #91591. Turns out std::mutex does not prevent
code from running twice. 🤦
@labath labath deleted the disconnect branch June 7, 2024 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants