[lldb-dap] Fix a race during shutdown #91591

labath · 2024-05-09T13:24:03Z

lldb-dap was setting a flag which was meant to shut it down as soon as it sent a terminated event. The problem with this flag is two-fold:

as far as I can tell (definitely not an expert here), there's no justification for this in the protocol spec. The only way I found to shut the server down was to send it a disconnect request.
the flag did not actually work most of the time, because it's only checked between requests so nothing will happen if the server starts listening for a new request before a different thread manages to send the terminated event. And since the next request is usually the disconnect request, everything will operate normally.

The combination of these two things meant that the issue was largely unnoticable, except for rare flaky test failures, which happened when the handler thread was too slow, and checked the flag after it has already been said. This caused the test suite to complain as it did not get a response to the disconnect request. This situation could be s(t)imulated by adding a sleep to the and of the main loop, which delayed the flag check, and caused the DAP tests to fail reliably.

This patch changes the shutdown condition to only trigger when the disconnect request has been received. Since the flag can now only be set from the handler thread, it no longer needs to be atomic.

lldb-dap was setting a flag which was meant to shut it down as soon as it sent a terminated event. The problem with this flag is two-fold: - as far as I can tell (definitely not an expert here), there's no justification for this in the protocol spec. The only way I found to shut the server down was to send it a disconnect request. - the flag did not actually work most of the time, because it's only checked between requests so nothing will happen if the server starts listening for a new request before a different thread manages to send the terminated event. And since the next request is usually the disconnect request, everything will operate normally. The combination of these two things meant that the issue was largely unnoticable, except for rare flaky test failures, which happened when the handler thread was too slow, and checked the flag after it has already been said. This caused the test suite to complain as it did not get a response to the disconnect request. This situation could be s(t)imulated by adding a sleep to the and of the main loop, which delayed the flag check, and caused the DAP tests to fail reliably. This patch changes the shutdown condition to only trigger when the disconnect request has been received. Since the flag can now only be set from the handler thread, it no longer needs to be atomic.

labath · 2024-05-09T13:24:24Z

https://lab.llvm.org/buildbot/#/builders/68/builds/73871 is an example of such a failure.

llvmbot · 2024-05-09T13:24:37Z

@llvm/pr-subscribers-lldb

Author: Pavel Labath (labath)

Changes

lldb-dap was setting a flag which was meant to shut it down as soon as it sent a terminated event. The problem with this flag is two-fold:

as far as I can tell (definitely not an expert here), there's no justification for this in the protocol spec. The only way I found to shut the server down was to send it a disconnect request.
the flag did not actually work most of the time, because it's only checked between requests so nothing will happen if the server starts listening for a new request before a different thread manages to send the terminated event. And since the next request is usually the disconnect request, everything will operate normally.

The combination of these two things meant that the issue was largely unnoticable, except for rare flaky test failures, which happened when the handler thread was too slow, and checked the flag after it has already been said. This caused the test suite to complain as it did not get a response to the disconnect request. This situation could be s(t)imulated by adding a sleep to the and of the main loop, which delayed the flag check, and caused the DAP tests to fail reliably.

This patch changes the shutdown condition to only trigger when the disconnect request has been received. Since the flag can now only be set from the handler thread, it no longer needs to be atomic.

Full diff: https://github.com/llvm/llvm-project/pull/91591.diff

3 Files Affected:

(modified) lldb/tools/lldb-dap/DAP.cpp (+2-3)
(modified) lldb/tools/lldb-dap/DAP.h (+1-1)
(modified) lldb/tools/lldb-dap/lldb-dap.cpp (+7-18)

diff --git a/lldb/tools/lldb-dap/DAP.cpp b/lldb/tools/lldb-dap/DAP.cpp
index b254ddfef0d5f..55ff1493c1011 100644
--- a/lldb/tools/lldb-dap/DAP.cpp
+++ b/lldb/tools/lldb-dap/DAP.cpp
@@ -39,8 +39,7 @@ DAP::DAP()
            {"objc_throw", "Objective-C Throw", lldb::eLanguageTypeObjC},
            {"swift_catch", "Swift Catch", lldb::eLanguageTypeSwift},
            {"swift_throw", "Swift Throw", lldb::eLanguageTypeSwift}}),
-      focus_tid(LLDB_INVALID_THREAD_ID), sent_terminated_event(false),
-      stop_at_entry(false), is_attach(false),
+      focus_tid(LLDB_INVALID_THREAD_ID), stop_at_entry(false), is_attach(false),
       enable_auto_variable_summaries(false),
       enable_synthetic_child_debugging(false),
       restarting_process_id(LLDB_INVALID_PROCESS_ID),
@@ -623,7 +622,7 @@ bool DAP::HandleObject(const llvm::json::Object &object) {
 }
 
 llvm::Error DAP::Loop() {
-  while (!sent_terminated_event) {
+  while (!disconnecting) {
     llvm::json::Object object;
     lldb_dap::PacketStatus status = GetNextObject(object);
 
diff --git a/lldb/tools/lldb-dap/DAP.h b/lldb/tools/lldb-dap/DAP.h
index 5c70a056fea4b..bbd9d46ba3a04 100644
--- a/lldb/tools/lldb-dap/DAP.h
+++ b/lldb/tools/lldb-dap/DAP.h
@@ -168,7 +168,7 @@ struct DAP {
   // arguments if we get a RestartRequest.
   std::optional<llvm::json::Object> last_launch_or_attach_request;
   lldb::tid_t focus_tid;
-  std::atomic<bool> sent_terminated_event;
+  bool disconnecting = false;
   bool stop_at_entry;
   bool is_attach;
   bool enable_auto_variable_summaries;
diff --git a/lldb/tools/lldb-dap/lldb-dap.cpp b/lldb/tools/lldb-dap/lldb-dap.cpp
index f35abd665e844..96da458be21d1 100644
--- a/lldb/tools/lldb-dap/lldb-dap.cpp
+++ b/lldb/tools/lldb-dap/lldb-dap.cpp
@@ -226,26 +226,14 @@ void SendContinuedEvent() {
 // Send a "terminated" event to indicate the process is done being
 // debugged.
 void SendTerminatedEvent() {
-  // If an inferior exits prior to the processing of a disconnect request, then
-  // the threads executing EventThreadFunction and request_discontinue
-  // respectively may call SendTerminatedEvent simultaneously. Without any
-  // synchronization, the thread executing EventThreadFunction may set
-  // g_dap.sent_terminated_event before the thread executing
-  // request_discontinue has had a chance to test it, in which case the latter
-  // would move ahead to issue a response to the disconnect request. Said
-  // response may get dispatched ahead of the terminated event compelling the
-  // client to terminate the debug session without consuming any console output
-  // that might've been generated by the execution of terminateCommands. So,
-  // synchronize simultaneous calls to SendTerminatedEvent.
+  // Prevent races if the process exits while we're being asked to disconnect.
   static std::mutex mutex;
   std::lock_guard<std::mutex> locker(mutex);
-  if (!g_dap.sent_terminated_event) {
-    g_dap.sent_terminated_event = true;
-    g_dap.RunTerminateCommands();
-    // Send a "terminated" event
-    llvm::json::Object event(CreateTerminatedEventObject());
-    g_dap.SendJSON(llvm::json::Value(std::move(event)));
-  }
+
+  g_dap.RunTerminateCommands();
+  // Send a "terminated" event
+  llvm::json::Object event(CreateTerminatedEventObject());
+  g_dap.SendJSON(llvm::json::Value(std::move(event)));
 }
 
 // Send a thread stopped event for all threads as long as the process
@@ -1003,6 +991,7 @@ void request_disconnect(const llvm::json::Object &request) {
     g_dap.broadcaster.BroadcastEventByType(eBroadcastBitStopProgressThread);
     g_dap.progress_event_thread.join();
   }
+  g_dap.disconnecting = true;
 }
 
 void request_exceptionInfo(const llvm::json::Object &request) {

walter-erquinigo

This all makes sense to me. Thank you!

clayborg

I can't remember: does terminate come before disconnecting? Or is it the other way around?

labath · 2024-05-10T06:32:12Z

I can't remember: does terminate come before disconnecting? Or is it the other way around?

Disconnect docs say "The disconnect request asks the debug adapter to disconnect from the debuggee (thus ending the debug session) and then to shut down itself (the debug adapter).", so it has to be the last request. The terminate just terminates the debuggee and, if the client&server support it, the debuggee can be restarted after that.

clayborg

Sounds good, thanks for the clarification.

kusmour · 2024-05-22T14:07:13Z

I can't remember: does terminate come before disconnecting? Or is it the other way around?

Disconnect docs say "The disconnect request asks the debug adapter to disconnect from the debuggee (thus ending the debug session) and then to shut down itself (the debug adapter).", so it has to be the last request. The terminate just terminates the debuggee and, if the client&server support it, the debuggee can be restarted after that.

This patch resulted in terminated event being sent twice for every session. One from exited event one from disconnect request. Before this patch the behavior was:

Debuggee program exits
DAP msg: exited event --> terminated event --> disconnect request --> disconnect response
User initiated stop debugging
DAP msg: disconnect request --> exited event --> terminated event --> disconnect response

Now

Debuggee program exits
DAP msg: exited event --> terminated event --> disconnect request --> terminated event --> disconnect response
User initiated stop debugging
DAP msg: disconnect request --> exited event --> terminated event --> disconnect response --> terminated event

(Also not sure why the terminated event comes after disconnect response in the second scenario)

labath · 2024-05-22T14:34:11Z

Ah yes, of course, I've kept the mutex thinking that its sufficient to make it run only once, but of course that's only true if the critical section actually contains a "run once" check. I'll create a patch for that tomorrow. Feel free to revert in the mean time if this is blocking something.

This is a regression from llvm#91591. Turns out std::mutex does not prevent code from running twice 🤦.

This is a regression from #91591. Turns out std::mutex does not prevent code from running twice. 🤦

labath requested review from walter-erquinigo and clayborg May 9, 2024 13:24

labath requested a review from JDevlieghere as a code owner May 9, 2024 13:24

llvmbot added the lldb label May 9, 2024

walter-erquinigo approved these changes May 9, 2024

View reviewed changes

clayborg reviewed May 10, 2024

View reviewed changes

clayborg approved these changes May 10, 2024

View reviewed changes

labath merged commit 871f483 into llvm:main May 10, 2024
6 checks passed

labath added a commit to labath/llvm-project that referenced this pull request May 23, 2024

[lldb-dap] Send terminated event only once

2661141

This is a regression from llvm#91591. Turns out std::mutex does not prevent code from running twice 🤦.

labath mentioned this pull request May 23, 2024

[lldb-dap] Send terminated event only once #93172

Merged

labath added a commit that referenced this pull request May 24, 2024

[lldb-dap] Send terminated event only once (#93172)

77ae18b

This is a regression from #91591. Turns out std::mutex does not prevent code from running twice. 🤦

labath deleted the disconnect branch June 7, 2024 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lldb-dap] Fix a race during shutdown #91591

[lldb-dap] Fix a race during shutdown #91591

Uh oh!

labath commented May 9, 2024

Uh oh!

labath commented May 9, 2024

Uh oh!

llvmbot commented May 9, 2024

Uh oh!

walter-erquinigo left a comment

Uh oh!

clayborg left a comment

Uh oh!

labath commented May 10, 2024

Uh oh!

clayborg left a comment

Uh oh!

Uh oh!

kusmour commented May 22, 2024

Uh oh!

labath commented May 22, 2024

Uh oh!

Uh oh!

[lldb-dap] Fix a race during shutdown #91591

[lldb-dap] Fix a race during shutdown #91591

Uh oh!

Conversation

labath commented May 9, 2024

Uh oh!

labath commented May 9, 2024

Uh oh!

llvmbot commented May 9, 2024

Uh oh!

walter-erquinigo left a comment

Choose a reason for hiding this comment

Uh oh!

clayborg left a comment

Choose a reason for hiding this comment

Uh oh!

labath commented May 10, 2024

Uh oh!

clayborg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kusmour commented May 22, 2024

Uh oh!

labath commented May 22, 2024

Uh oh!

Uh oh!