-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[OpenMP] Fix work-stealing stack clobber with taskwait #126049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
@shiltian I think I saw you've done some work in this area too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH at the first glance I think the fix makes sense but after looking it more, I'm not sure I understand the underlying issue. I don't follow why and when node
could be "overridden". Isn't it a stack corruption?
Yes, it's a stack corruption bug. Two different tasks on two different threads access the same, supposedly thread-local stack for one of those threads. E.g: In kmp_tasking.cpp:__kmp_execute_tasks_template, the first thread steals a task (__kmp_steal_task), then immediately executes it (__kmp_invoke_task). __kmp_invoke_task calls __kmp_release_deps via __kmp_task_finish. The second thread executes by the non-task-stealing path: __kmp_remove_my_task called from __kmp_execute_tasks_template, then __kmp_invoke_task, etc. as above. But, on the task-stealing path, the task's depnode pointer is still pointing to the stack from its original thread, not the one it actually executes on (the stack frame that actually lives in __kmpc_omp_taskwait_deps_51, that the fix for PR85963 took pains to keep alive). So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread. An alternative fix might be to add more locking in an appropriate place, but that seems like it'd be more error-prone and probably slower. |
I don't see any demonstration of the issue. Is there any test code to reproduce an issue? The __kmpc_omp_taskwait_deps_51 will not return until all dependencies to the depnode on the stack are complete. At this time, the ref count of the depnode must be 1. If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting. |
If a task has a pointer to this depnode object on the stack, it means the taskwait depends on that task and the task needs to finish before the taskwait can complete.
Please explain, where allocating a new depnode will use the stack memory. |
That is exactly what I think after a second thought. If allocating stack memory for later task can overwrite the existing stack, that is definitely wrong, and a stack corruption. |
I've been using this test case from OpenMP_VV: if you modify the test to increase the number of iterations (N) to 1024000, the test fails some percentage of the time -- with an unmodified compiler, I see about a 0.1% failure rate. |
Everything works fine if task stealing is forcibly disabled. The problem is that two different threads can access the same depnode, and hence the same refcounts -- that makes things go haywire. Quite rarely, but still. One giveaway is that __kmp_node_deref sometimes tries to free a stack-allocated depnode. That happens because two different threads can be executing __kmp_release_deps at the same time, pointing at the same depnode on what is supposed to be one of the threads' local stack. I can try to paste more debug logs from my investigation if that would be helpful, but they're not easy to follow... |
Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.) |
Possibly a dependency on an undefined value, but I'm not sure there's a race condition, and I also saw failures with a simplified version of the test. Can you explain where the race condition is? |
Each active instance of __kmpc_omp_taskwait_deps_51 works on it's own part of the stack. For these stack-allocated depnode objects, the cleanup code in __kmp_release_deps should never trigger. The __kmpc_omp_taskwait_deps_51 owns one reference to the object which is never explicitly released but simply dropped on return. |
Yes, all that is true if you don't take task stealing into account. That's where the problem lies! The cleanup code in __kmp_release_deps does trigger, even though it should not, because another thread is interfering with the reference counts. See the first patch, and maybe try applying it by itself and running the sollve test. One thread's taskdep node is definitely accessed by a different thread after task stealing takes place. With that patch (which pretty much just verifies the supposed invariants that you describe), crashes can be seen far more frequently. |
I still can't understand the issue. Even with task stealing, the stack memory for |
Exiting the region isn't the issue as such I don't think, but there might be something missing in my explanation. I'll try to figure out a better way to present evidence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the linked test code from solvve, I triggered the assertion in kmp_taskdeps.h:31 :
KMP_DEBUG_ASSERT(n >= 0);
This patch does not fix the fundamental issue. Triggering the cleanup code for stack nodes is just a symptom.
Is that with or without the patch? |
It took me an embarrassingly long time to find this, but the timing hole is this sequence of events:
We've reached improper stack referencing behavior. Varied results/crashes/asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls One solution is along the lines of this patch which is to allocate all depnodes on the heap -- you may still need to have a deref() in the early exit as well. The other is to stick another:
right before the early exit if |
Thanks @jpeyton52 for digging into the issue. |
Thanks @jpeyton52 for figuring out a more coherent explanation for what's going on with the bug, and to @jprotze and @shiltian for review! I'll prepare a new version of the patch fixing the early-exit path and adding some assertions. |
2a9522d
to
323ab24
Compare
This version adjusts the assertions to avoid conditionally growing the kmp_depnode_t type, and just adds another wait-loop to __kmpc_omp_taskwait_deps_51, at @jpeyton52 and @jprotze 's suggestion. Borrowing bit 0 of the refcount was an idea I had whilst investigating this bug earlier, but the last version of the patch didn't need it. Doing that (vs. adding a new "on_stack" field) allows us to make the new assertions have essentially zero extra cost. |
This patch fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. Task stealing can lead to a situation where references to an on-stack 'taskwait' dependency node remain even for the early-exit path in __kmpc_omp_taskwait_deps_51. This patch adds a wait loop to ensure the function does not return before such references are decremented to 1, along similar lines to the fix for PR85963. Several new assertions are also added for safety, borrowing bit 0 of the depnode refcount as a low-cost way of distinguishing heap-allocated from stack-allocated depnodes.
323ab24
to
df7ffe8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thank you! I don't have write access, so if someone could merge this for me, that'd be much appreciated. |
This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the synchronization mechanism in place for the late return. Additional debug assertions check for the implied invariants of the code. @jpeyton52 found the timing hole as this sequence of events: > > 1. THREAD 1: A regular task with dependences is created, call it T1 > 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack based depnode (`NULL` task), call it T2 (stack) > 3. THREAD 2: Steals task T1 and executes it getting to `__kmp_release_deps()` region. > 4. THREAD 1: During processing of dependences for T2 (stack) (within `__kmp_check_deps()` region), a link is created T1 -> T2. This increases T2's (stack) `nrefs` count. > 5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it. > 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because THREAD 2 still references it! > 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit. > _Now the stack based depnode is invalid, but THREAD 2 still references it._ > > We've reached improper stack referencing behavior. Varied results/crashes/ asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls `__kmp_node_deref()`.
Merging with web ui failed twice, so manually pushed the commit. |
Thank you! |
) This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the synchronization mechanism in place for the late return. Additional debug assertions check for the implied invariants of the code. @jpeyton52 found the timing hole as this sequence of events: > > 1. THREAD 1: A regular task with dependences is created, call it T1 > 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack based depnode (`NULL` task), call it T2 (stack) > 3. THREAD 2: Steals task T1 and executes it getting to `__kmp_release_deps()` region. > 4. THREAD 1: During processing of dependences for T2 (stack) (within `__kmp_check_deps()` region), a link is created T1 -> T2. This increases T2's (stack) `nrefs` count. > 5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it. > 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because THREAD 2 still references it! > 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit. > _Now the stack based depnode is invalid, but THREAD 2 still references it._ > > We've reached improper stack referencing behavior. Varied results/crashes/ asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls `__kmp_node_deref()`.
) This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the synchronization mechanism in place for the late return. Additional debug assertions check for the implied invariants of the code. @jpeyton52 found the timing hole as this sequence of events: > > 1. THREAD 1: A regular task with dependences is created, call it T1 > 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack based depnode (`NULL` task), call it T2 (stack) > 3. THREAD 2: Steals task T1 and executes it getting to `__kmp_release_deps()` region. > 4. THREAD 1: During processing of dependences for T2 (stack) (within `__kmp_check_deps()` region), a link is created T1 -> T2. This increases T2's (stack) `nrefs` count. > 5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it. > 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because THREAD 2 still references it! > 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit. > _Now the stack based depnode is invalid, but THREAD 2 still references it._ > > We've reached improper stack referencing behavior. Varied results/crashes/ asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls `__kmp_node_deref()`.
This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios.
The implementation of taskwait builds a graph of dependency nodes for tasks. Some of those dependency nodes (kmp_depnode_t) are allocated on the stack, and some on the heap. In the former case, the stack is specific to a given thread, and the task associated with the node is initially bound to the same thread. This works as long as there is a 1:1 mapping between tasks and the per-thread stack.
However, kmp_tasking.cpp:__kmp_execute_tasks_template implements a work-stealing algorithm that can take some task 'T1' from some thread's ready queue (say, thread 'A'), and execute them on another thread (say, thread 'B').
If that happens, task T1 may have a dependency node on thread A's stack, and that will not be moved to thread B's stack when the work-stealing takes place.
Now, in a heavily multi-threaded program, another task, T2, can be invoked on thread 'A', re-using the stack slot for thread A at the same time that T1 is using the same slot from thread 'B'. This leads to random crashes, often (but not always) during dependency-node cleanup (__kmp_release_deps).