[OpenMP] Fix work-stealing stack clobber with taskwait #126049

jtb20 · 2025-02-06T11:23:28Z

This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios.

The implementation of taskwait builds a graph of dependency nodes for tasks. Some of those dependency nodes (kmp_depnode_t) are allocated on the stack, and some on the heap. In the former case, the stack is specific to a given thread, and the task associated with the node is initially bound to the same thread. This works as long as there is a 1:1 mapping between tasks and the per-thread stack.

However, kmp_tasking.cpp:__kmp_execute_tasks_template implements a work-stealing algorithm that can take some task 'T1' from some thread's ready queue (say, thread 'A'), and execute them on another thread (say, thread 'B').

If that happens, task T1 may have a dependency node on thread A's stack, and that will not be moved to thread B's stack when the work-stealing takes place.

Now, in a heavily multi-threaded program, another task, T2, can be invoked on thread 'A', re-using the stack slot for thread A at the same time that T1 is using the same slot from thread 'B'. This leads to random crashes, often (but not always) during dependency-node cleanup (__kmp_release_deps).

jtb20 · 2025-02-06T11:24:40Z

@jdoerfert @alexey-bataev @uweigand

github-actions · 2025-02-06T11:27:07Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jtb20 · 2025-02-06T14:57:39Z

@shiltian I think I saw you've done some work in this area too?

shiltian

TBH at the first glance I think the fix makes sense but after looking it more, I'm not sure I understand the underlying issue. I don't follow why and when node could be "overridden". Isn't it a stack corruption?

jtb20 · 2025-02-06T17:07:25Z

TBH at the first glance I think the fix makes sense but after looking it more, I'm not sure I understand the underlying issue. I don't follow why and when node could be "overridden". Isn't it a stack corruption?

Yes, it's a stack corruption bug. Two different tasks on two different threads access the same, supposedly thread-local stack for one of those threads. E.g:

In kmp_tasking.cpp:__kmp_execute_tasks_template, the first thread steals a task (__kmp_steal_task), then immediately executes it (__kmp_invoke_task). __kmp_invoke_task calls __kmp_release_deps via __kmp_task_finish.

The second thread executes by the non-task-stealing path: __kmp_remove_my_task called from __kmp_execute_tasks_template, then __kmp_invoke_task, etc. as above.

But, on the task-stealing path, the task's depnode pointer is still pointing to the stack from its original thread, not the one it actually executes on (the stack frame that actually lives in __kmpc_omp_taskwait_deps_51, that the fix for PR85963 took pains to keep alive). So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

An alternative fix might be to add more locking in an appropriate place, but that seems like it'd be more error-prone and probably slower.

jprotze · 2025-02-06T17:16:21Z

I don't see any demonstration of the issue. Is there any test code to reproduce an issue?

The __kmpc_omp_taskwait_deps_51 will not return until all dependencies to the depnode on the stack are complete. At this time, the ref count of the depnode must be 1.

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

jprotze · 2025-02-06T17:32:59Z

But, on the task-stealing path, the task's depnode pointer is still pointing to the stack from its original thread, not the one it actually executes on (the stack frame that actually lives in __kmpc_omp_taskwait_deps_51, that the fix for PR85963 took pains to keep alive).

If a task has a pointer to this depnode object on the stack, it means the taskwait depends on that task and the task needs to finish before the taskwait can complete.

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

shiltian · 2025-02-06T17:38:55Z

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

That is exactly what I think after a second thought. If allocating stack memory for later task can overwrite the existing stack, that is definitely wrong, and a stack corruption.

jtb20 · 2025-02-06T17:44:06Z

I don't see any demonstration of the issue. Is there any test code to reproduce an issue?

The __kmpc_omp_taskwait_deps_51 will not return until all dependencies to the depnode on the stack are complete. At this time, the ref count of the depnode must be 1.

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

I've been using this test case from OpenMP_VV:

https://github.com/OpenMP-Validation-and-Verification/OpenMP_VV/blob/master/tests/5.0/taskwait/test_taskwait_depend.c

if you modify the test to increase the number of iterations (N) to 1024000, the test fails some percentage of the time -- with an unmodified compiler, I see about a 0.1% failure rate.

jtb20 · 2025-02-06T17:48:07Z

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

That is exactly what I think after a second thought. If allocating stack memory for later task can overwrite the existing stack, that is definitely wrong, and a stack corruption.

Everything works fine if task stealing is forcibly disabled. The problem is that two different threads can access the same depnode, and hence the same refcounts -- that makes things go haywire. Quite rarely, but still.

One giveaway is that __kmp_node_deref sometimes tries to free a stack-allocated depnode. That happens because two different threads can be executing __kmp_release_deps at the same time, pointing at the same depnode on what is supposed to be one of the threads' local stack.

I can try to paste more debug logs from my investigation if that would be helpful, but they're not easy to follow...

jtb20 · 2025-02-06T17:53:13Z

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.)

jtb20 · 2025-02-06T17:59:14Z

The test code has a race condition. That's an issue of the application, not an issue of OpenMP

Possibly a dependency on an undefined value, but I'm not sure there's a race condition, and I also saw failures with a simplified version of the test. Can you explain where the race condition is?

jprotze · 2025-02-06T18:10:49Z

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.)

Each active instance of __kmpc_omp_taskwait_deps_51 works on it's own part of the stack. For these stack-allocated depnode objects, the cleanup code in __kmp_release_deps should never trigger. The __kmpc_omp_taskwait_deps_51 owns one reference to the object which is never explicitly released but simply dropped on return.
If the cleanup code is triggered, the root cause is somewhere else.

jtb20 · 2025-02-06T18:15:47Z

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.)

Each active instance of __kmpc_omp_taskwait_deps_51 works on it's own part of the stack. For these stack-allocated depnode objects, the cleanup code in __kmp_release_deps should never trigger. The __kmpc_omp_taskwait_deps_51 owns one reference to the object which is never explicitly released but simply dropped on return. If the cleanup code is triggered, the root cause is somewhere else.

Yes, all that is true if you don't take task stealing into account. That's where the problem lies! The cleanup code in __kmp_release_deps does trigger, even though it should not, because another thread is interfering with the reference counts.

See the first patch, and maybe try applying it by itself and running the sollve test. One thread's taskdep node is definitely accessed by a different thread after task stealing takes place. With that patch (which pretty much just verifies the supposed invariants that you describe), crashes can be seen far more frequently.

shiltian · 2025-02-06T18:31:55Z

I still can't understand the issue. Even with task stealing, the stack memory for node should still be valid, because for that thread it has not exited the region thus the stack variable will not be release.

jtb20 · 2025-02-06T18:41:55Z

I still can't understand the issue. Even with task stealing, the stack memory for node should still be valid, because for that thread it has not exited the region thus the stack variable will not be release.

Exiting the region isn't the issue as such I don't think, but there might be something missing in my explanation. I'll try to figure out a better way to present evidence.

jprotze

Using the linked test code from solvve, I triggered the assertion in kmp_taskdeps.h:31 :

KMP_DEBUG_ASSERT(n >= 0);

This patch does not fix the fundamental issue. Triggering the cleanup code for stack nodes is just a symptom.

jtb20 · 2025-02-06T19:09:00Z

Using the linked test code from solvve, I triggered the assertion in kmp_taskdeps.h:31 :
KMP_DEBUG_ASSERT(n >= 0);
This patch does not fix the fundamental issue. Triggering the cleanup code for stack nodes is just a symptom.

Is that with or without the patch?

jpeyton52 · 2025-02-08T07:42:17Z

It took me an embarrassingly long time to find this, but the timing hole is this sequence of events:

THREAD 1: A regular task with dependences is created, call it T1
THREAD 1: Call into __kmpc_omp_taskwait_deps_51() and create a stack based depnode (NULL task), call it T2 (stack)
THREAD 2: Steals task T1 and executes it getting to __kmp_release_deps() region.
THREAD 1: During processing of dependences for T2 (stack) (within __kmp_check_deps() region), a link is created T1 -> T2. This increases T2's (stack) nrefs count.
THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET __kmp_node_deref()-ed it.
THREAD 1: Now when finished with __kmp_check_deps(), it returns false because npredecessor count is 0, but T2's (stack) nrefs count is 2 because THREAD 2 still references it!
THREAD 1: Because __kmp_check_deps() returns false, early exit.
Now the stack based depnode is invalid, but THREAD 2 still references it.

We've reached improper stack referencing behavior. Varied results/crashes/asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls __kmp_node_deref().

One solution is along the lines of this patch which is to allocate all depnodes on the heap -- you may still need to have a deref() in the early exit as well.

The other is to stick another:

    // Wait until the last __kmp_release_deps is finished before we free the
    // current stack frame holding the "node" variable; once its nrefs count
    // reaches 1, we're sure nobody else can try to reference it again.
    while (node.dn.nrefs > 1)
      KMP_YIELD(TRUE);

right before the early exit if __kmp_check_deps() returns false. This isn't ideal, but that's the current state of things.

jprotze · 2025-02-08T17:28:11Z

Thanks @jpeyton52 for digging into the issue.
My suggestion is to add the assertions from 812b492 (but not the fprintf).
I would prefer to keep the current stack object and add the while loop before the other return statement. The advantage is that issues like observed here break more obviously and can be fixed rather than failing silently. The current patch would lead to silent memory leaks.

jtb20 · 2025-02-10T10:18:47Z

Thanks @jpeyton52 for figuring out a more coherent explanation for what's going on with the bug, and to @jprotze and @shiltian for review!

I'll prepare a new version of the patch fixing the early-exit path and adding some assertions.

jtb20 · 2025-02-10T13:46:47Z

This version adjusts the assertions to avoid conditionally growing the kmp_depnode_t type, and just adds another wait-loop to __kmpc_omp_taskwait_deps_51, at @jpeyton52 and @jprotze 's suggestion. Borrowing bit 0 of the refcount was an idea I had whilst investigating this bug earlier, but the last version of the patch didn't need it. Doing that (vs. adding a new "on_stack" field) allows us to make the new assertions have essentially zero extra cost.

openmp/runtime/src/kmp_taskdeps.cpp

This patch fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. Task stealing can lead to a situation where references to an on-stack 'taskwait' dependency node remain even for the early-exit path in __kmpc_omp_taskwait_deps_51. This patch adds a wait loop to ensure the function does not return before such references are decremented to 1, along similar lines to the fix for PR85963. Several new assertions are also added for safety, borrowing bit 0 of the depnode refcount as a low-cost way of distinguishing heap-allocated from stack-allocated depnodes.

jprotze

lgtm

jtb20 · 2025-02-11T15:53:11Z

lgtm

Thank you! I don't have write access, so if someone could merge this for me, that'd be much appreciated.

@jpeyton52

This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the synchronization mechanism in place for the late return. Additional debug assertions check for the implied invariants of the code. @jpeyton52 found the timing hole as this sequence of events: > > 1. THREAD 1: A regular task with dependences is created, call it T1 > 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack based depnode (`NULL` task), call it T2 (stack) > 3. THREAD 2: Steals task T1 and executes it getting to `__kmp_release_deps()` region. > 4. THREAD 1: During processing of dependences for T2 (stack) (within `__kmp_check_deps()` region), a link is created T1 -> T2. This increases T2's (stack) `nrefs` count. > 5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it. > 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because THREAD 2 still references it! > 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit. > _Now the stack based depnode is invalid, but THREAD 2 still references it._ > > We've reached improper stack referencing behavior. Varied results/crashes/ asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls `__kmp_node_deref()`.

jprotze · 2025-02-14T09:58:34Z

Merging with web ui failed twice, so manually pushed the commit.

jtb20 · 2025-02-14T10:28:44Z

Thank you!

@jpeyton52

) This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the synchronization mechanism in place for the late return. Additional debug assertions check for the implied invariants of the code. @jpeyton52 found the timing hole as this sequence of events: > > 1. THREAD 1: A regular task with dependences is created, call it T1 > 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack based depnode (`NULL` task), call it T2 (stack) > 3. THREAD 2: Steals task T1 and executes it getting to `__kmp_release_deps()` region. > 4. THREAD 1: During processing of dependences for T2 (stack) (within `__kmp_check_deps()` region), a link is created T1 -> T2. This increases T2's (stack) `nrefs` count. > 5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it. > 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because THREAD 2 still references it! > 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit. > _Now the stack based depnode is invalid, but THREAD 2 still references it._ > > We've reached improper stack referencing behavior. Varied results/crashes/ asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls `__kmp_node_deref()`.

@jpeyton52

) This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios. TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the synchronization mechanism in place for the late return. Additional debug assertions check for the implied invariants of the code. @jpeyton52 found the timing hole as this sequence of events: > > 1. THREAD 1: A regular task with dependences is created, call it T1 > 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack based depnode (`NULL` task), call it T2 (stack) > 3. THREAD 2: Steals task T1 and executes it getting to `__kmp_release_deps()` region. > 4. THREAD 1: During processing of dependences for T2 (stack) (within `__kmp_check_deps()` region), a link is created T1 -> T2. This increases T2's (stack) `nrefs` count. > 5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it. > 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because THREAD 2 still references it! > 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit. > _Now the stack based depnode is invalid, but THREAD 2 still references it._ > > We've reached improper stack referencing behavior. Varied results/crashes/ asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls `__kmp_node_deref()`.

llvmbot added the openmp:libomp OpenMP host runtime label Feb 6, 2025

jtb20 force-pushed the taskdeps-fix-2 branch from 5b0e5b7 to 2a9522d Compare February 6, 2025 12:18

TerryLWilmarth requested a review from nawrinsu February 6, 2025 14:42

shiltian reviewed Feb 6, 2025

View reviewed changes

jhuber6 requested a review from jpeyton52 February 6, 2025 18:39

jprotze requested changes Feb 6, 2025

View reviewed changes

jtb20 force-pushed the taskdeps-fix-2 branch from 2a9522d to 323ab24 Compare February 10, 2025 13:36

shiltian reviewed Feb 10, 2025

View reviewed changes

openmp/runtime/src/kmp_taskdeps.cpp Outdated Show resolved Hide resolved

openmp/runtime/src/kmp_taskdeps.cpp Outdated Show resolved Hide resolved

jtb20 force-pushed the taskdeps-fix-2 branch from 323ab24 to df7ffe8 Compare February 10, 2025 16:01

jtb20 requested a review from jprotze February 11, 2025 09:20

jprotze approved these changes Feb 11, 2025

View reviewed changes

jprotze closed this Feb 14, 2025

[OpenMP] Fix work-stealing stack clobber with taskwait #126049

[OpenMP] Fix work-stealing stack clobber with taskwait #126049

Uh oh!

Conversation

jtb20 commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

github-actions bot commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

shiltian left a comment

Choose a reason for hiding this comment

Uh oh!

jtb20 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jprotze commented Feb 6, 2025

Uh oh!

jprotze commented Feb 6, 2025

Uh oh!

shiltian commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

jprotze commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented Feb 6, 2025

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

jprotze left a comment

Choose a reason for hiding this comment

Uh oh!

jtb20 commented Feb 6, 2025

Uh oh!

jpeyton52 commented Feb 8, 2025

Uh oh!

jprotze commented Feb 8, 2025

Uh oh!

jtb20 commented Feb 10, 2025

Uh oh!

jtb20 commented Feb 10, 2025

Uh oh!

Uh oh!

Uh oh!

jprotze left a comment

Choose a reason for hiding this comment

Uh oh!

jtb20 commented Feb 11, 2025

Uh oh!

jprotze commented Feb 14, 2025

Uh oh!

jtb20 commented Feb 14, 2025

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2025 •

edited

Loading

jtb20 commented Feb 6, 2025 •

edited

Loading

jtb20 commented Feb 6, 2025 •

edited

Loading