Skip to content

[OpenMP] Fix work-stealing stack clobber with taskwait #126049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jtb20
Copy link
Contributor

@jtb20 jtb20 commented Feb 6, 2025

This patch series demonstrates and fixes a bug that causes crashes with OpenMP 'taskwait' directives in heavily multi-threaded scenarios.

The implementation of taskwait builds a graph of dependency nodes for tasks. Some of those dependency nodes (kmp_depnode_t) are allocated on the stack, and some on the heap. In the former case, the stack is specific to a given thread, and the task associated with the node is initially bound to the same thread. This works as long as there is a 1:1 mapping between tasks and the per-thread stack.

However, kmp_tasking.cpp:__kmp_execute_tasks_template implements a work-stealing algorithm that can take some task 'T1' from some thread's ready queue (say, thread 'A'), and execute them on another thread (say, thread 'B').

If that happens, task T1 may have a dependency node on thread A's stack, and that will not be moved to thread B's stack when the work-stealing takes place.

Now, in a heavily multi-threaded program, another task, T2, can be invoked on thread 'A', re-using the stack slot for thread A at the same time that T1 is using the same slot from thread 'B'. This leads to random crashes, often (but not always) during dependency-node cleanup (__kmp_release_deps).

@llvmbot llvmbot added the openmp:libomp OpenMP host runtime label Feb 6, 2025
@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

Copy link

github-actions bot commented Feb 6, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

@shiltian I think I saw you've done some work in this area too?

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH at the first glance I think the fix makes sense but after looking it more, I'm not sure I understand the underlying issue. I don't follow why and when node could be "overridden". Isn't it a stack corruption?

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

TBH at the first glance I think the fix makes sense but after looking it more, I'm not sure I understand the underlying issue. I don't follow why and when node could be "overridden". Isn't it a stack corruption?

Yes, it's a stack corruption bug. Two different tasks on two different threads access the same, supposedly thread-local stack for one of those threads. E.g:

In kmp_tasking.cpp:__kmp_execute_tasks_template, the first thread steals a task (__kmp_steal_task), then immediately executes it (__kmp_invoke_task). __kmp_invoke_task calls __kmp_release_deps via __kmp_task_finish.

The second thread executes by the non-task-stealing path: __kmp_remove_my_task called from __kmp_execute_tasks_template, then __kmp_invoke_task, etc. as above.

But, on the task-stealing path, the task's depnode pointer is still pointing to the stack from its original thread, not the one it actually executes on (the stack frame that actually lives in __kmpc_omp_taskwait_deps_51, that the fix for PR85963 took pains to keep alive). So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

An alternative fix might be to add more locking in an appropriate place, but that seems like it'd be more error-prone and probably slower.

@jprotze
Copy link
Collaborator

jprotze commented Feb 6, 2025

I don't see any demonstration of the issue. Is there any test code to reproduce an issue?

The __kmpc_omp_taskwait_deps_51 will not return until all dependencies to the depnode on the stack are complete. At this time, the ref count of the depnode must be 1.

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

@jprotze
Copy link
Collaborator

jprotze commented Feb 6, 2025

But, on the task-stealing path, the task's depnode pointer is still pointing to the stack from its original thread, not the one it actually executes on (the stack frame that actually lives in __kmpc_omp_taskwait_deps_51, that the fix for PR85963 took pains to keep alive).

If a task has a pointer to this depnode object on the stack, it means the taskwait depends on that task and the task needs to finish before the taskwait can complete.

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

@shiltian
Copy link
Contributor

shiltian commented Feb 6, 2025

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

That is exactly what I think after a second thought. If allocating stack memory for later task can overwrite the existing stack, that is definitely wrong, and a stack corruption.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

I don't see any demonstration of the issue. Is there any test code to reproduce an issue?

The __kmpc_omp_taskwait_deps_51 will not return until all dependencies to the depnode on the stack are complete. At this time, the ref count of the depnode must be 1.

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

I've been using this test case from OpenMP_VV:

https://github.com/OpenMP-Validation-and-Verification/OpenMP_VV/blob/master/tests/5.0/taskwait/test_taskwait_depend.c

if you modify the test to increase the number of iterations (N) to 1024000, the test fails some percentage of the time -- with an unmodified compiler, I see about a 0.1% failure rate.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

If you run into issues at this part of the code, not the depnode on the stack is the issue, but somewhere else in the code is causing an error with the reference counting.

That is exactly what I think after a second thought. If allocating stack memory for later task can overwrite the existing stack, that is definitely wrong, and a stack corruption.

Everything works fine if task stealing is forcibly disabled. The problem is that two different threads can access the same depnode, and hence the same refcounts -- that makes things go haywire. Quite rarely, but still.

One giveaway is that __kmp_node_deref sometimes tries to free a stack-allocated depnode. That happens because two different threads can be executing __kmp_release_deps at the same time, pointing at the same depnode on what is supposed to be one of the threads' local stack.

I can try to paste more debug logs from my investigation if that would be helpful, but they're not easy to follow...

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.)

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

The test code has a race condition. That's an issue of the application, not an issue of OpenMP

Possibly a dependency on an undefined value, but I'm not sure there's a race condition, and I also saw failures with a simplified version of the test. Can you explain where the race condition is?

@jprotze
Copy link
Collaborator

jprotze commented Feb 6, 2025

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.)

Each active instance of __kmpc_omp_taskwait_deps_51 works on it's own part of the stack. For these stack-allocated depnode objects, the cleanup code in __kmp_release_deps should never trigger. The __kmpc_omp_taskwait_deps_51 owns one reference to the object which is never explicitly released but simply dropped on return.
If the cleanup code is triggered, the root cause is somewhere else.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

So when the second thread comes along and allocates a "new" depnode, it's actually using a chunk of (stack) memory that is still in use by the first thread.

Please explain, where allocating a new depnode will use the stack memory.

Exactly in __kmpc_omp_taskwait_deps_51, which can be invoked simultaneously by any of the threads executing tasks. The trouble is there is nothing blocking some "new" task from running that function whilst another (stolen) task is e.g. performing cleanup in __kmp_release_deps on another thread. (Again, adding that locking explicitly is probably possible, but I don't think it's a good idea. It'd kill any benefit from task stealing, since the stolen task would block.)

Each active instance of __kmpc_omp_taskwait_deps_51 works on it's own part of the stack. For these stack-allocated depnode objects, the cleanup code in __kmp_release_deps should never trigger. The __kmpc_omp_taskwait_deps_51 owns one reference to the object which is never explicitly released but simply dropped on return. If the cleanup code is triggered, the root cause is somewhere else.

Yes, all that is true if you don't take task stealing into account. That's where the problem lies! The cleanup code in __kmp_release_deps does trigger, even though it should not, because another thread is interfering with the reference counts.

See the first patch, and maybe try applying it by itself and running the sollve test. One thread's taskdep node is definitely accessed by a different thread after task stealing takes place. With that patch (which pretty much just verifies the supposed invariants that you describe), crashes can be seen far more frequently.

@shiltian
Copy link
Contributor

shiltian commented Feb 6, 2025

I still can't understand the issue. Even with task stealing, the stack memory for node should still be valid, because for that thread it has not exited the region thus the stack variable will not be release.

@jhuber6 jhuber6 requested a review from jpeyton52 February 6, 2025 18:39
@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

I still can't understand the issue. Even with task stealing, the stack memory for node should still be valid, because for that thread it has not exited the region thus the stack variable will not be release.

Exiting the region isn't the issue as such I don't think, but there might be something missing in my explanation. I'll try to figure out a better way to present evidence.

Copy link
Collaborator

@jprotze jprotze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the linked test code from solvve, I triggered the assertion in kmp_taskdeps.h:31 :

KMP_DEBUG_ASSERT(n >= 0);

This patch does not fix the fundamental issue. Triggering the cleanup code for stack nodes is just a symptom.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 6, 2025

Using the linked test code from solvve, I triggered the assertion in kmp_taskdeps.h:31 :

KMP_DEBUG_ASSERT(n >= 0);

This patch does not fix the fundamental issue. Triggering the cleanup code for stack nodes is just a symptom.

Is that with or without the patch?

@jpeyton52
Copy link
Contributor

It took me an embarrassingly long time to find this, but the timing hole is this sequence of events:

  1. THREAD 1: A regular task with dependences is created, call it T1

  2. THREAD 1: Call into __kmpc_omp_taskwait_deps_51() and create a stack based depnode (NULL task), call it T2 (stack)

  3. THREAD 2: Steals task T1 and executes it getting to __kmp_release_deps() region.

  4. THREAD 1: During processing of dependences for T2 (stack) (within __kmp_check_deps() region), a link is created T1 -> T2. This increases T2's (stack) nrefs count.

  5. THREAD 2: Iterates through the successors list: decrement the T2's (stack) npredecessor count. BUT HASN'T YET __kmp_node_deref()-ed it.

  6. THREAD 1: Now when finished with __kmp_check_deps(), it returns false because npredecessor count is 0, but T2's (stack) nrefs count is 2 because THREAD 2 still references it!

  7. THREAD 1: Because __kmp_check_deps() returns false, early exit.
    Now the stack based depnode is invalid, but THREAD 2 still references it.

We've reached improper stack referencing behavior. Varied results/crashes/asserts can occur if THREAD 1 comes back and recreates the exact same depnode in the exact same stack address during the same time THREAD 2 calls __kmp_node_deref().

One solution is along the lines of this patch which is to allocate all depnodes on the heap -- you may still need to have a deref() in the early exit as well.

The other is to stick another:

    // Wait until the last __kmp_release_deps is finished before we free the
    // current stack frame holding the "node" variable; once its nrefs count
    // reaches 1, we're sure nobody else can try to reference it again.
    while (node.dn.nrefs > 1)
      KMP_YIELD(TRUE);

right before the early exit if __kmp_check_deps() returns false. This isn't ideal, but that's the current state of things.

@jprotze
Copy link
Collaborator

jprotze commented Feb 8, 2025

Thanks @jpeyton52 for digging into the issue.
My suggestion is to add the assertions from 812b492 (but not the fprintf).
I would prefer to keep the current stack object and add the while loop before the other return statement. The advantage is that issues like observed here break more obviously and can be fixed rather than failing silently. The current patch would lead to silent memory leaks.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 10, 2025

Thanks @jpeyton52 for figuring out a more coherent explanation for what's going on with the bug, and to @jprotze and @shiltian for review!

I'll prepare a new version of the patch fixing the early-exit path and adding some assertions.

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 10, 2025

This version adjusts the assertions to avoid conditionally growing the kmp_depnode_t type, and just adds another wait-loop to __kmpc_omp_taskwait_deps_51, at @jpeyton52 and @jprotze 's suggestion. Borrowing bit 0 of the refcount was an idea I had whilst investigating this bug earlier, but the last version of the patch didn't need it. Doing that (vs. adding a new "on_stack" field) allows us to make the new assertions have essentially zero extra cost.

This patch fixes a bug that causes crashes with OpenMP 'taskwait'
directives in heavily multi-threaded scenarios.

Task stealing can lead to a situation where references to an on-stack
'taskwait' dependency node remain even for the early-exit path in
__kmpc_omp_taskwait_deps_51.  This patch adds a wait loop to ensure the
function does not return before such references are decremented to 1,
along similar lines to the fix for PR85963.

Several new assertions are also added for safety, borrowing bit 0 of the
depnode refcount as a low-cost way of distinguishing heap-allocated from
stack-allocated depnodes.
@jtb20 jtb20 requested a review from jprotze February 11, 2025 09:20
Copy link
Collaborator

@jprotze jprotze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jtb20
Copy link
Contributor Author

jtb20 commented Feb 11, 2025

lgtm

Thank you! I don't have write access, so if someone could merge this for me, that'd be much appreciated.

jprotze pushed a commit that referenced this pull request Feb 14, 2025
This patch series demonstrates and fixes a bug that causes crashes with
OpenMP 'taskwait' directives in heavily multi-threaded scenarios.

TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the
synchronization mechanism in place for the late return.

Additional debug assertions check for the implied invariants of the code.

@jpeyton52 found the timing hole as this sequence of events:
>
> 1. THREAD 1: A regular task with dependences is created, call it T1
> 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack
based depnode (`NULL` task), call it T2 (stack)
> 3. THREAD 2: Steals task T1 and executes it getting to
`__kmp_release_deps()` region.
> 4. THREAD 1: During processing of dependences for T2 (stack) (within
`__kmp_check_deps()` region),  a link is created T1 -> T2. This increases
T2's (stack) `nrefs` count.
> 5. THREAD 2: Iterates through the successors list: decrement the T2's
(stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it.
> 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false
because npredecessor count is 0, but T2's (stack) `nrefs`  count is 2 because
THREAD 2 still references it!
> 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit.
>    _Now the stack based depnode is invalid, but THREAD 2 still references it._
>
> We've reached improper stack referencing behavior. Varied results/crashes/
asserts can occur if THREAD 1 comes back and recreates the exact same depnode
in the exact same stack address during the same time THREAD 2 calls
`__kmp_node_deref()`.
@jprotze
Copy link
Collaborator

jprotze commented Feb 14, 2025

Merging with web ui failed twice, so manually pushed the commit.

@jprotze jprotze closed this Feb 14, 2025
@jtb20
Copy link
Contributor Author

jtb20 commented Feb 14, 2025

Thank you!

joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025
)

This patch series demonstrates and fixes a bug that causes crashes with
OpenMP 'taskwait' directives in heavily multi-threaded scenarios.

TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the
synchronization mechanism in place for the late return.

Additional debug assertions check for the implied invariants of the code.

@jpeyton52 found the timing hole as this sequence of events:
>
> 1. THREAD 1: A regular task with dependences is created, call it T1
> 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack
based depnode (`NULL` task), call it T2 (stack)
> 3. THREAD 2: Steals task T1 and executes it getting to
`__kmp_release_deps()` region.
> 4. THREAD 1: During processing of dependences for T2 (stack) (within
`__kmp_check_deps()` region),  a link is created T1 -> T2. This increases
T2's (stack) `nrefs` count.
> 5. THREAD 2: Iterates through the successors list: decrement the T2's
(stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it.
> 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false
because npredecessor count is 0, but T2's (stack) `nrefs`  count is 2 because
THREAD 2 still references it!
> 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit.
>    _Now the stack based depnode is invalid, but THREAD 2 still references it._
>
> We've reached improper stack referencing behavior. Varied results/crashes/
asserts can occur if THREAD 1 comes back and recreates the exact same depnode
in the exact same stack address during the same time THREAD 2 calls
`__kmp_node_deref()`.
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025
)

This patch series demonstrates and fixes a bug that causes crashes with
OpenMP 'taskwait' directives in heavily multi-threaded scenarios.

TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the
synchronization mechanism in place for the late return.

Additional debug assertions check for the implied invariants of the code.

@jpeyton52 found the timing hole as this sequence of events:
>
> 1. THREAD 1: A regular task with dependences is created, call it T1
> 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack
based depnode (`NULL` task), call it T2 (stack)
> 3. THREAD 2: Steals task T1 and executes it getting to
`__kmp_release_deps()` region.
> 4. THREAD 1: During processing of dependences for T2 (stack) (within
`__kmp_check_deps()` region),  a link is created T1 -> T2. This increases
T2's (stack) `nrefs` count.
> 5. THREAD 2: Iterates through the successors list: decrement the T2's
(stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it.
> 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false
because npredecessor count is 0, but T2's (stack) `nrefs`  count is 2 because
THREAD 2 still references it!
> 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit.
>    _Now the stack based depnode is invalid, but THREAD 2 still references it._
>
> We've reached improper stack referencing behavior. Varied results/crashes/
asserts can occur if THREAD 1 comes back and recreates the exact same depnode
in the exact same stack address during the same time THREAD 2 calls
`__kmp_node_deref()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomp OpenMP host runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants