Description
I may have stumbled across a nasty lifetime issue in the handling of stop callbacks in the when_all
/split
algorithms. But this would apply to any algorithm using a inplace_stop_source
inside its operation state.
when_all
has a inplace_stop_source
inside its operation state, and then a stop callback, like this:
struct when_all_opstate {
// ...
inplace_stop_source stop_src{};
// ...
optional<stop_callback> on_stop{nullopt};
// ...
};
In on_stop
, a stop callback is registered, which calls stop_src.request_stop()
when there is a stop request on the receiver's stop token. All child senders of the when_all
are registered on the stop_src
. This propagates the stop request from the receiver to all children of the when_all
sender.
Now, what I observed is the following chain of events:
- receiver's stop token triggers
on_stop
callback callsstop_src.stop_requested()
- the
stop_src
now iterates over its list of registered stop callbacks (those are the ones from the children) (*) - the last child calls
set_stopped
synchronously from inside its stop callback. In my case it's from deregistering a sleep timer from an self-written epoll-based "io context", something like this:struct stop_cb { opstate* op; void operator()() noexcept { if (op->loop_->deregister_deadline(op)) { op->set_stopped(); } else { // lost the race, main loop will call `set_value()` } } };
- now,
arrive()
of thewhen_all
opstate gets called complete()
of thewhen_all
opstate gets calledex::set_stopped
is called, satisfying the receiver contract of thewhen_all
sender- the
when_all
opstate is synchronously (!) destroyed from inside theset_stopped
callback by some follow up work.
-> UB, since we are still iterating over the list of registered stop callbacks ofstop_src
! (the line marked with "*" above)
I've managed to "hack around" it by doing some checking of thread id's and deferring the completion to the stop callback if I detect that the completion is called synchronously from inside a stop callback. So something like this:
template <typename Opstate>
struct on_stop_request_with_thread_id {
public:
void operator()() const noexcept
{
id_->store(std::this_thread::get_id(), std::memory_order::relaxed);
op_->request_stop();
if (id_->load(std::memory_order::relaxed) == std::thread::id{}) {
op_->deferred_complete();
return;
}
id_->store(std::thread::id{}, std::memory_order::relaxed);
}
public: // NOLINT enable aggregate initialization
Opstate* op_;
std::atomic<std::thread::id>* id_;
};
export template <typename Opstate, typename Token>
struct stop_callback_with_thread_id {
public:
bool should_defer_completion()
{
if (id_.load(std::memory_order::relaxed) != std::this_thread::get_id()) {
return false;
}
id_.store(std::thread::id{}, std::memory_order::relaxed);
return true;
}
void reset() { on_stop_.reset(); }
void emplace(Token token, Opstate* op)
{
on_stop_.emplace(std::move(token), on_stop_request_with_thread_id<Opstate>{op, &id_});
}
private:
using stop_callback_t = ex::stop_callback_for_t<Token, on_stop_request_with_thread_id<Opstate>>;
std::atomic<std::thread::id> id_{};
std::optional<stop_callback_t> on_stop_ = {};
};
...and then using stop_callback_with_thread_id
instead of optional<stop_callback>
inside when_all
's opstate, and having a complete()
like this:
void complete(Rcvr& rcvr) noexcept
{
// If we are completing synchronously from inside a stop
// callback, defer completion.
if (on_stop.should_defer_completion()) {
this->deferred_rcvr = &rcvr;
return;
}
complete_impl(rcvr);
}
This all feels very hacky to me, though.
I haven't deeply investigated split
yet, but I think the solution could be a bit simpler there, by using a stop callback like this instead of on-stop-request
:
template <typename SharedState>
struct split_on_stop_request {
void operator()() const noexcept
{
shared_state_->inc_ref();
shared_state_->stop_src.request_stop();
shared_state_->dec_ref();
}
SharedState* shared_state_;
};
...i.e. just wrapping the request_stop()
between inc_ref/dec_ref
to ensure the opstate object stays alive long enough.
I do wonder if there is a more elegant way to solve this issue. I don't think synchronous completions from stop callbacks should be outlawed -- it seems "natural" to me to do the set_stopped
right inside the stop callback if possible. Or maybe synchronous destruction from inside the set_stopped
completion of when_all
is the problem? I've thought that you have to assume the lifetime of the opstate may end when calling the completion, though.