`when_all`/`split`: operation state might get prematurely destroyed when child completes synchronously inside its stop callback

I may have stumbled across a nasty lifetime issue in the handling of stop callbacks in the `when_all`/`split` algorithms. But this would apply to any algorithm using a `inplace_stop_source` inside its operation state.

`when_all` has a `inplace_stop_source` inside its operation state, and then a stop callback, like this:

```c++
struct when_all_opstate {
    // ...
    inplace_stop_source stop_src{};
    // ...
    optional<stop_callback> on_stop{nullopt};
    // ...
};
```

In `on_stop`, a stop callback is registered, which calls `stop_src.request_stop()` when there is a stop request on the receiver's stop token. All child senders of the `when_all` are registered on the `stop_src`. This propagates the stop request from the receiver to all children of the `when_all` sender.

Now, what I observed is the following chain of events:

- receiver's stop token triggers
- `on_stop` callback calls `stop_src.stop_requested()`
- the `stop_src` now iterates over its list of registered stop callbacks (those are the ones from the children) (*)
- the last child calls `set_stopped` _synchronously_ from inside its stop callback. In my case it's from deregistering a sleep timer from an self-written epoll-based "io context", something like this:
    ```c++
                struct stop_cb {
                    opstate* op;
                    void operator()() noexcept
                    {
                        if (op->loop_->deregister_deadline(op)) {
                            op->set_stopped();
                        } else {
                            // lost the race, main loop will call `set_value()`
                        }
                    }
                };
    ```
- now, `arrive()` of the `when_all` opstate gets called
- `complete()` of the `when_all` opstate gets called
- `ex::set_stopped` is called, satisfying the receiver contract of the `when_all` sender
- the `when_all` opstate is _synchronously_ (!) destroyed from inside the `set_stopped` callback by some follow up work.
-> UB, since we are still iterating over the list of registered stop callbacks of `stop_src`! (the line marked with "*" above)

I've managed to "hack around" it by doing some checking of thread id's and deferring the completion to the stop callback if I detect that the completion is called synchronously from inside a stop callback. So something like this:

```c++
template <typename Opstate>
struct on_stop_request_with_thread_id {
public:
    void operator()() const noexcept
    {
        id_->store(std::this_thread::get_id(), std::memory_order::relaxed);
        op_->request_stop();
        if (id_->load(std::memory_order::relaxed) == std::thread::id{}) {
            op_->deferred_complete();
            return;
        }
        id_->store(std::thread::id{}, std::memory_order::relaxed);
    }

public: // NOLINT enable aggregate initialization
    Opstate* op_;
    std::atomic<std::thread::id>* id_;
};

export template <typename Opstate, typename Token>
struct stop_callback_with_thread_id {
public:
    bool should_defer_completion()
    {
        if (id_.load(std::memory_order::relaxed) != std::this_thread::get_id()) {
            return false;
        }

        id_.store(std::thread::id{}, std::memory_order::relaxed);
        return true;
    }

    void reset() { on_stop_.reset(); }

    void emplace(Token token, Opstate* op)
    {
        on_stop_.emplace(std::move(token), on_stop_request_with_thread_id<Opstate>{op, &id_});
    }

private:
    using stop_callback_t = ex::stop_callback_for_t<Token, on_stop_request_with_thread_id<Opstate>>;

    std::atomic<std::thread::id> id_{};
    std::optional<stop_callback_t> on_stop_ = {};
};
```

...and then using `stop_callback_with_thread_id` instead of `optional<stop_callback>` inside `when_all`'s opstate, and having a `complete()` like this:

```c++
            void complete(Rcvr& rcvr) noexcept
            {
                // If we are completing synchronously from inside a stop
                // callback, defer completion.
                if (on_stop.should_defer_completion()) {
                    this->deferred_rcvr = &rcvr;
                    return;
                }

                complete_impl(rcvr);
            }
```

This all feels very hacky to me, though.

I haven't deeply investigated `split` yet, but I think the solution could be a bit simpler there, by using a stop callback like this instead of `on-stop-request`:

```c++
template <typename SharedState>
struct split_on_stop_request {
    void operator()() const noexcept
    {
        shared_state_->inc_ref();
        shared_state_->stop_src.request_stop();
        shared_state_->dec_ref();
    }

    SharedState* shared_state_;
};
```
...i.e. just wrapping the `request_stop()` between `inc_ref/dec_ref` to ensure the opstate object stays alive long enough.

I do wonder if there is a more elegant way to solve this issue. I don't think synchronous completions from stop callbacks should be outlawed -- it seems "natural" to me to do the `set_stopped` right inside the stop callback if possible. Or maybe synchronous destruction from inside the `set_stopped` completion of `when_all` is the problem? I've thought that you _have to_ assume the lifetime of the opstate may end when calling the completion, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`when_all`/`split`: operation state might get prematurely destroyed when child completes synchronously inside its stop callback #300

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

when_all/split: operation state might get prematurely destroyed when child completes synchronously inside its stop callback #300

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`when_all`/`split`: operation state might get prematurely destroyed when child completes synchronously inside its stop callback #300