Cooperative semantic

<a href="https://github.com/gevtushenko"><img src="https://avatars.githubusercontent.com/u/9890394?v=4" align="left" width="96" height="96" hspace="10"></img></a> **Issue by [gevtushenko](https://github.com/gevtushenko)**
_Sunday Feb 27, 2022 at 01:05 GMT_
_Originally opened as https://github.com/NVIDIA/stdexec/issues/475_

----

A few crucial areas that P2300 can cover require clarification on cooperative semantics. Cooperative API involves multiple threads working together towards a shared aim. For instance, let's consider the following function:

```cpp
void f(int tid, auto &scheduler) {
  auto snd = schedule(scheduler) 
           | then([tid]{printf("{t%d}", tid);}) 
           | bulk(2, [tid](int i){printf("{b%d:%d}", tid, i);});

    printf("~");
    sync_wait(snd); 
}
```

If two threads execute the code above with an inline scheduler `f(tid, inline_scheduler)`, we'll get some interleaving of the following characters:

```cpp
~~{t0}{t1}{b0:0}{b0:1}{b1:0}{b1:1}
```

In other words, `then` is executed by each thread as well as `bulk`, which is expected. On the contrary, an inline cooperative scheduler `f(tid, inline_coop_scheduler)` would lead to the following result: 

```cpp
~~{t0}{b0:0}{b1:1}
```

Here `then` is specialized to execute work only once and `bulk` distributes work between participating threads. This approach allows representing cooperating threads as a single execution context without the overhead of task queue maintenance. 

## motivation

1. distributed context:

   Let's consider the following sender adaptor:

   ```cpp
   sender auto compute(auto &computer) {
       return schedule(computer) 
            | bulk(n_cells, process_cell)
            | then(done_once)
            | transfer(inline_scheduler{})
            | then(write);
   }
   ```

   If it adapts an inline scheduler, calling thread processes `n_cells`. Thread pool scheduler represents a set of threads as a single execution resource, so `then` would be executed once and, `bulk` would process `n_cells` in a federated manner to achieve some speedup. Extending this idea, we came to a distributed scheduler. This scheduler would partition `n_cells` between multiple nodes of a distributed system. Although task-based programming model is a known approach for distributed programming models, static information can improve performance by reducing tasks distribution. This leads us to a cooperative distributed scheduler:

   ```cpp
   int main() {
       // Access runtime to query process id and number of processes
       coop_distributed_scheduler scheduler{}; 
       sync_wait(scheduler);
   }
   
   // mpirun -np 2 ./compute
   ```

   Note that we can achieve the effect of performing `then` by each cooperating executor by `transfer`-ing to an `inline_scheduler`. 

2. locality:

   Assigning a thread to a particular execution resource might reduce the number of context switches, which affects performance. For the code above, we might use a multi-GPU scheduler:

   ```cpp
   int main() {
       // Switches between GPUs internally
       multi_gpu_scheduler scheduler{}; 
       sync_wait(scheduler);
   }
   ```

   Performance might be improved if we assign a thread to a particular GPU:

   ```cpp
   int main() {
       #pragma omp parallel 
       {
         // No GPU context switches
         coop_multi_gpu_scheduler scheduler{}; 
         sync_wait(scheduler);
       }
   }
   ```

   

3. nesting:

   The following code represents a case of executing a sender in cooperative and inline contexts. It's expected to get the same result in cases `(1)` and `(2)` without the overhead of dealing with a task queue. 

   ```cpp
   assert(get_forward_progress_guarantee(scheduler) == concurrent);
   
   sync_wait(
     schedule(scheduler)
   | bulk(2, [](int thread_num) {
       inline_cooperative_scheduler sub_scheduler{thread_num, 2};
   
       // per-thread prologue
       sync_wait(schedule(sub_scheduler) | compute());    // 1
       // per-thread epilogue
     })
   );
   
   sync_wait(schedule(scheduler) | compute());            // 2
   ```

Having inline behavior in those contexts would change the sender's behavior. Providing cooperative versions of `then` and `bulk` would limit code reuse since a sender author would have to know if they are developing code for a cooperative context. 

## goals

- [ ] collect feedback and use cases from other fields
- [ ] find out if P2300 usage is limited in cooperative contexts
- [ ] find out if P2300 should express cooperative guarantees explicitly




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cooperative semantic #91

motivation

goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cooperative semantic #91

Description

motivation

goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions