Description
Issue by gevtushenko
Sunday Feb 27, 2022 at 01:05 GMT
Originally opened as NVIDIA/stdexec#475
A few crucial areas that P2300 can cover require clarification on cooperative semantics. Cooperative API involves multiple threads working together towards a shared aim. For instance, let's consider the following function:
void f(int tid, auto &scheduler) {
auto snd = schedule(scheduler)
| then([tid]{printf("{t%d}", tid);})
| bulk(2, [tid](int i){printf("{b%d:%d}", tid, i);});
printf("~");
sync_wait(snd);
}
If two threads execute the code above with an inline scheduler f(tid, inline_scheduler)
, we'll get some interleaving of the following characters:
~~{t0}{t1}{b0:0}{b0:1}{b1:0}{b1:1}
In other words, then
is executed by each thread as well as bulk
, which is expected. On the contrary, an inline cooperative scheduler f(tid, inline_coop_scheduler)
would lead to the following result:
~~{t0}{b0:0}{b1:1}
Here then
is specialized to execute work only once and bulk
distributes work between participating threads. This approach allows representing cooperating threads as a single execution context without the overhead of task queue maintenance.
motivation
-
distributed context:
Let's consider the following sender adaptor:
sender auto compute(auto &computer) { return schedule(computer) | bulk(n_cells, process_cell) | then(done_once) | transfer(inline_scheduler{}) | then(write); }
If it adapts an inline scheduler, calling thread processes
n_cells
. Thread pool scheduler represents a set of threads as a single execution resource, sothen
would be executed once and,bulk
would processn_cells
in a federated manner to achieve some speedup. Extending this idea, we came to a distributed scheduler. This scheduler would partitionn_cells
between multiple nodes of a distributed system. Although task-based programming model is a known approach for distributed programming models, static information can improve performance by reducing tasks distribution. This leads us to a cooperative distributed scheduler:int main() { // Access runtime to query process id and number of processes coop_distributed_scheduler scheduler{}; sync_wait(scheduler); } // mpirun -np 2 ./compute
Note that we can achieve the effect of performing
then
by each cooperating executor bytransfer
-ing to aninline_scheduler
. -
locality:
Assigning a thread to a particular execution resource might reduce the number of context switches, which affects performance. For the code above, we might use a multi-GPU scheduler:
int main() { // Switches between GPUs internally multi_gpu_scheduler scheduler{}; sync_wait(scheduler); }
Performance might be improved if we assign a thread to a particular GPU:
int main() { #pragma omp parallel { // No GPU context switches coop_multi_gpu_scheduler scheduler{}; sync_wait(scheduler); } }
-
nesting:
The following code represents a case of executing a sender in cooperative and inline contexts. It's expected to get the same result in cases
(1)
and(2)
without the overhead of dealing with a task queue.assert(get_forward_progress_guarantee(scheduler) == concurrent); sync_wait( schedule(scheduler) | bulk(2, [](int thread_num) { inline_cooperative_scheduler sub_scheduler{thread_num, 2}; // per-thread prologue sync_wait(schedule(sub_scheduler) | compute()); // 1 // per-thread epilogue }) ); sync_wait(schedule(scheduler) | compute()); // 2
Having inline behavior in those contexts would change the sender's behavior. Providing cooperative versions of then
and bulk
would limit code reuse since a sender author would have to know if they are developing code for a cooperative context.
goals
- collect feedback and use cases from other fields
- find out if P2300 usage is limited in cooperative contexts
- find out if P2300 should express cooperative guarantees explicitly