Skip to content

Cooperative semantic #91

Open
Open
@ericniebler

Description

@ericniebler

Issue by gevtushenko
Sunday Feb 27, 2022 at 01:05 GMT
Originally opened as NVIDIA/stdexec#475


A few crucial areas that P2300 can cover require clarification on cooperative semantics. Cooperative API involves multiple threads working together towards a shared aim. For instance, let's consider the following function:

void f(int tid, auto &scheduler) {
  auto snd = schedule(scheduler) 
           | then([tid]{printf("{t%d}", tid);}) 
           | bulk(2, [tid](int i){printf("{b%d:%d}", tid, i);});

    printf("~");
    sync_wait(snd); 
}

If two threads execute the code above with an inline scheduler f(tid, inline_scheduler), we'll get some interleaving of the following characters:

~~{t0}{t1}{b0:0}{b0:1}{b1:0}{b1:1}

In other words, then is executed by each thread as well as bulk, which is expected. On the contrary, an inline cooperative scheduler f(tid, inline_coop_scheduler) would lead to the following result:

~~{t0}{b0:0}{b1:1}

Here then is specialized to execute work only once and bulk distributes work between participating threads. This approach allows representing cooperating threads as a single execution context without the overhead of task queue maintenance.

motivation

  1. distributed context:

    Let's consider the following sender adaptor:

    sender auto compute(auto &computer) {
        return schedule(computer) 
             | bulk(n_cells, process_cell)
             | then(done_once)
             | transfer(inline_scheduler{})
             | then(write);
    }

    If it adapts an inline scheduler, calling thread processes n_cells. Thread pool scheduler represents a set of threads as a single execution resource, so then would be executed once and, bulk would process n_cells in a federated manner to achieve some speedup. Extending this idea, we came to a distributed scheduler. This scheduler would partition n_cells between multiple nodes of a distributed system. Although task-based programming model is a known approach for distributed programming models, static information can improve performance by reducing tasks distribution. This leads us to a cooperative distributed scheduler:

    int main() {
        // Access runtime to query process id and number of processes
        coop_distributed_scheduler scheduler{}; 
        sync_wait(scheduler);
    }
    
    // mpirun -np 2 ./compute

    Note that we can achieve the effect of performing then by each cooperating executor by transfer-ing to an inline_scheduler.

  2. locality:

    Assigning a thread to a particular execution resource might reduce the number of context switches, which affects performance. For the code above, we might use a multi-GPU scheduler:

    int main() {
        // Switches between GPUs internally
        multi_gpu_scheduler scheduler{}; 
        sync_wait(scheduler);
    }

    Performance might be improved if we assign a thread to a particular GPU:

    int main() {
        #pragma omp parallel 
        {
          // No GPU context switches
          coop_multi_gpu_scheduler scheduler{}; 
          sync_wait(scheduler);
        }
    }
  3. nesting:

    The following code represents a case of executing a sender in cooperative and inline contexts. It's expected to get the same result in cases (1) and (2) without the overhead of dealing with a task queue.

    assert(get_forward_progress_guarantee(scheduler) == concurrent);
    
    sync_wait(
      schedule(scheduler)
    | bulk(2, [](int thread_num) {
        inline_cooperative_scheduler sub_scheduler{thread_num, 2};
    
        // per-thread prologue
        sync_wait(schedule(sub_scheduler) | compute());    // 1
        // per-thread epilogue
      })
    );
    
    sync_wait(schedule(scheduler) | compute());            // 2

Having inline behavior in those contexts would change the sender's behavior. Providing cooperative versions of then and bulk would limit code reuse since a sender author would have to know if they are developing code for a cooperative context.

goals

  • collect feedback and use cases from other fields
  • find out if P2300 usage is limited in cooperative contexts
  • find out if P2300 should express cooperative guarantees explicitly

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions