@@ -329,6 +329,107 @@ and as such, libc++ does not go out of its way to support them. The library may
329
329
compiler extensions which would then be documented explicitly, but the basic expectation should be
330
330
that no special support is provided for arbitrary compiler extensions.
331
331
332
+ Offloading C++ Parallel Algorithms to GPUs
333
+ ------------------------------------------
334
+
335
+ Experimental support for GPU offloading has been added to ``libc++ ``. The
336
+ implementation uses OpenMP target offloading to leverage GPU compute resources.
337
+ The OpenMP PSTL backend can target both NVIDIA and AMD GPUs.
338
+ However, the implementation only supports contiguous iterators, such as
339
+ iterators for ``std::vector `` or ``std::array ``.
340
+ To enable the OpenMP offloading backend it must be selected with
341
+ ``LIBCXX_PSTL_BACKEND=openmp `` when installing ``libc++ ``. Further, when
342
+ compiling a program, the user must specify the command line options
343
+ ``-fopenmp -fexperimental-library ``. To install LLVM with OpenMP offloading
344
+ enabled, please read
345
+ `the LLVM OpenMP FAQ. <https://openmp.llvm.org/SupportAndFAQ.html >`_
346
+ You may also want to to visit
347
+ `the OpenMP offloading command-line argument reference. <https://openmp.llvm.org/CommandLineArgumentReference.html#offload-command-line-arguments >`_
348
+
349
+ Example
350
+ ~~~~~~~
351
+
352
+ The following is an example of offloading vector addition to a GPU using our
353
+ standard library extension. It implements the classical vector addition from
354
+ BLAS that overwrites the vector ``y `` with ``y=a*x+y ``. Thus ``y.begin() `` is
355
+ both used as an input and an output iterator in this example.
356
+
357
+ .. code-block :: cpp
358
+
359
+ #include <algorithm>
360
+ #include <execution>
361
+
362
+ template <typename T1, typename T2, typename T3>
363
+ void axpy(const T1 a, const std::vector<T2> &x, std::vector<T3> &y) {
364
+ std::transform(std::execution::par_unseq, x.begin(), x.end(), y.begin(),
365
+ y.begin(), [=](T2 xi, T3 yi) { return a * xi + yi; });
366
+ }
367
+
368
+ The execution policy ``std::execution::par_unseq `` states that the algorithm's
369
+ execution may be parallelized, vectorized, and migrated across threads. This is
370
+ the only execution mode that is safe to offload to GPUs, and for all other
371
+ execution modes the algorithms will execute on the CPU.
372
+ Special attention must be paid to the lambda captures when enabling GPU
373
+ offloading. If the lambda captures by reference, the user must manually map the
374
+ variables to the device. If capturing by reference, the above example could
375
+ be implemented in the following way.
376
+
377
+ .. code-block :: cpp
378
+
379
+ template <typename T1, typename T2, typename T3>
380
+ void axpy(const T1 a, const std::vector<T2> &x, std::vector<T3> &y) {
381
+ #pragma omp target data map(to : a)
382
+ std::transform(std::execution::par_unseq, x.begin(), x.end(), y.begin(),
383
+ y.begin(), [&](T2 xi, T3 yi) { return a * xi + yi; });
384
+ }
385
+
386
+ However, if unified shared memory, USM, is enabled, no additional data mapping
387
+ is necessary when capturing y reference.
388
+
389
+ Compiling functions for GPUs with OpenMP
390
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
391
+
392
+ The C++ standard defines that all accesses to memory are inside a single address
393
+ space. However, discrete GPU systems have distinct address spaces. A single
394
+ address space can be emulated if your system supports unified shared memory.
395
+ However, many discrete GPU systems do not, and in those cases it is important to
396
+ pass device function pointers to the parallel algorithms. Below is an example of
397
+ how the OpenMP ``declare target `` directive with the ``indirect `` clause can be
398
+ used to mark that a function should be compiled for both host and device.
399
+
400
+ .. code-block :: cpp
401
+
402
+ // This function computes the squared difference of two floating points
403
+ float squared(float a, float b) { return a * a - 2.0f * a * b + b * b; };
404
+
405
+ // Declare that the function must be compiled for both host and device
406
+ #pragma omp declare target indirect to(squared)
407
+
408
+ int main() {
409
+ std::vector<float> a(100, 1.0);
410
+ std::vector<float> b(100, 1.25);
411
+
412
+ // Pass the host function pointer to the parallel algorithm and let OpenMP
413
+ // translate it to the device function pointer internally
414
+ float sum =
415
+ std::transform_reduce(std::execution::par_unseq, a.begin(), a.end(),
416
+ b.begin(), 0.0f, std::plus{}, squared);
417
+
418
+ // Validate that the result is approximately 6.25
419
+ assert(std::abs(sum - 6.25f) < 1e-10);
420
+ return 0;
421
+ }
422
+
423
+ Without unified shared memory, the above example will not work if the host
424
+ function pointer ``squared `` is passed to the parallel algorithm.
425
+
426
+ Important notes about exception handling
427
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
428
+
429
+ GPU architectures do not support exception handling and, for now,
430
+ ``-fno-exceptions `` is required to offload to the GPU. Parallel CPU fallback
431
+ is available without restrictions.
432
+
332
433
Platform specific behavior
333
434
==========================
334
435
0 commit comments