@@ -633,7 +633,7 @@ specified by the ``i32 %d0 ... i32 %d4`` arguments.
633
633
For more information, refer PTX ISA
634
634
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor >`_.
635
635
636
- '``llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[1 -5]d ``'
636
+ '``llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3 -5]d ``'
637
637
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
638
638
639
639
Syntax:
@@ -648,7 +648,7 @@ Syntax:
648
648
Overview:
649
649
"""""""""
650
650
651
- The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[1 -5]d ``' intrinsics
651
+ The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3 -5]d ``' intrinsics
652
652
correspond to the ``cp.async.bulk.prefetch.tensor.[1-5]d.L2.global* `` set
653
653
of PTX instructions. These instructions initiate an asynchronous prefetch
654
654
of tensor data from global memory to the L2 cache. In im2col mode, some
@@ -663,6 +663,76 @@ the same functionality as described in the ``tile`` mode intrinsics above.
663
663
For more information, refer PTX ISA
664
664
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor >`_.
665
665
666
+ '``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].tile.[1-5]d ``'
667
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
668
+
669
+ Syntax:
670
+ """""""
671
+
672
+ .. code-block :: llvm
673
+
674
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.add.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
675
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.min.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
676
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.max.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
677
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.inc.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
678
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.dec.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
679
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.and.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
680
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.or.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
681
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.xor.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
682
+
683
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.2d(..., i32 %d0, i32 %d1, ...)
684
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
685
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
686
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
687
+
688
+ Overview:
689
+ """""""""
690
+
691
+ The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.[1-5]d ``' intrinsics
692
+ correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.* `` set of PTX instructions.
693
+ These instructions initiate an asynchronous reduction operation of tensor data
694
+ in global memory with the tensor data in shared{::cta} memory, using ``tile `` mode.
695
+ The dimension of the tensor data ranges from 1d to 5d with the coordinates
696
+ specified by the ``i32 %d0 ... i32 %d4 `` arguments. The supported reduction
697
+ operations are {add, min, max, inc, dec, and, or, xor} as described in the
698
+ ``tile.1d `` intrinsics.
699
+
700
+ * The last argument to these intrinsics is a boolean flag
701
+ indicating support for cache_hint. This flag argument must
702
+ be a compile-time constant. When set, it indicates a valid
703
+ cache_hint (``i64 %ch ``) and generates the ``.L2::cache_hint ``
704
+ variant of the PTX instruction.
705
+
706
+ For more information, refer PTX ISA
707
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor >`_.
708
+
709
+ '``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].im2col.[3-5]d ``'
710
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
711
+
712
+ Syntax:
713
+ """""""
714
+
715
+ .. code-block :: llvm
716
+
717
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch)
718
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
719
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
720
+
721
+ Overview:
722
+ """""""""
723
+
724
+ The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.[3-5]d ``' intrinsics
725
+ correspond to the ``cp.reduce.async.bulk.tensor.[3-5]d.* `` set of PTX instructions.
726
+ These instructions initiate an asynchronous reduction operation of tensor data
727
+ in global memory with the tensor data in shared{::cta} memory, using ``im2col `` mode.
728
+ In this mode, the tensor has to be at least three-dimensional. The supported reduction
729
+ operations supported are the same as the ones in the tile mode. The last argument to
730
+ these intrinsics is a boolean flag, with the same functionality as described in the
731
+ ``tile `` mode intrinsics above.
732
+
733
+ For more information, refer PTX ISA
734
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor >`_.
735
+
666
736
Other Intrinsics
667
737
----------------
668
738
0 commit comments