Skip to content

Commit 38bb3b4

Browse files
committed
[Docs] Update performance tuning guide
Added cuda graph explaination Added core pinning section Added tensor core usage section
1 parent 6537199 commit 38bb3b4

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

recipes_source/recipes/tuning_guide.py

+38
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,7 @@ def gelu(x):
213213

214214
###############################################################################
215215
# Typically, the following environment variables are used to set for CPU affinity with GNU OpenMP implementation. ``OMP_PROC_BIND`` specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions. ``OMP_SCHEDULE`` determines how OpenMP threads are scheduled. ``GOMP_CPU_AFFINITY`` binds threads to specific CPUs.
216+
# An important tuning parameter is core pinning which prevent the threads of migrating between multiple CPUs, enhancing data location and minimizing inter core communication.
216217
#
217218
# .. code-block:: sh
218219
#
@@ -318,6 +319,43 @@ def gelu(x):
318319
# GPU specific optimizations
319320
# --------------------------
320321

322+
###############################################################################
323+
# Enable Tensor cores
324+
# ~~~~~~~~~~~~~~~~~~~~~~~
325+
# Tensor cores are specialized hardware to compute matrix-matrix multiplication
326+
# operations which neural network operation can take advantage of.
327+
#
328+
# Hardware tensor core operations tend to use a different floating point format
329+
# which sacrifices precision at expense of speed gains.
330+
# Prior to pytorch 1.12 this was enabled by default but since this version
331+
# it must be explicitly set as it can conflict with some operations which do not
332+
# benefit from Tensor core computations.
333+
334+
## Tensor computation can be enabled "manually" modifying the matrix multiplication precision
335+
## The default precision is "highest" which will perform the operation according to the dtype
336+
337+
# precision "high" and "medium" can be hardware accelerated via tensor cores
338+
# and will set torch.backends.cuda.matmul.allow_tf32 = True if available
339+
340+
# Carefully consider the tradeoff between speed and precision at the moment of evaluating your models!
341+
torch.set_float32_matmul_precision("high")
342+
343+
###############################################################################
344+
# Use CUDA Graphs
345+
# ~~~~~~~~~~~~~~~~~~~~~~~
346+
# At the time of using a GPU, work first must be launched from the CPU and
347+
# on some cases the context switch between CPU and GPU can lead to bad resource
348+
# utilization. CUDA graphs are a way to keep computation within the GPU without
349+
# paying the extra cost of kernel launches and host synchronization.
350+
351+
# It can be enabled using
352+
torch.compile(m, "reduce-overhead")
353+
# or
354+
torch.compile(m, "max-autotune")
355+
356+
###############################################################################
357+
# Special care must be present when using cuda graphs as it can lead to increased memory consumption and some models might not compile.
358+
321359
###############################################################################
322360
# Enable cuDNN auto-tuner
323361
# ~~~~~~~~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)