Skip to content

Commit cbde504

Browse files
ZailiWangsvekars
andauthored
Adding xeon_run_cpu.rst doc (#2931)
* adding tutorial for xeon.run_cpu script usage. --------- Co-authored-by: Svetlana Karslioglu <[email protected]>
1 parent 3b97695 commit cbde504

File tree

2 files changed

+372
-0
lines changed

2 files changed

+372
-0
lines changed

recipes_source/recipes_index.rst

+8
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,13 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
274274
:link: ../recipes/recipes/tuning_guide.html
275275
:tags: Model-Optimization
276276

277+
.. customcarditem::
278+
:header: CPU launcher script for optimal performance on Intel® Xeon
279+
:card_description: How to use launcher script for optimal runtime configurations on Intel® Xeon CPUs.
280+
:image: ../_static/img/thumbnails/cropped/profiler.png
281+
:link: ../recipes/recipes/xeon_run_cpu.html
282+
:tags: Model-Optimization
283+
277284
.. customcarditem::
278285
:header: PyTorch Inference Performance Tuning on AWS Graviton Processors
279286
:card_description: Tips for achieving the best inference performance on AWS Graviton CPUs
@@ -424,6 +431,7 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
424431
/recipes/recipes/dynamic_quantization
425432
/recipes/recipes/amp_recipe
426433
/recipes/recipes/tuning_guide
434+
/recipes/recipes/xeon_run_cpu
427435
/recipes/recipes/intel_extension_for_pytorch
428436
/recipes/compiling_optimizer
429437
/recipes/torch_compile_backend_ipex

recipes_source/xeon_run_cpu.rst

+364
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,364 @@
1+
Optimizing PyTorch Inference with Intel® Xeon® Scalable Processors
2+
======================================================================
3+
4+
There are several configuration options that can impact the performance of PyTorch inference when executed on Intel® Xeon® Scalable Processors.
5+
To get peak performance, the ``torch.backends.xeon.run_cpu`` script is provided that optimizes the configuration of thread and memory management.
6+
For thread management, the script configures thread affinity and the preload of Intel® OMP library.
7+
For memory management, it configures NUMA binding and preloads optimized memory allocation libraries, such as TCMalloc and JeMalloc.
8+
In addition, the script provides tunable parameters for compute resource allocation in both single instance and multiple instance scenarios,
9+
helping the users try out an optimal coordination of resource utilization for the specific workloads.
10+
11+
What You Will Learn
12+
-------------------
13+
14+
* How to utilize tools like ``numactl``, ``taskset``, Intel® OpenMP Runtime Library and optimized memory
15+
allocators such as ``TCMalloc`` and ``JeMalloc`` for enhanced performance.
16+
* How to configure CPU resources and memory management to maximize PyTorch inference performance on Intel® Xeon® processors.
17+
18+
Introduction of the Optimizations
19+
---------------------------------
20+
21+
Applying NUMA Access Control
22+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
23+
24+
It is beneficial that an increasing number of CPU cores are being provided to users within a single socket, as this offers greater computational resources.
25+
However, this also leads to competition for memory access, which can cause programs to stall due to busy memory.
26+
To address this problem, Non-Uniform Memory Access (NUMA) was introduced.
27+
Unlike Uniform Memory Access (UMA), where all memories are equally accessible to all cores,
28+
NUMA organizes memory into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket.
29+
Local memory access is much faster than remote memory access.
30+
31+
Users can get CPU information with ``lscpu`` command on Linux to learn how many cores and sockets are there on the machine.
32+
Additionally, this command provides NUMA information, such as the distribution of CPU cores.
33+
Below is an example of executing ``lscpu`` on a machine equipped with an Intel® Xeon® CPU Max 9480:
34+
35+
.. code-block:: console
36+
37+
$ lscpu
38+
...
39+
CPU(s): 224
40+
On-line CPU(s) list: 0-223
41+
Vendor ID: GenuineIntel
42+
Model name: Intel (R) Xeon (R) CPU Max 9480
43+
CPU family: 6
44+
Model: 143
45+
Thread(s) per core: 2
46+
Core(s) per socket: 56
47+
Socket(s): 2
48+
...
49+
NUMA:
50+
NUMA node(s): 2
51+
NUMA node0 CPU(s): 0-55,112-167
52+
NUMA node1 CPU(s): 56-111,168-223
53+
...
54+
55+
* Two sockets were detected, each containing 56 physical cores. With Hyper-Threading enabled, each core can handle 2 threads, resulting in 56 logical cores per socket. Therefore, the machine has a total of 224 CPU cores in service.
56+
* Typically, physical cores are indexed before logical cores. In this scenario, cores 0-55 are the physical cores on the first NUMA node, and cores 56-111 are the physical cores on the second NUMA node.
57+
* Logical cores are indexed subsequently: cores 112-167 correspond to the logical cores on the first NUMA node, and cores 168-223 to those on the second NUMA node.
58+
59+
Typically, running PyTorch programs with compute intense workloads should avoid using logical cores to get good performance.
60+
61+
Linux provides a tool called ``numactl`` that allows user control of NUMA policy for processes or shared memory.
62+
It runs processes with a specific NUMA scheduling or memory placement policy.
63+
As described above, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations.
64+
From a memory access perspective, bounding memory access locally is much faster than accessing remote memories.
65+
``numactl`` command should have been installed in recent Linux distributions. In case it is missing, you can install it manually with the installation command, like on Ubuntu:
66+
67+
.. code-block:: console
68+
69+
$ apt-get install numactl
70+
71+
on CentOS you can run the following command:
72+
73+
.. code-block:: console
74+
75+
$ yum install numactl
76+
77+
The ``taskset`` command in Linux is another powerful utility that allows you to set or retrieve the CPU affinity of a running process.
78+
``taskset`` are pre-installed in most Linux distributions and in case it's not, on Ubuntu you can install it with the command:
79+
80+
.. code-block:: console
81+
82+
$ apt-get install util-linux
83+
84+
on CentOS you can run the following command:
85+
86+
.. code-block:: console
87+
88+
$ yum install util-linux
89+
90+
Using Intel® OpenMP Runtime Library
91+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
92+
93+
OpenMP is an implementation of multithreading, a method of parallelizing where a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.
94+
Users can control OpenMP behaviors with some environment variable settings to fit for their workloads, the settings are read and executed by OMP libraries. By default, PyTorch uses GNU OpenMP Library (GNU libgomp) for parallel computation. On Intel® platforms, Intel® OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It usually brings more performance benefits compared to libgomp.
95+
96+
The Intel® OpenMP Runtime Library can be installed using one of these commands:
97+
98+
.. code-block:: console
99+
100+
$ pip install intel-openmp
101+
102+
or
103+
104+
.. code-block:: console
105+
106+
$ conda install mkl
107+
108+
Choosing an Optimized Memory Allocator
109+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110+
111+
Memory allocator plays an important role from performance perspective as well. A more efficient memory usage reduces overhead on unnecessary memory allocations or destructions, and thus results in a faster execution. From practical experiences, for deep learning workloads, ``TCMalloc`` or ``JeMalloc`` can get better performance by reusing memory as much as possible than default malloc operations.
112+
113+
You can install ``TCMalloc`` by running the following command on Ubuntu:
114+
115+
.. code-block:: console
116+
117+
$ apt-get install google-perftools
118+
119+
On CentOS, you can install it by running:
120+
121+
.. code-block:: console
122+
123+
$ yum install gperftools
124+
125+
In a conda environment, it can also be installed by running:
126+
127+
.. code-block:: console
128+
129+
$ conda install conda-forge::gperftools
130+
131+
On Ubuntu ``JeMalloc`` can be installed by this command:
132+
133+
.. code-block:: console
134+
135+
$ apt-get install libjemalloc2
136+
137+
On CentOS it can be installed by running:
138+
139+
.. code-block:: console
140+
141+
$ yum install jemalloc
142+
143+
In a conda environment, it can also be installed by running:
144+
145+
.. code-block:: console
146+
147+
$ conda install conda-forge::jemalloc
148+
149+
Quick Start Example Commands
150+
----------------------------
151+
152+
1. To run single-instance inference with 1 thread on 1 CPU core (only Core #0 would be used):
153+
154+
.. code-block:: console
155+
156+
$ python -m torch.backends.xeon.run_cpu --ninstances 1 --ncores-per-instance 1 <program.py> [program_args]
157+
158+
2. To run single-instance inference on a single CPU node (NUMA socket):
159+
160+
.. code-block:: console
161+
162+
$ python -m torch.backends.xeon.run_cpu --node-id 0 <program.py> [program_args]
163+
164+
3. To run multi-instance inference, 8 instances with 14 cores per instance on a 112-core CPU:
165+
166+
.. code-block:: console
167+
168+
$ python -m torch.backends.xeon.run_cpu --ninstances 8 --ncores-per-instance 14 <program.py> [program_args]
169+
170+
4. To run inference in throughput mode, in which all the cores in each CPU node set up an instance:
171+
172+
.. code-block:: console
173+
174+
$ python -m torch.backends.xeon.run_cpu --throughput-mode <program.py> [program_args]
175+
176+
.. note::
177+
178+
Term "instance" here doesn't refer to a cloud instance. This script is executed as a single process which invokes multiple "instances" which are formed from multiple threads. "Instance" is kind of group of threads in this context.
179+
180+
Using ``torch.backends.xeon.run_cpu``
181+
-------------------------------------
182+
183+
The argument list and usage guidance can be shown with the following command:
184+
185+
.. code-block:: console
186+
187+
$ python -m torch.backends.xeon.run_cpu –h
188+
usage: run_cpu.py [-h] [--multi-instance] [-m] [--no-python] [--enable-tcmalloc] [--enable-jemalloc] [--use-default-allocator] [--disable-iomp] [--ncores-per-instance] [--ninstances] [--skip-cross-node-cores] [--rank] [--latency-mode] [--throughput-mode] [--node-id] [--use-logical-core] [--disable-numactl] [--disable-taskset] [--core-list] [--log-path] [--log-file-prefix] <program> [program_args]
189+
190+
The command above has the following positional arguments:
191+
192+
.. list-table::
193+
:widths: 25 50
194+
:header-rows: 1
195+
196+
* - knob
197+
- help
198+
* - ``program``
199+
- The full path of the program/script to be launched.
200+
* - ``program_args``
201+
- The input arguments for the program/script to be launched.
202+
203+
Explanation of the options
204+
~~~~~~~~~~~~~~~~~~~~~~~~~~
205+
206+
The generic option settings (knobs) include the following:
207+
208+
.. list-table::
209+
:widths: 25 10 15 50
210+
:header-rows: 1
211+
212+
* - knob
213+
- type
214+
- default value
215+
- help
216+
* - ``-h``, ``--help``
217+
-
218+
-
219+
- To show the help message and exit.
220+
* - ``-m``, ``--module``
221+
-
222+
-
223+
- To change each process to interpret the launch script as a python module, executing with the same behavior as "python -m".
224+
* - ``--no-python``
225+
- bool
226+
- False
227+
- To avoid prepending the program with "python" - just execute it directly. Useful when the script is not a Python script.
228+
* - ``--log-path``
229+
- str
230+
- ``''``
231+
- To specify the log file directory. Default path is ``''``, which means disable logging to files.
232+
* - ``--log-file-prefix``
233+
- str
234+
- "run"
235+
- Prefix of the log file name.
236+
237+
Knobs for applying or disabling optimizations are:
238+
239+
.. list-table::
240+
:widths: 25 10 15 50
241+
:header-rows: 1
242+
243+
* - knob
244+
- type
245+
- default value
246+
- help
247+
* - ``--enable-tcmalloc``
248+
- bool
249+
- False
250+
- To enable ``TCMalloc`` memory allocator.
251+
* - ``--enable-jemalloc``
252+
- bool
253+
- False
254+
- To enable ``JeMalloc`` memory allocator.
255+
* - ``--use-default-allocator``
256+
- bool
257+
- False
258+
- To use default memory allocator. Neither ``TCMalloc`` nor ``JeMalloc`` would be used.
259+
* - ``--disable-iomp``
260+
- bool
261+
- False
262+
- By default, Intel® OpenMP lib will be used if installed. Setting this flag would disable the usage of Intel® OpenMP.
263+
264+
.. note::
265+
266+
Memory allocators influence performance. If the user does not specify a desired memory allocator, the ``run_cpu`` script will search if any of them is installed in the order of TCMalloc > JeMalloc > PyTorch default memory allocator, and takes the first matched one.
267+
268+
Knobs for controlling instance number and compute resource allocation are:
269+
270+
.. list-table::
271+
:widths: 25 10 15 50
272+
:header-rows: 1
273+
274+
* - knob
275+
- type
276+
- default value
277+
- help
278+
* - ``--ninstances``
279+
- int
280+
- 0
281+
- Number of instances.
282+
* - ``--ncores-per-instance``
283+
- int
284+
- 0
285+
- Number of cores used by each instance.
286+
* - ``--node-id``
287+
- int
288+
- -1
289+
- The node ID to be used for multi-instance, by default all nodes will be used.
290+
* - ``--core-list``
291+
- str
292+
- ``''``
293+
- To specify the core list as ``'core_id, core_id, ....'`` or core range as ``'core_id-core_id'``. By dafault all the cores will be used.
294+
* - ``--use-logical-core``
295+
- bool
296+
- False
297+
- By default only physical cores are used. Specifying this flag enables logical cores usage.
298+
* - ``--skip-cross-node-cores``
299+
- bool
300+
- False
301+
- To prevent the workload to be executed on cores across NUMA nodes.
302+
* - ``--rank``
303+
- int
304+
- -1
305+
- To specify instance index to assign ncores_per_instance for rank; otherwise ncores_per_instance will be assigned sequentially to the instances.
306+
* - ``--multi-instance``
307+
- bool
308+
- False
309+
- A quick set to invoke multiple instances of the workload on multi-socket CPU servers.
310+
* - ``--latency-mode``
311+
- bool
312+
- False
313+
- A quick set to invoke benchmarking with latency mode, in which all physical cores are used and 4 cores per instance.
314+
* - ``--throughput-mode``
315+
- bool
316+
- False
317+
- A quick set to invoke benchmarking with throughput mode, in which all physical cores are used and 1 numa node per instance.
318+
* - ``--disable-numactl``
319+
- bool
320+
- False
321+
- By default ``numactl`` command is used to control NUMA access. Setting this flag will disable it.
322+
* - ``--disable-taskset``
323+
- bool
324+
- False
325+
- To disable the usage of ``taskset`` command.
326+
327+
.. note::
328+
329+
Environment variables that will be set by this script include the following:
330+
331+
.. list-table::
332+
:widths: 25 50
333+
:header-rows: 1
334+
335+
* - Environment Variable
336+
- Value
337+
* - LD_PRELOAD
338+
- Depending on knobs you set, <lib>/libiomp5.so, <lib>/libjemalloc.so, <lib>/libtcmalloc.so might be appended to LD_PRELOAD.
339+
* - KMP_AFFINITY
340+
- If libiomp5.so is preloaded, KMP_AFFINITY could be set to ``"granularity=fine,compact,1,0"``.
341+
* - KMP_BLOCKTIME
342+
- If libiomp5.so is preloaded, KMP_BLOCKTIME is set to "1".
343+
* - OMP_NUM_THREADS
344+
- Value of ``ncores_per_instance``
345+
* - MALLOC_CONF
346+
- If libjemalloc.so is preloaded, MALLOC_CONF will be set to ``"oversize_threshold:1,background_thread:true,metadata_thp:auto"``.
347+
348+
Please note that the script respects environment variables set preliminarily. For example, if you have set the environment variables mentioned above before running the script, the values of the variables will not be overwritten by the script.
349+
350+
Conclusion
351+
----------
352+
353+
In this tutorial, we explored a variety of advanced configurations and tools designed to optimize PyTorch inference performance on Intel® Xeon® Scalable Processors.
354+
By leveraging the ``torch.backends.xeon.run_cpu`` script, we demonstrated how to fine-tune thread and memory management to achieve peak performance.
355+
We covered essential concepts such as NUMA access control, optimized memory allocators like ``TCMalloc`` and ``JeMalloc``, and the use of Intel® OpenMP for efficient multithreading.
356+
357+
Additionally, we provided practical command-line examples to guide you through setting up single and multiple instance scenarios, ensuring optimal resource utilization tailored to specific workloads.
358+
By understanding and applying these techniques, users can significantly enhance the efficiency and speed of their PyTorch applications on Intel® Xeon® platforms.
359+
360+
See also:
361+
362+
* `PyTorch Performance Tuning Guide <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations>`__
363+
* `PyTorch Multiprocessing Best Practices <https://pytorch.org/docs/stable/notes/multiprocessing.html#cpu-in-multiprocessing>`__
364+
* Grokking PyTorch Intel CPU performance: `Part 1 <https://pytorch.org/tutorials/intermediate/torchserve_with_ipex>`__ `Part 2 <https://pytorch.org/tutorials/intermediate/torchserve_with_ipex_2>`__

0 commit comments

Comments
 (0)