Skip to content

Commit 2f3f3fa

Browse files
authored
Updated Doc for Intel XPU Profile (#3013)
* add xpu profiling files and recipes_source/profile_with_itt.rst Update profiler_recipe.py to unify the accelerators python codes * Update en-wordlist.txt
1 parent 37d8ddb commit 2f3f3fa

File tree

6 files changed

+108
-22
lines changed

6 files changed

+108
-22
lines changed
93.3 KB
Loading
Loading

_static/img/trace_xpu_img.png

88.3 KB
Loading

en-wordlist.txt

+8-1
Original file line numberDiff line numberDiff line change
@@ -647,4 +647,11 @@ url
647647
colab
648648
sharders
649649
Criteo
650-
torchrec
650+
torchrec
651+
_batch_norm_impl_index
652+
convolution_overrideable
653+
aten
654+
XPU
655+
XPUs
656+
impl
657+
overrideable

recipes_source/profile_with_itt.rst

+33-3
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,10 @@ Launch Intel® VTune™ Profiler
5858

5959
To verify the functionality, you need to start an Intel® VTune™ Profiler instance. Please check the `Intel® VTune™ Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/launch.html>`__ for steps to launch Intel® VTune™ Profiler.
6060

61+
.. note::
62+
Users can also use web-server-ui by following `Intel® VTune™ Profiler Web Server UI Guide <https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2024-1/web-server-ui.html>`__
63+
ex : vtune-backend --web-port=8080 --allow-remote-access --enable-server-profiling
64+
6165
Once you get the Intel® VTune™ Profiler GUI launched, you should see a user interface as below:
6266

6367
.. figure:: /_static/img/itt_tutorial/vtune_start.png
@@ -66,8 +70,8 @@ Once you get the Intel® VTune™ Profiler GUI launched, you should see a user i
6670

6771
Three sample results are available on the left side navigation bar under `sample (matrix)` project. If you do not want profiling results appear in this default sample project, you can create a new project via the button `New Project...` under the blue `Configure Analysis...` button. To start a new profiling, click the blue `Configure Analysis...` button to initiate configuration of the profiling.
6872

69-
Configure Profiling
70-
~~~~~~~~~~~~~~~~~~~
73+
Configure Profiling for CPU
74+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7175

7276
Once you click the `Configure Analysis...` button, you should see the screen below:
7377

@@ -77,6 +81,16 @@ Once you click the `Configure Analysis...` button, you should see the screen bel
7781

7882
The right side of the windows is split into 3 parts: `WHERE` (top left), `WHAT` (bottom left), and `HOW` (right). With `WHERE`, you can assign a machine where you want to run the profiling on. With `WHAT`, you can set the path of the application that you want to profile. To profile a PyTorch script, it is recommended to wrap all manual steps, including activating a Python environment and setting required environment variables, into a bash script, then profile this bash script. In the screenshot above, we wrapped all steps into the `launch.sh` bash script and profile `bash` with the parameter to be `<path_of_launch.sh>`. On the right side `HOW`, you can choose whatever type that you would like to profile. Intel® VTune™ Profiler provides a bunch of profiling types that you can choose from. Details can be found at `Intel® VTune™ Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance.html>`__.
7983

84+
85+
Configure Profiling for XPU
86+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
87+
Pick GPU Offload Profiling Type instead of Hotspots, and follow the same instructions as CPU to Launch the Application.
88+
89+
.. figure:: /_static/img/itt_tutorial/vtune_xpu_config.png
90+
:width: 100%
91+
:align: center
92+
93+
8094
Read Profiling Result
8195
~~~~~~~~~~~~~~~~~~~~~
8296

@@ -101,6 +115,18 @@ As illustrated on the right side navigation bar, brown portions in the timeline
101115

102116
Of course there are much more enriched sets of profiling features that Intel® VTune™ Profiler provides to help you understand a performance issue. When you understand the root cause of a performance issue, you can get it fixed. More detailed usage instructions are available at `Intel® VTune™ Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance.html>`__.
103117

118+
Read XPU Profiling Result
119+
~~~~~~~~~~~~~~~~~~~~~~~~~
120+
121+
With a successful profiling with ITT, you can open `Platform` tab of the profiling result to see labels in the Intel® VTune™ Profiler timeline.
122+
123+
.. figure:: /_static/img/itt_tutorial/vtune_xpu_timeline.png
124+
:width: 100%
125+
:align: center
126+
127+
128+
The timeline shows the main thread as a `python` thread on the top. Labeled PyTorch operators and customized regions are shown in the main thread row. All operators starting with `aten::` are operators labeled implicitly by the ITT feature in PyTorch. The timeline also shows the GPU Computing Queue on the top, and users could see different XPU Kernels dispatched into GPU Queue.
129+
104130
A short sample code showcasing how to use PyTorch ITT APIs
105131
----------------------------------------------------------
106132

@@ -128,8 +154,12 @@ The topology is formed by two operators, `Conv2d` and `Linear`. Three iterations
128154
return x
129155
130156
def main():
131-
m = ITTSample()
157+
m = ITTSample
158+
# unmark below code for XPU
159+
# m = m.to("xpu")
132160
x = torch.rand(10, 3, 244, 244)
161+
# unmark below code for XPU
162+
# x = x.to("xpu")
133163
with torch.autograd.profiler.emit_itt():
134164
for i in range(3)
135165
# Labeling a region with pair of range_push and range_pop

recipes_source/recipes/profiler_recipe.py

+67-18
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
# - ``ProfilerActivity.CPU`` - PyTorch operators, TorchScript functions and
7171
# user-defined code labels (see ``record_function`` below);
7272
# - ``ProfilerActivity.CUDA`` - on-device CUDA kernels;
73+
# - ``ProfilerActivity.XPU`` - on-device XPU kernels;
7374
# - ``record_shapes`` - whether to record shapes of the operator inputs;
7475
# - ``profile_memory`` - whether to report amount of memory consumed by
7576
# model's Tensors;
@@ -160,17 +161,28 @@
160161
# Note the occurrence of ``aten::convolution`` twice with different input shapes.
161162

162163
######################################################################
163-
# Profiler can also be used to analyze performance of models executed on GPUs:
164-
165-
model = models.resnet18().cuda()
166-
inputs = torch.randn(5, 3, 224, 224).cuda()
167-
168-
with profile(activities=[
169-
ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
164+
# Profiler can also be used to analyze performance of models executed on GPUs and XPUs:
165+
# Users could switch between cpu, cuda and xpu
166+
if torch.cuda.is_available():
167+
device = 'cuda'
168+
elif torch.xpu.is_available():
169+
device = 'xpu'
170+
else:
171+
print('Neither CUDA nor XPU devices are available to demonstrate profiling on acceleration devices')
172+
import sys
173+
sys.exit(0)
174+
175+
activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA, ProfilerActivity.XPU]
176+
sort_by_keyword = device + "_time_total"
177+
178+
model = models.resnet18().to(device)
179+
inputs = torch.randn(5, 3, 224, 224).to(device)
180+
181+
with profile(activities=activities, record_shapes=True) as prof:
170182
with record_function("model_inference"):
171183
model(inputs)
172184

173-
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
185+
print(prof.key_averages().table(sort_by=sort_by_keyword, row_limit=10))
174186

175187
######################################################################
176188
# (Note: the first use of CUDA profiling may bring an extra overhead.)
@@ -197,6 +209,36 @@
197209
# Self CPU time total: 23.015ms
198210
# Self CUDA time total: 11.666ms
199211
#
212+
######################################################################
213+
214+
215+
######################################################################
216+
# (Note: the first use of XPU profiling may bring an extra overhead.)
217+
218+
######################################################################
219+
# The resulting table output (omitting some columns):
220+
#
221+
# .. code-block:: sh
222+
#
223+
#------------------------------------------------------- ------------ ------------ ------------ ------------ ------------
224+
# Name Self XPU Self XPU % XPU total XPU time avg # of Calls
225+
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------
226+
# model_inference 0.000us 0.00% 2.567ms 2.567ms 1
227+
# aten::conv2d 0.000us 0.00% 1.871ms 93.560us 20
228+
# aten::convolution 0.000us 0.00% 1.871ms 93.560us 20
229+
# aten::_convolution 0.000us 0.00% 1.871ms 93.560us 20
230+
# aten::convolution_overrideable 1.871ms 72.89% 1.871ms 93.560us 20
231+
# gen_conv 1.484ms 57.82% 1.484ms 74.216us 20
232+
# aten::batch_norm 0.000us 0.00% 432.640us 21.632us 20
233+
# aten::_batch_norm_impl_index 0.000us 0.00% 432.640us 21.632us 20
234+
# aten::native_batch_norm 432.640us 16.85% 432.640us 21.632us 20
235+
# conv_reorder 386.880us 15.07% 386.880us 6.448us 60
236+
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------
237+
# Self CPU time total: 712.486ms
238+
# Self XPU time total: 2.567ms
239+
240+
#
241+
200242

201243
######################################################################
202244
# Note the occurrence of on-device kernels in the output (e.g. ``sgemm_32x32x32_NN``).
@@ -266,17 +308,22 @@
266308
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
267309
#
268310
# Profiling results can be outputted as a ``.json`` trace file:
311+
# Tracing CUDA or XPU kernels
312+
# Users could switch between cpu, cuda and xpu
313+
device = 'cuda'
314+
315+
activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA, ProfilerActivity.XPU]
269316

270-
model = models.resnet18().cuda()
271-
inputs = torch.randn(5, 3, 224, 224).cuda()
317+
model = models.resnet18().to(device)
318+
inputs = torch.randn(5, 3, 224, 224).to(device)
272319

273-
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
320+
with profile(activities=activities) as prof:
274321
model(inputs)
275322

276323
prof.export_chrome_trace("trace.json")
277324

278325
######################################################################
279-
# You can examine the sequence of profiled operators and CUDA kernels
326+
# You can examine the sequence of profiled operators and CUDA/XPU kernels
280327
# in Chrome trace viewer (``chrome://tracing``):
281328
#
282329
# .. image:: ../../_static/img/trace_img.png
@@ -287,15 +334,16 @@
287334
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
288335
#
289336
# Profiler can be used to analyze Python and TorchScript stack traces:
337+
sort_by_keyword = "self_" + device + "_time_total"
290338

291339
with profile(
292-
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
340+
activities=activities,
293341
with_stack=True,
294342
) as prof:
295343
model(inputs)
296344

297345
# Print aggregated stats
298-
print(prof.key_averages(group_by_stack_n=5).table(sort_by="self_cuda_time_total", row_limit=2))
346+
print(prof.key_averages(group_by_stack_n=5).table(sort_by=sort_by_keyword, row_limit=2))
299347

300348
#################################################################################
301349
# The output might look like this (omitting some columns):
@@ -384,15 +432,17 @@
384432
# To send the signal to the profiler that the next step has started, call ``prof.step()`` function.
385433
# The current profiler step is stored in ``prof.step_num``.
386434
#
387-
# The following example shows how to use all of the concepts above:
435+
# The following example shows how to use all of the concepts above for CUDA and XPU Kernels:
436+
437+
sort_by_keyword = "self_" + device + "_time_total"
388438

389439
def trace_handler(p):
390-
output = p.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)
440+
output = p.key_averages().table(sort_by=sort_by_keyword, row_limit=10)
391441
print(output)
392442
p.export_chrome_trace("/tmp/trace_" + str(p.step_num) + ".json")
393443

394444
with profile(
395-
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
445+
activities=activities,
396446
schedule=torch.profiler.schedule(
397447
wait=1,
398448
warmup=1,
@@ -403,7 +453,6 @@ def trace_handler(p):
403453
model(inputs)
404454
p.step()
405455

406-
407456
######################################################################
408457
# Learn More
409458
# ----------

0 commit comments

Comments
 (0)