@@ -302,11 +302,13 @@ Note that there exists a debugging tool provided by PyTorch, called `Minifier <h
302
302
303
303
Performance profiling
304
304
--------------
305
+
305
306
For this part, we are going to describe how to analyze torchinductor model performance.
306
307
Firsly, we choose an eager model as a baseline. We set up a benchmark to compare
307
308
the end to end performance between eager model and inductor model.
308
309
309
310
.. code-block:: python
311
+
310
312
from transformers import T5ForConditionalGeneration
311
313
# init an eager model
312
314
eager_model = T5ForConditionalGeneration.from_pretrained(" t5-small" )
@@ -343,21 +345,28 @@ the end to end performance between eager model and inductor model.
343
345
print(" ratio:" , eager_t / inductor_t)
344
346
345
347
Output:
348
+
346
349
.. code-block:: shell
350
+
347
351
eager use: 410.12550354003906
348
352
inductor use: 478.59081745147705
349
353
ratio: 0.8569439458198976
350
354
351
355
We see inductor model spent more time than eager model, which does not meet our expectation.
352
- To deep dive op-level performance, we can use ` Pytorch Profiler< https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html> `
353
- We can enable kernel profile in inductor by:
356
+ To deep dive op-level performance, we can use ` Pytorch Profiler < https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html> ` _
357
+
358
+ To enable kernel profile in inductor, we need set ` ` enable_kernel_profile` ` by:
359
+
354
360
.. code-block:: python
361
+
355
362
from torch._inductor import config
356
363
config.cpp.enable_kernel_profile = True
357
364
358
- Following the steps in ` Pytorch Profiler< https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html> `
359
- we can get the profiling table and trace files.
365
+ Following the steps in ` Pytorch Profiler < https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html> ` _
366
+ we are able to get the profiling table and trace files.
367
+
360
368
.. code-block:: python
369
+
361
370
from torch.profiler import profile, schedule, ProfilerActivity
362
371
my_schedule = schedule(
363
372
skip_first=10,
@@ -388,8 +397,10 @@ we can get the profiling table and trace files.
388
397
p.step ()
389
398
print(" latency: {} ms" .format(1000* (total)/100))
390
399
391
- We can get following profile tables for eager model
400
+ We will get following profile tables for eager model
401
+
392
402
.. code-block:: shell
403
+
393
404
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
394
405
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
395
406
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
@@ -415,8 +426,11 @@ We can get following profile tables for eager model
415
426
aten::fill_ 0.15% 613.000us 0.15% 613.000us 15.718us 39
416
427
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
417
428
Self CPU time total: 415.949ms
418
- And for inductor model
429
+
430
+ And get above table for inductor model
431
+
419
432
.. code-block:: shell
433
+
420
434
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
421
435
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
422
436
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
@@ -443,8 +457,10 @@ And for inductor model
443
457
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
444
458
Self CPU time total: 474.360ms
445
459
446
- We can search the most time consuming ` graph_0_cpp_fused__softmax_7` in ` output_code.py` to see the generated code:
460
+ We can search the most time consuming ` ` graph_0_cpp_fused__softmax_7` ` in ` ` output_code.py` ` to see the generated code:
461
+
447
462
.. code-block:: python
463
+
448
464
cpp_fused__softmax_7 = async_compile.cpp(' ' '
449
465
#include <ATen/record_function.h>
450
466
#include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
@@ -584,8 +600,9 @@ We can search the most time consuming `graph_0_cpp_fused__softmax_7` in `output_
584
600
}
585
601
}
586
602
' ' ' )
587
- With the kernel name ` cpp_fused__softmax_* ` and considering the profile
588
- results together, we may suspect the generated code for ' softmax' is
603
+
604
+ With the kernel name ` ` cpp_fused__softmax_* ` ` and considering the profile
605
+ results together, we may suspect the generated code for ` ` softmax` ` is
589
606
inefficient. We encourage you to report an issue with all you findings above.
590
607
591
608
0 commit comments