Skip to content

Commit c06809f

Browse files
zhuhaozheValentine233
authored andcommitted
fix format (pytorch#2)
1 parent 5117de6 commit c06809f

File tree

1 file changed

+26
-9
lines changed

1 file changed

+26
-9
lines changed

intermediate_source/inductor_debug_cpu.rst

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -302,11 +302,13 @@ Note that there exists a debugging tool provided by PyTorch, called `Minifier <h
302302
303303
Performance profiling
304304
--------------
305+
305306
For this part, we are going to describe how to analyze torchinductor model performance.
306307
Firsly, we choose an eager model as a baseline. We set up a benchmark to compare
307308
the end to end performance between eager model and inductor model.
308309
309310
.. code-block:: python
311+
310312
from transformers import T5ForConditionalGeneration
311313
# init an eager model
312314
eager_model = T5ForConditionalGeneration.from_pretrained("t5-small")
@@ -343,21 +345,28 @@ the end to end performance between eager model and inductor model.
343345
print("ratio:", eager_t / inductor_t)
344346
345347
Output:
348+
346349
.. code-block:: shell
350+
347351
eager use: 410.12550354003906
348352
inductor use: 478.59081745147705
349353
ratio: 0.8569439458198976
350354
351355
We see inductor model spent more time than eager model, which does not meet our expectation.
352-
To deep dive op-level performance, we can use `Pytorch Profiler<https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`
353-
We can enable kernel profile in inductor by:
356+
To deep dive op-level performance, we can use `Pytorch Profiler <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`_
357+
358+
To enable kernel profile in inductor, we need set ``enable_kernel_profile`` by:
359+
354360
.. code-block:: python
361+
355362
from torch._inductor import config
356363
config.cpp.enable_kernel_profile = True
357364
358-
Following the steps in `Pytorch Profiler<https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`
359-
we can get the profiling table and trace files.
365+
Following the steps in `Pytorch Profiler <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`_
366+
we are able to get the profiling table and trace files.
367+
360368
.. code-block:: python
369+
361370
from torch.profiler import profile, schedule, ProfilerActivity
362371
my_schedule = schedule(
363372
skip_first=10,
@@ -388,8 +397,10 @@ we can get the profiling table and trace files.
388397
p.step()
389398
print("latency: {} ms".format(1000*(total)/100))
390399
391-
We can get following profile tables for eager model
400+
We will get following profile tables for eager model
401+
392402
.. code-block:: shell
403+
393404
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
394405
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
395406
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
@@ -415,8 +426,11 @@ We can get following profile tables for eager model
415426
aten::fill_ 0.15% 613.000us 0.15% 613.000us 15.718us 39
416427
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
417428
Self CPU time total: 415.949ms
418-
And for inductor model
429+
430+
And get above table for inductor model
431+
419432
.. code-block:: shell
433+
420434
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
421435
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
422436
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
@@ -443,8 +457,10 @@ And for inductor model
443457
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
444458
Self CPU time total: 474.360ms
445459
446-
We can search the most time consuming `graph_0_cpp_fused__softmax_7` in `output_code.py` to see the generated code:
460+
We can search the most time consuming ``graph_0_cpp_fused__softmax_7`` in ``output_code.py`` to see the generated code:
461+
447462
.. code-block:: python
463+
448464
cpp_fused__softmax_7 = async_compile.cpp('''
449465
#include <ATen/record_function.h>
450466
#include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
@@ -584,8 +600,9 @@ We can search the most time consuming `graph_0_cpp_fused__softmax_7` in `output_
584600
}
585601
}
586602
''')
587-
With the kernel name `cpp_fused__softmax_*` and considering the profile
588-
results together, we may suspect the generated code for 'softmax' is
603+
604+
With the kernel name ``cpp_fused__softmax_*`` and considering the profile
605+
results together, we may suspect the generated code for ``softmax`` is
589606
inefficient. We encourage you to report an issue with all you findings above.
590607
591608

0 commit comments

Comments
 (0)