Skip to content

Commit 71d3e3e

Browse files
authored
Fix italics in blog post
1 parent f08d45b commit 71d3e3e

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

_posts/2023-12-18-training-production-ai-models.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,20 @@ author: CK Luk, Daohang Shi, Yuzhen Huang, Jackie (Jiaqi) Xu, Jade Nie, Zhou Wan
88

99
## 1. Introduction
1010

11-
[PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) (abbreviated as PT2) can significantly improve the training and inference performance of an AI model using a compiler called_ torch.compile_ while being 100% backward compatible with PyTorch 1.x. There have been reports on how PT2 improves the performance of common _benchmarks_ (e.g., [huggingface’s diffusers](https://huggingface.co/docs/diffusers/optimization/torch2.0)). In this blog, we discuss our experiences in applying PT2 to _production _AI models_ _at Meta.
11+
[PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) (abbreviated as PT2) can significantly improve the training and inference performance of an AI model using a compiler called _torch.compile_ while being 100% backward compatible with PyTorch 1.x. There have been reports on how PT2 improves the performance of common _benchmarks_ (e.g., [huggingface’s diffusers](https://huggingface.co/docs/diffusers/optimization/torch2.0)). In this blog, we discuss our experiences in applying PT2 to _production AI models_ at Meta.
1212

1313

1414
## 2. Background
1515

1616

1717
### 2.1 Why is automatic performance optimization important for production?
1818

19-
Performance is particularly important for production—e.g, even a 5% reduction in the training time of a heavily used model can translate to substantial savings in GPU cost and data-center _power_. Another important metric is _development efficiency_, which measures how many engineer-months are required to bring a model to production. Typically, a significant part of this bring-up effort is spent on _manual _performance tuning such as rewriting GPU kernels to improve the training speed. By providing _automatic _performance optimization, PT2 can improve _both_ cost and development efficiency.
19+
Performance is particularly important for production—e.g, even a 5% reduction in the training time of a heavily used model can translate to substantial savings in GPU cost and data-center _power_. Another important metric is _development efficiency_, which measures how many engineer-months are required to bring a model to production. Typically, a significant part of this bring-up effort is spent on _manual_ performance tuning such as rewriting GPU kernels to improve the training speed. By providing _automatic_ performance optimization, PT2 can improve _both_ cost and development efficiency.
2020

2121

2222
### 2.2 How PT2 improves performance
2323

24-
As a compiler, PT2 can view_ multiple_ operations in the training graph captured from a model (unlike in PT1.x, where only one operation is executed at a time). Consequently, PT2 can exploit a number of performance optimization opportunities, including:
24+
As a compiler, PT2 can view _multiple_ operations in the training graph captured from a model (unlike in PT1.x, where only one operation is executed at a time). Consequently, PT2 can exploit a number of performance optimization opportunities, including:
2525

2626

2727

@@ -124,9 +124,9 @@ In this section, we use three production models to evaluate PT2. First we show t
124124
Figure 7 reports the training-time speedup with PT2. For each model, we show four cases: (i) no-compile with bf16, (ii) compile with fp32, (iii) compile with bf16, (iv) compile with bf16 and autotuning. The y-axis is the speedup over the baseline, which is no-compile with fp32. Note that no-compile with bf16 is actually slower than no-compile with fp32, due to the type conversion overhead. In contrast, compiling with bf16 achieves much larger speedups by reducing much of this overhead. Overall, given that these models are already heavily optimized by hand, we are excited to see that torch.compile can still provide 1.14-1.24x speedup.
125125

126126

127-
![Fig.7 Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is_ omitted _in this figure).](/assets/images/training-production-ai-models/blog-fig7.jpg){:style="width:100%;"}
127+
![Fig.7 Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is _omitted_ in this figure).](/assets/images/training-production-ai-models/blog-fig7.jpg){:style="width:100%;"}
128128

129-
<p style="line-height: 1.05"><small><em><strong>Fig. 7</strong>: Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is_ omitted _in this figure).</em></small></p>
129+
<p style="line-height: 1.05"><small><em><strong>Fig. 7</strong>: Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is omitted in this figure).</em></small></p>
130130

131131

132132

@@ -148,4 +148,4 @@ In this blog, we demonstrate that PT2 can significantly accelerate the training
148148

149149
## 6. Acknowledgements
150150

151-
Many thanks to [Mark Saroufim](mailto:[email protected]), [Adnan Aziz](mailto:[email protected]), and [Gregory Chanan](mailto:[email protected]) for their detailed and insightful reviews.
151+
Many thanks to [Mark Saroufim](mailto:[email protected]), [Adnan Aziz](mailto:[email protected]), and [Gregory Chanan](mailto:[email protected]) for their detailed and insightful reviews.

0 commit comments

Comments
 (0)