You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2023-12-18-training-production-ai-models.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -8,20 +8,20 @@ author: CK Luk, Daohang Shi, Yuzhen Huang, Jackie (Jiaqi) Xu, Jade Nie, Zhou Wan
8
8
9
9
## 1. Introduction
10
10
11
-
[PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) (abbreviated as PT2) can significantly improve the training and inference performance of an AI model using a compiler called_ torch.compile_ while being 100% backward compatible with PyTorch 1.x. There have been reports on how PT2 improves the performance of common _benchmarks_ (e.g., [huggingface’s diffusers](https://huggingface.co/docs/diffusers/optimization/torch2.0)). In this blog, we discuss our experiences in applying PT2 to _production _AI models__at Meta.
11
+
[PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) (abbreviated as PT2) can significantly improve the training and inference performance of an AI model using a compiler called _torch.compile_ while being 100% backward compatible with PyTorch 1.x. There have been reports on how PT2 improves the performance of common _benchmarks_ (e.g., [huggingface’s diffusers](https://huggingface.co/docs/diffusers/optimization/torch2.0)). In this blog, we discuss our experiences in applying PT2 to _production AI models_at Meta.
12
12
13
13
14
14
## 2. Background
15
15
16
16
17
17
### 2.1 Why is automatic performance optimization important for production?
18
18
19
-
Performance is particularly important for production—e.g, even a 5% reduction in the training time of a heavily used model can translate to substantial savings in GPU cost and data-center _power_. Another important metric is _development efficiency_, which measures how many engineer-months are required to bring a model to production. Typically, a significant part of this bring-up effort is spent on _manual _performance tuning such as rewriting GPU kernels to improve the training speed. By providing _automatic _performance optimization, PT2 can improve _both_ cost and development efficiency.
19
+
Performance is particularly important for production—e.g, even a 5% reduction in the training time of a heavily used model can translate to substantial savings in GPU cost and data-center _power_. Another important metric is _development efficiency_, which measures how many engineer-months are required to bring a model to production. Typically, a significant part of this bring-up effort is spent on _manual_ performance tuning such as rewriting GPU kernels to improve the training speed. By providing _automatic_ performance optimization, PT2 can improve _both_ cost and development efficiency.
20
20
21
21
22
22
### 2.2 How PT2 improves performance
23
23
24
-
As a compiler, PT2 can view_ multiple_ operations in the training graph captured from a model (unlike in PT1.x, where only one operation is executed at a time). Consequently, PT2 can exploit a number of performance optimization opportunities, including:
24
+
As a compiler, PT2 can view _multiple_ operations in the training graph captured from a model (unlike in PT1.x, where only one operation is executed at a time). Consequently, PT2 can exploit a number of performance optimization opportunities, including:
25
25
26
26
27
27
@@ -124,9 +124,9 @@ In this section, we use three production models to evaluate PT2. First we show t
124
124
Figure 7 reports the training-time speedup with PT2. For each model, we show four cases: (i) no-compile with bf16, (ii) compile with fp32, (iii) compile with bf16, (iv) compile with bf16 and autotuning. The y-axis is the speedup over the baseline, which is no-compile with fp32. Note that no-compile with bf16 is actually slower than no-compile with fp32, due to the type conversion overhead. In contrast, compiling with bf16 achieves much larger speedups by reducing much of this overhead. Overall, given that these models are already heavily optimized by hand, we are excited to see that torch.compile can still provide 1.14-1.24x speedup.
125
125
126
126
127
-
{:style="width:100%;"}
127
+
{:style="width:100%;"}
128
128
129
-
<pstyle="line-height: 1.05"><small><em><strong>Fig. 7</strong>: Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is_ omitted _in this figure).</em></small></p>
129
+
<pstyle="line-height: 1.05"><small><em><strong>Fig. 7</strong>: Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is omitted in this figure).</em></small></p>
130
130
131
131
132
132
@@ -148,4 +148,4 @@ In this blog, we demonstrate that PT2 can significantly accelerate the training
0 commit comments