|
| 1 | +--- |
| 2 | +layout: blog_detail |
| 3 | +title: "PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing" |
| 4 | +author: Team PyTorch |
| 5 | +--- |
| 6 | + |
| 7 | +We are excited to announce the release of PyTorch® 2.1 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.1.0))! PyTorch 2.1 offers automatic dynamic shape support in _torch.compile_, _torch.distributed.checkpoint_ for saving/loading distributed training jobs on multiple ranks in parallel, and _torch.compile_ support for the NumPy API. |
| 8 | + |
| 9 | +In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of _torch.export_, a sound full-graph capture mechanism, and _torch.export_-based quantization. |
| 10 | + |
| 11 | +Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog. |
| 12 | + |
| 13 | +This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page. |
| 14 | + |
| 15 | +Summary: |
| 16 | +- _torch.compile_ now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using _automatic dynamic shapes._ |
| 17 | +- _torch.distributed.checkpoint_ enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology. |
| 18 | +- _torch.compile_ can now compile NumPy operations via translating them into PyTorch-equivalent operations. |
| 19 | +- _torch.compile_ now includes improved support for Python 3.11. |
| 20 | +- New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels. |
| 21 | +- _torch.export_, a sound full-graph capture mechanism is introduced as a prototype feature, as well as _torch.export_-based quantization. |
| 22 | +- _torch.sparse_ now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs. |
| 23 | + |
| 24 | + |
| 25 | + | **Stable** | **Beta** | **Prototype** | **Performance Improvements** | |
| 26 | +|------------|-----------------------------------------------|---------------------------------|-----------------------------------------------------------| |
| 27 | +| | Automatic Dynamic Shapes | _torch.export()_ | AVX512 kernel support | |
| 28 | +| | _torch.distributed.checkpoint_ | Torch.export-based Quantization | CPU optimizations for scaled-dot-product-attention (SPDA) | |
| 29 | +| | _torch.compile_ + NumPy | semi-structed (2:4) sparsity | CPU optimizations for bfloat16 | |
| 30 | +| | _torch.compile_ + Python 3.11 | _cpp_wrapper_ for torchinductor | | |
| 31 | +| | _torch.compile + autograd.Function_ | | | |
| 32 | +| | third-party device integration: _PrivateUse1_ | | | |
| 33 | + |
| 34 | +\*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing). |
| 35 | + |
| 36 | +## **Beta Features** |
| 37 | + |
| 38 | + **(Beta) Automatic Dynamic Shapes** |
| 39 | + |
| 40 | + Dynamic shapes is functionality built into _torch.compile_ that can minimize recompilations by tracking and generating code based on the symbolic shape of a tensor rather than the static shape (e.g. _\[B, 128, 4]_ rather than _\[64, 128, 4]_). This allows _torch.compile_ to generate a single kernel that can work for many sizes, at only a modest cost to efficiency. Dynamic shapes has been greatly stabilized in PyTorch 2.1, and is now automatically enabled if _torch.compile_ notices recompilation due to varying input shapes. You can disable automatic dynamic by passing _dynamic=False_ to torch.compile, or by setting _torch.\_dynamo.config.automatic\_dynamic\_shapes = False_. |
| 41 | + |
| 42 | + In PyTorch 2.1, we have shown good performance with dynamic shapes enabled on a variety of model types, including large language models, on both CUDA and CPU. |
| 43 | + |
| 44 | + For more information on dynamic shapes, see [this documentation](https://pytorch.org/docs/2.1/torch.compiler_dynamic_shapes.html). |
| 45 | + |
| 46 | + **\[Beta] _torch.distributed.checkpoint_** |
| 47 | + |
| 48 | + _torch.distributed.checkpoint_ enables saving and loading models from multiple ranks in parallel. In addition, checkpointing automatically handles fully-qualified-name (FQN) mappings across models and optimizers, enabling load-time resharding across differing cluster topologies. |
| 49 | + |
| 50 | + For more information, see _torch.distributed.checkpoint_ [documentation](https://pytorch.org/docs/2.1/distributed.checkpoint.html) and [tutorial](https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html). |
| 51 | + |
| 52 | + **\[Beta] _torch.compile_ + _NumPy_** |
| 53 | + |
| 54 | + _torch.compile_ now understands how to compile NumPy operations via translating them into PyTorch-equivalent operations. Because this integration operates in a device-agnostic manner, you can now GPU-accelerate NumPy programs – or even mixed NumPy/PyTorch programs – just by using _torch.compile_. |
| 55 | + |
| 56 | + Please see [this section](https://pytorch.org/docs/2.1/torch.compiler_faq.html#does-numpy-work-with-torch-compile) in the _torch.compile_ FAQ for more information about _torch.compile + NumPy interaction_, and follow the [PyTorch Blog](https://pytorch.org/blog/) for a forthcoming blog about this feature. |
| 57 | + |
| 58 | + **\[Beta] _torch.compile_ + Python 3.11** |
| 59 | + |
| 60 | + _torch.compile_ previously only supported Python versions 3.8-3.10. Users can now optimize models with _torch.compile_ in Python 3.11. |
| 61 | + |
| 62 | + **\[Beta] _torch.compile_ + _autograd.Function_** |
| 63 | + |
| 64 | + _torch.compile_ can now trace and optimize the backward function of user-defined [autograd Functions](https://pytorch.org/docs/stable/autograd.html#function), which unlocks training optimizations for models that make heavier use of extensions mechanisms. |
| 65 | + |
| 66 | + **\[Beta] Improved third-party device support: _PrivateUse1_** |
| 67 | + |
| 68 | + Third-party device types can now be registered to PyTorch using the privateuse1 dispatch key. This allows device extensions to register new kernels to PyTorch and to associate them with the new key, allowing user code to work equivalently to built-in device types. For example, to register _“my\_hardware\_device_”, one can do the following: |
| 69 | + |
| 70 | +``` |
| 71 | +torch.rename_privateuse1_backend("my_hardware_device") |
| 72 | +torch.utils.generate_methods_for_privateuse1_backend() |
| 73 | +x = torch.randn((2, 3), device='my_hardware_device') |
| 74 | +y = x + x # run add kernel on 'my_hardware_device' |
| 75 | +``` |
| 76 | + |
| 77 | +To validate this feature, the OSS team from _Ascend NPU_ has successfully integrated [**torch\_npu**](https://github.com/Ascend/pytorch) into pytorch as a plug-in through the _PrivateUse1_ functionality. |
| 78 | + |
| 79 | +For more information, please see the PrivateUse1 tutorial [here](https://pytorch.org/tutorials/advanced/privateuseone.html). |
| 80 | + |
| 81 | +## **Prototype Features** |
| 82 | + |
| 83 | +**\[Prototype] _torch.export()_** |
| 84 | + |
| 85 | +_torch.export()_ provides a sound tracing mechanism to capture a full graph from a PyTorch program based on new technologies provided by PT2.0. |
| 86 | + |
| 87 | +Users can extract a clean representation (Export IR) of a PyTorch program in the form of a dataflow graph, consisting of mostly straight-line calls to PyTorch operators. Export IR can then be transformed, serialized, saved to file, transferred, loaded back for execution in an environment with or without Python. |
| 88 | + |
| 89 | +For more information, please see the tutorial [here](https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html). |
| 90 | + |
| 91 | +**\[Prototype] _torch.export_-based Quantization** |
| 92 | + |
| 93 | +_torch.ao.quantization_ now supports post-training static quantization on PyTorch2-based _torch.export_ flows. This includes support for built-in _XNNPACK_ and _X64Inductor_ _Quantizer_, as well as the ability to specify one’s own _Quantizer_. |
| 94 | + |
| 95 | +For an explanation on post-training static quantization with torch.export, see [this tutorial](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html), for quantization-aware training for static quantization with torch.export, see [this tutorial](https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html). |
| 96 | + |
| 97 | +For an explanation on how to write one’s own Quantizer, see [this tutorial](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html). |
| 98 | + |
| 99 | +**\[Prototype] semi-structured (2:4) sparsity for NVIDIA® GPUs** |
| 100 | + |
| 101 | +_torch.sparse_ now supports creating and accelerating compute over semi-structured sparse (2:4) tensors. For more information on the format, see [this](https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/) blog from NVIDIA.A minimal example introducing semi-structured sparsity is as follows: |
| 102 | + |
| 103 | +``` |
| 104 | +from torch.sparse import to_sparse_semi_structured |
| 105 | + |
| 106 | +x = torch.rand(64, 64).half().cuda() |
| 107 | +mask = torch.tensor([0, 0, 1, 1]).tile((64, 16)).cuda().bool() |
| 108 | +linear = nn.Linear(64, 64).half().cuda() |
| 109 | +
|
| 110 | +linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight.masked_fill(~mask, 0))) |
| 111 | +linear(x) |
| 112 | +``` |
| 113 | + |
| 114 | +To learn more, please see the [documentation](https://pytorch.org/docs/2.1/sparse.html#sparse-semi-structured-tensors) and accompanying [tutorial](https://pytorch.org/tutorials/prototype/semi_structured_sparse.html). |
| 115 | + |
| 116 | +**\[Prototype] _cpp\_wrapper_ for _torchinductor_** |
| 117 | + |
| 118 | +_cpp\_wrapper_ can reduce the Python overhead for invoking kernels in torchinductor by generating the kernel wrapper code in C++. This feature is still in the prototype phase; it does not support all programs that successfully compile in PT2 today. Please file issues if you discover limitations for your use case to help us prioritize. |
| 119 | + |
| 120 | +The API to turn this feature on is: |
| 121 | +``` |
| 122 | +import torch |
| 123 | +import torch._inductor.config as config |
| 124 | +config.cpp_wrapper = True |
| 125 | +``` |
| 126 | + |
| 127 | +For more information, please see the [tutorial](https://pytorch.org/tutorials/prototype/inductor_cpp_wrapper_tutorial.html). |
| 128 | + |
| 129 | +## **Performance Improvements** |
| 130 | + |
| 131 | +**AVX512 kernel support** |
| 132 | + |
| 133 | +In PyTorch 2.0, AVX2 kernels would be used even if the CPU supported AVX512 instructions. Now, PyTorch defaults to using AVX512 CPU kernels if the CPU supports those instructions, equivalent to setting _ATEN\_CPU\_CAPABILITY=avx512_ in previous releases. The previous behavior can be enabled by setting _ATEN\_CPU\_CAPABILITY=avx2._ |
| 134 | + |
| 135 | +**CPU optimizations for scaled-dot-product-attention (SDPA)** |
| 136 | + |
| 137 | +Previous versions of PyTorch provided optimized CUDA implementations for transformer primitives via _torch.nn.functiona.scaled\_dot\_product\_attention_. PyTorch 2.1 includes optimized FlashAttention-based CPU routines. |
| 138 | + |
| 139 | +See the documentation [here](https://pytorch.org/docs/2.1/generated/torch.nn.functional.scaled_dot_product_attention.html). |
| 140 | + |
| 141 | +**CPU optimizations for bfloat16** |
| 142 | + |
| 143 | +PyTorch 2.1 includes CPU optimizations for bfloat16, including improved vectorization support and _torchinductor_ codegen. |
0 commit comments