Skip to content

Commit 201fcfa

Browse files
c-p-i-osvekars
andcommitted
[doc] Fix to DDP tutorial (#3120)
Summary: 1. Add "set_device" call to keep things consistent between all DDP tutorials. This was inspired by the following change in the PyTorch repo: pytorch/examples#1285 (review) 2. Fix up the tutorial and add additional prints when the model exits. Test Plan: Ran tutorial with the applied changes. """ Co-authored-by: Svetlana Karslioglu <[email protected]>
1 parent 5a7f1e4 commit 201fcfa

File tree

1 file changed

+53
-46
lines changed

1 file changed

+53
-46
lines changed

intermediate_source/ddp_tutorial.rst

+53-46
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Getting Started with Distributed Data Parallel
22
=================================================
33
**Author**: `Shen Li <https://mrshenli.github.io/>`_
44

5-
**Edited by**: `Joe Zhu <https://github.com/gunandrose4u>`_
5+
**Edited by**: `Joe Zhu <https://github.com/gunandrose4u>`_, `Chirag Pandya <https://github.com/c-p-i-o>`__
66

77
.. note::
88
|edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/ddp_tutorial.rst>`__.
@@ -15,24 +15,30 @@ Prerequisites:
1515

1616

1717
`DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#module-torch.nn.parallel>`__
18-
(DDP) implements data parallelism at the module level which can run across
19-
multiple machines. Applications using DDP should spawn multiple processes and
20-
create a single DDP instance per process. DDP uses collective communications in the
18+
(DDP) is a powerful module in PyTorch that allows you to parallelize your model across
19+
multiple machines, making it perfect for large-scale deep learning applications.
20+
To use DDP, you'll need to spawn multiple processes and create a single instance of DDP per process.
21+
22+
But how does it work? DDP uses collective communications from the
2123
`torch.distributed <https://pytorch.org/tutorials/intermediate/dist_tuto.html>`__
22-
package to synchronize gradients and buffers. More specifically, DDP registers
23-
an autograd hook for each parameter given by ``model.parameters()`` and the
24-
hook will fire when the corresponding gradient is computed in the backward
25-
pass. Then DDP uses that signal to trigger gradient synchronization across
26-
processes. Please refer to
27-
`DDP design note <https://pytorch.org/docs/master/notes/ddp.html>`__ for more details.
24+
package to synchronize gradients and buffers across all processes. This means that each process will have
25+
its own copy of the model, but they'll all work together to train the model as if it were on a single machine.
26+
27+
To make this happen, DDP registers an autograd hook for each parameter in the model.
28+
When the backward pass is run, this hook fires and triggers gradient synchronization across all processes.
29+
This ensures that each process has the same gradients, which are then used to update the model.
30+
31+
For more information on how DDP works and how to use it effectively, be sure to check out the
32+
`DDP design note <https://pytorch.org/docs/master/notes/ddp.html>`__.
33+
With DDP, you can train your models faster and more efficiently than ever before!
34+
35+
The recommended way to use DDP is to spawn one process for each model replica. The model replica can span
36+
multiple devices. DDP processes can be placed on the same machine or across machines. Note that GPU devices
37+
cannot be shared across DDP processes (i.e. one GPU for one DDP process).
2838

2939

30-
The recommended way to use DDP is to spawn one process for each model replica,
31-
where a model replica can span multiple devices. DDP processes can be
32-
placed on the same machine or across machines, but GPU devices cannot be
33-
shared across processes. This tutorial starts from a basic DDP use case and
34-
then demonstrates more advanced use cases including checkpointing models and
35-
combining DDP with model parallel.
40+
In this tutorial, we'll start with a basic DDP use case and then demonstrate more advanced use cases,
41+
including checkpointing models and combining DDP with model parallel.
3642

3743

3844
.. note::
@@ -43,25 +49,22 @@ combining DDP with model parallel.
4349
Comparison between ``DataParallel`` and ``DistributedDataParallel``
4450
-------------------------------------------------------------------
4551

46-
Before we dive in, let's clarify why, despite the added complexity, you would
47-
consider using ``DistributedDataParallel`` over ``DataParallel``:
52+
Before we dive in, let's clarify why you would consider using ``DistributedDataParallel``
53+
over ``DataParallel``, despite its added complexity:
4854

49-
- First, ``DataParallel`` is single-process, multi-thread, and only works on a
50-
single machine, while ``DistributedDataParallel`` is multi-process and works
51-
for both single- and multi- machine training. ``DataParallel`` is usually
52-
slower than ``DistributedDataParallel`` even on a single machine due to GIL
53-
contention across threads, per-iteration replicated model, and additional
54-
overhead introduced by scattering inputs and gathering outputs.
55+
- First, ``DataParallel`` is single-process, multi-threaded, but it only works on a
56+
single machine. In contrast, ``DistributedDataParallel`` is multi-process and supports
57+
both single- and multi- machine training.
58+
Due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by
59+
scattering inputs and gathering outputs, ``DataParallel`` is usually
60+
slower than ``DistributedDataParallel`` even on a single machine.
5561
- Recall from the
5662
`prior tutorial <https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html>`__
5763
that if your model is too large to fit on a single GPU, you must use **model parallel**
5864
to split it across multiple GPUs. ``DistributedDataParallel`` works with
59-
**model parallel**; ``DataParallel`` does not at this time. When DDP is combined
65+
**model parallel**, while ``DataParallel`` does not at this time. When DDP is combined
6066
with model parallel, each DDP process would use model parallel, and all processes
6167
collectively would use data parallel.
62-
- If your model needs to span multiple machines or if your use case does not fit
63-
into data parallelism paradigm, please see `the RPC API <https://pytorch.org/docs/stable/rpc.html>`__
64-
for more generic distributed training support.
6568

6669
Basic Use Case
6770
--------------
@@ -141,6 +144,7 @@ different DDP processes starting from different initial model parameter values.
141144
optimizer.step()
142145
143146
cleanup()
147+
print(f"Finished running basic DDP example on rank {rank}.")
144148
145149
146150
def run_demo(demo_fn, world_size):
@@ -154,7 +158,7 @@ provides a clean API as if it were a local model. Gradient synchronization
154158
communications take place during the backward pass and overlap with the
155159
backward computation. When the ``backward()`` returns, ``param.grad`` already
156160
contains the synchronized gradient tensor. For basic use cases, DDP only
157-
requires a few more LoCs to set up the process group. When applying DDP to more
161+
requires a few more lines of code to set up the process group. When applying DDP to more
158162
advanced use cases, some caveats require caution.
159163

160164
Skewed Processing Speeds
@@ -179,13 +183,14 @@ It's common to use ``torch.save`` and ``torch.load`` to checkpoint modules
179183
during training and recover from checkpoints. See
180184
`SAVING AND LOADING MODELS <https://pytorch.org/tutorials/beginner/saving_loading_models.html>`__
181185
for more details. When using DDP, one optimization is to save the model in
182-
only one process and then load it to all processes, reducing write overhead.
183-
This is correct because all processes start from the same parameters and
186+
only one process and then load it on all processes, reducing write overhead.
187+
This works because all processes start from the same parameters and
184188
gradients are synchronized in backward passes, and hence optimizers should keep
185-
setting parameters to the same values. If you use this optimization, make sure no process starts
189+
setting parameters to the same values.
190+
If you use this optimization (i.e. save on one process but restore on all), make sure no process starts
186191
loading before the saving is finished. Additionally, when
187192
loading the module, you need to provide an appropriate ``map_location``
188-
argument to prevent a process from stepping into others' devices. If ``map_location``
193+
argument to prevent processes from stepping into others' devices. If ``map_location``
189194
is missing, ``torch.load`` will first load the module to CPU and then copy each
190195
parameter to where it was saved, which would result in all processes on the
191196
same machine using the same set of devices. For more advanced failure recovery
@@ -218,7 +223,7 @@ and elasticity support, please refer to `TorchElastic <https://pytorch.org/elast
218223
219224
loss_fn = nn.MSELoss()
220225
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
221-
226+
222227
optimizer.zero_grad()
223228
outputs = ddp_model(torch.randn(20, 10))
224229
labels = torch.randn(20, 5).to(rank)
@@ -234,6 +239,7 @@ and elasticity support, please refer to `TorchElastic <https://pytorch.org/elast
234239
os.remove(CHECKPOINT_PATH)
235240
236241
cleanup()
242+
print(f"Finished running DDP checkpoint example on rank {rank}.")
237243
238244
Combining DDP with Model Parallelism
239245
------------------------------------
@@ -285,6 +291,7 @@ either the application or the model ``forward()`` method.
285291
optimizer.step()
286292
287293
cleanup()
294+
print(f"Finished running DDP with model parallel example on rank {rank}.")
288295
289296
290297
if __name__ == "__main__":
@@ -323,15 +330,14 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
323330
324331
325332
def demo_basic():
333+
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
326334
dist.init_process_group("nccl")
327335
rank = dist.get_rank()
328336
print(f"Start running basic DDP example on rank {rank}.")
329-
330337
# create model and move it to GPU with id rank
331338
device_id = rank % torch.cuda.device_count()
332339
model = ToyModel().to(device_id)
333340
ddp_model = DDP(model, device_ids=[device_id])
334-
335341
loss_fn = nn.MSELoss()
336342
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
337343
@@ -341,22 +347,23 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
341347
loss_fn(outputs, labels).backward()
342348
optimizer.step()
343349
dist.destroy_process_group()
344-
350+
print(f"Finished running basic DDP example on rank {rank}.")
351+
345352
if __name__ == "__main__":
346353
demo_basic()
347354
348-
One can then run a `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command
355+
One can then run a `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command
349356
on all nodes to initialize the DDP job created above:
350357

351358
.. code:: bash
352359
353360
torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py
354361
355-
We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we
356-
are running it on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes.
362+
In the example above, we are running the DDP script on two hosts and we run with 8 processes on each host. That is, we
363+
are running this job on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes.
357364

358-
Here torchrun will launch 8 process and invoke ``elastic_ddp.py``
359-
on each process on the node it is launched on, but user also needs to apply cluster
365+
Here ``torchrun`` will launch 8 processes and invoke ``elastic_ddp.py``
366+
on each process on the node it is launched on, but user also needs to apply cluster
360367
management tools like slurm to actually run this command on 2 nodes.
361368

362369
For example, on a SLURM enabled cluster, we can write a script to run the command above
@@ -368,8 +375,8 @@ and set ``MASTER_ADDR`` as:
368375
369376
370377
Then we can just run this script using the SLURM command: ``srun --nodes=2 ./torchrun_script.sh``.
371-
Of course, this is just an example; you can choose your own cluster scheduling tools
372-
to initiate the torchrun job.
373378

374-
For more information about Elastic run, one can check this
375-
`quick start document <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ to learn more.
379+
This is just an example; you can choose your own cluster scheduling tools to initiate the ``torchrun`` job.
380+
381+
For more information about Elastic run, please see the
382+
`quick start document <https://pytorch.org/docs/stable/elastic/quickstart.html>`__.

0 commit comments

Comments
 (0)