Closed
Description
In this tutorial, there are three demos on Distributed Training. The last one is the Model Parallel use case. In the last code block, under the if
statement, the world_size
parameter of the function call run_demo(demo_model_parallel, world_size)
has to be world_size//2
because in the Model Parallel demo, two exclusive GPU s are assigned to every process so there must be half as many processes as GPU s.
In the demo though, we see world_size = n_gpus
in the last code block. This assignment is correct for the function calls run_demo(demo_basic, world_size)
and run_demo(demo_checkpoint, world_size)
but not for the run_demo(demo_model_parallel, world_size)
I propose to edit if
statement in the last block to be:
if __name__ == "__main__":
n_gpus = torch.cuda.device_count()
assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
world_size = n_gpus
run_demo(demo_basic, world_size)
run_demo(demo_checkpoint, world_size)
world_size = n_gpus//2
run_demo(demo_model_parallel, world_size)