Skip to content

Fix Model Parallel demo world_size parameter in DDP Tutorial #1750

Closed
@sadra-barikbin

Description

@sadra-barikbin

In this tutorial, there are three demos on Distributed Training. The last one is the Model Parallel use case. In the last code block, under the if statement, the world_size parameter of the function call run_demo(demo_model_parallel, world_size) has to be world_size//2 because in the Model Parallel demo, two exclusive GPU s are assigned to every process so there must be half as many processes as GPU s.

In the demo though, we see world_size = n_gpus in the last code block. This assignment is correct for the function calls run_demo(demo_basic, world_size) and run_demo(demo_checkpoint, world_size) but not for the run_demo(demo_model_parallel, world_size)

I propose to edit if statement in the last block to be:

if __name__ == "__main__":
    n_gpus = torch.cuda.device_count()
    assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
    world_size = n_gpus
    run_demo(demo_basic, world_size)
    run_demo(demo_checkpoint, world_size)
    world_size = n_gpus//2
    run_demo(demo_model_parallel, world_size)

cc @mrshenli @osalpekar @H-Huang @kwen2501

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions