Description
Describe the bug
Thanks for this awesome project!
When I run the script "train_dreambooth_lora.py" without acceleration, it works fine. But when I use acceleration launch, it fails when the number of steps reaches "checkpointing_steps".
I am running the script in a Docker with 4 * 3090 vGPUs. And I ran accelerate test, it's successed.
I am new to this and would appreciate any guidance or suggestions you can offer.
Reproduction
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="/diffusers/examples/dreambooth/dunhuang512"
export OUTPUT_DIR="path-to-save-model"
cd /diffusers/examples/dreambooth/
accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--logging_dir='./logs' \
--instance_prompt="dhstyle_test" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="dhstyle_test" \
--validation_epochs=50 \
--seed="0"\
--enable_xformers_memory_efficient_attention \
--use_8bit_adam
Logs
File "/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1093, in <module>
main(args)
File "/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 972, in main
LoraLoaderMixin.save_lora_weights(
File "/diffusers/src/diffusers/loaders.py", line 1111, in save_lora_weights
for module_name, param in unet_lora_layers.state_dict().items()
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1820, in state_dict
hook_result = hook(self, destination, prefix, local_metadata)
File "/diffusers/src/diffusers/loaders.py", line 74, in map_to
num = int(key.split(".")[1]) # 0 is always "layers"
ValueError: invalid literal for int() with base 10: 'layers'
Steps: 20%|████████████████████▊ | 100/500 [03:35<14:20, 2.15s/it, loss=0.217, lr=0.0001]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63642 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63643 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63644 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63641) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_lora.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-29_00:59:00
host : sd-5b564dfd58-7v76h
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 63641)
error_file: <N/A>
System Info
-
diffusers
version: 0.17.0.dev0 -
Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
-
Python version: 3.10.9
-
PyTorch version (GPU?): 2.0.0+cu117 (True)
-
Huggingface_hub version: 0.14.0
-
Transformers version: 4.25.1
-
Accelerate version: 0.18.0
-
xFormers version: 0.0.19
-
Using GPU in script?:
-
Using distributed or parallel set-up in script?:
-
Accelerate
default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []