Skip to content

run train_dreambooth_lora.py failed with accelerate #3284

Closed
@webliupeng

Description

@webliupeng

Describe the bug

Thanks for this awesome project!
When I run the script "train_dreambooth_lora.py" without acceleration, it works fine. But when I use acceleration launch, it fails when the number of steps reaches "checkpointing_steps".
I am running the script in a Docker with 4 * 3090 vGPUs. And I ran accelerate test, it's successed.
I am new to this and would appreciate any guidance or suggestions you can offer.

Reproduction

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="/diffusers/examples/dreambooth/dunhuang512"
export OUTPUT_DIR="path-to-save-model"
cd /diffusers/examples/dreambooth/
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --logging_dir='./logs' \
  --instance_prompt="dhstyle_test" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="dhstyle_test" \
  --validation_epochs=50 \
  --seed="0"\
  --enable_xformers_memory_efficient_attention \
  --use_8bit_adam

Logs

  File "/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1093, in <module>
    main(args)
  File "/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 972, in main
    LoraLoaderMixin.save_lora_weights(
  File "/diffusers/src/diffusers/loaders.py", line 1111, in save_lora_weights
    for module_name, param in unet_lora_layers.state_dict().items()
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1820, in state_dict
    hook_result = hook(self, destination, prefix, local_metadata)
  File "/diffusers/src/diffusers/loaders.py", line 74, in map_to
    num = int(key.split(".")[1])  # 0 is always "layers"
ValueError: invalid literal for int() with base 10: 'layers'
Steps:  20%|████████████████████▊                                                                                   | 100/500 [03:35<14:20,  2.15s/it, loss=0.217, lr=0.0001]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63642 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63643 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63644 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63641) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 914, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_lora.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-29_00:59:00
  host      : sd-5b564dfd58-7v76h
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 63641)
  error_file: <N/A>

System Info

  • diffusers version: 0.17.0.dev0

  • Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31

  • Python version: 3.10.9

  • PyTorch version (GPU?): 2.0.0+cu117 (True)

  • Huggingface_hub version: 0.14.0

  • Transformers version: 4.25.1

  • Accelerate version: 0.18.0

  • xFormers version: 0.0.19

  • Using GPU in script?:

  • Using distributed or parallel set-up in script?:

  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: no
    - use_cpu: False
    - num_processes: 4
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions