Skip to content

train_dreambooth.py DeepSpeed offloading stage 3 seams broken #3177

Closed
@IMbackK

Description

@IMbackK

Describe the bug

Unfortionatly training with train_dreambooth.py @ commit 3045fb2 fails with

Traceback (most recent call last):
  File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 1039, in <module>
Traceback (most recent call last):
  File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 1039, in <module>
    main(args)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 844, in main
    main(args)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 844, in main
    unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1118, in prepare
    unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1118, in prepare
    result = self._prepare_deepspeed(*args)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
    result = self._prepare_deepspeed(*args)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    self._configure_optimizer(optimizer, model_parameters)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    engine = DeepSpeedEngine(args=args,
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 259, in __init__
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 571, in _create_fp16_partitions_with_defragmentation
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 259, in __init__
    param_groups: List[List[Parameter]] = tuple(
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 572, in <genexpr>
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 571, in _create_fp16_partitions_with_defragmentation
    self._create_fp16_sub_groups(param_group["params"])
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in _create_fp16_sub_groups
    param_groups: List[List[Parameter]] = tuple(
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 572, in <genexpr>
    self._create_fp16_sub_groups(param_group["params"])
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in _create_fp16_sub_groups
    params_group_numel = sum([param.partition_numel() for param in params_group])
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in <listcomp>
    params_group_numel = sum([param.partition_numel() for param in params_group])
  File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in <listcomp>
    params_group_numel = sum([param.partition_numel() for param in params_group])
AttributeError: 'Parameter' object has no attribute 'partition_numel'    params_group_numel = sum([param.partition_numel() for param in params_group])
AttributeError: 'Parameter' object has no attribute 'partition_numel'

Additionally adding --mixed_precision="fp16" to train in 16bit as suggested in the examples will cause the script to fail even earlier with:

ValueError: Text encoder loaded as datatype torch.float16. Please make sure to always have all model weights in full float32 precision when starting training - even if doing mixed precision training. copy of the weights should still be float32.

Reproduction

Here is the command used:

accelerate launch \
  --config_file=../accelerate_config_offload_all.yaml \
  --num_processes=2 \
  ../train_dreambooth.py \
  --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4"\
  --train_text_encoder \
  --instance_data_dir=proc \
  --class_data_dir=class \
  --output_dir=trained \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt=$PROMPT \
  --class_prompt=$CLASSPROMPT \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800

Here is the accelerate config accelerate_config_offload_all.yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2
use_cpu: false

Logs

No response

System Info

  • diffusers version: 0.16.0.dev0
  • Platform: Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
  • Python version: 3.10.9
  • PyTorch version (GPU?): 1.13.1+rocm5.2 (True)
  • Huggingface_hub version: 0.13.4
  • Transformers version: 4.28.1
  • Accelerate version: 0.18.0
  • xFormers version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

deepspeed 0.8.3 was used, deepspeed 0.9.0 was also tried.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions