Closed
Description
Describe the bug
Unfortionatly training with train_dreambooth.py @ commit 3045fb2 fails with
Traceback (most recent call last):
File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 1039, in <module>
Traceback (most recent call last):
File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 1039, in <module>
main(args)
File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 844, in main
main(args)
File "/media/sharedHome/machine-lerning/Diffuserspayground/rwomen/../train_dreambooth.py", line 844, in main
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1118, in prepare
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1118, in prepare
result = self._prepare_deepspeed(*args)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
result = self._prepare_deepspeed(*args)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
self._configure_optimizer(optimizer, model_parameters)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
engine = DeepSpeedEngine(args=args,
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 259, in __init__
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 571, in _create_fp16_partitions_with_defragmentation
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 259, in __init__
param_groups: List[List[Parameter]] = tuple(
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 572, in <genexpr>
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 571, in _create_fp16_partitions_with_defragmentation
self._create_fp16_sub_groups(param_group["params"])
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in _create_fp16_sub_groups
param_groups: List[List[Parameter]] = tuple(
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 572, in <genexpr>
self._create_fp16_sub_groups(param_group["params"])
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in _create_fp16_sub_groups
params_group_numel = sum([param.partition_numel() for param in params_group])
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in <listcomp>
params_group_numel = sum([param.partition_numel() for param in params_group])
File "/media/sharedHome/machine-lerning/Diffuserspayground/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 825, in <listcomp>
params_group_numel = sum([param.partition_numel() for param in params_group])
AttributeError: 'Parameter' object has no attribute 'partition_numel' params_group_numel = sum([param.partition_numel() for param in params_group])
AttributeError: 'Parameter' object has no attribute 'partition_numel'
Additionally adding --mixed_precision="fp16" to train in 16bit as suggested in the examples will cause the script to fail even earlier with:
ValueError: Text encoder loaded as datatype torch.float16. Please make sure to always have all model weights in full float32 precision when starting training - even if doing mixed precision training. copy of the weights should still be float32.
Reproduction
Here is the command used:
accelerate launch \
--config_file=../accelerate_config_offload_all.yaml \
--num_processes=2 \
../train_dreambooth.py \
--pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4"\
--train_text_encoder \
--instance_data_dir=proc \
--class_data_dir=class \
--output_dir=trained \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt=$PROMPT \
--class_prompt=$CLASSPROMPT \
--resolution=512 \
--train_batch_size=1 \
--sample_batch_size=1 \
--gradient_accumulation_steps=1 --gradient_checkpointing \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800
Here is the accelerate config accelerate_config_offload_all.yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2
use_cpu: false
Logs
No response
System Info
diffusers
version: 0.16.0.dev0- Platform: Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
- Python version: 3.10.9
- PyTorch version (GPU?): 1.13.1+rocm5.2 (True)
- Huggingface_hub version: 0.13.4
- Transformers version: 4.28.1
- Accelerate version: 0.18.0
- xFormers version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
deepspeed 0.8.3 was used, deepspeed 0.9.0 was also tried.