Closed
Description
I am trying to resume training a lora in sdxl but when I try to resume it gives an error ValueError: Attempting to unscale FP16 gradients.
It works the first time but when I resume training it gives me that error
!accelerate launch --mixed_precision="fp16" /content/train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path ${MODEL_NAME} \
--train_data_dir images/ \
--resolution ${RESOLUTION} \
--train_batch_size ${BATCH_SIZE} \
--num_train_epochs ${NUM_STEPS} \
--gradient_accumulation ${GRADIENT_ACCUMULATION} \
--checkpointing_steps 5 \
--resume_from_checkpoint "latest" \
--mixed_precision "fp16" \
--caption_column 'text'
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-01-03 21:54:11.134347: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-03 21:54:11.134394: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-03 21:54:11.135989: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-03 21:54:12.546627: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
01/03/2024 21:54:13 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'reverse_transformer_layers_per_block', 'attention_type', 'dropout'} was not found in config. Values will be initialized to default values.
01/03/2024 21:55:36 - INFO - __main__ - ***** Running training *****
01/03/2024 21:55:36 - INFO - __main__ - Num examples = 1
01/03/2024 21:55:36 - INFO - __main__ - Num Epochs = 50
01/03/2024 21:55:36 - INFO - __main__ - Instantaneous batch size per device = 1
01/03/2024 21:55:36 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
01/03/2024 21:55:36 - INFO - __main__ - Gradient Accumulation steps = 4
01/03/2024 21:55:36 - INFO - __main__ - Total optimization steps = 50
Resuming from checkpoint checkpoint-35
01/03/2024 21:55:36 - INFO - accelerate.accelerator - Loading states from sd-model-finetuned-lora/checkpoint-35
Loading unet.
01/03/2024 21:55:36 - INFO - peft.tuners.tuners_utils - Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All model weights loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All dataloader sampler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - GradScaler state loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All random states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps: 70% 35/50 [00:00<?, ?it/s]Traceback (most recent call last):
File "/content/train_text_to_image_lora_sdxl.py", line 1261, in <module>
main(args)
File "/content/train_text_to_image_lora_sdxl.py", line 1077, in main
accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 70% 35/50 [00:05<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir', 'images/', '--resolution', '1024', '--train_batch_size', '1', '--num_train_epochs', '50', '--gradient_accumulation', '4', '--checkpointing_steps', '5', '--resume_from_checkpoint', 'latest', '--mixed_precision', 'fp16', '--caption_column', 'text']' returned non-zero exit status 1.
Metadata
Metadata
Assignees
Labels
No labels