Skip to content

ValueError: Attempting to unscale FP16 gradients. #6442

Closed
@loboere

Description

@loboere

I am trying to resume training a lora in sdxl but when I try to resume it gives an error ValueError: Attempting to unscale FP16 gradients.
It works the first time but when I resume training it gives me that error

!accelerate launch --mixed_precision="fp16" /content/train_text_to_image_lora_sdxl.py   \
--pretrained_model_name_or_path ${MODEL_NAME} \
--train_data_dir images/ \
--resolution ${RESOLUTION} \
--train_batch_size ${BATCH_SIZE} \
--num_train_epochs ${NUM_STEPS} \
--gradient_accumulation ${GRADIENT_ACCUMULATION} \
--checkpointing_steps 5 \
--resume_from_checkpoint "latest" \
--mixed_precision "fp16" \
--caption_column 'text'

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-01-03 21:54:11.134347: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-03 21:54:11.134394: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-03 21:54:11.135989: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-03 21:54:12.546627: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
01/03/2024 21:54:13 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'reverse_transformer_layers_per_block', 'attention_type', 'dropout'} was not found in config. Values will be initialized to default values.
01/03/2024 21:55:36 - INFO - __main__ - ***** Running training *****
01/03/2024 21:55:36 - INFO - __main__ -   Num examples = 1
01/03/2024 21:55:36 - INFO - __main__ -   Num Epochs = 50
01/03/2024 21:55:36 - INFO - __main__ -   Instantaneous batch size per device = 1
01/03/2024 21:55:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
01/03/2024 21:55:36 - INFO - __main__ -   Gradient Accumulation steps = 4
01/03/2024 21:55:36 - INFO - __main__ -   Total optimization steps = 50
Resuming from checkpoint checkpoint-35
01/03/2024 21:55:36 - INFO - accelerate.accelerator - Loading states from sd-model-finetuned-lora/checkpoint-35
Loading unet.
01/03/2024 21:55:36 - INFO - peft.tuners.tuners_utils - Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All model weights loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All dataloader sampler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - GradScaler state loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All random states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps:  70% 35/50 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/train_text_to_image_lora_sdxl.py", line 1261, in <module>
    main(args)
  File "/content/train_text_to_image_lora_sdxl.py", line 1077, in main
    accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:  70% 35/50 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir', 'images/', '--resolution', '1024', '--train_batch_size', '1', '--num_train_epochs', '50', '--gradient_accumulation', '4', '--checkpointing_steps', '5', '--resume_from_checkpoint', 'latest', '--mixed_precision', 'fp16', '--caption_column', 'text']' returned non-zero exit status 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions