Closed
Description
We have been a lot of issues on this topic:
File "/content/train_text_to_image_lora_sdxl.py", line 1261, in <module>
main(args)
File "/content/train_text_to_image_lora_sdxl.py", line 1077, in main
accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
- ValueError: Attempting to unscale FP16 gradients. #6442
- ValueError: Attempting to unscale FP16 gradients #6098
- SDXL dreambooth can't be resumed from a checkpoint at fp16 training #5004
#6514 introduces a fix for the SDXL DreamBooth LoRA training script. #6553 is a follow-up that cleans things up a bit. However, it would be nice to extend them for the other scripts too. This issue tracks the integration:
- DreamBooth LoRA SD
- Text-to-image LoRA SDXL
- SDXL Consistency Distillation
- Advanced LoRA trainer (cc: @linoytsaban)
- Advanced LoRA trainer SD v1.5 (cc: @linoytsaban)
Feel free to claim any of these and submit PRs. Do tag me in those PRs and focus on fixing only one script at a time. I know that @linoytsaban is already working on the last one.
While submitting the PRs, please also provide example commands as I did in #6514 (comment) so as to quickly test the validity.