[Tracker] fix training resuming problem when using FP16 in the examples

We have been a lot of issues on this topic:

```bash
  File "/content/train_text_to_image_lora_sdxl.py", line 1261, in <module>
    main(args)
  File "/content/train_text_to_image_lora_sdxl.py", line 1077, in main
    accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
```

* https://github.com/huggingface/diffusers/issues/6442
* https://github.com/huggingface/diffusers/issues/6098
* https://github.com/huggingface/diffusers/issues/5004

https://github.com/huggingface/diffusers/pull/6514 introduces a fix for the [SDXL DreamBooth LoRA training script](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py). https://github.com/huggingface/diffusers/pull/6553 is a follow-up that cleans things up a bit. However, it would be nice to extend them for the other scripts too. This issue tracks the integration:

- [x] [DreamBooth LoRA SD](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py)
- [x] [Text-to-image LoRA SDXL](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora_sdxl.py)
- [x] [SDXL Consistency Distillation](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl.py)
- [x] [Advanced LoRA trainer](https://github.com/huggingface/diffusers/blob/main/examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py) (cc: @linoytsaban)
- [x] [Advanced LoRA trainer SD v1.5](https://github.com/huggingface/diffusers/blob/main/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py) (cc: @linoytsaban)

Feel free to claim any of these and submit PRs. Do tag me in those PRs and focus on fixing only one script at a time. I know that @linoytsaban is already working on the last one. 

While submitting the PRs, please also provide example commands as I did in https://github.com/huggingface/diffusers/pull/6514#issue-2073815774 so as to quickly test the validity. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracker] fix training resuming problem when using FP16 in the examples #6552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracker] fix training resuming problem when using FP16 in the examples #6552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions