Skip to content

[Tracker] fix training resuming problem when using FP16 in the examples #6552

Closed
@sayakpaul

Description

@sayakpaul

We have been a lot of issues on this topic:

  File "/content/train_text_to_image_lora_sdxl.py", line 1261, in <module>
    main(args)
  File "/content/train_text_to_image_lora_sdxl.py", line 1077, in main
    accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

#6514 introduces a fix for the SDXL DreamBooth LoRA training script. #6553 is a follow-up that cleans things up a bit. However, it would be nice to extend them for the other scripts too. This issue tracks the integration:

Feel free to claim any of these and submit PRs. Do tag me in those PRs and focus on fixing only one script at a time. I know that @linoytsaban is already working on the last one.

While submitting the PRs, please also provide example commands as I did in #6514 (comment) so as to quickly test the validity.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions