Skip to content

SDXL dreambooth can't be resumed from a checkpoint at fp16 training #5004

Closed
@epi-morphism

Description

@epi-morphism

Describe the bug

train_dreambooth_lora_sdxl.py can't be resumed from a checkpoint using fp16. The log error is Attempting to unscale FP16 gradients.

This is a big blocker from being able to train on the free colab tier since you need fp16 to fit in vram, but also need to resume from checkpoints since it can hit a timeout at any moment.

Reproduction

Reproduce with: https://colab.research.google.com/drive/15woNcXcpsa3GDGk6cmDtIL2V8zRtOOj3

Logs

No response

System Info

latest diffusers, system is whatever is on colab (see linked colab above)

Who can help?

@patrickvonplaten @sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions