Description
Describe the bug
Hi there,
I have concerns with this line of code (https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L282).
Specifically, grid_size
is the tuple consisting of the height H
and width W
of the image. grid
computed in L280 should have the shape 2*H*W
, and L282 reshapes it into 2*1*W*H
. The dimensions W*H
will be later flattened to match the dimensions of the latent.
However, if you continue to PatchEmbed
(https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L549), you will notice that the latent with shape BCHW
is flattened into B(H*W)C
, this flattening operation does not seem to match with grid
in L282. I think this reordering will mess up the ordering of dimensions when being flattened in case H
and W
are not equal.
Reproduction
This potential bug is conceptual and no need for reproduction.
Logs
System Info
Current diffusers
implementation.