Skip to content

Potential incorrect reshaping in 2D positional embedding #11309

Open
@jinhong-ni

Description

@jinhong-ni

Describe the bug

Hi there,

I have concerns with this line of code (https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L282).

Specifically, grid_size is the tuple consisting of the height H and width W of the image. grid computed in L280 should have the shape 2*H*W, and L282 reshapes it into 2*1*W*H. The dimensions W*H will be later flattened to match the dimensions of the latent.

However, if you continue to PatchEmbed (https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L549), you will notice that the latent with shape BCHW is flattened into B(H*W)C, this flattening operation does not seem to match with grid in L282. I think this reordering will mess up the ordering of dimensions when being flattened in case H and W are not equal.

Reproduction

This potential bug is conceptual and no need for reproduction.

Logs

System Info

Current diffusers implementation.

Who can help?

@yiyixuxu @sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions