fix compatability issue between PAG and IP-adapter #8379

sunovivid · 2024-06-02T18:22:43Z

What does this PR do?

I fixed the IP-adapter compatibility issue of the proposed PAG Mixin.

First, I found that load_ip_adapter overwrites the loaded PAGIdentitySelfAttnProcessor2_0 with AttnProcessor or AttnProcessor2_0 (see uent.py). So, I changed the code to keep the original processor if it is not a cross-attention processor.

Second, I also found that even if I use PAG only (not using CFG), the image embeddings of the IP-adapter are only applied to one of noise_pred_uncond or noise_pred_perturb (I can't remember the exact variable 😭). I checked this by changing _apply_perturbed_attention_guidance in pag_utils.py line 133 to use only noise_pred_uncond or noise_pred_perturb. If I do not use classifier-free guidance, the results should be the same, but the final results are completely different: one has the image condition applied, and the other does not. So, I copied image_embeds in pipeline_stable_diffusion_xl.py similar to how latents are copied when using CFG (latent_model_input = torch.cat([latents] * (prompt_embeds.shape[0] // latents.shape[0]))). I'm not sure this is the right approach because I couldn't identify the exact location where image_embeds are only applied to single latents.

Finally, I changed do_perturbed_attention_guidance of PAGMixin to work consistently even when pag_scale is 0. This is because if we use enable_pag(...), the attention processor is changed to PAGIdentitySelfAttnProcessor2_0 even though pag_scale is 0. This causes errors when a single latent passes through PAGIdentitySelfAttnProcessor2_0, which expects copied and concatenated latents.

Example code and results

I attached the example code and the results. In the grid image, the IP adapter scale increases to the right, and the PAG scale increases downward.

IP-adapter + PAG (without CFG)

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=False)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("finally.png")

IP-adapter + PAG (with CFG)

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=True)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=3.0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("finally.png")

In my opinion, using PAG reduces artifacts and improves the overall composition both with and without CFG. It is very encouraging to see that PAG works well with the IP-adapter.

asomoza · 2024-06-03T01:53:09Z

Really nice too, I'm loving PAG and what it does to the generations, thank you for your work.

IP Adapter VIT-H without CFG

IP Image	No PAG	standard PAG	PAG + Custom layers

IP Adapter VIT-H with CFG

No PAG	standard PAG	PAG + Custom layers

IP Adapter PLUS without CFG

No PAG	standard PAG	PAG + Custom layers

IP Adapter PLUS with CFG

No PAG	standard PAG	PAG + Custom layers

asomoza · 2024-06-03T01:57:20Z

src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

@@ -1172,6 +1172,10 @@ def __call__(
                self.do_classifier_free_guidance,
            )

+        # expand the image embeddings if we are using perturbed-attention guidance
+        for i in range(len(image_embeds)):
+            image_embeds[i] = image_embeds[i].repeat(prompt_embeds.shape[0] // latents.shape[0], 1, 1)


This throws an error with the PLUS versions of IP Adapters each image_embeds is a 4D tensor.

Thank you for finding the error! I found the cause of the error and, thanks to this, came up with a more elegant design. I will upload the revised code with the results soon!

sunovivid · 2024-06-03T12:14:26Z

Hi @asomoza. Thank you for your thorough review and awesome showcases!

As you detected errors when using IP-adapter plus, I dug into the code run and found out the cause of the problem. The issue was because the IP-adapter image embedding was not properly copied, unlike the prompt_embeds. In my previous implementation, I copied image_embeds heuristically, but you pointed out that when a 4D-tensor of IP-adapter plus is given, it causes an error. I noticed that there is a function to handle this: prepare_ip_adapter_image_embeds, so I implemented the copying of image embeddings in that function. Now it works nicely with IP-adapter and IP-adapter plus. Thank you for notifying me about this!

I have attached the example code and the results. In the grid image, the IP adapter scale increases to the right, and the PAG scale increases downward.

Example code and results

PAG only

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=False)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("showcase-ipadapterplus-nocfg.png")

Result:

PAG with CFG

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=True)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=3.0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("showcase-ipadapterplus-cfg.png")

Results:

It would be fantastic to see PAG easily usable in Diffusers in the near future. If there's anything I can assist with, please let me know. Thank you!

fix compatability issue between PAG and IP-adapter

0ddc76a

sunovivid mentioned this pull request Jun 2, 2024

add PAG support #7944

Merged

asomoza reviewed Jun 3, 2024

View reviewed changes

fix compatibility issue between PAG and IP-adapter plus

1edb67d

yiyixuxu merged commit 4cc0b8b into huggingface:pag Jun 5, 2024

sunovivid mentioned this pull request Jun 10, 2024

support IPAdapter? sunovivid/Perturbed-Attention-Guidance#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix compatability issue between PAG and IP-adapter #8379

fix compatability issue between PAG and IP-adapter #8379

Uh oh!

sunovivid commented Jun 2, 2024

Uh oh!

asomoza commented Jun 3, 2024

Uh oh!

asomoza Jun 3, 2024 •

edited

Loading

Uh oh!

sunovivid Jun 3, 2024

Uh oh!

sunovivid commented Jun 3, 2024

Uh oh!

Uh oh!

fix compatability issue between PAG and IP-adapter #8379

fix compatability issue between PAG and IP-adapter #8379

Uh oh!

Conversation

sunovivid commented Jun 2, 2024

What does this PR do?

Example code and results

Uh oh!

asomoza commented Jun 3, 2024

IP Adapter VIT-H without CFG

IP Adapter VIT-H with CFG

IP Adapter PLUS without CFG

IP Adapter PLUS with CFG

Uh oh!

asomoza Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunovivid Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

sunovivid commented Jun 3, 2024

Example code and results

PAG only

PAG with CFG

Uh oh!

Uh oh!

asomoza Jun 3, 2024 •

edited

Loading