Skip to content

fix compatability issue between PAG and IP-adapter #8379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 5, 2024

Conversation

sunovivid
Copy link
Contributor

What does this PR do?

I fixed the IP-adapter compatibility issue of the proposed PAG Mixin.

First, I found that load_ip_adapter overwrites the loaded PAGIdentitySelfAttnProcessor2_0 with AttnProcessor or AttnProcessor2_0 (see uent.py). So, I changed the code to keep the original processor if it is not a cross-attention processor.

Second, I also found that even if I use PAG only (not using CFG), the image embeddings of the IP-adapter are only applied to one of noise_pred_uncond or noise_pred_perturb (I can't remember the exact variable 😭). I checked this by changing _apply_perturbed_attention_guidance in pag_utils.py line 133 to use only noise_pred_uncond or noise_pred_perturb. If I do not use classifier-free guidance, the results should be the same, but the final results are completely different: one has the image condition applied, and the other does not. So, I copied image_embeds in pipeline_stable_diffusion_xl.py similar to how latents are copied when using CFG (latent_model_input = torch.cat([latents] * (prompt_embeds.shape[0] // latents.shape[0]))). I'm not sure this is the right approach because I couldn't identify the exact location where image_embeds are only applied to single latents.

Finally, I changed do_perturbed_attention_guidance of PAGMixin to work consistently even when pag_scale is 0. This is because if we use enable_pag(...), the attention processor is changed to PAGIdentitySelfAttnProcessor2_0 even though pag_scale is 0. This causes errors when a single latent passes through PAGIdentitySelfAttnProcessor2_0, which expects copied and concatenated latents.

Example code and results

I attached the example code and the results. In the grid image, the IP adapter scale increases to the right, and the PAG scale increases downward.

IP-adapter + PAG (without CFG)

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=False)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("finally.png")

showcase_nocfg

IP-adapter + PAG (with CFG)

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=True)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=3.0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("finally.png")

showcase_cfg

In my opinion, using PAG reduces artifacts and improves the overall composition both with and without CFG. It is very encouraging to see that PAG works well with the IP-adapter.

@sunovivid sunovivid mentioned this pull request Jun 2, 2024
@asomoza
Copy link
Member

asomoza commented Jun 3, 2024

Really nice too, I'm loving PAG and what it does to the generations, thank you for your work.

IP Adapter VIT-H without CFG

IP Image No PAG standard PAG PAG + Custom layers
20240602204427 20240602212634_2883174019 20240602212429_2883174019 20240602212524_2883174019

IP Adapter VIT-H with CFG

No PAG standard PAG PAG + Custom layers
20240602213155_2883174019 20240602213259_2883174019 20240602213412_2883174019

IP Adapter PLUS without CFG

No PAG standard PAG PAG + Custom layers
20240602214630_2883174019 20240602214801_2883174019 20240602214832_2883174019

IP Adapter PLUS with CFG

No PAG standard PAG PAG + Custom layers
20240602214942_2883174019 20240602215021_2883174019 20240602215109_2883174019

@@ -1172,6 +1172,10 @@ def __call__(
self.do_classifier_free_guidance,
)

# expand the image embeddings if we are using perturbed-attention guidance
for i in range(len(image_embeds)):
image_embeds[i] = image_embeds[i].repeat(prompt_embeds.shape[0] // latents.shape[0], 1, 1)
Copy link
Member

@asomoza asomoza Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws an error with the PLUS versions of IP Adapters each image_embeds is a 4D tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding the error! I found the cause of the error and, thanks to this, came up with a more elegant design. I will upload the revised code with the results soon!

@sunovivid
Copy link
Contributor Author

Hi @asomoza. Thank you for your thorough review and awesome showcases!

As you detected errors when using IP-adapter plus, I dug into the code run and found out the cause of the problem. The issue was because the IP-adapter image embedding was not properly copied, unlike the prompt_embeds. In my previous implementation, I copied image_embeds heuristically, but you pointed out that when a 4D-tensor of IP-adapter plus is given, it causes an error. I noticed that there is a function to handle this: prepare_ip_adapter_image_embeds, so I implemented the copying of image embeddings in that function. Now it works nicely with IP-adapter and IP-adapter plus. Thank you for notifying me about this!

I have attached the example code and the results. In the grid image, the IP adapter scale increases to the right, and the PAG scale increases downward.

Example code and results

PAG only

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=False)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("showcase-ipadapterplus-nocfg.png")

Result:
highres-compressed-showcase-ipadapterplus-nocfg

PAG with CFG

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")

pag_scales = [0.0, 1.5, 3.0, 5.0, 7.0]
ip_adapter_scales = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.enable_pag(pag_scale=pag_scale, pag_applied_layers=["mid"], pag_cfg=True)
        pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")
        pipeline.set_ip_adapter_scale(ip_adapter_scale)

        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=3.0,
            generator=generator,
        ).images
        images[0]

        grid.append(images[0])
        pipeline.disable_pag()

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("showcase-ipadapterplus-cfg.png")

Results:
highres-compressed-showcase-ipadapterplus-cfg

It would be fantastic to see PAG easily usable in Diffusers in the near future. If there's anything I can assist with, please let me know. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants