Skip to content

Flux Redux #9988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Flux Redux #9988

wants to merge 8 commits into from

Conversation

yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Nov 22, 2024

part of #9985

This PR adds Flux Redux

TO-DO

  • test if we use all zero prompts will make a difference in output, if not we do not need t5 when redux is used
  • doc and tests
  • test flux with all other pipelines, including the control + fill model that just released (this will be in a different PR)

I tested empty prompt vs. zero prompt embeds, I think the results are similar; in that case, we can recommend running redux without text_encoders, Here are the results: left is with text_encoders, and the right is without

cc @asomoza here, can you test a little bit and let me know what you think

to use with t5 (same as in original impl, prompt="")

# test 1
import torch
from PIL import Image

device = "cuda"
dtype = torch.bfloat16

from diffusers import FluxPriorReduxPipeline, FluxPipeline
from diffusers.utils import load_image

repo_redux = "YiYiXu/yiyi-redux"
repo_base = "black-forest-labs/FLUX.1-dev"

pipe = FluxPipeline.from_pretrained(repo_base, torch_dtype=torch.bfloat16)
pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(
    repo_redux, 
    text_encoder=pipe.text_encoder,
    tokenizer=pipe.tokenizer,
    text_encoder_2=pipe.text_encoder_2,
    tokenizer_2=pipe.tokenizer_2,
    torch_dtype=dtype
)
pipe_prior_redux.to(device)

img_path = "/raid/yiyi/flux-new/assets/robot.webp"
image = Image.open(img_path).convert("RGB")

pipe_prior_output = pipe_prior_redux(image)


pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
latents = torch.load("/raid/yiyi/flux-new/redux_latents.pt")
print(latents.shape)
image = pipe(
    guidance_scale=2.5,
    height=768,
    width=1360,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
    latents=latents,
    **pipe_prior_output,
).images[0]
image.save("yiyi_test_5_out.png")

# Clean up memory
del pipe
del pipe_prior_redux
import gc
gc.collect()
torch.cuda.empty_cache()

run without t5 (use zero prompt embeds)

# test 2 (zero prompt embeds)

import torch
from PIL import Image
from diffusers import FluxPriorReduxPipeline, FluxPipeline

device = "cuda"
dtype = torch.bfloat16


repo_redux = "YiYiXu/yiyi-redux"
repo_base = "black-forest-labs/FLUX.1-dev"

pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(repo_redux, torch_dtype=dtype)
pipe_prior_redux.to(device)

img_path = "/raid/yiyi/flux-new/assets/robot.webp"
image = Image.open(img_path).convert("RGB")
pipe_prior_output = pipe_prior_redux(image)

pipe = FluxPipeline.from_pretrained(
    repo_base, 
    text_encoder=None,
    tokenizer=None,
    text_encoder_2=None,
    tokenizer_2=None,
    torch_dtype=torch.bfloat16
)
pipe.to(device) #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
latents = torch.load("/raid/yiyi/flux-new/redux_latents.pt")
print(latents.shape)
image = pipe(
    guidance_scale=2.5,
    height=768,
    width=1360,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
    latents=latents,
    **pipe_prior_output,
).images[0]
image.save("yiyi_test_5_out_2.png")```

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@wsxwd
Copy link

wsxwd commented Nov 22, 2024

black-forest-labs/FLUX.1-Redux-dev only support image prompt, right?

@yiyixuxu
Copy link
Collaborator Author

@wsxwd yes

pooled_prompt_embeds,
_,
) = self.encode_prompt(
prompt=[""] * batch_size,
Copy link
Collaborator Author

@yiyixuxu yiyixuxu Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though flux-dev-redux is an image variation pipeline and does not take text input, in the original implementation, it still puts an empty string through the text encoders to create the prompt embeds. It then concatenates with the image embeds to create model inputs.

However, the model itself does not seem to be controlled by text prompts at all, i.e., if you change the empty prompt to a different prompt, the generation does not seem to be affected by the prompt itself, even though the output might be slightly different.

So here for now, we do that only if the text encoders are explicitly added to the redux pipeline; otherwise, we create zero prompt embeds. @asomoza , could you test it a bit and let me know if it affects the generation?

If, in any case, the prompt embeds generated from an empty string are better, we can add prompt/prompt_embeds argument to the redux pipeline, this way users can save the prompt embeds and use them as inputs, so that they don't have to load the text encoders.

I added the scripts for both use cases in the PR description.

Copy link
Contributor

@bghira bghira Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat the prompt a lot and it starts to impact things. the token count of the image space is just very large so its importance very much outweighs the smaller text input

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one was not provided by the user who reported this

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great and seem to be working (I haven't tried matching numerically to original but I believe you've covered it)! We can handle tests, docs and example usage doc string in the combined PR. Thanks!


logger = logging.get_logger(__name__) # pylint: disable=invalid-name

EXAMPLE_DOC_STRING = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example needs update with correct pipeline usage

Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍🏽

@asomoza
Copy link
Member

asomoza commented Nov 22, 2024

With more complex images the difference it's also minimal and the image quality it's the same, so with the added benefit of using less VRAM, without the T5 is the way to go if people want to use this.

source with TE without TE
cat_square 20241122231949_1686210458 20241122233416_1686210458
capyrabbit 20241122235154_3389208790 20241122235548_3389208790
ip_image 20241123000105_3140105074 20241123000302_3140105074

the different variations and image sizes works flawlessly too.

Base automatically changed from flux-new to main November 23, 2024 11:41
@yiyixuxu yiyixuxu closed this Nov 24, 2024
@yiyixuxu yiyixuxu deleted the flux-redux branch November 24, 2024 21:18
@lhjlhj11
Copy link

part of #9985

This PR adds Flux Redux

TO-DO

* [x]  test if we use all zero prompts will make a difference in output, if not we do not need t5 when redux is used

* [ ]  doc and tests

* [ ]  test flux with all other pipelines, including the control + fill model that just released (this will be in a different PR)

I tested empty prompt vs. zero prompt embeds, I think the results are similar; in that case, we can recommend running redux without text_encoders, Here are the results: left is with text_encoders, and the right is without

cc @asomoza here, can you test a little bit and let me know what you think

to use with t5 (same as in original impl, prompt="")

# test 1
import torch
from PIL import Image

device = "cuda"
dtype = torch.bfloat16

from diffusers import FluxPriorReduxPipeline, FluxPipeline
from diffusers.utils import load_image

repo_redux = "YiYiXu/yiyi-redux"
repo_base = "black-forest-labs/FLUX.1-dev"

pipe = FluxPipeline.from_pretrained(repo_base, torch_dtype=torch.bfloat16)
pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(
    repo_redux, 
    text_encoder=pipe.text_encoder,
    tokenizer=pipe.tokenizer,
    text_encoder_2=pipe.text_encoder_2,
    tokenizer_2=pipe.tokenizer_2,
    torch_dtype=dtype
)
pipe_prior_redux.to(device)

img_path = "/raid/yiyi/flux-new/assets/robot.webp"
image = Image.open(img_path).convert("RGB")

pipe_prior_output = pipe_prior_redux(image)


pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
latents = torch.load("/raid/yiyi/flux-new/redux_latents.pt")
print(latents.shape)
image = pipe(
    guidance_scale=2.5,
    height=768,
    width=1360,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
    latents=latents,
    **pipe_prior_output,
).images[0]
image.save("yiyi_test_5_out.png")

# Clean up memory
del pipe
del pipe_prior_redux
import gc
gc.collect()
torch.cuda.empty_cache()

run without t5 (use zero prompt embeds)

# test 2 (zero prompt embeds)

import torch
from PIL import Image
from diffusers import FluxPriorReduxPipeline, FluxPipeline

device = "cuda"
dtype = torch.bfloat16


repo_redux = "YiYiXu/yiyi-redux"
repo_base = "black-forest-labs/FLUX.1-dev"

pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(repo_redux, torch_dtype=dtype)
pipe_prior_redux.to(device)

img_path = "/raid/yiyi/flux-new/assets/robot.webp"
image = Image.open(img_path).convert("RGB")
pipe_prior_output = pipe_prior_redux(image)

pipe = FluxPipeline.from_pretrained(
    repo_base, 
    text_encoder=None,
    tokenizer=None,
    text_encoder_2=None,
    tokenizer_2=None,
    torch_dtype=torch.bfloat16
)
pipe.to(device) #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
latents = torch.load("/raid/yiyi/flux-new/redux_latents.pt")
print(latents.shape)
image = pipe(
    guidance_scale=2.5,
    height=768,
    width=1360,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
    latents=latents,
    **pipe_prior_output,
).images[0]
image.save("yiyi_test_5_out_2.png")```

So does the redux support text prompt input now?

@bghira
Copy link
Contributor

bghira commented Jan 10, 2025

yes it works with prompts. you need attention bias to upweight the importance of the prompt vs img tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants