Skip to content

Commit a6ffbd1

Browse files
committed
Merge branch 'main' of https://github.com/huggingface/diffusers into noise-autocorr-loss
2 parents 400d2f4 + 0a73b4d commit a6ffbd1

File tree

153 files changed

+2872
-741
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

153 files changed

+2872
-741
lines changed

.github/workflows/pr_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ jobs:
4040
framework: pytorch_examples
4141
runner: docker-cpu
4242
image: diffusers/diffusers-pytorch-cpu
43-
report: torch_cpu
43+
report: torch_example_cpu
4444

4545
name: ${{ matrix.config.name }}
4646

.github/workflows/push_tests_fast.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ jobs:
3838
framework: pytorch_examples
3939
runner: docker-cpu
4040
image: diffusers/diffusers-pytorch-cpu
41-
report: torch_cpu
41+
report: torch_example_cpu
4242

4343
name: ${{ matrix.config.name }}
4444

docs/source/en/api/loaders.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,11 @@ API to load such adapter neural networks via the [`loaders.py` module](https://g
2828
### UNet2DConditionLoadersMixin
2929

3030
[[autodoc]] loaders.UNet2DConditionLoadersMixin
31+
32+
### TextualInversionLoaderMixin
33+
34+
[[autodoc]] loaders.TextualInversionLoaderMixin
35+
36+
### LoraLoaderMixin
37+
38+
[[autodoc]] loaders.LoraLoaderMixin

docs/source/en/api/pipelines/stable_diffusion/self_attention_guidance.mdx

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,25 +14,26 @@ specific language governing permissions and limitations under the License.
1414

1515
## Overview
1616

17-
[Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
17+
[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
1818

1919
The abstract of the paper is the following:
2020

21-
*Denoising diffusion models (DDMs) have been drawing much attention for their appreciable sample quality and diversity. Despite their remarkable performance, DDMs remain black boxes on which further study is necessary to take a profound step. Motivated by this, we delve into the design of conventional U-shaped diffusion models. More specifically, we investigate the self-attention modules within these models through carefully designed experiments and explore their characteristics. In addition, inspired by the studies that substantiate the effectiveness of the guidance schemes, we present plug-and-play diffusion guidance, namely Self-Attention Guidance (SAG), that can drastically boost the performance of existing diffusion models. Our method, SAG, extracts the intermediate attention map from a diffusion model at every iteration and selects tokens above a certain attention score for masking and blurring to obtain a partially blurred input. Subsequently, we measure the dissimilarity between the predicted noises obtained from feeding the blurred and original input to the diffusion model and leverage it as guidance. With this guidance, we observe apparent improvements in a wide range of diffusion models, e.g., ADM, IDDPM, and Stable Diffusion, and show that the results further improve by combining our method with the conventional guidance scheme. We provide extensive ablation studies to verify our choices.*
21+
*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.*
2222

2323
Resources:
2424

2525
* [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance).
2626
* [Paper](https://arxiv.org/abs/2210.00939).
2727
* [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance).
28-
* [Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
28+
* [Hugging Face Demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance).
29+
* [Colab Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
2930

3031

3132
## Available Pipelines:
3233

3334
| Pipeline | Tasks | Demo
3435
|---|---|:---:|
35-
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [Colab](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb) |
36+
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [🤗 Space](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) |
3637

3738
## Usage example
3839

docs/source/en/api/pipelines/text_to_video_zero.mdx

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,15 @@ Resources:
6161
To generate a video from prompt, run the following python command
6262
```python
6363
import torch
64+
import imageio
6465
from diffusers import TextToVideoZeroPipeline
6566

6667
model_id = "runwayml/stable-diffusion-v1-5"
6768
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
6869

6970
prompt = "A panda is playing guitar on times square"
7071
result = pipe(prompt=prompt).images
72+
result = [(r * 255).astype("uint8") for r in result]
7173
imageio.mimsave("video.mp4", result, fps=4)
7274
```
7375
You can change these parameters in the pipeline call:
@@ -95,6 +97,7 @@ To generate a video from prompt with additional pose control
9597

9698
2. Read video containing extracted pose images
9799
```python
100+
from PIL import Image
98101
import imageio
99102

100103
reader = imageio.get_reader(video_path, "ffmpeg")
@@ -151,6 +154,7 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/
151154

152155
2. Read video from path
153156
```python
157+
from PIL import Image
154158
import imageio
155159

156160
reader = imageio.get_reader(video_path, "ffmpeg")
@@ -174,14 +178,14 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/
174178
```
175179

176180

177-
### Dreambooth specialization
181+
### DreamBooth specialization
178182

179183
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control**
180184
can run with custom [DreamBooth](../training/dreambooth) models, as shown below for
181185
[Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
182186
[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model
183187

184-
1. Download demo video from huggingface
188+
1. Download a demo video
185189

186190
```python
187191
from huggingface_hub import hf_hub_download
@@ -193,6 +197,7 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below
193197

194198
2. Read video from path
195199
```python
200+
from PIL import Image
196201
import imageio
197202

198203
reader = imageio.get_reader(video_path, "ffmpeg")

docs/source/en/conceptual/contribution.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ please have a look at the next sections.
170170

171171
For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull requst](#how-to-open-a-pr) section.
172172

173-
### 4. Fixing a "Good first issue"
173+
### 4. Fixing a `Good first issue`
174174

175175
*Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already
176176
explains how a potential solution should look so that it is easier to fix.
@@ -275,7 +275,7 @@ Once an example script works, please make sure to add a comprehensive `README.md
275275

276276
If you are contributing to the official training examples, please also make sure to add a test to [examples/test_examples.py](https://github.com/huggingface/diffusers/blob/main/examples/test_examples.py). This is not necessary for non-official training examples.
277277

278-
### 8. Fixing a "Good second issue"
278+
### 8. Fixing a `Good second issue`
279279

280280
*Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are
281281
usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).

docs/source/en/index.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ The library has three main components:
7373
| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing|
7474
| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
7575
| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
76-
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
76+
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation |
7777
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
7878
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
7979
| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing |
@@ -90,4 +90,4 @@ The library has three main components:
9090
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
9191
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
9292
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
93-
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
93+
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |

docs/source/en/training/text_inversion.mdx

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -157,24 +157,61 @@ If you're interested in following along with your model training progress, you c
157157

158158
## Inference
159159

160-
Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`]. Make sure you include the `placeholder_token` in your prompt, in this case, it is `<cat-toy>`.
160+
Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`].
161+
162+
The textual inversion script will by default only save the textual inversion embedding vector(s) that have
163+
been added to the text encoder embedding matrix and consequently been trained.
161164

162165
<frameworkcontent>
163166
<pt>
167+
<Tip>
168+
169+
💡 The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library).
170+
Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the libary.
171+
172+
</Tip>
173+
174+
To load the textual inversion embeddings you first need to load the base model that was used when training
175+
your textual inversion embedding vectors. Here we assume that [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5)
176+
was used as a base model so we load it first:
164177
```python
165178
from diffusers import StableDiffusionPipeline
179+
import torch
166180

167-
model_id = "path-to-your-trained-model"
181+
model_id = "runwayml/stable-diffusion-v1-5"
168182
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
183+
```
169184

170-
prompt = "A <cat-toy> backpack"
185+
Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`]
186+
function. Here we'll load the embeddings of the "<cat-toy>" example from before.
187+
```python
188+
pipe.load_textual_inversion("sd-concepts-library/cat-toy")
189+
```
171190

172-
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
191+
Now we can run the pipeline making sure that the placeholder token `<cat-toy>` is used in our prompt.
173192

193+
```python
194+
prompt = "A <cat-toy> backpack"
195+
196+
image = pipe(prompt, num_inference_steps=50).images[0]
174197
image.save("cat-backpack.png")
175198
```
199+
200+
The function [`TextualInversionLoaderMixin.load_textual_inversion`] can not only
201+
load textual embedding vectors saved in Diffusers' format, but also embedding vectors
202+
saved in [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) format.
203+
To do so, you can first download an embedding vector from [civitAI](https://civitai.com/models/3036?modelVersionId=8387)
204+
and then load it locally:
205+
```python
206+
pipe.load_textual_inversion("./charturnerv2.pt")
207+
```
176208
</pt>
177209
<jax>
210+
Currently there is no `load_textual_inversion` function for Flax so one has to make sure the textual inversion
211+
embedding vector is saved as part of the model after training.
212+
213+
The model can then be run just like any other Flax model:
214+
178215
```python
179216
import jax
180217
import numpy as np

docs/source/en/tutorials/basic_training.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
344344

345345
... # Sample a random timestep for each image
346346
... timesteps = torch.randint(
347-
... 0, noise_scheduler.num_train_timesteps, (bs,), device=clean_images.device
347+
... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device
348348
... ).long()
349349

350350
... # Add noise to the clean images according to the noise magnitude at each timestep

docs/source/en/using-diffusers/contribute_pipeline.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
6262

6363
def __call__(self):
6464
image = torch.randn(
65-
(1, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size),
65+
(1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
6666
)
6767
timestep = 1
6868

@@ -108,7 +108,7 @@ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
108108

109109
def __call__(self):
110110
image = torch.randn(
111-
(1, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size),
111+
(1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
112112
)
113113
timestep = 1
114114

docs/source/en/using-diffusers/custom_pipeline_overview.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,9 @@ class MyPipeline(DiffusionPipeline):
8989
@torch.no_grad()
9090
def __call__(self, batch_size: int = 1, num_inference_steps: int = 50):
9191
# Sample gaussian noise to begin loop
92-
image = torch.randn((batch_size, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size))
92+
image = torch.randn(
93+
(batch_size, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size)
94+
)
9395

9496
image = image.to(self.device)
9597

docs/source/en/using-diffusers/loading.mdx

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, safety_checker=Non
123123

124124
### Reuse components across pipelines
125125

126-
You can also reuse the same components in multiple pipelines without loading the weights into RAM twice. Use the [`DiffusionPipeline.components`] method to save the components in `components`:
126+
You can also reuse the same components in multiple pipelines to avoid loading the weights into RAM twice. Use the [`~DiffusionPipeline.components`] method to save the components:
127127

128128
```python
129129
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
@@ -140,6 +140,25 @@ Then you can pass the `components` to another pipeline without reloading the wei
140140
stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(**components)
141141
```
142142

143+
You can also pass the components individually to the pipeline if you want more flexibility over which components to reuse or disable. For example, to reuse the same components in the text-to-image pipeline, except for the safety checker and feature extractor, in the image-to-image pipeline:
144+
145+
```py
146+
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
147+
148+
model_id = "runwayml/stable-diffusion-v1-5"
149+
stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id)
150+
stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(
151+
vae=stable_diffusion_txt2img.vae,
152+
text_encoder=stable_diffusion_txt2img.text_encoder,
153+
tokenizer=stable_diffusion_txt2img.tokenizer,
154+
unet=stable_diffusion_txt2img.unet,
155+
scheduler=stable_diffusion_txt2img.scheduler,
156+
safety_checker=None,
157+
feature_extractor=None,
158+
requires_safety_checker=False,
159+
)
160+
```
161+
143162
## Checkpoint variants
144163

145164
A checkpoint variant is usually a checkpoint where it's weights are:

examples/community/bit_diffusion.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ def __call__(
238238
**kwargs,
239239
) -> Union[Tuple, ImagePipelineOutput]:
240240
latents = torch.randn(
241-
(batch_size, self.unet.in_channels, height, width),
241+
(batch_size, self.unet.config.in_channels, height, width),
242242
generator=generator,
243243
)
244244
latents = decimal_to_bits(latents) * self.bit_scale

examples/community/clip_guided_stable_diffusion.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ def __call__(
254254
# Unlike in other pipelines, latents need to be generated in the target device
255255
# for 1-to-1 results reproducibility with the CompVis implementation.
256256
# However this currently doesn't work in `mps`.
257-
latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8)
257+
latents_shape = (batch_size * num_images_per_prompt, self.unet.config.in_channels, height // 8, width // 8)
258258
latents_dtype = text_embeddings.dtype
259259
if latents is None:
260260
if self.device.type == "mps":

examples/community/clip_guided_stable_diffusion_img2img.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -414,7 +414,7 @@ def __call__(
414414
# Unlike in other pipelines, latents need to be generated in the target device
415415
# for 1-to-1 results reproducibility with the CompVis implementation.
416416
# However this currently doesn't work in `mps`.
417-
latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8)
417+
latents_shape = (batch_size * num_images_per_prompt, self.unet.config.in_channels, height // 8, width // 8)
418418
latents_dtype = text_embeddings.dtype
419419
if latents is None:
420420
if self.device.type == "mps":

0 commit comments

Comments
 (0)