huggingface
diff --git a/‎.github/workflows/pr_tests.yml
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/pr_tests.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/push_tests_fast.yml
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/push_tests_fast.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/en/api/loaders.mdx
Lines changed: 8 additions & 0 deletions b/‎docs/source/en/api/loaders.mdx
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/source/en/api/pipelines/stable_diffusion/self_attention_guidance.mdx
Lines changed: 5 additions & 4 deletions b/‎docs/source/en/api/pipelines/stable_diffusion/self_attention_guidance.mdx
Lines changed: 5 additions & 4 deletions
diff --git a/‎docs/source/en/api/pipelines/text_to_video_zero.mdx
Lines changed: 7 additions & 2 deletions b/‎docs/source/en/api/pipelines/text_to_video_zero.mdx
Lines changed: 7 additions & 2 deletions
diff --git a/‎docs/source/en/conceptual/contribution.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/en/conceptual/contribution.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/en/index.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/en/index.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/en/training/text_inversion.mdx
Lines changed: 41 additions & 4 deletions b/‎docs/source/en/training/text_inversion.mdx
Lines changed: 41 additions & 4 deletions
diff --git a/‎docs/source/en/tutorials/basic_training.mdx
Lines changed: 1 addition & 1 deletion b/‎docs/source/en/tutorials/basic_training.mdx
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/en/using-diffusers/contribute_pipeline.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/en/using-diffusers/contribute_pipeline.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/en/using-diffusers/custom_pipeline_overview.mdx
Lines changed: 3 additions & 1 deletion b/‎docs/source/en/using-diffusers/custom_pipeline_overview.mdx
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source/en/using-diffusers/loading.mdx
Lines changed: 20 additions & 1 deletion b/‎docs/source/en/using-diffusers/loading.mdx
Lines changed: 20 additions & 1 deletion
diff --git a/‎examples/community/bit_diffusion.py
Lines changed: 1 addition & 1 deletion b/‎examples/community/bit_diffusion.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/community/clip_guided_stable_diffusion.py
Lines changed: 1 addition & 1 deletion b/‎examples/community/clip_guided_stable_diffusion.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/community/clip_guided_stable_diffusion_img2img.py
Lines changed: 1 addition & 1 deletion b/‎examples/community/clip_guided_stable_diffusion_img2img.py
Lines changed: 1 addition & 1 deletion
@@ -40,7 +40,7 @@ jobs:
             framework: pytorch_examples
             runner: docker-cpu
             image: diffusers/diffusers-pytorch-cpu
-            report: torch_cpu
+            report: torch_example_cpu
 
     name: ${{ matrix.config.name }}
 
 
@@ -38,7 +38,7 @@ jobs:
             framework: pytorch_examples
             runner: docker-cpu
             image: diffusers/diffusers-pytorch-cpu
-            report: torch_cpu
+            report: torch_example_cpu
 
     name: ${{ matrix.config.name }}
 
 
@@ -28,3 +28,11 @@ API to load such adapter neural networks via the [`loaders.py` module](https://g
 ### UNet2DConditionLoadersMixin
 
 [[autodoc]] loaders.UNet2DConditionLoadersMixin
+
+### TextualInversionLoaderMixin
+
+[[autodoc]] loaders.TextualInversionLoaderMixin
+
+### LoraLoaderMixin
+
+[[autodoc]] loaders.LoraLoaderMixin
@@ -14,25 +14,26 @@ specific language governing permissions and limitations under the License.
 
 ## Overview
 
-[Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
+[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
 
 The abstract of the paper is the following:
 
-*Denoising diffusion models (DDMs) have been drawing much attention for their appreciable sample quality and diversity. Despite their remarkable performance, DDMs remain black boxes on which further study is necessary to take a profound step. Motivated by this, we delve into the design of conventional U-shaped diffusion models. More specifically, we investigate the self-attention modules within these models through carefully designed experiments and explore their characteristics. In addition, inspired by the studies that substantiate the effectiveness of the guidance schemes, we present plug-and-play diffusion guidance, namely Self-Attention Guidance (SAG), that can drastically boost the performance of existing diffusion models. Our method, SAG, extracts the intermediate attention map from a diffusion model at every iteration and selects tokens above a certain attention score for masking and blurring to obtain a partially blurred input. Subsequently, we measure the dissimilarity between the predicted noises obtained from feeding the blurred and original input to the diffusion model and leverage it as guidance. With this guidance, we observe apparent improvements in a wide range of diffusion models, e.g., ADM, IDDPM, and Stable Diffusion, and show that the results further improve by combining our method with the conventional guidance scheme. We provide extensive ablation studies to verify our choices.*
+*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.*
 
 Resources:
 
 * [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance).
 * [Paper](https://arxiv.org/abs/2210.00939).
 * [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance).
-* [Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
+* [Hugging Face Demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance).
+* [Colab Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
 
 
 ## Available Pipelines:
 
 | Pipeline | Tasks | Demo
 |---|---|:---:|
-| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [Colab](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb) |
+| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [🤗 Space](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) |
 
 ## Usage example
 
 
@@ -61,13 +61,15 @@ Resources:
 To generate a video from prompt, run the following python command
 ```python
 import torch
+import imageio
 from diffusers import TextToVideoZeroPipeline
 
 model_id = "runwayml/stable-diffusion-v1-5"
 pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
 
 prompt = "A panda is playing guitar on times square"
 result = pipe(prompt=prompt).images
+result = [(r * 255).astype("uint8") for r in result]
 imageio.mimsave("video.mp4", result, fps=4)
 ```
 You can change these parameters in the pipeline call:
@@ -95,6 +97,7 @@ To generate a video from prompt with additional pose control
 
 2. Read video containing extracted pose images
     ```python
+    from PIL import Image
     import imageio
 
     reader = imageio.get_reader(video_path, "ffmpeg")
@@ -151,6 +154,7 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/
 
 2. Read video from path
     ```python
+    from PIL import Image
     import imageio
 
     reader = imageio.get_reader(video_path, "ffmpeg")
@@ -174,14 +178,14 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/
     ```
 
 
-### Dreambooth specialization 
+### DreamBooth specialization 
 
 Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control**
 can run with custom [DreamBooth](../training/dreambooth) models, as shown below for
 [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
 [Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model
 
-1. Download demo video from huggingface
+1. Download a demo video
 
     ```python
     from huggingface_hub import hf_hub_download
@@ -193,6 +197,7 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below
 
 2. Read video from path
     ```python
+    from PIL import Image
     import imageio
 
     reader = imageio.get_reader(video_path, "ffmpeg")
 
@@ -170,7 +170,7 @@ please have a look at the next sections.
 
 For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull requst](#how-to-open-a-pr) section.
 
-### 4. Fixing a "Good first issue"
+### 4. Fixing a `Good first issue`
 
 *Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already
 explains how a potential solution should look so that it is easier to fix.
@@ -275,7 +275,7 @@ Once an example script works, please make sure to add a comprehensive `README.md
 
 If you are contributing to the official training examples, please also make sure to add a test to [examples/test_examples.py](https://github.com/huggingface/diffusers/blob/main/examples/test_examples.py). This is not necessary for non-official training examples.
 
-### 8. Fixing a "Good second issue"
+### 8. Fixing a `Good second issue`
 
 *Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are
 usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).
 
@@ -73,7 +73,7 @@ The library has three main components:
 | [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800)  | Text-Guided Image Editing|
 | [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
 | [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
-| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
+| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation |
 | [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
 | [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
 | [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing |
@@ -90,4 +90,4 @@ The library has three main components:
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
-| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
+| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
@@ -157,24 +157,61 @@ If you're interested in following along with your model training progress, you c
 
 ## Inference
 
-Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`]. Make sure you include the `placeholder_token` in your prompt, in this case, it is `<cat-toy>`.
+Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`].
+
+The textual inversion script will by default only save the textual inversion embedding vector(s) that have 
+been added to the text encoder embedding matrix and consequently been trained.
 
 <frameworkcontent>
 <pt>
+<Tip>
+
+💡 The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library).
+Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the libary.
+
+</Tip>
+
+To load the textual inversion embeddings you first need to load the base model that was used when training 
+your textual inversion embedding vectors. Here we assume that [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5)
+was used as a base model so we load it first:
 ```python
 from diffusers import StableDiffusionPipeline
+import torch
 
-model_id = "path-to-your-trained-model"
+model_id = "runwayml/stable-diffusion-v1-5"
 pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+```
 
-prompt = "A <cat-toy> backpack"
+Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`]
+function. Here we'll load the embeddings of the "<cat-toy>" example from before.
+```python
+pipe.load_textual_inversion("sd-concepts-library/cat-toy")
+```
 
-image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
+Now we can run the pipeline making sure that the placeholder token `<cat-toy>` is used in our prompt.
 
+```python
+prompt = "A <cat-toy> backpack"
+
+image = pipe(prompt, num_inference_steps=50).images[0]
 image.save("cat-backpack.png")
 ```
+
+The function [`TextualInversionLoaderMixin.load_textual_inversion`] can not only 
+load textual embedding vectors saved in Diffusers' format, but also embedding vectors
+saved in [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) format.
+To do so, you can first download an embedding vector from [civitAI](https://civitai.com/models/3036?modelVersionId=8387)
+and then load it locally:
+```python
+pipe.load_textual_inversion("./charturnerv2.pt")
+```
 </pt>
 <jax>
+Currently there is no `load_textual_inversion` function for Flax so one has to make sure the textual inversion
+embedding vector is saved as part of the model after training.
+
+The model can then be run just like any other Flax model:
+
 ```python
 import jax
 import numpy as np
 
@@ -344,7 +344,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
 
 ...             # Sample a random timestep for each image
 ...             timesteps = torch.randint(
-...                 0, noise_scheduler.num_train_timesteps, (bs,), device=clean_images.device
+...                 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device
 ...             ).long()
 
 ...             # Add noise to the clean images according to the noise magnitude at each timestep
 
@@ -62,7 +62,7 @@ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
 
     def __call__(self):
         image = torch.randn(
-            (1, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size),
+            (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
         )
         timestep = 1
 
@@ -108,7 +108,7 @@ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
 
     def __call__(self):
         image = torch.randn(
-            (1, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size),
+            (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
         )
         timestep = 1
 
 
@@ -89,7 +89,9 @@ class MyPipeline(DiffusionPipeline):
     @torch.no_grad()
     def __call__(self, batch_size: int = 1, num_inference_steps: int = 50):
         # Sample gaussian noise to begin loop
-        image = torch.randn((batch_size, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size))
+        image = torch.randn(
+            (batch_size, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size)
+        )
 
         image = image.to(self.device)
 
 
@@ -123,7 +123,7 @@ stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, safety_checker=Non
 
 ### Reuse components across pipelines
 
-You can also reuse the same components in multiple pipelines without loading the weights into RAM twice. Use the [`DiffusionPipeline.components`] method to save the components in `components`:
+You can also reuse the same components in multiple pipelines to avoid loading the weights into RAM twice. Use the [`~DiffusionPipeline.components`] method to save the components:
 
 ```python
 from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
@@ -140,6 +140,25 @@ Then you can pass the `components` to another pipeline without reloading the wei
 stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(**components)
 ```
 
+You can also pass the components individually to the pipeline if you want more flexibility over which components to reuse or disable. For example, to reuse the same components in the text-to-image pipeline, except for the safety checker and feature extractor, in the image-to-image pipeline:
+
+```py
+from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id)
+stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(
+    vae=stable_diffusion_txt2img.vae,
+    text_encoder=stable_diffusion_txt2img.text_encoder,
+    tokenizer=stable_diffusion_txt2img.tokenizer,
+    unet=stable_diffusion_txt2img.unet,
+    scheduler=stable_diffusion_txt2img.scheduler,
+    safety_checker=None,
+    feature_extractor=None,
+    requires_safety_checker=False,
+)
+```
+
 ## Checkpoint variants
 
 A checkpoint variant is usually a checkpoint where it's weights are:
 
@@ -238,7 +238,7 @@ def __call__(
         **kwargs,
     ) -> Union[Tuple, ImagePipelineOutput]:
         latents = torch.randn(
-            (batch_size, self.unet.in_channels, height, width),
+            (batch_size, self.unet.config.in_channels, height, width),
             generator=generator,
         )
         latents = decimal_to_bits(latents) * self.bit_scale
 
@@ -254,7 +254,7 @@ def __call__(
         # Unlike in other pipelines, latents need to be generated in the target device
         # for 1-to-1 results reproducibility with the CompVis implementation.
         # However this currently doesn't work in `mps`.
-        latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8)
+        latents_shape = (batch_size * num_images_per_prompt, self.unet.config.in_channels, height // 8, width // 8)
         latents_dtype = text_embeddings.dtype
         if latents is None:
             if self.device.type == "mps":
 
@@ -414,7 +414,7 @@ def __call__(
         # Unlike in other pipelines, latents need to be generated in the target device
         # for 1-to-1 results reproducibility with the CompVis implementation.
         # However this currently doesn't work in `mps`.
-        latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8)
+        latents_shape = (batch_size * num_images_per_prompt, self.unet.config.in_channels, height // 8, width // 8)
         latents_dtype = text_embeddings.dtype
         if latents is None:
             if self.device.type == "mps":
Original file line number	Diff line number	Diff line change
`@@ -62,7 +62,7 @@ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):`
`62`	`62`
`63`	`63`	`def __call__(self):`
`64`	`64`	`image = torch.randn(`
`65`		`- (1, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size),`
	`65`	`+ (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),`
`66`	`66`	`)`
`67`	`67`	`timestep = 1`
`68`	`68`
`@@ -108,7 +108,7 @@ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):`
`108`	`108`
`109`	`109`	`def __call__(self):`
`110`	`110`	`image = torch.randn(`
`111`		`- (1, self.unet.in_channels, self.unet.sample_size, self.unet.sample_size),`
	`111`	`+ (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),`
`112`	`112`	`)`
`113`	`113`	`timestep = 1`
`114`	`114`
Original file line number	Diff line number	Diff line change
`@@ -238,7 +238,7 @@ def __call__(`
`238`	`238`	`**kwargs,`
`239`	`239`	`) -> Union[Tuple, ImagePipelineOutput]:`
`240`	`240`	`latents = torch.randn(`
`241`		`- (batch_size, self.unet.in_channels, height, width),`
	`241`	`+ (batch_size, self.unet.config.in_channels, height, width),`
`242`	`242`	`generator=generator,`
`243`	`243`	`)`
`244`	`244`	`latents = decimal_to_bits(latents) * self.bit_scale`