-
Notifications
You must be signed in to change notification settings - Fork 6k
[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components #9840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR ! Left a suggestion
@SunMarc WDYT now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this ! LGTM ! I'll marge the PR on accelerate also
Have run the integration tests and they are passing. |
|
@SunMarc yes, on |
No, I read that as a question, my bad ;) |
pipeline_has_bnb = any( | ||
(_check_bnb_status(module)[1] or _check_bnb_status(module)[-1]) for _, module in self.components.items() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO cleaner.
pipeline_has_bnb = any( | |
(_check_bnb_status(module)[1] or _check_bnb_status(module)[-1]) for _, module in self.components.items() | |
) | |
pipeline_has_bnb = any( | |
any((_check_bnb_status(module))) for _, module in self.components.items() | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this check is placed after the sequential offloading check, placement would still fail right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the test gives:
E ValueError: It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading.
src/diffusers/pipelines/pipeline_utils.py:417: ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this check is placed after the sequential offloading check, placement would still fail right?
I have modified the placement of the logic. Could you check again?
Re. tests, I just ran pytest tests/quantization/bnb/test_4bit.py::SlowBnb4BitTests
and pytest tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests
and everything passed.
You need this PR huggingface/accelerate#3223 for this to work.
@@ -389,6 +392,13 @@ def to(self, *args, **kwargs): | |||
|
|||
device = device or device_arg | |||
|
|||
pipeline_has_bnb = any(any((_check_bnb_status(module))) for _, module in self.components.items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to have some overlapping logics with the code just a little bit below this, no?
if is_loaded_in_8bit_bnb and device is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
However, the LoC you pointed out is relevant when we're transferring an 8bit quantized model from one device to the other. It's a log to let the users know that this model has already been placed on a GPU and will remain so. Requesting to put it on a CPU will be ineffective.
We call self.to("cpu")
when doing enable_model_cpu_offload()
:
self.to("cpu", silence_dtype_warnings=True) |
So, this kind of log becomes informative in the context of using enable_model_cpu_offload()
, for example.
This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.
So, maybe this apparent overlap is justified. LMK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.
did I miss something?
this PR add a check which throw a value error under certain condition - not enable a new use case like you described here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the enablement comes from the accelerate
fix huggingface/accelerate#3223 and this PR adds a check for that as you described. Sorry for the wrong order of words 😅
If you have other comments on the PR happy to address them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my previous comments stands, it has overlapping logic with other checks you have below and is very very confusing.
you're not enable a new use case here, this PR correct a previous wrong error message and allow user to take correct action, I would simply update the warning message here, to add the other possible scenario that they are trying to call to("cuda")
on a quantized model without offloading, and they need to upgrade accelerate in order to do that
if pipeline_is_offloaded and device and torch.device(device).type == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR correct a previous wrong error message
What was the wrong error message?
IIUC the line you're point to has nothing to do with the changes introduced in this PR and has been in the codebase for quite a while.
The problem line (fixed by the accelerate
PR) was this:
if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda": |
So, what I have done in 1779093 is as follows:
Updated the condition of the error message:
"You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation."
to:
if (
not pipeline_is_offloaded
and not pipeline_is_sequentially_offloaded
and pipeline_has_bnb
and torch.device(device).type == "cuda"
and is_accelerate_version("<", "1.1.0.dev0")
):
This now also considers when the pipeline is not offloaded. Additionally,
f"The module '{module.__class__.__name__}' has been loaded in `bitsandbytes` 8bit and moving it to {device} via `.to()` is not supported. Module is still on {module.device}." |
now also considers if the pipeline is not offloaded:
if is_loaded_in_8bit_bnb and not is_offloaded and device is not None: |
* fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <[email protected]>
and torch.device(device).type == "cuda" | ||
and is_accelerate_version("<", "1.1.0.dev0") | ||
): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error message you want to throw against this scenario, no?
- accelerator < 1.1.0.dev0
you call pipeline.to("cuda")
on a pipeline that has bnb
but if these 2 condition are met (older accelerator version + bnb):
- not pipeline_is_sequentially_offloaded
will be
False` here and you will not reach the value error - you will reach this check first and get an error message -this is the wrong error message I was talking about
if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda":
if (
not pipeline_is_offloaded
and not pipeline_is_sequentially_offloaded
and pipeline_has_bnb
and torch.device(device).type == "cuda"
and is_accelerate_version("<", "1.1.0.dev0")
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes a ton of sense. Thanks for the elaborate clarification. I have reflected this in my latest commits.
I have also tested most of the SLOW tests and they are passing. This is to ensure existing functionalities don't break with the current changes.
LMK.
"It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading." | ||
) | ||
if device and torch.device(device).type == "cuda": | ||
if pipeline_is_sequentially_offloaded and not pipeline_has_bnb: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my previous comments here apply almost exactly here so I will just repeat it
#9840
the error message you want to throw against this scenario:
- accelerator < 1.1.0.dev0
- you call
pipeline.to("cuda")
on a pipeline that has bnb
if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading)
can you do this? #9840 (comment)
IF not, please remove the changes to pipline_utils.py
and we can merge (I will work on it in a separate PR) I think the added tests are fine without the changes: if accecelrate version is new, it is not affected by the changes in this PR; if it is not, it throw a different error, that's all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I was wrong! will merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that works but here's my last try.
if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading)
When you have:
model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2")
transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
pipeline_4bit = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=t5_4bit,
transformer=transformer_4bit,
torch_dtype=torch.float16,
)
in if pipeline_is_sequentially_offloaded and not pipeline_has_bnb
, pipeline_is_sequentially_offloaded
will be True (older accelerate
version), however, not pipeline_has_bnb
will be False (as expected). So, the following error won't be raised:
"It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading."
And it will hit the else
.
To test, you can run the following with accelerate
1.0.1:
from diffusers import DiffusionPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2")
transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
pipeline_4bit = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=t5_4bit,
transformer=transformer_4bit,
torch_dtype=torch.float16,
).to("cuda")
It throws:
ValueError: You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation.
Isn't this what we expect or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I missed that not pipeline_has_bnb
in the statement, it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw your comment. Thanks for beating it with me :)
…h bnb components (#9840) * allow device placement when using bnb quantization. * warning. * tests * fixes * docs. * require accelerate version. * remove print. * revert to() * tests * fixes * fix: missing AutoencoderKL lora adapter (#9807) * fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <[email protected]> * fixes * fix condition test * updates * updates * remove is_offloaded. * fixes * better * empty --------- Co-authored-by: Emmanuel Benazera <[email protected]>
What does this PR do?
When a pipeline is loaded with models that have quantization config, we should still be able to call
to("cuda")
on the pipeline object. For GPUs that would allow the memory (such as a 4090), this has performance benefits (as demonstrated below).Flux.1 Dev, steps: 30
Currently, calling
to("cuda")
is not possible because:has:
This is why this line complains:
diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 413 in c10f875
This PR fixes that behavior.
Benchmarking code:
Unroll