-
Notifications
You must be signed in to change notification settings - Fork 6k
[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components #9840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
35b4cf2
ec4d422
2afa9b0
3679ebd
79633ee
876cd13
a28c702
ad1584d
34d0925
d713c41
e9ef6ea
6ce560e
329b32e
2f6b07d
fdeb500
53bc502
f81b71e
8e1b6f5
e3e3a96
9e9561b
2ddcbf1
5130cc3
e76f93a
1963b5c
a799ba8
7d47364
ebfec45
1fe8a79
f05d81d
6e17cad
ea09eb2
1779093
6ff53e3
7b73dc2
729acea
3d3aab4
c033816
b5cffab
662868b
3fc15fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -66,7 +66,6 @@ | |
if is_torch_npu_available(): | ||
import torch_npu # noqa: F401 | ||
|
||
|
||
from .pipeline_loading_utils import ( | ||
ALL_IMPORTABLE_CLASSES, | ||
CONNECTED_PIPES_KEYS, | ||
|
@@ -388,6 +387,7 @@ def to(self, *args, **kwargs): | |
) | ||
|
||
device = device or device_arg | ||
pipeline_has_bnb = any(any((_check_bnb_status(module))) for _, module in self.components.items()) | ||
|
||
# throw warning if pipeline is in "offloaded"-mode but user tries to manually set to GPU. | ||
def module_is_sequentially_offloaded(module): | ||
|
@@ -410,10 +410,16 @@ def module_is_offloaded(module): | |
pipeline_is_sequentially_offloaded = any( | ||
module_is_sequentially_offloaded(module) for _, module in self.components.items() | ||
) | ||
if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda": | ||
raise ValueError( | ||
"It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading." | ||
) | ||
if device and torch.device(device).type == "cuda": | ||
if pipeline_is_sequentially_offloaded and not pipeline_has_bnb: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my previous comments here apply almost exactly here so I will just repeat it the error message you want to throw against this scenario:
if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading) can you do this? #9840 (comment) IF not, please remove the changes to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok I was wrong! will merge There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure that works but here's my last try.
When you have: model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2")
transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
pipeline_4bit = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=t5_4bit,
transformer=transformer_4bit,
torch_dtype=torch.float16,
) in "It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading." And it will hit the To test, you can run the following with from diffusers import DiffusionPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2")
transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
pipeline_4bit = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=t5_4bit,
transformer=transformer_4bit,
torch_dtype=torch.float16,
).to("cuda") It throws: ValueError: You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation. Isn't this what we expect or am I missing something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I missed that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Saw your comment. Thanks for beating it with me :) |
||
raise ValueError( | ||
"It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading." | ||
) | ||
# PR: https://github.com/huggingface/accelerate/pull/3223/ | ||
elif pipeline_has_bnb and is_accelerate_version("<", "1.1.0.dev0"): | ||
raise ValueError( | ||
"You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation." | ||
) | ||
|
||
is_pipeline_device_mapped = self.hf_device_map is not None and len(self.hf_device_map) > 1 | ||
if is_pipeline_device_mapped: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to have some overlapping logics with the code just a little bit below this, no?
diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 444 in 6db3333
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
However, the LoC you pointed out is relevant when we're transferring an 8bit quantized model from one device to the other. It's a log to let the users know that this model has already been placed on a GPU and will remain so. Requesting to put it on a CPU will be ineffective.
We call
self.to("cpu")
when doingenable_model_cpu_offload()
:diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 1039 in 963ffca
So, this kind of log becomes informative in the context of using
enable_model_cpu_offload()
, for example.This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.
So, maybe this apparent overlap is justified. LMK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did I miss something?
this PR add a check which throw a value error under certain condition - not enable a new use case like you described here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the enablement comes from the
accelerate
fix huggingface/accelerate#3223 and this PR adds a check for that as you described. Sorry for the wrong order of words 😅If you have other comments on the PR happy to address them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my previous comments stands, it has overlapping logic with other checks you have below and is very very confusing.
you're not enable a new use case here, this PR correct a previous wrong error message and allow user to take correct action, I would simply update the warning message here, to add the other possible scenario that they are trying to call
to("cuda")
on a quantized model without offloading, and they need to upgrade accelerate in order to do thatdiffusers/src/diffusers/pipelines/pipeline_utils.py
Line 426 in 8421c14
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the wrong error message?
IIUC the line you're point to has nothing to do with the changes introduced in this PR and has been in the codebase for quite a while.
The problem line (fixed by the
accelerate
PR) was this:diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 413 in c10f875
So, what I have done in 1779093 is as follows:
Updated the condition of the error message:
"You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation."
to:
This now also considers when the pipeline is not offloaded. Additionally,
diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 446 in 8421c14
now also considers if the pipeline is not offloaded:
diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 460 in 1779093