[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components #9840

sayakpaul · 2024-11-02T04:14:41Z

What does this PR do?

When a pipeline is loaded with models that have quantization config, we should still be able to call to("cuda") on the pipeline object. For GPUs that would allow the memory (such as a 4090), this has performance benefits (as demonstrated below).

Model CPU Offload	Batch Size	Time (seconds)	Memory (GB)
False	1	19.316	14.935
True	1	36.746	12.139
False	4	80.665	20.576
True	4	98.612	12.138

Flux.1 Dev, steps: 30

Currently, calling to("cuda") is not possible because:

from transformers import T5EncoderModel
from transformers import BitsAndBytesConfig as BnbConfig
import torch 

ckpt_id = "black-forest-labs/FLUX.1-dev"

text_encoder_2_config = BnbConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
text_encoder_2 = T5EncoderModel.from_pretrained(
    ckpt_id,
    subfolder="text_encoder_2",
    quantization_config=text_encoder_2_config,
    torch_dtype=torch.bfloat16
)
print(text_encoder_2._hf_hook)

has:

AlignDevicesHook(execution_device=0, offload=False, io_same_device=True, offload_buffers=False, place_submodules=True, skip_keys=None)

This is why this line complains:

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 413 in c10f875

    
           if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda":

This PR fixes that behavior.

Benchmarking code:

Unroll

from diffusers import DiffusionPipeline, FluxTransformer2DModel, BitsAndBytesConfig
from transformers import T5EncoderModel
from transformers import BitsAndBytesConfig as BnbConfig
import torch.utils.benchmark as benchmark
import torch 
import fire

def benchmark_fn(f, *args, **kwargs):
    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

def load_pipeline(model_cpu_offload=False):
    ckpt_id = "black-forest-labs/FLUX.1-dev"

    transformer_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    transformer = FluxTransformer2DModel.from_pretrained(
        ckpt_id, 
        subfolder="transformer",
        quantization_config=transformer_config,
        torch_dtype=torch.bfloat16
    )

    text_encoder_2_config = BnbConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    text_encoder_2 = T5EncoderModel.from_pretrained(
        ckpt_id,
        subfolder="text_encoder_2",
        quantization_config=text_encoder_2_config,
        torch_dtype=torch.bfloat16
    )

    pipeline = DiffusionPipeline.from_pretrained(
        "black-forest-labs/FLUX.1-dev",
        text_encoder_2=text_encoder_2,
        transformer=transformer,
        torch_dtype=torch.bfloat16,
    )
    if model_cpu_offload:
        pipeline.enable_model_cpu_offload()
    else:
        pipeline = pipeline.to("cuda")

    pipeline.set_progress_bar_config(disable=True)
    return pipeline

def run_pipeline(pipeline, batch_size=1):
    _ = pipeline(
        prompt="a dog sitting besides a sea", 
        guidance_scale=3.5, 
        max_sequence_length=512, 
        num_inference_steps=30,
        num_images_per_prompt=batch_size
    )


def main(batch_size: int = 1, model_cpu_offload: bool = False):
    pipeline = load_pipeline(model_cpu_offload=model_cpu_offload)

    for _ in range(5):
        run_pipeline(pipeline)

    time = benchmark_fn(run_pipeline, pipeline, batch_size)
    memory = torch.cuda.max_memory_allocated() / 1024 / 1024 / 1024
    print(f"{model_cpu_offload=}, {batch_size=} {time=} seconds {memory=} GB.")

    image = pipeline(
        prompt="a dog sitting besides a sea", 
        guidance_scale=3.5, 
        max_sequence_length=512, 
        num_inference_steps=30,
        num_images_per_prompt=1
    ).images[0]
    img_name = f"mco@{model_cpu_offload}-bs@{batch_size}.png"
    image.save(img_name)


if __name__ == "__main__":
    fire.Fire(main)

HuggingFaceDocBuilderDev · 2024-11-02T04:21:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for the PR ! Left a suggestion

src/diffusers/pipelines/pipeline_utils.py

sayakpaul · 2024-11-05T15:22:49Z

@SunMarc WDYT now?

SunMarc

Thanks for adding this ! LGTM ! I'll marge the PR on accelerate also

sayakpaul · 2024-11-16T13:26:07Z

Have run the integration tests and they are passing.

SunMarc · 2024-11-18T14:44:43Z

Have run the integration tests and they are passing.
On diffusers ?

sayakpaul · 2024-11-18T14:45:45Z

@SunMarc yes, on diffusers. Anywhere else they need to be run?

SunMarc · 2024-11-18T15:03:57Z

No, I read that as a question, my bad ;)

src/diffusers/pipelines/pipeline_utils.py

DN6 · 2024-12-02T06:40:58Z

src/diffusers/pipelines/pipeline_utils.py

+        pipeline_has_bnb = any(
+            (_check_bnb_status(module)[1] or _check_bnb_status(module)[-1]) for _, module in self.components.items()
+        )


IMO cleaner.

Suggested change

pipeline_has_bnb = any(

(_check_bnb_status(module)[1] or _check_bnb_status(module)[-1]) for _, module in self.components.items()

)

pipeline_has_bnb = any(

any((_check_bnb_status(module))) for _, module in self.components.items()

)

If this check is placed after the sequential offloading check, placement would still fail right?

Running the test gives:

E ValueError: It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading. src/diffusers/pipelines/pipeline_utils.py:417: ValueError

If this check is placed after the sequential offloading check, placement would still fail right?

I have modified the placement of the logic. Could you check again?

Re. tests, I just ran pytest tests/quantization/bnb/test_4bit.py::SlowBnb4BitTests and pytest tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests and everything passed.

You need this PR huggingface/accelerate#3223 for this to work.

yiyixuxu · 2024-12-02T22:03:25Z

src/diffusers/pipelines/pipeline_utils.py

@@ -389,6 +392,13 @@ def to(self, *args, **kwargs):

        device = device or device_arg

+        pipeline_has_bnb = any(any((_check_bnb_status(module))) for _, module in self.components.items())


it seems to have some overlapping logics with the code just a little bit below this, no?

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 444 in 6db3333

if is_loaded_in_8bit_bnb and device is not None:

Good point.

However, the LoC you pointed out is relevant when we're transferring an 8bit quantized model from one device to the other. It's a log to let the users know that this model has already been placed on a GPU and will remain so. Requesting to put it on a CPU will be ineffective.

We call self.to("cpu") when doing enable_model_cpu_offload():

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 1039 in 963ffca

self.to("cpu", silence_dtype_warnings=True)

So, this kind of log becomes informative in the context of using enable_model_cpu_offload(), for example.

This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.

So, maybe this apparent overlap is justified. LMK.

This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.

did I miss something?
this PR add a check which throw a value error under certain condition - not enable a new use case like you described here, no?

Well, the enablement comes from the accelerate fix huggingface/accelerate#3223 and this PR adds a check for that as you described. Sorry for the wrong order of words 😅

If you have other comments on the PR happy to address them.

my previous comments stands, it has overlapping logic with other checks you have below and is very very confusing.

you're not enable a new use case here, this PR correct a previous wrong error message and allow user to take correct action, I would simply update the warning message here, to add the other possible scenario that they are trying to call to("cuda") on a quantized model without offloading, and they need to upgrade accelerate in order to do that

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 426 in 8421c14

if pipeline_is_offloaded and device and torch.device(device).type == "cuda":

this PR correct a previous wrong error message

What was the wrong error message?

IIUC the line you're point to has nothing to do with the changes introduced in this PR and has been in the codebase for quite a while.

The problem line (fixed by the accelerate PR) was this:

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 413 in c10f875

if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda":

So, what I have done in 1779093 is as follows:

Updated the condition of the error message:

"You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation."

to:

if ( not pipeline_is_offloaded and not pipeline_is_sequentially_offloaded and pipeline_has_bnb and torch.device(device).type == "cuda" and is_accelerate_version("<", "1.1.0.dev0") ):

This now also considers when the pipeline is not offloaded. Additionally,

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 446 in 8421c14

f"The module '{module.__class__.__name__}' has been loaded in `bitsandbytes` 8bit and moving it to {device} via `.to()` is not supported. Module is still on {module.device}."

now also considers if the pipeline is not offloaded:

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 460 in 1779093

if is_loaded_in_8bit_bnb and not is_offloaded and device is not None:

* fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <[email protected]>

yiyixuxu · 2024-12-04T09:20:32Z

src/diffusers/pipelines/pipeline_utils.py

+            and torch.device(device).type == "cuda"
+            and is_accelerate_version("<", "1.1.0.dev0")
+        ):
+            raise ValueError(


the error message you want to throw against this scenario, no?

accelerator < 1.1.0.dev0

you call pipeline.to("cuda") on a pipeline that has bnb

but if these 2 condition are met (older accelerator version + bnb):

not pipeline_is_sequentially_offloadedwill beFalse` here and you will not reach the value error

you will reach this check first and get an error message -this is the wrong error message I was talking about

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 413 in 8421c14

if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda":

if ( not pipeline_is_offloaded and not pipeline_is_sequentially_offloaded and pipeline_has_bnb and torch.device(device).type == "cuda" and is_accelerate_version("<", "1.1.0.dev0") ):

Yeah this makes a ton of sense. Thanks for the elaborate clarification. I have reflected this in my latest commits.

I have also tested most of the SLOW tests and they are passing. This is to ensure existing functionalities don't break with the current changes.

LMK.

yiyixuxu · 2024-12-04T16:17:04Z

src/diffusers/pipelines/pipeline_utils.py

-                "It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading."
-            )
+        if device and torch.device(device).type == "cuda":
+            if pipeline_is_sequentially_offloaded and not pipeline_has_bnb:


my previous comments here apply almost exactly here so I will just repeat it
#9840

the error message you want to throw against this scenario:

accelerator < 1.1.0.dev0

you call pipeline.to("cuda") on a pipeline that has bnb

if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading)

can you do this? #9840 (comment)

IF not, please remove the changes to pipline_utils.py and we can merge (I will work on it in a separate PR) I think the added tests are fine without the changes: if accecelrate version is new, it is not affected by the changes in this PR; if it is not, it throw a different error, that's all

ok I was wrong! will merge

Sure that works but here's my last try.

if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading)

When you have:

model_id = "hf-internal-testing/flux.1-dev-nf4-pkg" t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2") transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer") pipeline_4bit = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", text_encoder_2=t5_4bit, transformer=transformer_4bit, torch_dtype=torch.float16, )

in if pipeline_is_sequentially_offloaded and not pipeline_has_bnb, pipeline_is_sequentially_offloaded will be True (older accelerate version), however, not pipeline_has_bnb will be False (as expected). So, the following error won't be raised:

"It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading."

And it will hit the else.

To test, you can run the following with accelerate 1.0.1:

from diffusers import DiffusionPipeline, FluxTransformer2DModel from transformers import T5EncoderModel import torch model_id = "hf-internal-testing/flux.1-dev-nf4-pkg" t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2") transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer") pipeline_4bit = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", text_encoder_2=t5_4bit, transformer=transformer_4bit, torch_dtype=torch.float16, ).to("cuda")

It throws:

ValueError: You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation.

Isn't this what we expect or am I missing something?

yeah I missed that not pipeline_has_bnb in the statement, it works

Saw your comment. Thanks for beating it with me :)

…h bnb components (#9840) * allow device placement when using bnb quantization. * warning. * tests * fixes * docs. * require accelerate version. * remove print. * revert to() * tests * fixes * fix: missing AutoencoderKL lora adapter (#9807) * fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <[email protected]> * fixes * fix condition test * updates * updates * remove is_offloaded. * fixes * better * empty --------- Co-authored-by: Emmanuel Benazera <[email protected]>

sayakpaul added 4 commits November 1, 2024 22:00

allow device placement when using bnb quantization.

35b4cf2

warning.

ec4d422

tests

2afa9b0

Merge branch 'main' into allow-device-placement-bnb

3679ebd

sayakpaul requested review from DN6, yiyixuxu and SunMarc November 2, 2024 04:34

SunMarc reviewed Nov 4, 2024

View reviewed changes

src/diffusers/pipelines/pipeline_utils.py Outdated Show resolved Hide resolved

sayakpaul requested a review from matthewdouglas November 5, 2024 08:23

sayakpaul added 3 commits November 5, 2024 15:52

fixes

79633ee

Merge branch 'main' into allow-device-placement-bnb

876cd13

Merge branch 'main' into allow-device-placement-bnb

a28c702

sayakpaul added 3 commits November 5, 2024 22:58

Merge branch 'main' into allow-device-placement-bnb

ad1584d

Merge branch 'main' into allow-device-placement-bnb

34d0925

Merge branch 'main' into allow-device-placement-bnb

d713c41

sayakpaul requested a review from SunMarc November 11, 2024 11:33

Merge branch 'main' into allow-device-placement-bnb

e9ef6ea

SunMarc approved these changes Nov 15, 2024

View reviewed changes

sayakpaul added 2 commits November 16, 2024 18:37

Merge branch 'main' into allow-device-placement-bnb

6ce560e

docs.

329b32e

Merge branch 'main' into allow-device-placement-bnb

2f6b07d

sayakpaul mentioned this pull request Nov 19, 2024

Moving a pipeline that has a quantized component, to cuda, causes an error #9953

Closed

yiyixuxu reviewed Nov 19, 2024

View reviewed changes

src/diffusers/pipelines/pipeline_utils.py Outdated Show resolved Hide resolved

yiyixuxu added the close-to-merge label Nov 19, 2024

sayakpaul added 2 commits November 24, 2024 11:48

Merge branch 'main' into allow-device-placement-bnb

2ddcbf1

Merge branch 'main' into allow-device-placement-bnb

5130cc3

sayakpaul requested a review from yiyixuxu November 26, 2024 06:11

Merge branch 'main' into allow-device-placement-bnb

e76f93a

DN6 reviewed Dec 2, 2024

View reviewed changes

Merge branch 'main' into allow-device-placement-bnb

1963b5c

sayakpaul requested a review from DN6 December 2, 2024 10:29

sayakpaul added 2 commits December 2, 2024 15:59

Merge branch 'main' into allow-device-placement-bnb

a799ba8

fixes

7d47364

yiyixuxu reviewed Dec 2, 2024

View reviewed changes

sayakpaul and others added 6 commits December 3, 2024 08:37

Merge branch 'main' into allow-device-placement-bnb

ebfec45

fix: missing AutoencoderKL lora adapter (#9807)

1fe8a79

* fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <[email protected]>

fixes

f05d81d

Merge branch 'main' into allow-device-placement-bnb

6e17cad

fix condition test

ea09eb2

updates

1779093

yiyixuxu reviewed Dec 4, 2024

View reviewed changes

sayakpaul added 7 commits December 4, 2024 16:31

Merge branch 'main' into allow-device-placement-bnb

6ff53e3

updates

7b73dc2

remove is_offloaded.

729acea

fixes

3d3aab4

Merge branch 'main' into allow-device-placement-bnb

c033816

Merge branch 'main' into allow-device-placement-bnb

b5cffab

better

662868b

yiyixuxu reviewed Dec 4, 2024

View reviewed changes

yiyixuxu approved these changes Dec 4, 2024

View reviewed changes

empty

3fc15fe

sayakpaul merged commit e8da75d into main Dec 4, 2024
18 checks passed

sayakpaul deleted the allow-device-placement-bnb branch December 4, 2024 16:57

		@@ -389,6 +392,13 @@ def to(self, args, *kwargs):

		device = device or device_arg

		pipeline_has_bnb = any(any((_check_bnb_status(module))) for _, module in self.components.items())

[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components #9840

[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components #9840

Uh oh!

Conversation

sayakpaul commented Nov 2, 2024

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 2, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul commented Nov 5, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Nov 16, 2024

Uh oh!

SunMarc commented Nov 18, 2024

Uh oh!

sayakpaul commented Nov 18, 2024

Uh oh!

SunMarc commented Nov 18, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul Dec 3, 2024 •

edited

Loading

sayakpaul Dec 4, 2024 •

edited

Loading

sayakpaul Dec 4, 2024 •

edited

Loading