Closed
Description
Now that we have a working PoC (#9165) of NF4 quantization through bitsandbytes
and also this through optimum.quanto
, it's time to bring in quantization more formally in diffusers
🎸
In this issue, I want to devise a rough plan to attack the integration. We are going to start with bitsandbytes
and then slowly increase the list of our supported quantizers based on community interest. This integration will also allow us to do LoRA fine-tuning of large models like Flux through peft
(guide).
Three PRs are expected:
- Introduce a base quantization config class like we have in
transformers
. - Introduce
bitsandbytes
related utilities to handle processing, post-processing of layers for injectingbitsandbytes
layers. Example is here. - Introduce a
bitsandbytes
config (example) and quantization loader mixin akaQuantizationLoaderMixin
. This loader will enable passing a quantization config tofrom_pretrained()
of aModelMixin
and will tackle how to modify and prepare the model for the provided quantization config. This will also allow us to serialize the model according to the quantization config.
Notes:
- We could have done this with
accelerate
(guide) but this doesn't yet support NF4 serialization. - Good example PR: Add TorchAOHfQuantizer transformers#32306