Description
What API design would you like to have changed or added to the library? Why?
Is it possible to allow setting every tensor attribute of scheduler to cuda device?
In https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lcm.py
It looks like that attributes like scheduler.alphas_cumprod
are tensors on cpu, but the scheduler.set_timesteps() allows setting scheduler.timesteps
to gpu/cuda device. Isn't this causing device mismatch when indexing scheduler.alphas_cumprod with scheduler.timesteps? Below is the code snippet that the pipline is indexing a cpu tensor(alphas_cumprod) with a gpu tensor(timestep)
I simply added following lines to print the timestep and self.alphas_cumprod type and device at the begining of the scheduler.step()
print("Printing scheduler.step() timestep")
print(type(timestep))
print(isinstance(timestep, torch.Tensor))
print(timestep.device)
print("Printing scheduler.step() self.alphas_cumprod")
print(type(self.alphas_cumprod))
print(isinstance(self.alphas_cumprod, torch.Tensor))
print(self.alphas_cumprod.device)
Output when running text-to-image:
Printing scheduler.step() timestep
<class 'torch.Tensor'>
True
cuda:0
Printing scheduler.step() self.alphas_cumprod
<class 'torch.Tensor'>
True
cpu
What use case would this enable or better enable? Can you give us a code example?
We are using a modified LCMScheduler (99% same as the original LCMScheduler) for video generations, it's generating frames repeatedly in a loop. for most of the time, this step doesn't cause performance issue. But we did see intermittent high cpu usage and latency for alpha_prod_t = self.alphas_cumprod[timestep]
. And from torch.profiler and tracing output, it. shows high latency for this specific step. We are wondering if this is the performance bottleneck.