Open
Description
Is your feature request related to a problem? Please describe.
I found that the inside the __call__
of stable video diffusion keeps doing async memcpy between host to device as attached.
Describe the solution you'd like.
The reason for that is actually coming from every time we get self.do_classifier_free_guidance
, we compared tensor between int
-> get boolean on device -> memcpy that boolean from gpu to cpu.
It'll be good to just assign a variable for it before the loop as the value won't change through the loop.
Additional context.
I'm glad to contribute this by opening a PR