Description
Model/Pipeline/Scheduler description
The authors propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Their approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue. Specifically, at each denoising step, this method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail.
However, diagonal denoising is a double-edged sword, as the frames near the tail can take advantage of cleaner ones by forward reference, but such a strategy induces the discrepancy between training and inference. To reduce this gap, the authors introduce latent partitioning to reduce the training-inference gap, and lookahead denoising to leverage the benefit of forward referencing.
The authors demonstrate promising results on existing pretrained text-to-video generation models such as VideoCrafter, OpenSora Plan, and ZeroScope.
Open source status
- The model implementation is available.
- The model weights are available (Only relevant if addition is not a scheduler).
Provide useful links for the implementation
Project Page: https://jjihwan.github.io/projects/FIFO-Diffusion
Code: https://github.com/jjihwan/FIFO-Diffusion_public
Arxiv: https://arxiv.org/abs/2405.11473
Contact: @jjihwan