Skip to content

FIFO-Diffusion: Generating Infinite Videos from Text without Training through Rolling Video Denoising #8274

Open
@clarencechen

Description

@clarencechen

Model/Pipeline/Scheduler description

The authors propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Their approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue. Specifically, at each denoising step, this method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail.

However, diagonal denoising is a double-edged sword, as the frames near the tail can take advantage of cleaner ones by forward reference, but such a strategy induces the discrepancy between training and inference. To reduce this gap, the authors introduce latent partitioning to reduce the training-inference gap, and lookahead denoising to leverage the benefit of forward referencing.

The authors demonstrate promising results on existing pretrained text-to-video generation models such as VideoCrafter, OpenSora Plan, and ZeroScope.

Open source status

  • The model implementation is available.
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Project Page: https://jjihwan.github.io/projects/FIFO-Diffusion
Code: https://github.com/jjihwan/FIFO-Diffusion_public
Arxiv: https://arxiv.org/abs/2405.11473
Contact: @jjihwan

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions