Press Enter to pause or resume all videos.
TL;DR: We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model without any training.
(a) Novel view video: Generating new videos from target camera views.
(b) Bullet time video: Generating a time-frozen video at a single time point.
Reconstruction pipeline of Zero4D: (a) Key frame generation step: (a1) Given a fixed-viewpoint grayscale input video, we generate synchronized multi-view videos using the I2V diffusion model. (a2) We first synthesize key frames of the 4D grid through diffusion sampling, guided by warped views. (a3) Next, we generate the end-view video frames using a process similar to (a2), guided by warped views. (a4) Finally, we complete the rightmost column using diffusion-based interpolation sampling. (b) Spatio-temporal bidirectional interpolation step: Starting from the initial noise in (a4), we denoise the remaining frames through a camera-axis interpolation denoising step. At this stage, the noisy frame xt[:, i] is updated based on edge frames along the camera axis. Then, the time-axis interpolation step follows, where perturbed frames are denoised through interpolation along the time axis of the 4D grid, conditioned on edge frames in the time axis. The spatio-temporal bidirectional sampling alternates between camera-axis interpolation and time-axis interpolation, progressively refining noisy latents into clean frames. Through this process, we obtain globally coherent spatio-temporal 4D grid videos.
For Novel view generation, we compare with Generative Camera Dolly, and Stable Video 4D which employs the same input frame as ours.