Zero4D: Training-Free 4D Video Generation From Single Video
Using Off-the-Shelf Video Diffusion Model


Jangho ParkTaesung KwonJong Chul Ye
    KAIST AI    

Press Enter to pause or resume all videos.

TL;DR

TL;DR: We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model without any training.



(a) Novel view video: Generating new videos from target camera views.

Input Video
View from Orbit left
View from Orbit right


(b) Bullet time video: Generating a time-frozen video at a single time point.

Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video



Method
Reconstruction Pipeline of Zero4D

Reconstruction pipeline of Zero4D: (a) Key frame generation step: (a1) Given a fixed-viewpoint grayscale input video, we generate synchronized multi-view videos using the I2V diffusion model. (a2) We first synthesize key frames of the 4D grid through diffusion sampling, guided by warped views. (a3) Next, we generate the end-view video frames using a process similar to (a2), guided by warped views. (a4) Finally, we complete the rightmost column using diffusion-based interpolation sampling. (b) Spatio-temporal bidirectional interpolation step: Starting from the initial noise in (a4), we denoise the remaining frames through a camera-axis interpolation denoising step. At this stage, the noisy frame xt[:, i] is updated based on edge frames along the camera axis. Then, the time-axis interpolation step follows, where perturbed frames are denoised through interpolation along the time axis of the 4D grid, conditioned on edge frames in the time axis. The spatio-temporal bidirectional sampling alternates between camera-axis interpolation and time-axis interpolation, progressively refining noisy latents into clean frames. Through this process, we obtain globally coherent spatio-temporal 4D grid videos.





Comparisons

For Novel view generation, we compare with Generative Camera Dolly, and Stable Video 4D which employs the same input frame as ours.


Input video
 
View from Orbit left
zero4d (Ours)
View from Orbit right
zero4d (Ours)
Input video
Generative Camera Dolly
Generative Camera Dolly
Input video
Stable Video 4D
Stable Video 4D



Input video
 
View from Orbit left
zero4d (Ours)
View from Orbit right
zero4d (Ours)
Input video
Generative Camera Dolly
Generative Camera Dolly
Input video
Stable Video 4D
Stable Video 4D



Input video
 
View from Orbit left
zero4d (Ours)
View from Orbit right
zero4d (Ours)
Input video
Generative Camera Dolly
Generative Camera Dolly
Input video
Stable Video 4D
Stable Video 4D

Input video
 
View from Orbit left
zero4d (Ours)
View from Orbit right
zero4d (Ours)
Input video
Generative Camera Dolly
Generative Camera Dolly
Input video
Stable Video 4D
Stable Video 4D