TL;DR: We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model
without any training.
(a) Novel view video: Generating new videos from target camera views.
Input Video
View from Orbit left
View from Orbit right
Input Video
View from Orbit left
View from Orbit right
Input Video
View from Orbit left
View from Orbit right
Input Video
View from Orbit left
View from Orbit right
Page 1/3
(b) Bullet time video: Generating a time-frozen video at a single time point.
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Method
Reconstruction pipeline of Zero4D:(a) Key frame generation step:(a1) Given a fixed-viewpoint grayscale input video, we generate synchronized multi-view videos using the I2V diffusion model.
(a2) We first synthesize key frames of the 4D grid through diffusion sampling, guided by warped views.
(a3) Next, we generate the end-view video frames using a process similar to (a2), guided by warped views.
(a4) Finally, we complete the rightmost column using diffusion-based interpolation sampling.
(b) Spatio-temporal bidirectional interpolation step:
Starting from the initial noise in (a4), we denoise the remaining frames through a camera-axis interpolation denoising step.
At this stage, the noisy frame xt[:, i] is updated based on edge frames along the camera axis.
Then, the time-axis interpolation step follows, where perturbed frames are denoised through interpolation along the time axis of the 4D grid, conditioned on edge frames in the time axis.
The spatio-temporal bidirectional sampling alternates between camera-axis interpolation and time-axis interpolation, progressively refining noisy latents into clean frames.
Through this process, we obtain globally coherent spatio-temporal 4D grid videos.
Comparisons
For Novel view generation, we compare with Generative Camera Dolly, and Stable Video 4D which employs the same input frame as ours.