Zero4d

TL;DR

TL;DR: We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model without any training.

(a) Novel view video: Generating new videos from target camera views.

Dynamic view video

Input Video

Fixed view orbit left

Fixed view orbit right

Dynamic view video

Input Video

Fixed view dolly out

Fixed view dolly in

Dynamic view video

Input Video

Fixed view orbit left

Fixed view orbit right

Dynamic view video

Input Video

Fixed view dolly out

Fixed view dolly in

Dynamic view video

Input Video

Fixed view orbit left

Fixed view orbit right

Dynamic view video

Input Video

Fixed view dolly out

Fixed view dolly in

Dynamic view video

Input Video

Fixed view orbit left

Fixed view orbit right

Dynamic view video

Input Video

Fixed view orbit left

Fixed view orbit right

Dynamic view video

Input Video

Fixed view dolly out

Fixed view dolly in

Page 1

(b) Bullet time video: Generating a time-frozen video at a single time point.

Time index 1

Time index 2

Time index 3

Input Video

Time index 1

Time index 2

Time index 3

Input Video

Time index 1

Time index 2

Time index 3

Input Video

Time index 1

Time index 2

Time index 3

Input Video

Time index 1

Time index 2

Time index 3

Input Video

Method

Reconstruction pipeline of Zero4D: (a) Key frame generation step: (a1) Given a fixed-viewpoint grayscale input video, we generate synchronized multi-view videos using the I2V diffusion model. (a2) We first synthesize key frames of the 4D grid through diffusion sampling, guided by warped views. (a3) Next, we generate the end-view video frames using a process similar to (a2), guided by warped views. (a4) Finally, we complete the rightmost column using diffusion-based interpolation sampling. (b) Spatio-temporal bidirectional interpolation step: Starting from the initial noise in (a4), we denoise the remaining frames through a camera-axis interpolation denoising step. At this stage, the noisy frame x_t[:, i] is updated based on edge frames along the camera axis. Then, the time-axis interpolation step follows, where perturbed frames are denoised through interpolation along the time axis of the 4D grid, conditioned on edge frames in the time axis. The spatio-temporal bidirectional sampling alternates between camera-axis interpolation and time-axis interpolation, progressively refining noisy latents into clean frames. Through this process, we obtain globally coherent spatio-temporal 4D grid videos.