Zero4D: Training-Free 4D Video Generation From Single Video
Using Off-the-Shelf Video Diffusion Models


Jangho ParkTaesung KwonJong Chul Ye
    KAIST AI    

Press Enter to pause or resume all videos.

TL;DR

TL;DR: We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model without any training.



(a) Novel-view video: Generating new videos from target camera views.

(Dynamic view videos and fixed view videos are synchronized).

Dynamic view video
Input Video
Fixed view orbit left
Fixed view orbit right
Dynamic view video
Input Video
Fixed view orbit left
Fixed view orbit right
Dynamic view video
Input Video
Fixed view dolly out
Fixed view dolly in
Dynamic view video
Input Video
Fixed view orbit left
Fixed view orbit right
Dynamic view video
Input Video
Fixed view dolly out
Fixed view dolly in
Dynamic view video
Input Video
Fixed view orbit left
Fixed view orbit right
Dynamic view video
Input Video
Fixed view dolly out
Fixed view dolly in
Dynamic view video
Input Video
Fixed view orbit left
Fixed view orbit right
Dynamic view video
Input Video
Fixed view orbit left
Fixed view orbit right
Dynamic view video
Input Video
Fixed view dolly out
Fixed view dolly in


(b) Bullet time video: Generating synchronized time-frozen videos at multiple time indices.

Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
Time index 1
Time index 2
Time index 3
Input Video
static/img/astronaut_baby.jpg
Time index 1
Time index 2
Time index 3
Input Video



Method
Reconstruction Pipeline of Zero4D

Generation pipeline of Zero4D: (a) Key frame generation step: Starting from the input video(shown as the gray-shaded row), we sequentially generate boundary frames—novel view synthesis, end-view video generation, and end-frame view synthesis—where each step leverages the results of the previous one. (b) Spatio-temporal bidirectional interpolation step: Starting from the noisy frames, we alternately perform camera-axis and time-axis interpolation, each conditioned on boundary frames, to progressively denoise the 4D grid. Through this bidirectional process, noisy latents are refined into globally coherent spatio-temporal videos.



Comparison with baselines
Fixed novel-view video from bullet time videos
Zero4d (Ours)
TrajectoryCrafter
TrajectoryAttention
CameraCtrl
Arrow
Bullet time video - Time 1
Bullet time video - Time 2
Fixed novel-view video from bullet time videos
Zero4D (Ours)
TrajectoryCrafter
TrajectoryAttention
CameraCtrl
Arrow
Bullet time video - Time 1
Bullet time video - Time 2
Fixed novel-view video from bullet time videos
Zero4D (Ours)
TrajectoryCrafter
TrajectoryAttention
CameraCtrl
Arrow
Bullet time video - Time 1
Bullet time video - Time 2
Fixed novel-view video from bullet time videos
Zero4D (Ours)
TrajectoryCrafter
TrajectoryAttention
CameraCtrl
Arrow
Bullet time video - Time 1
Bullet time video - Time 2




More Results
(Bullet time videos and novel-view videos are synchronized)
Input video
Fixed time 1
Fixed time 2
Novel-view video