Object-centric Video Prediction with Mask-guided Spatiotemporal Diffusion
-
-
Abstract
Accurately modelling object structures and their temporally coherent dynamics remains a fundamental challenge in video prediction. In this paper, we propose an object-centric framework that integrates a high-fidelity diffusion-based spatial decoder with a diffusion-style temporal prediction module. Unlike prior works, we formulate temporal dynamics as a denoising process and employ a transformer as the denoising network, enabling progressive refinement of long-range motion trajectories across object slots. To support multiple video understanding tasks within a unified architecture, we introduce a slotwise masking mechanism. By selectively masking object slots during training, our spatiotemporal model learns to jointly perform video prediction, frame interpolation, and unconditional video generation with shared parameters. Built upon the SlotDiffusion decoder for spatial reconstruction and extended with our temporal diffusion transformer, the proposed framework demonstrates consistent improvements over representative baselines across multiple benchmarks, ensuring spatial fidelity and temporal coherence in object-centric video modelling.
-
-