Object-centric Video Prediction with Mask-guided Spatiotemporal Diffusion

Chenchen Han; Jiayi Fan; Na Wu; Jinpeng Dai; Hanbin Bao; Xiankai Lu

doi:10.1007/s11633-025-1584-y

Chenchen Han, Jiayi Fan, Na Wu, Jinpeng Dai, Hanbin Bao, Xiankai Lu. Object-centric Video Prediction with Mask-guided Spatiotemporal DiffusionJ. Machine Intelligence Research. DOI: 10.1007/s11633-025-1584-y

Citation:

Object-centric Video Prediction with Mask-guided Spatiotemporal Diffusion

Abstract

Abstract

Accurately modelling object structures and their temporally coherent dynamics remains a fundamental challenge in video prediction. In this paper, we propose an object-centric framework that integrates a high-fidelity diffusion-based spatial decoder with a diffusion-style temporal prediction module. Unlike prior works, we formulate temporal dynamics as a denoising process and employ a transformer as the denoising network, enabling progressive refinement of long-range motion trajectories across object slots. To support multiple video understanding tasks within a unified architecture, we introduce a slotwise masking mechanism. By selectively masking object slots during training, our spatiotemporal model learns to jointly perform video prediction, frame interpolation, and unconditional video generation with shared parameters. Built upon the SlotDiffusion decoder for spatial reconstruction and extended with our temporal diffusion transformer, the proposed framework demonstrates consistent improvements over representative baselines across multiple benchmarks, ensuring spatial fidelity and temporal coherence in object-centric video modelling.

FullText(HTML)

References (40)

Cited By

免责声明：本文中文版本由iFLYTEK翻译自动生成，仅供参考。对于该英文译文的合理性、准确性及完整性，我们不予负责，亦不对由此产生的相关后果承担任何商业及法律责任。

Object-centric Video Prediction with Mask-guided Spatiotemporal Diffusion

Abstract

Catalog

Export File

Citation

Format

Content