Training-free Dense Video Captioning with Large-scale Pretrained Models

Yue Feng; Ziyi Yan; Yizhen Jia; Ethan Q. Chen; Jie Qin

doi:10.1007/s11633-025-1585-x

Yue Feng, Ziyi Yan, Yizhen Jia, Ethan Q. Chen, Jie Qin. Training-free Dense Video Captioning with Large-scale Pretrained ModelsJ. Machine Intelligence Research. DOI: 10.1007/s11633-025-1585-x

Citation:

Yue Feng, Ziyi Yan, Yizhen Jia, Ethan Q. Chen, Jie Qin. Training-free Dense Video Captioning with Large-scale Pretrained ModelsJ. Machine Intelligence Research. DOI: 10.1007/s11633-025-1585-x

Citation:

Yue Feng, Ziyi Yan, Yizhen Jia, Ethan Q. Chen, Jie Qin. Training-free Dense Video Captioning with Large-scale Pretrained ModelsJ. Machine Intelligence Research. DOI: 10.1007/s11633-025-1585-x

Training-free Dense Video Captioning with Large-scale Pretrained Models

Abstract

Abstract

Dense video captioning (DVC) aims to locate important segments in videos and generate captions simultaneously. Despite relying on large datasets for pretraining, traditional methods struggle in zero-shot scenarios, thus limiting their real-world applicability. In this paper, we propose a training-free dense video captioning framework (FreeDVC) that leverages the general capabilities of pretrained large models to achieve accurate zero-shot DVC. Our framework incorporates three key architectural contributions: 1) a shot segment description module that generates internally consistent semantic descriptions of shot segments via an improved shot boundary detection model and a visual-language model; 2) a shot-speech fusion module that adaptively integrates transcribed speech into shot segment descriptions via a large language model, aligning semantic and temporal information; and 3) a caption retrieval alignment module that uses image-text feature similarity to retrieve the most similar frames of each caption, filtering out incorrect captions caused by the shot-speech misalignment. We conduct extensive experiments on two DVC datasets featuring real-world internet videos with abundant shots and speech (i.e., YouCook2 and ViTT), which demonstrate strong zero-shot performance in terms of both visual and multimodal settings.

FullText(HTML)

References (70)

Cited By

免责声明：本文中文版本由iFLYTEK翻译自动生成，仅供参考。对于该英文译文的合理性、准确性及完整性，我们不予负责，亦不对由此产生的相关后果承担任何商业及法律责任。

Training-free Dense Video Captioning with Large-scale Pretrained Models

Abstract

Catalog

Export File

Citation

Format

Content