Training-free Dense Video Captioning with Large-scale Pretrained Models
-
-
Abstract
Dense video captioning (DVC) aims to locate important segments in videos and generate captions simultaneously. Despite relying on large datasets for pretraining, traditional methods struggle in zero-shot scenarios, thus limiting their real-world applicability. In this paper, we propose a training-free dense video captioning framework (FreeDVC) that leverages the general capabilities of pretrained large models to achieve accurate zero-shot DVC. Our framework incorporates three key architectural contributions: 1) a shot segment description module that generates internally consistent semantic descriptions of shot segments via an improved shot boundary detection model and a visual-language model; 2) a shot-speech fusion module that adaptively integrates transcribed speech into shot segment descriptions via a large language model, aligning semantic and temporal information; and 3) a caption retrieval alignment module that uses image-text feature similarity to retrieve the most similar frames of each caption, filtering out incorrect captions caused by the shot-speech misalignment. We conduct extensive experiments on two DVC datasets featuring real-world internet videos with abundant shots and speech (i.e., YouCook2 and ViTT), which demonstrate strong zero-shot performance in terms of both visual and multimodal settings.
-
-