Yue Feng, Ziyi Yan, Yizhen Jia, Ethan Q. Chen, Jie Qin. Training-free Dense Video Captioning with Large-scale Pretrained ModelsJ. Machine Intelligence Research. DOI: 10.1007/s11633-025-1585-x
Citation: Yue Feng, Ziyi Yan, Yizhen Jia, Ethan Q. Chen, Jie Qin. Training-free Dense Video Captioning with Large-scale Pretrained ModelsJ. Machine Intelligence Research. DOI: 10.1007/s11633-025-1585-x

Training-free Dense Video Captioning with Large-scale Pretrained Models

  • Dense video captioning (DVC) aims to locate important segments in videos and generate captions simultaneously. Despite relying on large datasets for pretraining, traditional methods struggle in zero-shot scenarios, thus limiting their real-world applicability. In this paper, we propose a training-free dense video captioning framework (FreeDVC) that leverages the general capabilities of pretrained large models to achieve accurate zero-shot DVC. Our framework incorporates three key architectural contributions: 1) a shot segment description module that generates internally consistent semantic descriptions of shot segments via an improved shot boundary detection model and a visual-language model; 2) a shot-speech fusion module that adaptively integrates transcribed speech into shot segment descriptions via a large language model, aligning semantic and temporal information; and 3) a caption retrieval alignment module that uses image-text feature similarity to retrieve the most similar frames of each caption, filtering out incorrect captions caused by the shot-speech misalignment. We conduct extensive experiments on two DVC datasets featuring real-world internet videos with abundant shots and speech (i.e., YouCook2 and ViTT), which demonstrate strong zero-shot performance in terms of both visual and multimodal settings.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return