Download PDFOpen PDF in browserSummarizing Video Content with a Single Image Using Large Language ModelsEasyChair Preprint 149706 pages•Date: September 21, 2024AbstractGenerating thumbnails for news videos plays an important role in efficiently understanding the contents. Prior techniques mostly handle this task by selecting one keyframe as a representative image. However, this approach cannot effectively handle a video whose key content is distributed across different frames. In this paper, we propose a summarization of a news video by composing its key contents into one image as a thumbnail. To achieve this, our method starts with text extraction from each scene in the video using OCR, speech recognition, and existing image captioning models. We then group these texts based on similarity and leverage large language models to score the group significance. Next, for each group, a keyframe is selected by jointly considering the importance and content quality. Eventually, we compose the objects in these keyframes into a single image as a thumbnail in a non-overlap manner and utilize diffusion-based generative models for further quality refinement. Experiments on real-world news videos demonstrate that our method can effectively extract key video contents and generate natural and informative video thumbnails. Keyphrases: large language models, video semantic analysis, video thumbnail generation
|