LLMVS/index.html at main · POSTECH-CVLab/LLMVS · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Video Summarization with Large Language Models</title>
  <!-- Bootstrap -->
  <link href="css/bootstrap-4.4.1.css" rel="stylesheet">
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
  <!-- <link href='http://fonts.googleapis.com/css?family=Open+Sans:400italic,700italic,800italic,400,700,800' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" type="text/css" href="css/project.css" media="screen" />
    <link rel="stylesheet" type="text/css" media="screen" href="css/iconize.css" />
    <script src="js/google-code-prettify/prettify.js"></script> -->


  <script>
    MathJax = {
      tex: {
        inlineMath: [['$', '$'], ['\\(', '\\)']]
      },
      svg: {
        fontCache: 'global'
      }
    };
    </script>
    <script type="text/javascript" id="MathJax-script" async
      src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
    </script>
</head>

<!-- cover -->
<section>
  <div class="jumbotron text-center mt-0">
    <div class="container">
      <div class="row">
        <div class="col-12">
          <h2>Video Summarization with Large Language Models</h2>
          <h4 style="color:#5a6268;">CVPR 2025</h4>
          <hr>
          <h6>
            <a href="https://mlee47.github.io/" target="_blank">Min Jung Lee</a>,
            <a href="https://gongda0e.github.io/" target="_blank">Dayoung Gong</a>,
            <a href="https://cvlab.postech.ac.kr/~mcho/" target="_blank">Minsu Cho</a>
          </h6>
          <p>
            Pohang University of Science and Technology (POSTECH), GenGenAI
          </p>

          <div class="row justify-content-center">
            <div class="column">
              <p class="mb-5"><a class="btn btn-large btn-light" href="https://arxiv.org/pdf/2504.11199" role="button"
                  target="_blank">
                  <i class="fa fa-file"></i> Paper</a> </p>
            </div>
            <div class="column">
              <p class="mb-5"><a class="btn btn-large btn-light" href="https://github.com/mlee47/LLMVS" role="button"
                  target="_blank">
                  <i class="fa fa-file"></i> Code</a> </p>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
</section>


<!-- abstract -->
<section>
  <div class="container">
    <div class="row">
      <div class="col-12 text-center">
        <br>
        <h2>Abstract</h2>
        <!-- <hr style="margin-top:0px"> -->

        <p class="text-left">
          The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes.
Our method, dubbed  <b>LLM</b>-based <b>V</b>ideo <b>S</b>ummarization (<b>LLMVS</b>), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
        </p>
        <br>
      </div>
    </div>
  </div>
</section>
<br>
<br>

<section>
  <div class="container">
    <div class="row">
      <div class="col-12 text-center">
        <h2>Contributions</h2>
        <!-- <hr style="margin-top:0px"> -->
      </div>
        <br>
        <div class="col-12 text-center">
        </div>
        <div class="col-12" style="display:flex; align-items:center; margin-bottom:20px;">
          <img src="images/teaser.png" style="width:50%; margin-top:10px; margin-right:20px;">
          <div>
            <ol>
              <li><b>(M-)LLM-centric summarization</b> <br>
                We introduce LLMVS, a novel video summarization framework that leverages (M-)LLMs to utilize textual data and general knowledge in video summarization effectively.
              </li>
              <li><b>Local-to-global architecture</b> <br>
                The proposed local-to-global video summarization framework integrates local context via window-based aggregation and global context through self-attention, enabling a comprehensive understanding of video content.
              </li>
              <li><b>Leveraging LLM Embeddings</b> <br>
                Experimental results show that using output embeddings from LLMs is more effective for video summarization than using direct answers generated by LLMs.
              </li>
            </ol>
          </div>
        </div>
      </div>
    </div>
</section>
<br>
<br>
<br>


<section>
  <div class="container">
    <div class="row">
      <div class="col-12 text-center">
        <h2>Method</h2>
      </div>

      <div class="col-12">
        <p class="text-left">
          <!-- <h4>Evaluation on GCD</h4> -->
          <img src="images/method.png" style="width:100%; margin-top:10px; margin-bottom:10px;">
          <br>
          Our LLMVS framework consists of three key components: text description generation, local importance scoring, and global context aggregation.
          First, captions for each video frame are generated using a pre-trained Multi-modal Large Language Model (M-LLM).
          These captions are then incorporated into the query component of an LLM by segmenting through a sliding window local context, while instructions and examples are provided as part of the in-context learning prompt.
          We obtain the output embeddings from an intermediate layer of the LLM, categorized into instructions, examples, queries, and answers. The query and answer embeddings are pooled and passed through an MLP to produce inputs for the global context aggregator, which encodes the overall context of the input video.
          Finally, we obtain the output score vectors for the corresponding input video frames.
          The output score vectors are then optimized using the Mean Squared Error (MSE) loss function.
          <br>
        </p>
      </div>
    </div>
  </div>
</section>
<br>
<br>
<br>

<section>
  <div class="container">
    <div class="row">
      <div class="col-12 text-center">
        <h2>Experiments</h2>
      </div>

      <div class="col-12">
        <p class="text-left">

          <div class="col-12 text-center">
            <h4>Comparison with the state of the arts</h4>
          </div>
          <img src="images/experiment1.png" style="width:60%; margin-top:10px; display: block; margin-left: auto; margin-right: auto;">
          <br>
          LLMVS achieves state-of-the-art performance on both datasets, significantly outperforming the zero-shot LLM by effectively addressing both general and subjective summarization aspects.
          The global context aggregator \(\psi\) plays a key role by enhancing LLM's sequence-level reasoning. Unlike prior models that treat text as auxiliary, LLMVS places language at the core, leveraging LLMs' reasoning to generate more coherent and semantically rich summaries.
          <br>
          <br>
          <br>

          <div class="col-12 text-center">
            <h4>Finetuning (M-)LLM, \(\phi\) and \(\pi\)</h4>
          </div>
          <img src="images/experiment2.png" style="width:60%; margin-top:10px; display: block; margin-left: auto; margin-right: auto;">
          <br>
          We conduct experiments using M-LLM and LLM in both zero-shot and finetuned settings.
          In zero-shot, providing captions from M-LLM to the LLM improved results over direct scoring by M-LLM alone.
          For finetuned models, we applied LoRA, with improvements shown in the first and second rows, and third and fourth rows.
          LLMVS showed greater improvements in the fifth row, demonstrating its effectiveness beyond simple finetuning.
          <br>
          <br>
          <br>

          <div class="col-12 text-center">
            <h4>Prompting to (M-)LLM, \(\phi\) and \(\pi\)</h4>
          </div>
          <img src="images/experiment3.png" style="width:60%; margin-top:10px; display: block; margin-left: auto; margin-right: auto;">
          <br>
          We evaluate different prompting styles for M-LLM and LLM.
          For M-LLM, a generic prompt outperforms region-specific prompts, suggesting that broader descriptions better capture scene dynamics.
          For LLM, direct numerical scoring of frame importance is more effective than textual summarization, indicating that explicit importance scores provide clearer evaluations of frame significance.
        </p>
      </div>
    </div>
  </div>
</section>
<br>
<br>
<br>

<section>
  <div class="container">
    <div class="row">
      <div class="col-12 text-center">
        <h2>Qualitative results</h2>
      </div>

      <div class="col-12">
        <p class="text-left">
        <img src="images/qual.png" style="width:100%; margin-top:10px; margin-bottom:10px;">
        The x- and y-axes are time step \(t\) and importance score \(s\), respectively.
        In this figure, the blue line represents the average user scores, while the orange line shows the normalized predicted scores from our model.
        All scores are in the range of 0 to 1.
        Green areas indicate segments that received high importance scores, while pink areas correspond to segments with low scores.
        <br>
        LLMVS produces importance scores that closely follow human annotations, particularly emphasizing action-related scenes.
        Qualitative examples show that static or interview segments receive lower scores, while dynamic actions like riding or performing stunts are rated higher, indicating the model's ability to highlight narratively significant, high-energy content.

        <div class="col-12">
        <p class="text-left">
        </p>
      </div>
    </div>
  </div>
</section>
<br>


<!-- citing -->
<div class="container">
  <div class="row ">
    <div class="col-12">
      <h3>Citation</h3>
      <hr style="margin-top:0px">
      <pre style="background-color: #e9eeef;padding: 1.25em 1.5em">
<code>
  @inproceedings{lee2025video,
    title={Video Summarization with Large Language Models},
    author={Lee, Minjung and Gong, Dayoung and Cho, Minsu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025}
  }
</code>
              </pre>
      <hr>
    </div>
  </div>
</div>

<footer class="text-center" style="margin-bottom:10px">
  Thanks to <a href="https://lioryariv.github.io/" target="_blank">Lior Yariv</a> for the website template.
</footer>

</body>

</html>